US11726757B2 - Processor for performing dynamic programming according to an instruction, and a method for configuring a processor for dynamic programming via an instruction - Google Patents
Processor for performing dynamic programming according to an instruction, and a method for configuring a processor for dynamic programming via an instruction Download PDFInfo
- Publication number
- US11726757B2 US11726757B2 US16/811,068 US202016811068A US11726757B2 US 11726757 B2 US11726757 B2 US 11726757B2 US 202016811068 A US202016811068 A US 202016811068A US 11726757 B2 US11726757 B2 US 11726757B2
- Authority
- US
- United States
- Prior art keywords
- matrix
- instruction
- states
- recited
- computed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 239000011159 matrix material Substances 0.000 claims abstract description 99
- 238000012545 processing Methods 0.000 claims abstract description 30
- 238000010586 diagram Methods 0.000 description 14
- 238000004891 communication Methods 0.000 description 6
- 108090000623 proteins and genes Proteins 0.000 description 6
- 238000001914 filtration Methods 0.000 description 5
- 238000002864 sequence alignment Methods 0.000 description 5
- 238000012217 deletion Methods 0.000 description 4
- 230000037430 deletion Effects 0.000 description 4
- 150000001413 amino acids Chemical class 0.000 description 3
- 238000012163 sequencing technique Methods 0.000 description 3
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000010899 nucleation Methods 0.000 description 2
- 102000004169 proteins and genes Human genes 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 229930024421 Adenine Natural products 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- -1 and Chemical compound 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 229910052739 hydrogen Inorganic materials 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000013386 optimize process Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000001273 protein sequence alignment Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/451—Code distribution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30123—Organisation of register space, e.g. banked or distributed register file according to context, e.g. thread buffers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
Definitions
- This application is directed, in general, to dynamic programming and, more specifically, to configuring processors to perform dynamic programming to, for example, create specialized execution cores designed to accelerate matrix computations, such as matrix computations of the inner loop of genomics for sequence alignment.
- Dynamic programming solves complex problems by decomposing them into subproblems that are iteratively solved.
- the subproblems are ordered and results of subproblems that appear earlier in the ordering are used to solve a subproblem that appears later in the ordering.
- a matrix can be used to store results of the subproblems for computing the later appearing subproblems.
- Genomics is an area within the field of biology that is interested in the sequencing and analysis of an organism's genome. Genomics is directed to such areas as determining complete DNA sequences and performing genetic mapping to further the understanding of biological systems.
- a complete pipeline for performing assembly from reads involves seeding, filtering, alignment, consensus, and variant calling.
- the core operation in most genomics applications is the sequence alignment, which can be reference-based or de novo.
- the alignment can be performed via dynamic programming using various algorithms that build an array, wherein each cell or element of the array represents a subproblem of the overall alignment problem, and only the current anti-diagonal of the array is stored at any given time.
- a Smith-Waterman algorithm is an example of an algorithm that is used for alignment.
- a method of configuring a processor for dynamic programming according to an instruction includes: (1) receiving, by execution cores of the processor, an instruction that directs the execution cores to compute a set of recurrence equations employing a matrix, (2) configuring the execution cores, according to the set of recurrence equations, to compute states for elements of the matrix, and (3) storing the computed states for current elements of the matrix in registers of the execution cores, wherein the computed states are determined based on the set of recurrence equations and input data.
- a processor in another aspect, includes: (1) a memory configured to store input code including an instruction that specifies mathematical operations to compute a set of recurrence equations employing a matrix, and (2) at least one execution core configured to receive the instruction and input data, perform the mathematical operations on the input data to generate the computed states, and store the computed states for current elements of the matrix in at least one register of the execution core.
- the disclosure provides a method of computing a modified Smith Waterman algorithm employing an instruction for configuring a parallel processing unit (PPU).
- the method of computing includes: (1) receiving, by execution cores of the PPU, an instruction that directs the execution cores to compute a set of recurrence equations for the modified Smith Waterman algorithm employing a matrix, (2) configuring the execution cores, according to the set of recurrence equations, to compute states for elements of the matrix in parallel and in swaths, and (3) computing computed states for current elements of the matrix in swaths, wherein the computed states are determined based on the set of recurrence equations and input sequences.
- FIG. 1 illustrates a block diagram of an example of a processor for performing dynamic programming according to the principles of the disclosure
- FIG. 2 illustrates a block diagram of an example of a parallel processing unit constructed according to the principles of the disclosure
- FIG. 3 illustrates a block diagram of an example of a parallel processing unit having multiple multiprocessors
- FIG. 4 illustrates a diagram of an example of a matrix employed for computing a set of recurrence equations according to the principles of the disclosure
- FIG. 5 illustrates a diagram of an example of a matrix divided into swaths for computing a set of recurrence equations according to the principles of the disclosure
- FIG. 6 illustrates a flow diagram of an example of a method for configuring a processor, via an instruction, for dynamic programming
- FIG. 7 illustrates a flow diagram of an example of a method for computing, via an instruction, a modified Smith Waterman algorithm carried out according to the principles of the disclosure.
- the disclosure provides processors that are configured to perform dynamic programming according to an instruction.
- the instruction can be, or be part of an instruction set that is, abstract assembly statements, intermediate representation (IR) statements, or assembly language instructions associated with a specific instruction set architecture (ISA) for a specific processing unit.
- the instruction can be an instruction, such as a Parallel Thread Execution (PTX) instruction used with graphics processing units (GPUs) from Nvidia Corporation of Santa Clara, Calif., that are translated at install time to a target hardware instruction set.
- PTX Parallel Thread Execution
- GPUs graphics processing units
- the instruction can be part of the target hardware instruction set that configures the native hardware of a processor without needing a translation.
- the instruction can then be backward compatible to abstract assembly statements, such as PTX, or another pseudo-assembly language.
- the instruction can be adapted to configure different types of processors, such as parallel processors or serial processors.
- the disclosure relates to configuring both a central processing unit (CPU) and a GPU to perform dynamic programming via an instruction.
- the disclosure provides a processor solution that is an improvement over slower, software only solutions and dedicated hardware solutions requiring processing area that is unused when not performing dynamic programming.
- the instruction may be directed to accelerating gene and protein sequence alignments using dynamic programming that is faster than a software solution and does not waste hardware when alignments are not being performed.
- the disclosure provides an instruction that configures a processor for dynamic programming by computing a new anti-diagonal of a matrix each cycle.
- the elements of the anti-diagonal can be computed in swaths. This is advantageous when the matrix is, for example, 100 by 100.
- a swath is a designated number of rows of a matrix that are computed in parallel, which includes substantially parallel, during a cycle.
- the size, or number of rows of a swath can correspond to the number of execution cores designated for computing elements of the matrix in parallel during a cycle.
- the swath size can be set by the instruction.
- FIGS. 4 - 5 provide examples of a matrix and swaths of 16 and 8, respectively.
- the hardware of a processor can be configured, i.e., arranged and ordered, to implement an instruction for dynamic programming.
- Register-transfer level RTL
- a compiler can also be employed to generate the instruction for configuring the hardware logic for dynamic programming.
- Different instructions can be used to configure the hardware to compute different sets of recurrence equations via dynamic programming.
- Various sets of recurrence equations for the different instructions can be stored in libraries.
- the libraries can include optimized processes for computing the recurrence equations.
- a set of recurrence equations may include a single recurrence equation.
- FIG. 1 illustrates a block diagram of an example of a processor 100 for performing dynamic programming according to an instruction.
- the processor 100 can be a parallel processor, such as a GPU.
- FIGS. 2 and 3 provide examples of parallel processing units having multiprocessors with multiple execution cores that can be configured for dynamic programming based on an instruction.
- the processor 100 can also be a serial processor, such as a CPU. Accordingly, the instruction can be used to configure a Single-Instruction, Multiple-Data (SIMD) extension to approximate a parallel processing unit such as a GPU.
- SIMD Single-Instruction, Multiple-Data
- the main CPU can be configured by the instruction instead of an extension of the CPU.
- the processor 100 uses the dynamic programming via the instruction to generate a result that can be provided as an output.
- the output can be a solution to a complex problem, such as sequencing. Accordingly, the output can include traceback pointers, record the high-scoring matrix element, or a combination of both.
- the output can be provided to a computing device for additional processing or reporting. For sequencing, the output can be provided for further use in a genomics pipeline.
- the processor 100 includes a memory 110 , an execution core 120 and a register 130 .
- the processor 100 can include more than one of any one of these components and can also include additional components typically included in a processor. Each of the components of the processor 100 can be connected via conventional connections employed in processors.
- the register 130 can be an internal register of the execution core 120 .
- the memory 110 is configured to store input code including an instruction that specifies mathematical operations to compute a set of recurrence equations employing a matrix and generate computed states for current elements of the matrix.
- the computed states are states or values of matrix elements determined based on the set of recurrence equations and input data.
- the memory 110 may also contain the input data, such as sequences to be aligned.
- the instruction configures the execution core 120 to compute the states for the elements of the matrix in swaths.
- the execution core 120 is configured to receive the instruction and the data. Both the instruction and the data can be received from the memory 110 . In some examples, the execution core 120 can receive the instruction from the memory 110 via an instruction cache. Additionally, the execution core 120 can receive the data from a register file that is separate from the memory 110 . The execution core 120 is further configured to perform the mathematical operations on the input data to generate the computed states. For example, logic of the execution core 120 is arranged according to the set of recurrence equations to generate the computed states based on the input data. As such, configuration of the logic of the execution core 120 varies depending on the set of recurrence equations for which dynamic programming is being performed. For the area of genomics, the below discussion regarding FIG. 7 provides an example of the logic configuration for computing a modified Smith Waterman algorithm.
- the execution core 120 can be configured to generate computed states for current elements of the matrix elements.
- the current elements of the matrix are a subset of the matrix elements, in which the computed states are generated in parallel.
- the execution core 120 can generate computed states for current elements that are an anti-diagonal of the matrix.
- the execution core 120 is configured to send the computed states, such as for the current matrix elements of the anti-diagonal, to the register 130 for storage.
- the execution core 120 is also configured to retrieve the computed states for the current elements when determining the computed states for the next elements of the matrix, e.g., a subsequent anti-diagonal.
- the next elements of the matrix are the elements computed in the immediate cycle after the current element states are computed.
- the execution core 120 can be one of a plurality of parallel execution cores wherein at least some of the execution cores are configured to compute a state for a single matrix element each cycle.
- the execution core 120 can compute the matrix elements in a systolic manner.
- the execution core 120 is further configured to generate a value associated with the element.
- the value may include the computed value, a “traceback pointer” indicating which previous value was used to compute the computed value, or both.
- the value can be provided as an output and can be, for example, stored for further processing or analysis.
- the output value can be provided to a data storage, such as a shared memory or a register file of the processor 100 (neither shown in FIG. 1 ), for storage.
- the register 130 is configured to store the computed states generated by the execution core 120 .
- the register 130 can store computed states for current elements of the matrix, such as an anti-diagonal.
- the register 130 can provide the computed states for the current matrix elements to the execution core 120 for generating the computed states for the next matrix elements.
- More than one register may be used to store the computed states for the current matrix elements. For example, if the set of recurrence equations is a modified Smith Waterman, then three registers can be used as discussed below regarding FIG. 7 . For a straight Smith Waterman (no affine gap penalties), a single register can be used.
- the register 130 can be an internal register of the execution core 120 , such as an input register of the execution core 120 .
- the register 130 can be a state register in the arithmetic logic unit (ALU) of the processor 100 . Being part of the execution core 120 reduces and/or eliminates latency and bandwidth concerns for communicating (i.e., transmitting and receiving) computed states between the execution core 120 and the register 130 .
- ALU arithmetic logic unit
- FIG. 2 illustrates a block diagram of an example of a parallel processing unit (PPU) 200 constructed according to the principles of the disclosure.
- the PPU 200 can be part of a desk top computer, a laptop computer, a computer tablet or pad, a smart phone, or another type of computing device.
- the PPU 200 can be part of a computing system having multiple PPUs that is employed, for example, in a data center.
- the PPU 200 can be coupled to a central processor that communicates with the PPU 200 to enable the PPU 200 to perform operations.
- the central processor can execute a driver kernel that implements an application programming interface (API) to enable an application or applications running on the central processor to schedule operations for execution on the PPU 200 .
- API application programming interface
- An application can direct a driver kernel to generate one or more grids for execution.
- the PPU 200 can have a SIMD architecture where each thread block in a grid is concurrently executed on a different data set by different threads in the thread block.
- the PPU 200 includes an input/output (I/O) unit 210 , interface, management, and distribution (IMD) units (collectively referred to as IMD units 220 ), multiprocessors 230 , an interconnect 240 , an L2 cache 250 , and a memory interface 260 .
- I/O input/output
- IMD interface, management, and distribution
- the I/O unit 210 is configured to communicate with the central processor to transmit and receive commands, data, etc., collectively referred to as communications.
- the communications can be sent over a system bus connected to the central processor.
- the I/O unit 210 can be a conventional interface, such as a Peripheral Component Interconnect (PCI) interface.
- PCI Peripheral Component Interconnect
- the I/O unit 210 can be a PCI Express (PCIe) interface for communicating over a PCIe bus.
- PCIe PCI Express
- the IMD units 220 are configured to route communications for the PPU 200 , decode commands received via the I/O unit 210 , and transmit the decoded commands to other components or units of the PPU 200 as directed by the commands.
- the IMD units 220 can also use pointers to select commands or command streams stored in a memory by the central processor.
- the IMD 220 can further manage, select and dispatch grids for execution by the multiprocessors 230 .
- the PPU 200 includes multiprocessors 230 that can store an instruction as discussed above with respect to FIG. 1 , and can include the combinational logic for performing the mathematical computations according to the recurrence equations denoted by the instruction.
- the multiprocessors 230 can also include registers with the logic that store the computed states, such as computed states of a current anti-diagonal of a matrix. Register files can also be included in the multiprocessors 230 and used to store intermediate states of the last row of a swath for computing the first row of matrix elements in the next swath.
- the multiprocessors 230 are parallel processors that can concurrently execute a plurality of threads from a particular thread block.
- the multiprocessors 230 can be streaming multiprocessors (SM) that have Computer Unified Device Architecture (CUDA) cores, also referred to as streaming processors (SP), which execute the threads.
- CUDA is a general purpose parallel computing architecture that leverages the parallel computing engines of GPUs available from Nvidia Corporation.
- Each of the multiprocessors 230 is connected to a level-two (L2) cache 250 via an interconnect 240 .
- the interconnect 240 can be a crossbar or other type of interconnect network used for communicating within a processor.
- the L2 cache 250 is connected to one or more memory interfaces, represented by memory interface 260 , and is shared by the multiprocessors 230 via the interconnect 240 .
- the memory interface 260 can be configured to communicate with a memory device using a data bus for high-speed data transfer.
- the memory interface 260 can communicate via 64 or 128-bit data buses.
- Different memory devices can be connected to the PPU 200 via the memory interface 260 .
- the memory devices may be located off-chip of the PPU 200 .
- the memory interface 260 can be coupled to a Dynamic Random Access Memory, such as a Synchronous Dynamic Random Access Memory, that is external to the PPU 200 . Data from the memory devices can be fetched and stored in the L2 cache 250 .
- each of the multiprocessors 230 can also include a dedicated L1 cache.
- FIG. 3 illustrates a block diagram of an example of a PPU 300 having a plurality of multiprocessors constructed according to the principles of the disclosure.
- One of the multiprocessors, 310 A is shown and discussed as a representative of the other multiprocessors, 310 B, 310 C, and 310 D.
- the multiprocessor 310 A includes an instruction cache 320 , a data cache 330 , an instruction fetch/dispatch unit 340 , a register file 350 , execution cores 360 a to 360 n , interconnect 370 , and shared memory 380 .
- the instruction cache 320 can store the instruction as discussed above with respect to FIG. 1 .
- the L1 data cache 330 is configured to store data for processing.
- the data can include input data that is fetched for dynamic programming processing by the execution cores 360 a to 360 n .
- the instruction and data can be received via an I/O interface, such as the I/O unit 210 of FIG. 2 .
- the data can also be retrieved from a memory via a memory interface, such as the memory interface 260 .
- the instruction fetch/dispatch unit 340 is configured to fetch data from the L1 data cache 330 and provide the data to the execution cores 360 a , 360 b , 360 n .
- the instruction fetch/dispatch unit 340 is also configured to fetch instructions from the instruction cache 320 and dispatch the instructions to the execution cores 360 a , 360 b , 360 n for execution.
- the execution cores 360 a , 360 b , 360 n are configured to receive instructions dispatched to them from the L1 instruction cache 320 , fetch data from the L1 data cache 330 , execute the instructions employing the data, and write results back to memory.
- the execution cores 360 a , 360 b , 360 n can write computed states of current matrix elements to the registers 362 a , 362 b , 362 n , to use for processing in the next cycle.
- the results or output from the completed processing can be written to, for example, the shared memory/L1 cache 380 or the L2 cache 250 of FIG. 2 .
- the register file 350 includes registers that can be individually assigned to each of the execution cores 360 a to 360 n .
- the register file 350 can include registers that store intermediate states for computed elements of a matrix that are the last row of a swath. More than one register can be assigned to one of the execution cores 360 a to 360 n.
- At least some of the execution cores 360 a to 360 n can include the combinational logic for performing the mathematical computations according to the recurrence equations denoted in the instruction.
- each of the execution cores 360 a to 360 n can include the logic for performing the mathematical computations in parallel.
- Each of the execution cores 360 a to 360 n include one or more registers, which are denoted in FIG. 3 as registers 362 a , 362 b , 362 n .
- the registers 362 a , 362 b , 362 n are configured to store the computed states of current matrix elements that are computed according to the instruction.
- the registers 362 a , 362 b , 362 n store the computed states of a current anti-diagonal of the matrix.
- register file 350 registers 362 a , 362 b , 362 n are located internal to their corresponding execution core 360 a , 360 b , 360 n .
- the internal registers 362 a , 362 b , 362 n are proximate to the execution logic to reduce travel time and bandwidth concerns to store computed states of the matrix elements and use them for computing according to the instruction.
- the registers 362 a , 362 b , 362 n can be input registers of the respective execution cores 360 a , 360 b , 360 n.
- the interconnect 370 is configured to connect each of the execution cores 360 a , 360 b , 360 n to the register file 350 and the shared memory/L1 cache 380 .
- the interconnect 370 can be a crossbar configured to connect any of the execution cores 360 a , 360 b , 360 n to any of the registers in the register file 350 or memory locations in shared memory/L1 cache 380 .
- the shared memory/L1 cache 380 is connected to and configured to store data for each of the execution cores 360 a , 360 b , 360 n .
- the shared memory/L1 cache 380 can be a memory that is dedicated to the multiprocessor 310 a .
- the shared memory/L1 cache 380 can be coupled to a shared L2 cache of a PPU, such as the L2 cache 250 of FIG. 2 , wherein data from the L2 cache can be fetched and stored in the shared memory/L1 cache 380 for processing in the execution cores 360 a , 360 b , 360 n.
- the shared memory 380 may store a value associated with an element of the matrix as each element is computed.
- the execution cores 360 a , 360 b , 360 n can generate the value and provide the value as an output to be stored in the shared memory 380 .
- the output can be provided to the register file 350 for storing.
- the value may include the computed state for the element, a “traceback pointer” indicating which previous element value was used for the computed state, or can include both of these.
- FIG. 4 illustrates a diagram of an example of a matrix 400 employed for computing a set of recurrence equations according to dynamic programming and the principles of the disclosure.
- the matrix 400 includes “m” rows and “n” columns of elements. A state of each element of the matrix 400 is computed according to a set of recurrence equations and input data.
- arrays R and Q are used as input data for FIG. 4 and U (i, j) is used to represent the set of recurrence equations for a Smith Waterman algorithm.
- R is a reference sequence that is compared to the query sequence Q and a minimum cost edit that matches R and Q is computed using U (i, j) and dynamic programming.
- the Smith Waterman algorithm can be a modified Smith Waterman that is discussed in more detail below regarding FIG. 7 .
- An anti-diagonal of the matrix 400 is an example of the computed states of current elements of the matrix 400 that are generated in parallel by execution cores, such as execution cores 360 a to 360 n , and stored in registers, such as register 130 and registers 362 a to 362 n .
- the computed states can be generated in parallel according to the operating capacity of the processing unit.
- the number of registers assigned for storing the computed states can correspond to the number of computed states that can be generated in parallel during a cycle.
- the computed states can be generated in swaths. For a swath of 16, the entire anti-diagonal of FIG. 4 can be computed in a single cycle with 16 execution cores.
- FIG. 5 provides an example of a matrix that is computed in swaths of 8.
- a computed state for each of the elements of the matrix 400 can depend only on the three elements of the matrix that are directly above, directly to the left, and directly above the element directly to the left, as well as R(j) and Q(i).
- U(i, j ⁇ 1) corresponds to element 430
- U(i ⁇ 1, j ⁇ 1) corresponds to element 440
- U(i ⁇ 1, j) corresponds to element 450 .
- a value associated with that element may be output. This value may include the element value U(i,j), a “traceback pointer” indicating which previous value of U was used to compute U(i,j), or both.
- the output may be stored in register file 350 , or in shared memory 380 .
- the elements of the last row of the swath are stored, such as in a register of the register file 350 or in memory 380 , and used for computing states of the first row of the next swath.
- the computed states of 440 , 450 are stored and used to generate the computed state of element 420 .
- FIG. 5 illustrates a diagram of an example of a matrix 500 divided into swaths of 8 for computing the set of recurrence equations.
- the matrix 500 includes “m” rows and “n” columns of elements.
- a state of each element of the matrix 500 is computed according to a set of recurrence equations and input data.
- Arrays L and P are used as an example of input data for FIG. 5 and U (i, j) is used to represent the set of recurrence equations in, for example, the area of economics, engineering, genomics, communications, etc.
- a first anti-diagonal and a portion of a second anti-diagonal of the matrix 500 are shown.
- the first anti-diagonal is illustrated having some elements that are shaded to indicate the elements of the first anti-diagonal that are computed in a first swath.
- the remaining three elements of the first anti-diagonal are computed in the next cycle of a swath of 8.
- the remaining three elements of the first anti-diagonal can be computed in the same cycle as the five elements of the second anti-diagonal such that each available execution core, e.g., 8 in this instance, is being used each cycle.
- FIG. 6 illustrates a flow diagram of an example of a method 600 for configuring a processor, via an instruction, to perform dynamic programming.
- the processor can be a parallel processor or a serial processor.
- the instruction can be, for example, abstract assembly statements, IR statements, or assembly language instructions associated with a specific ISA for a specific processor or processing unit.
- the method begins in a step 605 .
- execution cores of the processor receive an instruction that directs the execution cores to compute a set of recurrence equations employing a matrix.
- the instruction can be fetched from an instruction cache via a fetch unit and provided to the execution cores.
- the execution cores can be processing cores of a CPU or a GPU.
- the recurrence equations can be from different fields including communications, genomics, engineering, and economics.
- the instruction can direct the execution cores to compute in swaths.
- the execution cores are configured, according to the set of recurrence equations, to compute states for elements of the matrix in a step 620 .
- the execution cores can be configured by arranging or organizing the logic of the execution cores according to the recurrence equations to generate the computed states employing the input data.
- the execution cores can be configured to compute the states for the elements of the matrix in swaths.
- One execution core can be configured to compute a state for a single matrix element.
- Multiple execution cores can be configured to compute states of multiple matrix elements in parallel, wherein each of the multiple execution cores computes a state for a single matrix element.
- registers of the execution cores are configured for storing the computed states for current elements of the matrix.
- One or more internal registers of each execution core can store the computed state from a current element of the matrix that is computed by that execution core.
- the computed states are determined based on the set of recurrence equations and input data.
- the input data can be, for example, sequences for alignment according to reference-based or de novo assembly.
- the computations can be done in swaths.
- Registers or other data storage locations external to the execution cores can be used to store intermediate states for computed elements of a matrix that are the last row of a swath.
- the method 600 ends in a step 640 .
- the processor is ready to process input data employing dynamic programming and the recurrence equations.
- an instruction can be directed to various recurrence equations characterizing the dynamic programming and used to arrange core logic for computations.
- Examples in the area of genomics include the Needleman-Wunsch algorithm and the Smith Waterman algorithm.
- S i,j max ⁇ 0, S i-1,j-1 +m ⁇ i,j ⁇ ,S i-1,j ⁇ D,S i-1,j-1 ⁇ D ⁇
- modified Smith Waterman is another example of an algorithm used for sequence alignment and will be discussed in more detail as an example of using an instruction for dynamic programming.
- a reference sequence R is compared to a query sequence Q and a minimum cost edit that matches R and Q is computed.
- the alignment assigns letters in R and Q to a single letter or a gap in the opposite sequence.
- the letters A, C, G, T are used that correspond to the four nucleotide bases adenine, cytosine, guanine, and, thymine.
- the alignment may be done to compute just the edit distance. For actual alignment the edit sequence itself is output.
- H(i,j) is the edit distance at position (i,j).
- I computes the running cost of an insertion
- D the running cost of a deletion
- W(r,q) is the cost of substituting character r for q
- o is the cost of opening an insert or delete
- e is the cost of extending the insert or delete
- source returns a two-bit code specifying which of the four arguments of max determined H(i,j)
- TB is an array of traceback pointers that identifies the minimum cost path through the matrix, such as the matrix 400 .
- FIG. 7 provides an example of a method directed to computing a modified Smith Waterman algorithm using an instruction.
- FIG. 7 illustrates a flow diagram of an example of a method 700 to compute a modified Smith Waterman algorithm employing an instruction according to the principles of the disclosure.
- the method 700 creates genomic cores that are specialized execution units designed specifically for performing the matrix operations that are the core compute function used in genomics.
- the genomic cores accelerate the matrix computations of the inner loop of genomics, such as for sequence alignment.
- the method 700 begins in a step 705 .
- an instruction is received that directs the processor to compute a set of recurrence equations for a modified Smith Waterman algorithm employing a matrix.
- the set of recurrence equations for the modified Smith Waterman are I(i,j), D(i,j), H(i,j), and TB(i,j).
- the processor can be a parallel processing unit, such as a GPU, or a serial processor, such as a CPU. Multiple execution cores of the processor can be directed to compute the modified Smith Waterman algorithm.
- the processor is configured to compute states for elements of the matrix in a step 720 .
- Multiple execution cores of the processor can be configured to compute the element states in parallel.
- the hardware of each of the execution cores can be configured to compute a single matrix element per cycle.
- lane l computes elements of H, I, and D with coordinates (Ns+l, t ⁇ l) to (Ns+l+7, t ⁇ l ⁇ 7).
- each lane i.e., each SP
- each SP is configured by an instruction to compute the elements of H, I, and D.
- a step 730 input data is received for processing by the configured execution cores.
- the input data can be fetched from a data cache via an instruction fetch unit.
- the input data is gene sequences, such as R and Q for reference-based assembly. In other examples, the input data can be sequences for de novo assembly that do not include a reference sequence.
- the recurrence equations are computed by the configured execution cores employing the input data.
- the recurrence equations can be computed in parallel in a systolic manner by computing a diagonal (e.g., an anti-diagonal) of I, D, H, and TB simultaneously.
- positions (i, t ⁇ i) can be computed in a systolic manner.
- the edge of recurrence array is larger than the number of execution cores, the array can be computed in swaths such as discussed above with respect to FIG. 5 .
- each element of R and Q can be 3-bits—encoding, for example, A, G, T, C, and N (unknown).
- elements R and Q can be 5-bits—to encode the amino acids.
- GACT Genome Alignment using Constant-memory Trace-back
- GACT-X GACT-X
- computed states for current elements of the matrix are stored in one or more registers of the execution cores.
- the computed states for the current elements of the matrix can be the state of H, I, and D and the state of the substitution matrix W which can be replicated for bandwidth.
- the register, or registers can be state registers associated with the execution cores, such as internal registers of the execution cores denoted as registers 362 a , 362 b , 362 n in FIG. 3 Accordingly, the computed states for current matrix elements can remain internal to genomics core. For example, only the last diagonal of H, I, and D (eight-elements per lane) needs to be retained. Each cycle, each lane reads R[Ns+l] to R[Ns+l+7] and Q[t ⁇ l] to Q[t ⁇ l ⁇ 7] from the registers.
- An output value can be generated when computing the computed states and can be output with the traceback pointers.
- the traceback pointers TB[Ns+l, t ⁇ l] to TB[Ns+l+7, t ⁇ l ⁇ 7] can be written back to register files, such as register file 350 of FIG. 3 , and can be ultimately copied to shared memory, such as shared memory 380 .
- shared memory such as shared memory 380 .
- Step 740 is completed when each element of the matrix is computed.
- Results of computing the modified Smith Waterman algorithm are provided in a step 750 .
- the edit distance is the minimum of H(i,j) across the bottom row (i max ) and right column (j max ). If only the edit distance is needed, TB is not needed. TB is used to reconstruct the actual alignment by walking the path back from the minimum scoring boundary location.
- the method 700 ends in a step 760 .
- the logic needed for performing the above processing can be part of the architecture of a GPU, such as described herein.
- I (i,j) which is the max of 0, H(i,j ⁇ 1)+o, and I(i,j ⁇ 1)+e. Determining the max of these three components requires taking the H computed value for the matrix element of the same row and adjacent column to the left and adding the constant o. This requires an adder. An additional adder is needed to add e to the computed I from the above row, same column. Another adder and multiplexer is needed to compare these two values and select one that is greater. Another adder is then needed to compare the greater value to 0. The result of this comparison is then multiplexed out.
- an execution core can be configured to compute I (i,j) using four adders (two used as comparators) and two multiplexors.
- H(i,j) an adder is needed to add H(i ⁇ 1, j ⁇ 1) and W(R[i],Q[j]), wherein W can be a table look up for determining the cost of substituting Q for R. The look up cost is then added to H from the matrix element that is above and to the left. Three comparators would then be needed to determine which is the greatest of the four components and another multiplexor for the selected output.
- the TB(i,j) can be stored to indicate how H(i, j) was computed and then used to trace back the path.
- TB(i,j) can be sent to a register files at each cycle. Accordingly, in one example, five adders, seven comparators, and a table lookup can be used for computing the recurrence equations of the modified Smith Waterman algorithm.
- the table for the table lookup can be stored in registers of a compute unit, such as the registers of the execution cores disclosed herein. Alternatively, a fixed table can be encoded in logic and used for the lookup.
- the table can have, for example, 16 entries of 16 bits each for a total of 256 bits.
- the logic can be configured differently.
- the instruction can be used to configure a SIMD extension to approximate a GPU.
- a vector extension as an example, a vector length of 128 is divided into 8 lanes of 16 bits with the same computation per the set of recurrence equations being performed on each lane.
- 16 lanes of 16 bits can be created for simultaneous computing.
- the main CPU instead of an extension the main CPU can be configured, wherein a 64 bit wide data path can be configured into 4 lanes for processing 16 bits in each of the 4 lanes.
- a complete pipeline for performing assembly from reads involves seeding, filtering, alignment, consensus, and variant calling.
- the Smith Waterman instruction processor can accelerate the filtering and alignment stages.
- a GPU with 80 SMs would have a speedup of 378,880 ⁇ . Area can be traded off against performance by varying M.
- alignment is done to compute cost.
- the modified Smith Waterman instruction can simplify finding the minimum edit cost, and its position, by keeping the state. A subsequent instruction can then be used to query this hidden state.
- alignment is done and traceback is then performed from the minimum cost position to give the minimum cost alignment.
- traceback For efficiently employing a GPU for traceback—a pointer chasing from shared memory (with encoded pointers) may be employed to ensure traceback does not become a bottleneck for the pipeline.
- instructions can be directed to recurrence equations in other fields for characterizing dynamic programming and arranging core logic for computations.
- the traveling salesman problem algorithm can be used for determining the most efficient route for data travel between nodes. Below is an example using nodes instead of cities:
- f ⁇ ( i ; j 1 , j 2 , ⁇ ⁇ , j k ) min 1 ⁇ m ⁇ k ⁇ ⁇ d ij m + f ⁇ ( i ; j 1 , j 2 , ⁇ ⁇ , j m - 1 , j m + 1 , ⁇ ⁇ , j k ) ⁇ wherein f(i; j 1 , j 2 , . . .
- d ij is the distance between the ith and jth nodes.
- recurrence equations include the Viterbi algorithm that can be used as a decoding algorithm for communication systems, and the Longest Common Subsequence algorithm that can be used to differentiate between two examples of text.
- the instructions can be used to configure hardware logic of execution cores for processors, such as CPUs and GPUs.
- the GPUs can be embodied on a single semiconductor substrate, included in a system with one or more other devices such as additional GPUs, a memory, and a CPU.
- the GPUs may be included on a graphics card that includes one or more memory devices and is configured to interface with a motherboard of a computer.
- the GPUs may be integrated GPUs (iGPUs) that are co-located with a CPU on a single chip.
- the processors or computers can be part of GPU racks located in a data center.
- the GPU racks can be high-density (HD) GPU racks that include high performance GPU compute nodes and storage nodes.
- the high performance GPU compute nodes can be servers designed for general-purpose computing on graphics processing units (GPGPU) to accelerate deep learning applications.
- the GPU compute nodes can be servers of the DGX product line from Nvidia Corporation.
- Portions of disclosed embodiments may relate to computer storage products, such as a memory, with a non-transitory computer-readable medium that have program code thereon for performing various computer-implemented operations that embody a part of an apparatus, device or carry out the steps of a method set forth herein.
- Non-transitory used herein refers to all computer-readable media except for transitory, propagating signals. Examples of non-transitory computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto-optical media such as floptical disks; and hardware devices that are specially configured to store and execute program code, such as ROM and RAM devices.
- Examples of program code include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.
Abstract
Description
S i,j=max{S i-1,j-1 +m{i,j},S i-1,j −D,S i,j-1 −D}
wherein Si,j is the score of the best alignment for the prefix of length i of the first input and the prefix of length j of the second input, m[i, j] is the matching score for aligning Q[i] and R{j}, and D is a penalty for an insertion or deletion during alignment.
S i,j=max{0,S i-1,j-1 +m{i,j},S i-1,j −D,S i-1,j-1 −D}
I(i,j)=max(0,H(i,j−1)+o,I(i,j−1)+e);
D(i,j)=max(0,H(i−1,j)+o,D(i−1,j)+e);
H(i,j)=max(0,I(i,j),D(i,j),H(i−1,j−1)+W(R[i],Q[j]));
TB(i,j)=source(H(i,j)).
wherein f(i; j1, j2, . . . , jk) is the length of a minimum path from i to 0 (the starting point) which passes only once through each of the remaining k unvisited nodes, and dij is the distance between the ith and jth nodes.
Claims (25)
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/811,068 US11726757B2 (en) | 2019-08-14 | 2020-03-06 | Processor for performing dynamic programming according to an instruction, and a method for configuring a processor for dynamic programming via an instruction |
CN202010640261.5A CN112395548A (en) | 2019-08-14 | 2020-07-06 | Processor for dynamic programming by instructions and method of configuring the processor |
DE102020118685.1A DE102020118685A1 (en) | 2019-08-14 | 2020-07-15 | PROCESSOR FOR PERFORMING DYNAMIC PROGRAMMING IN ACCORDANCE WITH AN INSTRUCTION AND PROCEDURE FOR CONFIGURING A PROCESSOR FOR DYNAMIC PROGRAMMING VIA AN INSTRUCTION |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962886893P | 2019-08-14 | 2019-08-14 | |
US16/811,068 US11726757B2 (en) | 2019-08-14 | 2020-03-06 | Processor for performing dynamic programming according to an instruction, and a method for configuring a processor for dynamic programming via an instruction |
Publications (2)
Publication Number | Publication Date |
---|---|
US20210048992A1 US20210048992A1 (en) | 2021-02-18 |
US11726757B2 true US11726757B2 (en) | 2023-08-15 |
Family
ID=74567220
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/811,068 Active 2040-10-15 US11726757B2 (en) | 2019-08-14 | 2020-03-06 | Processor for performing dynamic programming according to an instruction, and a method for configuring a processor for dynamic programming via an instruction |
Country Status (2)
Country | Link |
---|---|
US (1) | US11726757B2 (en) |
CN (1) | CN112395548A (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB202008299D0 (en) * | 2020-06-02 | 2020-07-15 | Imagination Tech Ltd | Manipulation of data in a memory |
CN114334008B (en) * | 2022-01-24 | 2022-08-02 | 广州明领基因科技有限公司 | FPGA-based gene sequencing accelerated comparison method and device |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040024536A1 (en) * | 2000-09-28 | 2004-02-05 | Torbjorn Rognes | Determination of optimal local sequence alignment similarity score |
US20070076936A1 (en) * | 2005-09-30 | 2007-04-05 | Eric Li | Fast alignment of large-scale sequences using linear space techniques |
US20080086274A1 (en) * | 2006-08-10 | 2008-04-10 | Chamberlain Roger D | Method and Apparatus for Protein Sequence Alignment Using FPGA Devices |
WO2008090336A2 (en) * | 2007-01-24 | 2008-07-31 | Inventanet Ltd | Method and system for searching for patterns in data |
US20090300327A1 (en) * | 2008-05-27 | 2009-12-03 | Stillwater Supercomputing, Inc. | Execution engine |
US20140173192A1 (en) * | 2008-05-27 | 2014-06-19 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
US20150057946A1 (en) * | 2013-08-21 | 2015-02-26 | Seven Bridges Genomics Inc. | Methods and systems for aligning sequences |
US20150106783A1 (en) * | 2013-10-14 | 2015-04-16 | Microsoft Corporation | Parallel dynamic programming through rank convergence |
US20150199473A1 (en) * | 2013-10-18 | 2015-07-16 | Seven Bridges Genomics Inc. | Methods and systems for quantifying sequence alignment |
US20150197815A1 (en) * | 2013-10-18 | 2015-07-16 | Seven Bridges Genomics Inc. | Methods and systems for identifying disease-induced mutations |
US20150347702A1 (en) * | 2012-12-28 | 2015-12-03 | Ventana Medical Systems, Inc. | Image Analysis for Breast Cancer Prognosis |
US20150347678A1 (en) * | 2013-08-21 | 2015-12-03 | Seven Bridges Genomics Inc. | Methods and systems for detecting sequence variants |
US20200019196A1 (en) * | 2018-07-10 | 2020-01-16 | Toyota Jidosha Kabushiki Kaisha | Control apparatus for linear solenoid |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120239706A1 (en) * | 2011-03-18 | 2012-09-20 | Los Alamos National Security, Llc | Computer-facilitated parallel information alignment and analysis |
-
2020
- 2020-03-06 US US16/811,068 patent/US11726757B2/en active Active
- 2020-07-06 CN CN202010640261.5A patent/CN112395548A/en active Pending
Patent Citations (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7917302B2 (en) * | 2000-09-28 | 2011-03-29 | Torbjorn Rognes | Determination of optimal local sequence alignment similarity score |
US20040024536A1 (en) * | 2000-09-28 | 2004-02-05 | Torbjorn Rognes | Determination of optimal local sequence alignment similarity score |
US20070076936A1 (en) * | 2005-09-30 | 2007-04-05 | Eric Li | Fast alignment of large-scale sequences using linear space techniques |
US20080086274A1 (en) * | 2006-08-10 | 2008-04-10 | Chamberlain Roger D | Method and Apparatus for Protein Sequence Alignment Using FPGA Devices |
WO2008090336A2 (en) * | 2007-01-24 | 2008-07-31 | Inventanet Ltd | Method and system for searching for patterns in data |
US20100138376A1 (en) * | 2007-01-24 | 2010-06-03 | Nicholas John Avis | Method and system for searching for patterns in data |
US20090300327A1 (en) * | 2008-05-27 | 2009-12-03 | Stillwater Supercomputing, Inc. | Execution engine |
US8688956B2 (en) * | 2008-05-27 | 2014-04-01 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
US20140173192A1 (en) * | 2008-05-27 | 2014-06-19 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
US20190227983A1 (en) * | 2008-05-27 | 2019-07-25 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
US10289606B2 (en) * | 2008-05-27 | 2019-05-14 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
US20170351642A1 (en) * | 2008-05-27 | 2017-12-07 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
US9767071B2 (en) * | 2008-05-27 | 2017-09-19 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
US9501448B2 (en) * | 2008-05-27 | 2016-11-22 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
US20150356055A1 (en) * | 2008-05-27 | 2015-12-10 | Stillwater Supercomputing, Inc. | Execution engine for executing single assignment programs with affine dependencies |
US20150347702A1 (en) * | 2012-12-28 | 2015-12-03 | Ventana Medical Systems, Inc. | Image Analysis for Breast Cancer Prognosis |
US20150057946A1 (en) * | 2013-08-21 | 2015-02-26 | Seven Bridges Genomics Inc. | Methods and systems for aligning sequences |
US20160306921A1 (en) * | 2013-08-21 | 2016-10-20 | Seven Bridges Genomics Inc. | Methods and systems for detecting sequence variants |
US20150347678A1 (en) * | 2013-08-21 | 2015-12-03 | Seven Bridges Genomics Inc. | Methods and systems for detecting sequence variants |
US20200168295A1 (en) * | 2013-08-21 | 2020-05-28 | Seven Bridges Genomics Inc. | Methods and systems for detecting sequence variants |
US20180336314A1 (en) * | 2013-08-21 | 2018-11-22 | Seven Bridges Genomics Inc. | Methods and systems for detecting sequence variants |
US20180357367A1 (en) * | 2013-08-21 | 2018-12-13 | Seven Bridges Genomics Inc. | Methods and systems for aligning sequences |
US9195436B2 (en) * | 2013-10-14 | 2015-11-24 | Microsoft Technology Licensing, Llc | Parallel dynamic programming through rank convergence |
EP3058463B1 (en) | 2013-10-14 | 2018-08-22 | Microsoft Technology Licensing, LLC | Parallel dynamic programming through rank convergence |
US20150106783A1 (en) * | 2013-10-14 | 2015-04-16 | Microsoft Corporation | Parallel dynamic programming through rank convergence |
US20150197815A1 (en) * | 2013-10-18 | 2015-07-16 | Seven Bridges Genomics Inc. | Methods and systems for identifying disease-induced mutations |
US20190169695A1 (en) * | 2013-10-18 | 2019-06-06 | Seven Bridges Genomics Inc. | Methods and systems for detecting sequence variants |
US20150199473A1 (en) * | 2013-10-18 | 2015-07-16 | Seven Bridges Genomics Inc. | Methods and systems for quantifying sequence alignment |
US20200019196A1 (en) * | 2018-07-10 | 2020-01-16 | Toyota Jidosha Kabushiki Kaisha | Control apparatus for linear solenoid |
US11237575B2 (en) * | 2018-07-10 | 2022-02-01 | Toyota Jidosha Kabushiki Kaisha | Control apparatus for linear solenoid |
Non-Patent Citations (7)
Title |
---|
Alpern, Bowen, Larry Carter, and Kang Su Gatlin. "Microparallelism and high-performance protein matching." Supercomputing'95: Proceedings of the 1995 ACM/IEEE Conference on Supercomputing. IEEE, 1995. (Year: 1995). * |
Bellman; "Dynamic Programming Treatment of the Travelling Salesman Problem"; RAND Corporation; Mar. 1961; 3 pgs. |
Dimitrov, Martin, Mike Mantor, and Huiyang Zhou. "Understanding software approaches for GPGPU reliability." Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units. 2009 (Year: 2009). * |
Qasim, Syed Manzoor, Shuja Ahmad Abbasi, and Bandar Almashary. "A proposed FPGA-based parallel architecture for matrix multiplication." APCCAS 2008-2008 IEEE Asia Pacific Conference on Circuits and Systems. IEEE, 2008. (Year: 2008). * |
Sheneman, Luke. (2002). A Survey of Specialized Processor Architectures Applied to Biological Sequence Alignment. Available at , https://www.researchgate.net/publication/228379726_ A_Survey_of_Specialized_Processor_Architectures_Applied_to_Biological_Sequence_Alignment> (Year: 2002). * |
Wakatani, Akiyoshi. "Evaluation of P-Scheme/G Algorithm for Solving Recurrence Equations." International Journal of Modeling and Optimization 3.4 (2013): 311. (Year: 2013). * |
Yoshida, Norihiko. "A transformational approach to the derivation of hardware algorithms from recurrence equations." Conference on High Performance Networking and Computing: Proceedings of the 1988 ACM/IEEE conference on Supercomputing. vol. 12. No. 17. 1988. (Year: 1988). * |
Also Published As
Publication number | Publication date |
---|---|
CN112395548A (en) | 2021-02-23 |
US20210048992A1 (en) | 2021-02-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fang et al. | swdnn: A library for accelerating deep learning applications on sunway taihulight | |
Huang et al. | Hardware acceleration of the pair-HMM algorithm for DNA variant calling | |
Stamatakis et al. | Exploring new search algorithms and hardware for phylogenetics: RAxML meets the IBM cell | |
US9489343B2 (en) | System and method for sparse matrix vector multiplication processing | |
Liu et al. | CUDA-BLASTP: accelerating BLASTP on CUDA-enabled graphics hardware | |
Di Bias et al. | The UCSC Kestrel parallel processor | |
Nagar et al. | A sparse matrix personality for the convey hc-1 | |
CN102640131A (en) | Unanimous branch instructions in a parallel thread processor | |
Yang et al. | An efficient parallel algorithm for longest common subsequence problem on gpus | |
Cali et al. | SeGraM: A universal hardware accelerator for genomic sequence-to-graph and sequence-to-sequence mapping | |
Rucci et al. | Oswald: O pencl s mith–w aterman on a ltera’s fpga for l arge protein d atabases | |
US11726757B2 (en) | Processor for performing dynamic programming according to an instruction, and a method for configuring a processor for dynamic programming via an instruction | |
Ham et al. | Genesis: A hardware acceleration framework for genomic data analysis | |
Sampietro et al. | Fpga-based pairhmm forward algorithm for dna variant calling | |
Munekawa et al. | Design and implementation of the Smith-Waterman algorithm on the CUDA-compatible GPU | |
Malakonakis et al. | Exploring modern FPGA platforms for faster phylogeny reconstruction with RAxML | |
Castells-Rufas | GPU acceleration of Levenshtein distance computation between long strings | |
Huang et al. | Improving the mapping of Smith-Waterman sequence database searches onto CUDA-enabled GPUs | |
Ren et al. | Exploration of alternative GPU implementations of the pair-HMMs forward algorithm | |
Xu et al. | SLPal: Accelerating long sequence alignment on many-core and multi-core architectures | |
US8413151B1 (en) | Selective thread spawning within a multi-threaded processing system | |
Langarita et al. | Compressed sparse FM-index: Fast sequence alignment using large k-steps | |
KR20210084220A (en) | System and method for reconfigurable systolic array with partial read/write | |
Kässens et al. | Combining GPU and FPGA technology for efficient exhaustive interaction analysis in GWAS | |
Sebastião et al. | Hardware accelerator architecture for simultaneous short-read DNA sequences alignment with enhanced traceback phase |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NVIDIA CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:DALLY, WILLIAM JAMES;REEL/FRAME:052036/0167 Effective date: 20200305 |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |