CN107748674B - Information processing system oriented to bit granularity - Google Patents

Information processing system oriented to bit granularity Download PDF

Info

Publication number
CN107748674B
CN107748674B CN201710804779.6A CN201710804779A CN107748674B CN 107748674 B CN107748674 B CN 107748674B CN 201710804779 A CN201710804779 A CN 201710804779A CN 107748674 B CN107748674 B CN 107748674B
Authority
CN
China
Prior art keywords
unit
module
data
vector
scalar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710804779.6A
Other languages
Chinese (zh)
Other versions
CN107748674A (en
Inventor
管武
梁利平
吴凯
任雁鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Microelectronics of CAS
Original Assignee
Institute of Microelectronics of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Microelectronics of CAS filed Critical Institute of Microelectronics of CAS
Priority to CN201710804779.6A priority Critical patent/CN107748674B/en
Priority to PCT/CN2017/102482 priority patent/WO2019047281A1/en
Publication of CN107748674A publication Critical patent/CN107748674A/en
Application granted granted Critical
Publication of CN107748674B publication Critical patent/CN107748674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30029Logical and Boolean instructions, e.g. XOR, NOT
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an information processing system facing bit granularity. Wherein, this system includes: an instruction memory IMEM for storing a plurality of instructions; the Control Unit is used for controlling a plurality of instructions to be executed in parallel, wherein the instruction execution mode comprises scalar execution and vector execution; the Scalar execution Unit Scalar Unit is used for finishing parallel information processing of a plurality of Scalar instructions; the Vector execution Unit Vector Unit is used for finishing parallel information processing of a plurality of Vector instructions; the data reading and writing Unit LS Unit is used for reading and writing the instruction memory IMEM; and the data memory DMEM is used for parallelly storing the data read from the data reading and writing Unit LS Unit. The invention solves the technical problem of higher processing complexity of an information parallel computing platform in the related technology.

Description

Information processing system oriented to bit granularity
Technical Field
The invention relates to the field of information parallel processing, in particular to an information processing system oriented to bit granularity.
Background
At present, parallel computing technology is rapidly developed, and various parallel computing information processing platforms are gradually brought into the market. TI's C64/C66 parallel processing platform occupies the information processing market of communication, multimedia and the like. The CEVA, which is a standing-by-the-book, starts the rapid development of software radio by using the parallel vector computing technology. In addition, DPU of Tensilica, ZSP of Verisilicon and PowerPC of freescale are outstanding in the information processing market by adopting a parallel computing platform. Parallel computing technology is entering the information processing market at a revolutionary rate.
Among parallel computing technologies for information processing, a parallel computing technology oriented to bit granularity occupies an important position. The information processing mainly includes symbol-oriented information processing and bit-oriented information processing. Symbol-oriented information processing mainly comprises byte operation, integer operation, complex operation and the like, and is the main field of parallel computation at present. The information processing facing bit granularity mainly comprises information source coding, channel coding, encryption coding and the like. These bit-granularity based coding techniques are spread throughout the fields of multimedia, communications, and secure transmissions. Meanwhile, with the increasing information processing rate and the increasing modes, a parallel computing platform facing bit granularity is urgently needed to complete high-performance universal bit granularity information processing.
The existing parallel information processing platform mainly realizes information processing of symbol granularity, and has two main problems in the aspect of parallel computation oriented to bit granularity. First, the bit-granularity oriented parallel computing platform is relatively single in function. Currently, various parallel processors for bit-granularity information processing are developed by domestic and foreign scholars, such as parallel channel error correction code processors designed by Niktash and the like, encryption coding parallel processors designed by McGregor and the like, channel coding and decoding processors developed by microelectronics of national academy of sciences and general parallel vector cipher processors designed by national defense and science university. However, these processors are dedicated parallel processors for each special field, and have single function. Second, parallel computing platforms oriented to bit granularity have difficulty in compromising on performance with respect to both rate and resource requirements. There are two kinds of information processing platforms with current bit granularity. The first is realized by arithmetic combination splicing of symbol granularity. This method consumes more computation time and has a limited processing rate. Taking TI's C66 platform as an example, a single core thereof can complete 10G operations per second, but when processing Turbo code decoding, the throughput rate is only about 1Mbps, and the rate is low. The second way is realized by a special coprocessor, but the way occupies more hardware resources. Taking TI's C66 platform as an example, 2 Turbo code coprocessors, 4 convolutional code coprocessors, 1 bit rate matching coprocessor and 1 security coprocessor are integrated inside; these coprocessors are numerous and occupy a large amount of resources. The two bit-oriented information processing modes occupy larger rate or resource and cannot be generalized. How to realize parallel computation oriented to bit granularity information processing and a high-speed low-complexity bit granularity information parallel computation platform are a problem to be solved urgently.
Aiming at the technical problem of high processing complexity of an information parallel computing platform in the related technology, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the invention provides a bit granularity-oriented information processing system, which at least solves the technical problem of higher processing complexity of an information parallel computing platform in the related technology.
According to an aspect of an embodiment of the present invention, there is provided a bit granularity oriented information processing system, including: an instruction memory IMEM for storing a plurality of instructions; the Control Unit is used for controlling a plurality of instructions to be executed in parallel, wherein the instruction execution mode comprises scalar execution and vector execution; the Scalar execution Unit Scalar Unit is used for finishing parallel information processing of a plurality of Scalar instructions; the Vector execution Unit Vector Unit is used for finishing parallel information processing of a plurality of Vector instructions; the data reading and writing Unit LS Unit is used for reading and writing the instruction memory IMEM; and the data memory DMEM is used for parallelly storing the data read from the data reading and writing Unit LS Unit.
Further, the instruction memory IMEM comprises a plurality of memory blocks, which store a program of a plurality of instructions in parallel.
Further, the Control Unit includes: the Fetch unit Fetch Align is used for reading a plurality of instructions stored in the instruction memory IMEM; and the instruction dispatching Unit Dispatcher is used for parallelly dispatching the instructions stored in the plurality of storage blocks to the Scalar execution Unit Scalar Unit, the Vector execution Unit Vector Unit and the data reading and writing Unit LS Unit.
Furthermore, the Fetch unit Fetch Align is configured to output an address to the instruction memory IMEM, so as to read an instruction of a corresponding address in a plurality of memory blocks of the instruction memory IMEM and form an instruction group, and output the instruction group to the instruction dispatch unit Dispatcher, where the Fetch unit Fetch Align is configured to, when no external input address is present and enabled, increment the address by the address self-adder and output the address to the instruction memory IMEM, and when the external input address is present and enabled, output the external input address to the instruction memory IMEM.
Further, the instruction dispatch Unit Dispatcher is configured to distribute the plurality of programs in the instruction group to at least one of the Scalar execution Unit Scalar Unit, the Vector execution Unit Vector Unit, and the data read-write Unit LS Unit in parallel, and the instruction dispatch Unit Dispatcher realizes random distribution of the Scalar instructions and the Vector instructions through the control program.
Furthermore, the system comprises a plurality of Scalar execution units Scalar units, wherein the Scalar execution units Scalar units execute in parallel and execute information processing of bit granularity by configuring different bit widths, each Scalar execution Unit Scalar Unit comprises a Scalar register file ACC-RF, a register read-write logic ACC-RF port, a program control Unit PCU, a plurality of computing units CU and a plurality of Scalar logic units BMU.
Further, the scalar register file ACC-RF comprises a plurality of registers, wherein the plurality of registers comprises a plurality of first registers and a plurality of second registers, the scalar register file ACC-RF further comprises a plurality of write ports for writing in parallel to a plurality of the plurality of first registers, a plurality of first read ports for reading in parallel to a plurality of the plurality of first registers, and a plurality of second read ports for reading in parallel to a plurality of the plurality of second registers.
Further, the ACC-RF port includes a read multiplexing unit and a write multiplexing unit, where the read multiplexing unit is configured to perform multiplexing of read ports, and the write multiplexing unit is configured to perform multiplexing of write ports.
Further, the program control unit PCU includes a program control instruction decoding unit, a multi-channel loop address register, and a multi-channel interrupt address register, wherein the program control instruction decoding unit is configured to generate a loop address or a return address according to an input instruction and a current address and store the loop address or the return address in the memory when the input instruction is a loop instruction or a function call instruction, compare the address with the loop address or the return address in the memory during the operation of the input instruction, output a loop start address or the return address as a jump address if a loop end is reached and the loop number is greater than 0 or a return instruction is received, and control jump enable to be effective, wherein a multi-layer loop is implemented by a plurality of loop addresses, and a multi-layer function call is implemented by a plurality of return addresses.
Furthermore, the plurality of computing units CU comprise a decoding module, an arithmetic module ALU, a multiplication module Mul, and an Add-subtract module Add Sub; the decoding module is used for outputting signals and control parameters of a reading scalar register file ACC-RF and controlling one of the arithmetic module ALU and the multiplication module Mul to start according to the control parameters; the arithmetic module ALU is used for executing operation according to the signal of the reading scalar register file ACC-RF output by the decoding module when the operation module is started, outputting an operation result through the signal of the writing scalar register file ACC-RF and writing the operation result into the scalar register file ACC-RF; the multiplication module Mul is used for executing multiplication operation according to the signals of the reading scalar register file ACC-RF output by the decoding module when the decoding module is started, outputting the result of the multiplication operation through the signals of the writing scalar register file ACC-RF, writing the result into the scalar register file ACC-RF or outputting the result of the multiplication operation to the addition and subtraction module Add Sub; the addition and subtraction module Add Sub adds and subtracts the multiplication result, outputs the operation result through writing a signal of the scalar register file ACC-RF, and writes the operation result into the scalar register file ACC-RF; during the operation of the arithmetic module ALU, the multiplication module Mul and the Add-subtract module Add Sub, the bit width of the input data and/or the output data is determined and configured by the bit width instruction.
Furthermore, each scalar Logic unit BMU comprises a decoding module, a Logic operation module Logic and a packing module Pack Unpack; the decoding module is used for outputting signals and control parameters of a reading scalar register file ACC-RF and controlling one of the Logic operation module Logic and the packing module Pack Unpack to start according to the control parameters; the Logic operation module Logic is used for executing operation according to signals of the reading scalar register file ACC-RF output by the decoding module when the Logic operation module is started, outputting operation results through signals of the writing scalar register file ACC-RF and writing the operation results into the scalar register file ACC-RF; the packing module Pack Unpack is used for executing packing operation and unpacking operation according to signals of the reading scalar register file ACC-RF output by the decoding module when the decoding module is started, outputting operation results through signals of the writing scalar register file ACC-RF and writing the operation results into the scalar register file ACC-RF; during the operation process of the Logic operation module Logic and the Pack Unpack module, the bit width of the input data and/or the output data is determined and configured through the bit width instruction.
Furthermore, the Logic operation module Logic can execute operations with various bit widths, the packing module Pack Unpack splits data when executing packing operations, and merges data when executing unpacking operations, wherein the packing module Pack Unpack controls the bit width of data through packing operations and unpacking operations.
Furthermore, the Vector execution Unit Vector Unit comprises a Vector register file VCC-RF, a Vector register read-write logic VCC-RF port, a plurality of Vector calculation units VCU and a plurality of Vector logic units VBMU; the plurality of vector logic units VBMUs are configured to execute in parallel, and each vector logic unit VBMU can process vector data with different bit widths.
Further, the vector register file VCC-RF comprises a plurality of registers, wherein the plurality of registers comprises a plurality of first registers and a plurality of second registers, the vector register file VCC-RF further comprises a plurality of write ports for writing in parallel to a plurality of the plurality of first registers, a plurality of first read ports for reading in parallel to a plurality of the plurality of first registers, and a plurality of second read ports for reading in parallel to a plurality of the plurality of second registers.
Further, the vector register read-write logic VCC-RF port includes a read multiplexing unit and a write multiplexing unit, where the read multiplexing unit is configured to perform multiplexing of a read port, and the write multiplexing unit is configured to perform multiplexing of a write port.
Further, the plurality of vector calculation units VCUs include a decoding module and an arithmetic module ALU group; the decoding module is used for outputting a signal and a control parameter of a read vector register file VCC-RF and controlling the starting of an arithmetic module ALU group according to the control parameter; the arithmetic module ALU is used for executing operation according to the signal of the read vector register file VCC-RF output by the decoding module when starting, outputting an operation result through the signal of the write vector register file VCC-RF and writing the operation result into the vector register file VCC-RF; during the operation of the arithmetic module ALU group, the bit width and the vector length of input data and/or output data are determined and configured through a bit width instruction.
Furthermore, each vector Logic unit VBMU comprises a decoding module and a Logic operation module Logic group; the decoding module is used for outputting a signal and a control parameter of a read vector register file VCC-RF and controlling the starting of a Logic Pack group of the packing module according to the control parameter; the Logic operation packing module Logic Pack is used for executing operation according to the signal of the read vector register file VCC-RF output by the decoding module when the Logic operation packing module is started, outputting an operation result through the signal of the write vector register file VCC-RF and writing the operation result into the vector register file VCC-RF; during the Logic Pack operation process of the Logic operation packing module, the bit width and the vector length of input data and/or output data are determined and configured through a bit width instruction.
Furthermore, the Logic operation packing module Logic Pack can execute operations with various bit widths, split data when executing packing operations, and combine data when executing unpacking operations, wherein the Logic operation packing module Logic Pack controls the bit width of data through packing operations and unpacking operations.
Furthermore, the data reading and writing Unit LS Unit is used for reading and writing the data storage Unit DMEM, and comprises a plurality of data reading and writing modules LSU and a data address register PTR-RF; the data reading and writing modules LSU are used for executing scalar reading and writing and vector reading and writing in parallel; the data read-write Unit LS Unit performs reading and writing in parallel through a plurality of ports.
Furthermore, each data reading and writing module LSU comprises a decoding module, an address generating module, a reading storage module and a writing storage module; the decoding module starts an address generating module, a reading storage module or a writing storage module according to an input instruction; the address generating module generates an address of data according to the address in the data address register PTR-RF and outputs the address to a data storage unit DMEM; the reading and storing module generates a reading enabling signal to a data storing Unit DMEM, reads data corresponding to the address of the data and writes the data into a Scalar register file ACC-RF or a Vector register file VCC-RF, wherein the Scalar register file ACC-RF is a file in a Scalar execution Unit Scalar Unit, and the Vector register file VCC-RF is a file in a Vector execution Unit Vector Unit; the write storage module generates a write enable signal to a data storage unit DMEM, and data input by a scalar register file ACC-RF or a vector register file VCC-RF are written into the data storage unit DMEM; in the process of reading and writing of the data reading and writing module LSU, the bit width and the vector length of input data and/or output data are determined and configured through a bit width instruction.
Further, the data address register PTR-RF of the data read/write Unit LS Unit includes a plurality of address registers for storing addresses of read data.
Further, the data memory DMEM includes a plurality of memory blocks for storing the information processing data in parallel, wherein the information processing data is data read from the data read/write module LSU.
In an embodiment of the invention, the instruction memory IMEM is adapted to store a plurality of instructions; the Control Unit is used for controlling a plurality of instructions to be executed in parallel, wherein the instruction execution mode comprises scalar execution and vector execution; the Scalar execution Unit Scalar Unit is used for finishing parallel information processing of a plurality of Scalar instructions; the Vector execution Unit Vector Unit is used for finishing parallel information processing of a plurality of Vector instructions; the data reading and writing Unit LS Unit is used for reading and writing the instruction memory IMEM; the data memory DMEM is used for parallelly storing the data read from the data reading and writing Unit LS Unit, the technical problem that the processing complexity of an information parallel computing platform in the related technology is high is solved, and the technical effect that the processing complexity of the information parallel computing platform can be reduced is achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic diagram of an alternative bit-granularity oriented information processing system according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an alternative fetch unit in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram of an alternative instruction dispatch unit, according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an alternative scalar register file according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an alternative scalar register read and write logic, according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of an alternative program controller according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an alternative scalar calculation unit in accordance with embodiments of the present invention;
FIG. 8 is a schematic diagram of an alternative scalar logic unit, according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of an alternative vector register file according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of an alternative vector register read-write unit, according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of an alternative vector operation unit according to an embodiment of the present invention;
FIG. 12 is a schematic diagram of an alternative vector logic unit in accordance with embodiments of the present invention;
FIG. 13 is a diagram of an alternative data read/write unit according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The present application provides an embodiment.
Fig. 1 is a schematic diagram of an alternative bit-granularity-oriented information processing system according to an embodiment of the present invention, as shown in fig. 1, the system including:
an instruction memory IMEM for storing a plurality of instructions; the Control Unit is used for controlling a plurality of instructions to be executed in parallel, wherein the instruction execution mode comprises scalar execution and vector execution; the Scalar execution Unit Scalar Unit is used for finishing parallel information processing of a plurality of Scalar instructions; the Vector execution Unit Vector Unit is used for finishing parallel information processing of a plurality of Vector instructions; the data reading and writing Unit LS Unit is used for reading and writing the instruction memory IMEM; and the data memory DMEM is used for parallelly storing the data read from the data reading and writing Unit LS Unit.
As an alternative embodiment, the instruction memory IMEM comprises a plurality of memory blocks, which store programs of a plurality of instructions in parallel.
As an alternative embodiment, the Control Unit includes: the Fetch unit Fetch Align is used for reading a plurality of instructions stored in the instruction memory IMEM; and the instruction dispatching Unit Dispatcher is used for parallelly dispatching the instructions stored in the plurality of storage blocks to the Scalar execution Unit Scalar Unit, the Vector execution Unit Vector Unit and the data reading and writing Unit LS Unit.
As an optional implementation manner, the Fetch unit Fetch Align is configured to output an address to the instruction memory IMEM, to read an instruction of a corresponding address in a plurality of memory blocks of the instruction memory IMEM and form an instruction group, and to output the instruction group to the instruction dispatch unit Dispatcher, where the Fetch unit Fetch Align is configured to, when no external input address is present and enabled, increment the address by the address self-adder and output the incremented address to the instruction memory IMEM, and when the external input address is present and enabled, output the external input address to the instruction memory IMEM.
As an alternative implementation, the instruction dispatch Unit Dispatcher is configured to distribute multiple programs in an instruction group to at least one of a Scalar execution Unit Scalar Unit, a Vector execution Unit Vector Unit, and a data read-write Unit LS Unit in parallel, and the instruction dispatch Unit Dispatcher implements random distribution of Scalar instructions and Vector instructions through a control program.
As an optional implementation, the system includes multiple Scalar execution units Scalar units, where the multiple Scalar execution units Scalar units execute in parallel and perform information processing of bit granularity by configuring different bit widths, where each Scalar execution Unit Scalar Unit includes a Scalar register file ACC-RF, a register read-write logic ACC-RF port, a program control Unit PCU, multiple computation units CU, and multiple Scalar logic units BMU.
As an alternative embodiment, the scalar register file ACC-RF comprises a plurality of registers, wherein the plurality of registers comprises a plurality of first registers and a plurality of second registers, the scalar register file ACC-RF further comprises a plurality of write ports for writing in parallel to a plurality of the plurality of first registers, a plurality of first read ports for reading in parallel to a plurality of the plurality of first registers, and a plurality of second read ports for reading in parallel to a plurality of the plurality of second registers.
As an optional implementation, the ACC-RF port includes a read multiplexing unit and a write multiplexing unit, where the read multiplexing unit is configured to perform multiplexing of a read port, and the write multiplexing unit is configured to perform multiplexing of a write port.
As an alternative implementation, the program control unit PCU includes a program control instruction decoding unit, a multi-channel loop address register, and a multi-channel interrupt address register, where the program control instruction decoding unit is configured to generate a loop address or a return address according to an input instruction and a current address and store the loop address or the return address into the memory when the input instruction is a loop instruction or a function call instruction, compare the loop address or the return address with the loop address or the return address in the memory during the input instruction operation, and if a loop tail is reached and the loop number is greater than 0 or a return instruction is received, output a loop head address or the return address as a jump address while controlling jump enable to be valid, where a multi-layer loop is implemented by a plurality of loop addresses, and a multi-layer function call is implemented by a plurality of return addresses.
As an alternative embodiment, the plurality of computing units CU includes a decoding module, an arithmetic module ALU, a multiplication module Mul, an Add-subtract module Add Sub; the decoding module is used for outputting signals and control parameters of a reading scalar register file ACC-RF and controlling one of the arithmetic module ALU and the multiplication module Mul to start according to the control parameters; the arithmetic module ALU is used for executing operation according to the signal of the reading scalar register file ACC-RF output by the decoding module when the operation module is started, outputting an operation result through the signal of the writing scalar register file ACC-RF and writing the operation result into the scalar register file ACC-RF; the multiplication module Mul is used for executing multiplication operation according to the signals of the reading scalar register file ACC-RF output by the decoding module when the decoding module is started, outputting the result of the multiplication operation through the signals of the writing scalar register file ACC-RF, writing the result into the scalar register file ACC-RF or outputting the result of the multiplication operation to the addition and subtraction module Add Sub; the addition and subtraction module Add Sub adds and subtracts the multiplication result, outputs the operation result through writing a signal of the scalar register file ACC-RF, and writes the operation result into the scalar register file ACC-RF; during the operation of the arithmetic module ALU, the multiplication module Mul and the Add-subtract module Add Sub, the bit width of the input data and/or the output data is determined and configured by the bit width instruction.
As an optional implementation, each scalar Logic unit BMU includes a decoding module, a Logic operation module Logic, and a packing module Pack Unpack; the decoding module is used for outputting signals and control parameters of a reading scalar register file ACC-RF and controlling one of the Logic operation module Logic and the packing module Pack Unpack to start according to the control parameters; the Logic operation module Logic is used for executing operation according to signals of the reading scalar register file ACC-RF output by the decoding module when the Logic operation module is started, outputting operation results through signals of the writing scalar register file ACC-RF and writing the operation results into the scalar register file ACC-RF; the packing module Pack Unpack is used for executing packing operation and unpacking operation according to signals of the reading scalar register file ACC-RF output by the decoding module when the decoding module is started, outputting operation results through signals of the writing scalar register file ACC-RF and writing the operation results into the scalar register file ACC-RF; during the operation process of the Logic operation module Logic and the Pack Unpack module, the bit width of the input data and/or the output data is determined and configured through the bit width instruction.
As an optional implementation manner, the Logic operation module Logic can perform operations with various bit widths, and the packing module Pack Unpack splits data when performing packing operations and merges data when performing unpacking operations, where the packing module Pack Unpack controls the bit width of data through packing operations and unpacking operations.
As an optional implementation, the Vector execution Unit Vector Unit includes a Vector register file VCC-RF, a Vector register read-write logic VCC-RF port, a plurality of Vector calculation units VCU, and a plurality of Vector logic units VBMU; the plurality of vector logic units VBMUs are configured to execute in parallel, and each vector logic unit VBMU can process vector data with different bit widths.
As an alternative implementation, the vector register file VCC-RF comprises a plurality of registers, wherein the plurality of registers comprises a plurality of first registers and a plurality of second registers, the vector register file VCC-RF further comprises a plurality of write ports for writing in parallel to a plurality of the plurality of first registers, a plurality of first read ports for reading in parallel to a plurality of the plurality of first registers, and a plurality of second read ports for reading in parallel to a plurality of the plurality of second registers.
As an optional implementation, the vector register read-write logic VCC-RF port includes a read multiplexing unit and a write multiplexing unit, where the read multiplexing unit is configured to perform multiplexing of a read port, and the write multiplexing unit is configured to perform multiplexing of a write port.
As an alternative embodiment, the plurality of vector calculation units VCU includes a decoding module and an arithmetic module ALU group; the decoding module is used for outputting a signal and a control parameter of a read vector register file VCC-RF and controlling the starting of an arithmetic module ALU group according to the control parameter; the arithmetic module ALU is used for executing operation according to the signal of the read vector register file VCC-RF output by the decoding module when starting, outputting an operation result through the signal of the write vector register file VCC-RF and writing the operation result into the vector register file VCC-RF; during the operation of the arithmetic module ALU group, the bit width and the vector length of input data and/or output data are determined and configured through a bit width instruction.
As an optional implementation mode, each vector Logic unit VBMU comprises a decoding module and a Logic operation module Logic group; the decoding module is used for outputting a signal and a control parameter of a read vector register file VCC-RF and controlling the starting of a Logic Pack group of the packing module according to the control parameter; the Logic operation packing module Logic Pack is used for executing operation according to the signal of the read vector register file VCC-RF output by the decoding module when the Logic operation packing module is started, outputting an operation result through the signal of the write vector register file VCC-RF and writing the operation result into the vector register file VCC-RF; during the Logic Pack operation process of the Logic operation packing module, the bit width and the vector length of input data and/or output data are determined and configured through a bit width instruction.
As an optional implementation manner, the Logic Pack module Logic Pack can perform operations with various bit widths, split data when performing a packing operation, and merge data when performing an unpacking operation, where the Logic Pack module Logic Pack controls a bit width of data through a packing operation and an unpacking operation.
As an optional implementation manner, the data reading and writing Unit LS Unit is used for reading and writing the data storage Unit DMEM, and includes a plurality of data reading and writing modules LSU and a data address register PTR-RF; the data reading and writing modules LSU are used for executing scalar reading and writing and vector reading and writing in parallel; the data read-write Unit LS Unit performs reading and writing in parallel through a plurality of ports.
As an optional implementation manner, each data reading and writing module LSU includes a decoding module, an address generating module, a reading storage module and a writing storage module; the decoding module starts an address generating module, a reading storage module or a writing storage module according to an input instruction; the address generating module generates an address of data according to the address in the data address register PTR-RF and outputs the address to a data storage unit DMEM; the reading and storing module generates a reading enabling signal to a data storing Unit DMEM, reads data corresponding to the address of the data and writes the data into a Scalar register file ACC-RF or a Vector register file VCC-RF, wherein the Scalar register file ACC-RF is a file in a Scalar execution Unit Scalar Unit, and the Vector register file VCC-RF is a file in a Vector execution Unit Vector Unit; the write storage module generates a write enable signal to a data storage unit DMEM, and data input by a scalar register file ACC-RF or a vector register file VCC-RF are written into the data storage unit DMEM; in the process of reading and writing of the data reading and writing module LSU, the bit width and the vector length of input data and/or output data are determined and configured through a bit width instruction.
As an alternative embodiment, the data address register PTR-RF of the data read/write Unit LS Unit includes a plurality of address registers for storing addresses of read data.
As an alternative implementation, the data memory DMEM includes a plurality of memory blocks, and the plurality of memory blocks are used for storing the information processing data in parallel, where the information processing data is data read from the data read/write module LSU.
The following describes a specific structure and processing method of an alternative bit-granularity-oriented information processing system with reference to fig. 1 as follows:
the system comprises an instruction memory IMEM, a Control Unit, a Scalar execution Unit, a Vector execution Unit, a data read-write Unit LS Unit and a data memory DMEM.
The instruction memory IMEM stores an information processing instruction group program in parallel via a plurality of memory blocks, and reads out a specific instruction via a Fetch/Align unit. The internal structure is shown in the following table:
table 1 internal memory structure of instruction memory IMEM
N×1 N×1 N×1 N×1
As shown in FIG. 1, the Control Unit includes an instruction Fetch Unit Fetch Align and an instruction dispatch Dispatcher Unit. The instruction fetching Unit Fetch Align is used for reading an instruction in the instruction storage, parallelly dispatching the instruction to the Scalar execution Unit Scalar Unit, the Vector execution Unit Vector Unit and the data read-write Unit LS Unit, and controlling a program to realize the random distribution of the Scalar and Vector instructions; the unit realizes the multi-instruction parallel execution through the distribution of the ultra-long instruction; through program control, scalar execution vector execution is organically combined, and high-speed and high-performance parallel processing oriented to bit granularity is realized.
The Scalar execution Unit Scalar Unit is used for finishing parallel information processing of Scalar data, and as shown in fig. 1, the Scalar execution Unit Scalar Unit comprises a Scalar register file ACC-RF, a register read-write logic ACC-RF port, a program control Unit PCU, 2 calculation units CU and 2 Scalar logic units BMU; multiple scalar units can be executed in parallel; the scalar unit can realize bit granularity information processing through bit width configuration; the scalar execution rate is improved by parallel processing of a plurality of scalar processing units.
The Vector execution Unit Vector Unit is used for completing parallel information processing of Vector data, and as shown in fig. 1, the Vector execution Unit Vector Unit includes a Vector register file VCC-RF, a Vector register read-write logic VCC-RF port, 2 Vector calculation units VCU, and 2 Vector logic units VBMU; multiple scalar units can be executed in parallel; the vector execution unit implements vectorization bit execution through multiple bit widths; the vector rate is improved by parallel processing of a plurality of vector modules.
The data reading and writing Unit LS Unit is used to realize reading and writing of data from the data storage Unit DMEM, as shown in fig. 1, the data reading and writing Unit LS Unit includes: 2 data read-write modules LSU, data address register PTR-RF; a plurality of units can be executed in parallel, and each unit can be used for scalar reading and writing and can also be used for vector reading and writing; and high-speed data storage is realized by parallel vector reading and writing of a plurality of ports.
The data memory DMEM can store information processing data in parallel through a plurality of memory blocks and reads specific data from the data reading and writing module LSU; and the two ports can be simultaneously used for reading and writing. The structure of the data memory DMEM can be shown in the following table:
table 2 internal memory structure of data memory DMEM
N×1 N×1 N×1 N×1 N×1 N×1 N×1 N×1
The Fetch Unit Fetch Align in the Control Unit is used for outputting addresses to the instruction memory IMEM, reading out instructions in a plurality of blocks in the IMEM corresponding to the addresses to form an instruction group, and outputting the instruction group to the instruction dispatch Dispatcher Unit; as an alternative implementation, the process of reading the instruction by the Fetch unit Fetch Align may be as shown in fig. 2, where the address of the Fetch unit is incremented by the self-adder when no external address is input; outputting the external address provided by the PCU when the PCU inputs the external address and is enabled; the instruction group is formed into a very long instruction word, and instruction parallelism is realized.
As shown in fig. 3, the instruction dispatch Dispatcher Unit in the Control Unit dispatches the multiple programs (INS0, INS1, INS2, INS3) in the instruction group to at least one of the Scalar execution Unit Scalar Unit, the Vector execution Unit Vector Unit, and the data read-write Unit LS Unit in parallel; and controlling the program to realize the random distribution of scalar and vector instructions; the unit realizes multiple-instruction parallel execution through ultra-long instruction distribution (as shown in FIG. 3, multiple instructions are subjected to parallel Operation0, Operation1, Operation2, … … Operation11 through multiple random distributions); through program control, scalar execution vector execution is organically combined, and high-speed and high-performance parallel processing oriented to bit granularity is realized.
The Scalar execution Unit Scalar Unit comprises a Scalar register file ACC-RF, a register read-write logic ACC-RF port, a program control Unit PCU, 2 calculation units CU and 2 Scalar logic units BMU; multiple scalar units can be executed in parallel; the scalar unit can realize bit granularity information processing through bit width configuration; the scalar execution rate is improved by parallel processing of a plurality of scalar processing units.
A Scalar register file ACC-RF of the Scalar execution Unit Scalar Unit is shown in FIG. 4, the Scalar register file has 32 registers, wherein the first 28 registers are data registers with a bit width of 40 bits, and the last 4 registers are prediction registers with a bit width of 4 bits; 4 write ports are provided in total, and 4 registers in 32 registers can be written simultaneously; the total number of the read ports is 8, and 8 registers in the 32 registers can be read simultaneously; 4 prediction register read ports are arranged, and 4 prediction registers can be read simultaneously; and a plurality of ports read and write in parallel to realize parallel input and output of the register file.
The register read-write logic ACC-RF port of the Scalar execution Unit Scalar Unit includes two units of read multiplexing and write multiplexing, as shown in fig. 5, the read multiplexing Unit multiplexes 11 ports of LSU, CU, BMU, and PCU into 8 read ports; the write multiplexing unit multiplexes 6 write ports of the LSU, CU, BMU into 4 write ports.
As shown in fig. 6, the program control Unit PCU of the Scalar execution Unit Scalar Unit includes a program control instruction decoding Unit, an 8-channel loop address register, and a 3-channel interrupt address register; when the input PC instruction is a circulation instruction or a function call instruction, the program control instruction decoding unit generates a circulation address or a return address according to the input PCU instruction and the current PC address and stores the circulation address or the return address into a memory; continuously comparing the PC address with the loop address or the return address in the memory during operation, and if the PC address reaches the loop tail part and the loop number is greater than 0 or a return instruction is received, outputting the loop head address or the return address as a jump address, and simultaneously enabling the jump to be effective; the multiple loop addresses implement a multi-layer loop; multiple return addresses implement a multi-level function call.
As shown in FIG. 7, 2 compute units CU of the Scalar execution Unit Scalar Unit include a decode and operation module ALU, a multiplication module Mul, an Add/subtract module Add/Sub; the decoding module outputs a file signal for reading the ACC-RF register and a control parameter; starting according to one of the control parameters Alu or Mul; when Alu starts, according to the result ACC-RF input signal of decoding read ACC-RF, realize 4 8 bit, or 2 16 bit, or 1 32 bit operation, and through writing ACC-RF0 signal output operation result, write ACC-RF register file; when Mul is started, according to the result ACC-RF input signal of decoding and reading ACC-RF, 4 multiplication operations of 8 bits are realized, and the operation result is output by writing ACC-RF0 signal and written into ACC-RF register file, or the multiplication result is output to Add/Sub module; the Add/Sub module adds and subtracts the multiplication result to realize 4 multiply-accumulate/multiply-accumulate-subtract of 8 bits or 2 multiply-accumulate/multiply-accumulate-subtract of 16 bits, and writes the ACC-RF 1 signal output operation result into an ACC-RF register file; in the operation, reading the pred-RF input of the pred-RF to obtain an indication bit, and realizing the byte effective indication of input/output data; and through the indication of pred-RF, the bit width enablement of various different types is realized, and the bit granularity and the bit width configurability are realized.
As shown in fig. 8, 2 Scalar logic units BMU of the Scalar execution Unit Scalar Unit. The system comprises a decoding module, a Logic operation module Logic and a packing module Pack/Unpack; the decoding module outputs a file signal for reading the ACC-RF register and a control parameter; starting the logic module or the packaging module according to the control parameter; when the logic module is started, according to the ACC-RF input signal of the result of decoding and reading the ACC-RF, the logic operation of 4 8 bits, or 2 16 bits, or 1 32 bits is realized, and the operation result is output by writing the ACC-RF0 signal and written into the ACC-RF register file; when the packing module is started, packing and unpacking operations of 4 8 bits, or 2 16 bits, or 1 32 bits are realized according to the ACC-RF input signal of the result of decoding and reading the ACC-RF, and an operation result is output by writing an ACC-RF0 signal and is written into an ACC-RF register file; in the operation, reading the pred-RF input of the pred-RF to obtain an indication bit, and realizing the byte effective indication of input/output data; and through the indication of pred-RF, the bit width enablement of various different types is realized, and the bit granularity and the bit width configurability are realized.
The logic operation of the scalar logic unit BMU can realize the operation of any bit width; the packing operation can divide the data into any bit width and can also combine the data with any bit width together; by packing and unpacking, finer control of bit granularity is achieved.
The Vector execution Unit Vector Unit completes information processing of Vector data, and comprises a Vector register file VCC-RF, a Vector register read-write logic VCC-RF port, 2 Vector calculation units VCU and 2 Vector logic units VBMU; multiple scalar units can be executed in parallel; the vector execution unit implements vectorization bit execution through multiple bit widths; the vector rate is improved by parallel processing of a plurality of vector modules.
The Vector register file VCC-RF of the Vector execution Unit Vector Unit, which has 32 registers in total, as shown in fig. 9, of which the first 28 are data registers VCC that are Nx40 bit wide, and the last 4 are prediction registers vpred that are Nx4 bit wide; 4 write ports are provided in total, and 4 registers in 32 registers can be written simultaneously; the total number of the read ports is 8, and 8 registers in the 32 registers can be read simultaneously; 4 prediction register read ports are arranged, and 4 prediction registers can be read simultaneously; and a plurality of ports read and write in parallel to realize parallel input and output of the register file.
As shown in fig. 10, the Vector register read-write logic VCC-RF port of the Vector execution Unit Vector Unit includes two units, i.e., a read multiplexing Unit and a write multiplexing Unit; the read multiplexing unit multiplexes 10 ports of the LSU, the CU and the BMU into 8 read ports; write multiplexing multiplexes 6 write ports of the LSU, CU, BMU into 4 write ports.
As shown in fig. 11, 2 Vector computing units VCUs of the Vector execution Unit Vector Unit include a decode and operation module ALU group; the decoding module outputs a VCC-RF register file reading signal and a control parameter; starting an Alu group according to a control parameter; when an Alu group is started, according to a VCC-RF input signal of a result of decoding and reading VCC-RF, the operation of 4 bits or 2 bits or 16 bits or 1 bit or 32 bits of N groups is realized, and the operation result is output by writing a VCC-RF signal and written into a VCC-RF register file; in the operation, an indication bit is obtained by reading vpred-RF input of vpred-RF, so that byte effective indication of input/output data is realized; and through the instruction of pred-RF, the enablement of various bit widths and vector lengths is realized, and the bit granularity bit width and the vector length are configurable.
As shown in fig. 12, 2 Vector Logic units VBMU of a Vector Unit include a Logic/Pack group of decoding and Logic operation modules; the decoding module outputs a VCC-RF register file reading signal and a control parameter; starting the logic module group according to the control parameters; when the logic module is started, N groups of logic operations with 4 bits, or 2 bits, or 1 bit, or 32 bits are realized according to VCC-RF input signals of the result of decoding and reading VCC-RF, and the operation result is output by writing VCC-RF 0 signals and written into a VCC-RF register file; in the operation, an indication bit is obtained by reading vpred-RF input of vpred-RF, so that byte effective indication of input/output data is realized; and through the instruction of pred-RF, the enablement of various bit widths and vector lengths is realized, and the bit granularity bit width and the vector length are configurable.
The logic operation of the vector logic unit VBMU can realize the operation of any group of any bit width; the packing operation can divide the data into any group of data with any bit width, and can also combine the data with any bit width; by packing and unpacking, finer control of bit granularity is achieved.
As shown in fig. 13, 2 data read/write modules LSU of the data read/write Unit LS Unit include instruction decoding, address generation, read storage, and write storage modules; the instruction decoding module generates a non-module, a read storage module or a write storage module according to an input instruction starting address; the address generating module generates a data address according to the address in the PTR-RF and outputs the data address to a data storage DMEM; the read storage module generates a read enable signal to be stored in data, reads data corresponding to the data address, and writes the data into ACC-RF or VCC-RF; the write storage module generates a write enable signal to be stored in data, and simultaneously writes data input by ACC-RF or VCC-RF into a data storage DMEM; the device can be used for reading and writing by two ports simultaneously; in reading and writing, reading pred/vpred-RF input of pred/vpred-RF to obtain an indication bit, and realizing byte effective indication of input/output data; through the indication of pred/vpred-RF, the enablement of various bit widths and vector lengths is realized, and the configurable bit width and vector length of the bit granularity are realized.
And a data address register PTR-RF of the data read-write Unit LS Unit comprises 4 address registers for the address use of data read-write.
The application provides a parallel computing method and a parallel computing device for low-complexity bit granularity information processing, which comprise the following steps: providing a low-complexity parallel computing structure for bit granularity information processing, a parallel information processing method, a parallel and vector processing instruction and a device; the device realizes the parallel execution of multiple instructions through the distribution of ultra-long instructions; bit granularity information processing is realized through bit width configurability of a scalar unit; the scalar execution rate is improved by parallel processing of a plurality of scalar processing units; vector bit execution is realized through vector execution units with various bit widths; the vector rate is improved by parallel processing of a plurality of vector modules; high-speed data storage is realized through parallel vector reading and writing of a plurality of ports; through program control, scalar execution vector execution is organically combined, and high-speed and high-performance parallel processing oriented to bit granularity is realized. The device can realize bit-level operation such as source coding, channel coding, encryption coding and the like in parallel and vectorization at high speed, and solves the problem that the existing bit granularity information is converted from a special processing device to a general parallel processor.
Compared with the prior art, the method has the following advantages:
the parallel computing structure, the parallel information processing method, the parallel and vector processing instruction and the device for bit granularity information processing are realized, a parallel computing platform for bit granularity is realized, and the problem of the existing general bit granularity information parallel processor is solved.
Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.
The above-mentioned apparatus may comprise a processor and a memory, and the above-mentioned units may be stored in the memory as program units, and the processor executes the above-mentioned program units stored in the memory to implement the corresponding functions.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The order of the embodiments of the present application described above does not represent the merits of the embodiments.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways.
The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (21)

1. A bit-granularity oriented information processing system, comprising:
an instruction memory IMEM for storing a plurality of instructions;
the Control Unit is used for controlling the parallel execution of the plurality of instructions, wherein the instruction execution mode comprises scalar execution and vector execution;
the Scalar execution Unit Scalar Unit is used for finishing parallel information processing of a plurality of Scalar instructions;
the Vector execution Unit Vector Unit is used for finishing parallel information processing of a plurality of Vector instructions;
the data reading and writing Unit LS Unit is used for reading and writing the instruction memory IMEM;
the data memory DMEM is used for parallelly storing the data read from the data reading and writing Unit LS Unit;
the Vector execution Unit Vector Unit comprises a plurality of Vector calculation units VCUs, and the Vector calculation units VCUs comprise a decoding module and an arithmetic module ALU group; the decoding module is used for outputting a signal and a control parameter of a read vector register file VCC-RF and controlling the starting of the arithmetic module ALU group according to the control parameter; the arithmetic module ALU is used for executing operation according to the signal output by the decoding module for reading the vector register file VCC-RF when the decoding module is started, and outputting an operation result by writing the signal of the vector register file VCC-RF and writing the operation result into the vector register file VCC-RF; during the operation of the arithmetic module ALU group, the bit width and the vector length of input data and/or output data are determined and configured through a bit width instruction.
2. The system according to claim 1, characterized in that said instruction memory IMEM comprises a plurality of memory blocks storing in parallel a program of said plurality of instructions.
3. The system of claim 2, wherein the Control Unit comprises:
an instruction Fetch unit Fetch Align for reading the plurality of instructions stored in the instruction memory IMEM;
and the instruction dispatching Unit Dispatcher is used for dispatching the instructions stored in the plurality of storage blocks to the Scalar execution Unit Scalar Unit, the Vector execution Unit Vector Unit and the data reading and writing Unit LS Unit in parallel.
4. The system according to claim 3, wherein the Fetch unit Fetch Align is configured to output an address to the instruction memory IMEM, to read an instruction of a corresponding address in the plurality of memory blocks of the instruction memory IMEM and compose an instruction group, and to output the instruction group to the instruction dispatch unit Dispatcher, wherein the Fetch unit Fetch Align is configured to increment the address by an address self-adder and output the incremented address to the instruction memory IMEM when no external input address is present and enabled, and to output the external input address to the instruction memory IMEM when the external input address is present and enabled.
5. The system according to claim 4, wherein the instruction dispatch Unit Dispatcher is configured to dispatch multiple programs in the instruction group to at least one of the Scalar execution Unit Scalar Unit, the Vector execution Unit Vector Unit, and the data read/write Unit LS Unit in parallel, and the instruction dispatch Unit Dispatcher implements random dispatch of Scalar instructions and Vector instructions through a control program.
6. The system according to claim 1, wherein a plurality of Scalar execution units Scalar units are included in the system, and execute in parallel and perform information processing at a bit granularity by configuring different bit widths, wherein each Scalar execution Unit Scalar Unit contains a Scalar register file ACC-RF, a register read-write logic ACC-RF port, a program control Unit PCU, a plurality of computation units CU, and a plurality of Scalar logic units BMU.
7. The system of claim 6, wherein the scalar register file ACC-RF comprises a plurality of registers, wherein the plurality of registers comprises a first plurality of registers and a second plurality of registers, wherein the scalar register file ACC-RF further comprises a plurality of write ports for writing a plurality of the first plurality of registers in parallel, a first plurality of read ports for reading a plurality of the first plurality of registers in parallel, and a second plurality of read ports for reading a plurality of the second plurality of registers in parallel.
8. The system of claim 6, wherein the ACC-RF port comprises a read multiplexing unit and a write multiplexing unit, wherein the read multiplexing unit is configured to perform multiplexing of read ports, and wherein the write multiplexing unit is configured to perform multiplexing of write ports.
9. The system of claim 6, wherein the program control unit PCU comprises a program control instruction decode unit, a multi-channel loop address register, and a multi-channel interrupt address register, wherein, the program control instruction decoding unit is used for decoding the instruction when the input instruction is a loop instruction or a function call instruction, generating a cyclic address or a return address according to the input instruction and the current address and storing the cyclic address or the return address in a memory, comparing the address with the loop address or the return address in the memory in the input instruction execution, if the loop tail is reached and the loop quantity is more than 0 or a return instruction is received, outputting a loop head address or a return address as a jump address, and controlling jump enabling to be effective, wherein, realize the multi-layer circulation through a plurality of circulation addresses, realize the function call of the multi-layer through a plurality of return addresses.
10. The system according to claim 6, wherein the plurality of Compute Units (CUs) comprise a coding module, an arithmetic module (ALU), a multiplication module (Mul), an Add-subtract module (Add-Sub); the decoding module is used for outputting signals and control parameters for reading the scalar register file ACC-RF and controlling one of the arithmetic module ALU and the multiplication module Mul to be started according to the control parameters; the arithmetic module ALU is used for executing operation according to the signal which is output by the decoding module and used for reading the scalar register file ACC-RF when the scalar register file is started, outputting an operation result by writing the signal of the scalar register file ACC-RF and writing the operation result into the scalar register file ACC-RF; the multiplication module Mul is configured to, when starting up, perform multiplication operation according to the signal output by the decoding module for reading the scalar register file ACC-RF, output a result of the multiplication operation by writing the signal of the scalar register file ACC-RF, write the result into the scalar register file ACC-RF, or output the result of the multiplication operation to the addition and subtraction module Add Sub; the addition and subtraction module Add Sub adds and subtracts the multiplication result and writes the operation result into the scalar register file ACC-RF by writing the signal output operation result of the scalar register file ACC-RF; during the operation of the arithmetic module ALU, the multiplication module Mul, and the Add-subtract module Add Sub, the bit width of the input data and/or the output data is determined and configured by the bit width instruction.
11. The system of claim 6, wherein each scalar Logic unit BMU comprises a decoding module, a Logic operation module Logic, and a packing module Pack Unpack; the decoding module is used for outputting signals and control parameters for reading the scalar register file ACC-RF and controlling one of the Logic operation module Logic and the packing module Pack Unpack to start according to the control parameters; the Logic operation module Logic is used for executing operation according to the signals output by the decoding module and used for reading the scalar register file ACC-RF when the Logic operation module Logic is started, outputting operation results through writing the signals of the scalar register file ACC-RF and writing the operation results into the scalar register file ACC-RF; the packing module Pack Unpack is used for executing packing operation and unpacking operation according to the signals output by the decoding module and used for reading the scalar register file ACC-RF when the packing module Pack Unpack is started, outputting operation results through writing the signals of the scalar register file ACC-RF and writing the operation results into the scalar register file ACC-RF; and in the process of the Logic operation module Logic and the Pack Unpack operation of the packing module, determining the bit width of input data and/or output data through a bit width instruction and configuring the bit width.
12. The system of claim 11, wherein the Logic operation module Logic is capable of performing operations of various bit widths, and the packing module Pack Unpack splits data when performing a packing operation and merges data when performing an unpacking operation, wherein the packing module Pack Unpack controls a bit width of data through the packing operation and the unpacking operation.
13. The system of claim 1, wherein the Vector execution Unit Vector Unit further comprises a Vector register file VCC-RF, a Vector register read-write logic VCC-RF port, and a plurality of Vector logic units VBMU; the plurality of vector logic units VBMUs are configured to execute in parallel, and each of the vector logic units VBMUs can process vector data with different bit widths.
14. The system of claim 13, wherein said vector register file VCC-RF comprises a plurality of registers, wherein said plurality of registers comprises a first plurality of registers and a second plurality of registers, wherein said vector register file VCC-RF further comprises a plurality of write ports for writing a plurality of said first plurality of registers in parallel, a first plurality of read ports for reading a plurality of said first plurality of registers in parallel, and a second plurality of read ports for reading a plurality of said second plurality of registers in parallel.
15. The system according to claim 13, wherein the vector register read-write logic VCC-RF port comprises a read multiplexing unit and a write multiplexing unit, wherein the read multiplexing unit is configured to perform multiplexing of read ports, and wherein the write multiplexing unit is configured to perform multiplexing of write ports.
16. The system according to claim 13, wherein each of the vector Logic units VBMU comprises a decoding module and a Logic operation module Logic group; the decoding module is used for outputting a signal for reading the VCC-RF of the vector register file and a control parameter and controlling the Logic Pack group of the Logic operation packing module to start according to the control parameter; the Logic Pack module is used for executing operation according to the signal output by the decoding module for reading the VCC-RF of the vector register file when the Logic Pack module is started, and outputting an operation result by writing the signal of the VCC-RF of the vector register file and writing the operation result into the VCC-RF of the vector register file; during the Logic Pack operation process of the Logic operation packing module, bit width and vector length of input data and/or output data are determined and configured through a bit width instruction.
17. The system of claim 16, wherein the Logic packing module Logic Pack is capable of performing operations of multiple bit widths, splitting data when performing packing operations, and combining data when performing unpacking operations, wherein the Logic packing module Logic Pack controls a bit width of data through the packing operations and the unpacking operations.
18. The system according to claim 1, wherein the data reading and writing Unit LS Unit is configured to read and write the data storage Unit DMEM, and includes a plurality of data reading and writing modules LSU and a data address register PTR-RF; the data reading and writing modules LSU are used for executing scalar reading and writing and vector reading and writing in parallel; the data read-write Unit LS Unit performs reading and writing in parallel through a plurality of ports.
19. The system according to claim 18, wherein each of the data read/write modules LSU comprises a decoding module, an address generating module, a read storage module and a write storage module; the decoding module starts the address generating module, the reading storage module or the writing storage module according to an input instruction; the address generating module generates an address of data according to the address in the data address register PTR-RF and outputs the address to the data storage unit DMEM; the reading and storing module generates a reading enabling signal to the data storage Unit DMEM, reads data corresponding to the address of the data, and writes the data into a Scalar register file ACC-RF or a Vector register file VCC-RF, wherein the Scalar register file ACC-RF is a file in the Scalar execution Unit Scalar Unit, and the Vector register file VCC-RF is a file in the Vector execution Unit Vector Unit; the writing storage module generates a writing enabling signal to the data storage unit DMEM, and writes data input by the scalar register file ACC-RF or the vector register file VCC-RF into the data storage unit DMEM; and in the process of reading and writing the data reading and writing module LSU, determining the bit width and the vector length of the input data and/or the output data through a bit width instruction and configuring.
20. The system as claimed in claim 18, wherein the data address register PTR-RF of the data read/write Unit LS Unit comprises a plurality of address registers for storing addresses of read data.
21. The system according to claim 18, characterized in that said data memory DMEM comprises a plurality of memory blocks for storing in parallel information processing data, wherein said information processing data are data read from said data read write module LSU.
CN201710804779.6A 2017-09-07 2017-09-07 Information processing system oriented to bit granularity Active CN107748674B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710804779.6A CN107748674B (en) 2017-09-07 2017-09-07 Information processing system oriented to bit granularity
PCT/CN2017/102482 WO2019047281A1 (en) 2017-09-07 2017-09-20 Bit-oriented granularity information processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710804779.6A CN107748674B (en) 2017-09-07 2017-09-07 Information processing system oriented to bit granularity

Publications (2)

Publication Number Publication Date
CN107748674A CN107748674A (en) 2018-03-02
CN107748674B true CN107748674B (en) 2021-08-31

Family

ID=61255614

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710804779.6A Active CN107748674B (en) 2017-09-07 2017-09-07 Information processing system oriented to bit granularity

Country Status (2)

Country Link
CN (1) CN107748674B (en)
WO (1) WO2019047281A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109672454A (en) * 2019-01-22 2019-04-23 上海无线通信研究中心 High speed viterbi coding method and its receiver under a kind of DSP little endian mode
CN110780921B (en) * 2019-08-30 2023-09-26 腾讯科技(深圳)有限公司 Data processing method and device, storage medium and electronic device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359310A (en) * 2007-07-31 2009-02-04 英特尔公司 Providing an inclusive shared cache among multiple core-cache clusters
CN101436122A (en) * 2008-11-25 2009-05-20 中国科学院微电子研究所 Optimizing method and apparatus for implementing instruction parallel execution
CN101833441A (en) * 2010-04-28 2010-09-15 中国科学院自动化研究所 Parallel vector processing engine structure
CN102541774A (en) * 2011-12-31 2012-07-04 中国科学院自动化研究所 Multi-grain parallel storage system and storage
CN103793201A (en) * 2012-10-30 2014-05-14 英特尔公司 Instruction and logic to provide vector compress and rotate functionality
CN105335127A (en) * 2015-10-29 2016-02-17 中国人民解放军国防科学技术大学 Scalar operation unit structure supporting floating-point division method in GPDSP
CN106209121A (en) * 2016-07-15 2016-12-07 中国科学院微电子研究所 A kind of communications baseband SoC chip of multimode multinuclear
CN107094369A (en) * 2014-09-26 2017-08-25 英特尔公司 Instruction and logic for providing SIMD SM3 Cryptographic Hash Functions

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7506135B1 (en) * 2002-06-03 2009-03-17 Mimar Tibet Histogram generation with vector operations in SIMD and VLIW processor by consolidating LUTs storing parallel update incremented count values for vector data elements
US7451293B2 (en) * 2005-10-21 2008-11-11 Brightscale Inc. Array of Boolean logic controlled processing elements with concurrent I/O processing and instruction sequencing
CN102200964B (en) * 2011-06-17 2013-05-15 孙瑞琛 Parallel-processing-based fast Fourier transform (FFT) device and method thereof
CN102629238B (en) * 2012-03-01 2014-10-29 中国人民解放军国防科学技术大学 Method and device for supporting vector condition memory access
CN102750133B (en) * 2012-06-20 2014-07-30 中国电子科技集团公司第五十八研究所 32-Bit triple-emission digital signal processor supporting SIMD
US9658851B2 (en) * 2013-08-30 2017-05-23 Think Silicon Sa Device and method for approximate memoization
US11275590B2 (en) * 2015-08-26 2022-03-15 Huawei Technologies Co., Ltd. Device and processing architecture for resolving execution pipeline dependencies without requiring no operation instructions in the instruction memory

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359310A (en) * 2007-07-31 2009-02-04 英特尔公司 Providing an inclusive shared cache among multiple core-cache clusters
CN101436122A (en) * 2008-11-25 2009-05-20 中国科学院微电子研究所 Optimizing method and apparatus for implementing instruction parallel execution
CN101833441A (en) * 2010-04-28 2010-09-15 中国科学院自动化研究所 Parallel vector processing engine structure
CN102541774A (en) * 2011-12-31 2012-07-04 中国科学院自动化研究所 Multi-grain parallel storage system and storage
CN103793201A (en) * 2012-10-30 2014-05-14 英特尔公司 Instruction and logic to provide vector compress and rotate functionality
CN107094369A (en) * 2014-09-26 2017-08-25 英特尔公司 Instruction and logic for providing SIMD SM3 Cryptographic Hash Functions
CN105335127A (en) * 2015-10-29 2016-02-17 中国人民解放军国防科学技术大学 Scalar operation unit structure supporting floating-point division method in GPDSP
CN106209121A (en) * 2016-07-15 2016-12-07 中国科学院微电子研究所 A kind of communications baseband SoC chip of multimode multinuclear

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于LLVM的指令并行调度与实现;屈秋雯;《微电子学与计算机》;20131105;全文 *
面向LTE-A宽带通信的并行比特协处理器;管武等;《电子技术应用》;20150106;全文 *

Also Published As

Publication number Publication date
CN107748674A (en) 2018-03-02
WO2019047281A1 (en) 2019-03-14

Similar Documents

Publication Publication Date Title
KR101703743B1 (en) Accelerated interlane vector reduction instructions
KR102413832B1 (en) vector multiply add instruction
CN102750133B (en) 32-Bit triple-emission digital signal processor supporting SIMD
US9355061B2 (en) Data processing apparatus and method for performing scan operations
KR0178078B1 (en) Data processor capable of simultaneoulsly executing two instructions
US7302552B2 (en) System for processing VLIW words containing variable length instructions having embedded instruction length identifiers
CN102508635A (en) Processor device and loop processing method thereof
US9965275B2 (en) Element size increasing instruction
KR20070026434A (en) Apparatus and method for control processing in dual path processor
JPWO2006049331A1 (en) SIMD type parallel processing unit, processing element, control method for SIMD type parallel processing unit
CN107748674B (en) Information processing system oriented to bit granularity
CN104035895A (en) Apparatus and Method for Memory Operation Bonding
CN108415728B (en) Extended floating point operation instruction execution method and device for processor
CN102682232B (en) High-performance superscalar elliptic curve cryptographic processor chip
US20060218373A1 (en) Processor and method of indirect register read and write operations
US20110289299A1 (en) System and Method to Evaluate a Data Value as an Instruction
CN102411490B (en) Instruction set optimization method for dynamically reconfigurable processors
US11106465B2 (en) Vector add-with-carry instruction
CN111095197B (en) Code processing method and device
Ren et al. Swift: A computationally-intensive dsp architecture for communication applications
CN102289363B (en) Method for controlling data stream and computer system
GB2390443A (en) A processor where some registers are not available to compiler generated code
Jungeblut et al. A systematic approach for optimized bypass configurations for application-specific embedded processors
CN112130899A (en) Stack computer
US8898433B2 (en) Efficient extraction of execution sets from fetch sets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant