WO2022165718A1 - 一种接口控制器、数据传输方法及片上系统 - Google Patents
一种接口控制器、数据传输方法及片上系统 Download PDFInfo
- Publication number
- WO2022165718A1 WO2022165718A1 PCT/CN2021/075307 CN2021075307W WO2022165718A1 WO 2022165718 A1 WO2022165718 A1 WO 2022165718A1 CN 2021075307 W CN2021075307 W CN 2021075307W WO 2022165718 A1 WO2022165718 A1 WO 2022165718A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- data slice
- processor
- slice
- slices
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 74
- 230000005540 biological transmission Effects 0.000 title claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 54
- 230000008569 process Effects 0.000 claims description 34
- 238000013528 artificial neural network Methods 0.000 claims description 22
- 238000010586 diagram Methods 0.000 description 23
- 230000006870 function Effects 0.000 description 13
- 238000004422 calculation algorithm Methods 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000013461 design Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- TVZRAEYQIKYCPH-UHFFFAOYSA-N 3-(trimethylsilyl)propane-1-sulfonic acid Chemical compound C[Si](C)(C)CCCS(O)(=O)=O TVZRAEYQIKYCPH-UHFFFAOYSA-N 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 210000003169 central nervous system Anatomy 0.000 description 1
- 238000005352 clarification Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
Definitions
- the present application relates to the field of interface technologies, and in particular, to an interface controller, a data transmission method, and a system-on-chip.
- SoC system on chip
- the general process of implementing a certain algorithm for a data frame is to pass the data frame through different processors (usually, the processor may contain a
- IP cores intellectual property cores
- the processor used in the algorithm link may be a pure hardware processor or a software programmable processor, such as a central processing unit (CPU), a digital signal processor (DSP), Image signal processor (image signal processor, ISP), video encoder (video codec), neural network processor (neural processing unit, NPU), graphics processor (graphics processing unit, GPU), or display subsystem (display sub system, DSS), etc.
- data interaction between processors is usually achieved through memory sharing.
- the previous-level processor finishes processing the current data frame, it puts it into the shared memory (at least one of the buffer or DDR), and notifies the latter-level processor to continue processing, and the latter-level processor
- the data frame is read from shared memory and processed afterwards.
- the algorithm link is executed serially between the former processor and the latter processor, so for periodic real-time processing and delay-sensitive algorithm links (such as 30FPS (30 frames) per second), the processing of one frame of data is required to be completed within a unit cycle, as shown in Figure 2.
- Processing and outputting one frame of data by the former processor will occupy the processing time T1 of one frame in a unit cycle, and the latter processor is in an idle state at this time.
- the remaining time period T2 of the unit cycle is processed by the next-level processor. Since the remaining time period T2 may be very short, this method places high demands on the performance of the post-processor. And, usually when the data frame is too large, when using shared memory, it cannot be placed in the on-chip cache (SoCcache), so it needs to be placed in external memory (for example, double data rate dynamic random access memory (double data rate dynamic random access). memory, DDR), which will lead to the overall performance caused by the limitation of bandwidth and power consumption specifications of the memory. In short, when processors cooperate to execute algorithms under the current SoC architecture, the way of data transmission will affect the overall performance improvement.
- the present application provides an interface controller, a data transmission method and a system-on-chip, which can improve the overall performance of the system.
- a data transmission method is provided.
- the data transmission method is applied to an interface controller, and the interface controller is connected to the first processor and the second processor.
- the data transmission method includes: acquiring a plurality of data slices in a data frame, wherein the data frame is generated by a first processor; writing the plurality of data slices into a buffer; according to a target data slice in the plurality of data slices Acquire a first data slice, where the first data slice includes at least a target data slice; and transmit the first data slice to a second processor, where the second processor is used for processing the first data slice.
- the interface controller can store multiple data slices in the data frame processed by the first processor in the buffer, and can acquire the first data slice according to the target data slice in the multiple data slices, wherein The first data slice includes at least the target data slice, and then the first data slice is sent to the second processor, and the second processor processes the corresponding first data slice.
- the data frame is divided into multiple data slices. Slicing, thus reducing the size of each transmission data relative to the data frame, and transmitting to the second processor through the buffer of the interface controller, thus avoiding the use of shared memory between the first processor and the second processor.
- Data is transmitted between the two data slices, and because the interface controller can transmit to the second processor and notify the second processor to process each time it acquires a first data slice, when the second processor processes the first data slice, the interface controller It is further used to transmit other data slices to the second processor, which can effectively form a computing link pipeline, and since the data frame is divided into data slices, the data in each data slice is stored in the first processor. After the processing is completed, it can be transmitted to the second processor through the interface controller, and the second processor will start processing, thereby obtaining more processing time for the second processor. In this scheme, as long as the first processor has processed the data of one data slice, it can be transmitted to the second processor for processing through the interface controller, so the performance requirements of the second processor are relatively low. And it can reduce the delay of the entire computing link; thereby improving the overall performance of the system.
- dividing the data frame into a plurality of data slices may be performed by the first processor or the interface controller.
- the first processor is a processor similar to a GPU, since the GPU usually renders images according to data slices, the first processor divides the data frame into multiple data slices, and obtains multiple data slices in the data frame.
- a data slice specifically, it includes: receiving a plurality of data slices transmitted by the first processor.
- dividing the data frame into multiple data slices can be performed by the interface controller.
- the method before acquiring the multiple data slices in the data frame, the method further includes: receiving the first A data frame transmitted by a processor; acquiring multiple data slices in the data frame includes: dividing the data frame into multiple data slices.
- acquiring the first data slice according to the target data slice in the multiple data slices includes: reading the target data slice in the buffer, and the data slices adjacent to the edge of the target data slice; The data slice and the edge-adjacent data slice of the target data slice generate a first data slice, wherein the first data slice covers the target data slice, and the first data slice and the edge-adjacent data slice of the target data slice have overlapping areas.
- the second processor may be a neural network processor NPU; the data transmitted between the GPU and the NPU is an image.
- the data on the input side of the NPU is The width and height of the slice will be output after being shrunk according to the number of layers and stride of the neural network. Therefore, usually, the width (height) of the data slice on the output side of the NPU will be smaller than the width (height) of the data slice on the input side. In this way, in order to ensure that the data slice on the output side of the NPU can completely cover the entire data frame, the data slice input to the NPU needs to have a larger width and height.
- the first data slice needs to have a larger size than the target data slice, that is, the first data slice is in addition to covering the target.
- the data slice adjacent to the edge of the target data slice also needs to have an overlapping area, wherein the width of the overlapping area is determined by the number of layers and the stride of the convolutional network of the NPU.
- the method before acquiring the first data slice according to the target data slice in the multiple data slices, the method further includes: determining that the data slices adjacent to the edge of the target data slice are stored in the buffer.
- the first data slice needs to be calculated according to the target data slice and the data slices adjacent to the edge of the target data slice, it is necessary to ensure that the data slices adjacent to the edge of the target data slice are stored in the buffer.
- the method further includes: generating a bitmap table, and after determining to acquire the data slice, setting a bit of the corresponding position of the data slice in the bitmap table to 1;
- the storing of the data slices with adjacent edges in the buffer includes: determining that the bits around the corresponding positions of the target data slices in the bitmap table are all 1s. Wherein, through the bitmap table bitmap, it can be recorded that the data slice corresponding to each bit has been stored in the buffer.
- the target data slice includes 2n data slices, where n is a positive integer. If the granularity of the target data slice (slice) is 1x1, that is, it contains only one data slice. Then, for each target data slice, it is necessary to obtain at least eight other data slices adjacent to the edge of the target data slice (except for the target data slice located at the edge of the data frame); then the first data slice overlaps with these other data slices The data of the zone will be repeatedly transmitted many times. In order to reduce repeated data transmission in the overlapping area, the target data slice provided by the embodiment of the present application may include 2n data slices.
- a maximum of 12 data slices adjacent to the edge of the target data slice are required, so that for each data slice in the target data slice of 2x2 granularity, it is equivalent to provide the adjacent edge of the target data slice on average.
- the calculation of the first data slice can be performed with three data slices, which reduces the repeated transmission of data in the overlapping area.
- the interface controller needs to acquire the information of the data frame in order to divide the data frame into multiple data slices, or in order to determine the position of the data slice in the data frame. Therefore, it also includes: acquiring the information of the data frame, where the information of the data frame includes the base address, width, height and number of channels; and dividing the data frame into a plurality of data slices according to the information of the data frame.
- the interface controller is further connected to the buffer of the system-on-chip; when it is determined that the buffer is full, the data slices in the buffer are stored in the buffer.
- the interface controller can store the data slices in the buffer into the system buffer, in addition, the interface controller can also store the data slices in the buffer into the DDR; When there is remaining space, read it out from the system-on-chip buffer or DDR. In this way, if there is enough space in the buffer, the data transmission is only performed in the buffer. If the space in the buffer is insufficient, in order to avoid interruption of data transmission, the data can be sliced into the system-on-chip buffer or DDR.
- an interface controller is provided, the interface controller is connected to a first processor and a second processor, and the interface controller includes: an acquisition unit configured to acquire a plurality of data slices in a data frame, wherein , the data frame is generated by the first processor; the processing unit is used for writing the plurality of data slices acquired by the acquiring unit into a buffer; the processing unit is used for according to all the data in the buffer A target data slice in the plurality of data slices obtains a first data slice, and the first data slice includes at least the target data slice; a sending unit is configured to transmit the first data slice obtained by the processing unit to a the second processor, wherein the second processor is used to process the first data slice.
- the obtaining unit is specifically configured to receive the multiple data slices transmitted by the first processor, wherein the first processor is configured to divide the data frame into the multiple data slices. data slice.
- the obtaining unit is further configured to receive the data frame transmitted by the first processor; the obtaining unit is specifically configured to divide the data frame into the multiple data slices .
- the processing unit is specifically configured to read the target data slice and the data slice adjacent to the edge of the target data slice in the buffer; according to the target data slice, and The data slice adjacent to the edge of the target data slice generates the first data slice, wherein the first data slice covers the target data slice, and the first data slice is adjacent to the edge of the target data slice. Adjacent data slices have overlapping regions.
- the second processor is a neural network processor NPU.
- the processing unit is further configured to determine that the data slices adjacent to the edge of the target data slice are stored in the buffer.
- the processing unit is further configured to generate a bitmap table, and after it is determined to acquire the data slice, set a bit of the corresponding position of the data slice in the bitmap table to 1;
- the processor is specifically configured to determine that the bits around the corresponding position of the target data slice in the bitmap table are all 1.
- the target data slice includes 2n data slices, where n is a positive integer.
- the method further includes: the obtaining unit is further configured to obtain information of the data frame, where the information of the data frame includes a base address, width, height and number of channels; the processing unit is specifically configured to Information divides the data frame into multiple data slices.
- the interface controller is further connected to the buffer of the system-on-chip; the processing unit is specifically configured to store the data slices in the buffer into the buffer through the sending unit when it is determined that the buffer is full buffer.
- an interface controller comprising a buffer and one or more processors for invoking computer instructions to perform the method as described above.
- a computer-readable storage medium is provided, a computer program is stored on the computer-readable storage medium, and when the computer program runs on a computer, the computer is made to execute the above-mentioned method.
- a system on a chip comprising a first processor, a second processor, and the above interface controller, wherein the interface controller is connected to the first processor and the second processor , the interface controller is used to execute the above-mentioned data transmission method.
- the interface controller when the second processor processes the first data slice, is further configured to transmit other data slices to the second processor.
- the first processor is further configured to process other data when the second processor processes the first data slice to form a pipeline.
- FIG. 1 is a schematic structural diagram of a system-on-chip provided by an embodiment of the present application
- FIG. 2 is a schematic diagram of an algorithm link pipeline between processors provided by an embodiment of the present application.
- FIG. 3 is a schematic structural diagram of a system-on-chip provided by another embodiment of the present application.
- 4a is a schematic structural diagram of a system-on-chip provided by another embodiment of the present application.
- 4b is a schematic flowchart of a data transmission method provided by an embodiment of the present application.
- FIG. 5 is a schematic structural diagram of a data slice array of a data frame according to an embodiment of the present application.
- FIG. 6 is a schematic structural diagram of a data slice provided by an embodiment of the present application.
- FIG. 7 is a schematic diagram of a processing sequence of data slices of a data frame according to an embodiment of the present application.
- FIG. 8 is a schematic diagram of a division sequence of data slices of a data frame according to an embodiment of the present application.
- FIG. 9 is a schematic structural diagram of a system-on-chip provided by another embodiment of the present application.
- FIG. 10 is a schematic diagram of the size of a data slice on the NPU output side and a data slice on the NPU input side provided by an embodiment of the present application;
- FIG. 11 is a schematic structural diagram of a data slice array of a data frame according to another embodiment of the present application.
- FIG. 12 is a schematic structural diagram of a bitmap provided by an embodiment of the present application.
- FIG. 13 is a schematic structural diagram of a bitmap provided by another embodiment of the application.
- FIG. 14 is a schematic structural diagram of a bitmap provided by another embodiment of the present application.
- 15 is a schematic structural diagram of a bitmap provided by yet another embodiment of the present application.
- 16 is a schematic structural diagram of a bitmap provided by another embodiment of the present application.
- 17 is a schematic structural diagram of a bitmap provided by another embodiment of the application.
- 18a is a schematic structural diagram of a bitmap provided by yet another embodiment of the present application.
- 18b is a schematic structural diagram of a bitmap provided by another embodiment of the present application.
- 19 is a schematic diagram of an algorithm link pipeline between processors provided by another embodiment of the present application.
- FIG. 21 is a schematic structural diagram of an interface controller according to an embodiment of the present application.
- first”, “second”, etc. are only used for convenience of description, and should not be understood as indicating or implying relative importance or implying the number of indicated technical features.
- a feature defined as “first”, “second”, etc. may expressly or implicitly include one or more of that feature.
- plural means two or more.
- a plurality of processing units refers to two or more processing units.
- connection should be understood in a broad sense.
- connection may be a fixed connection, a detachable connection, or an integrated; It can also be indirectly connected through an intermediary.
- words such as “exemplary” or “for example” are used to represent examples, illustrations or illustrations. Any embodiments or designs described in the embodiments of the present application as “exemplary” or “such as” should not be construed as preferred or advantageous over other embodiments or designs. Rather, use of words such as “exemplary” or “such as” is intended to present the related concepts in a specific manner.
- the artificial neural network (ANN) mentioned in the embodiments of this application is referred to as a neural network (NN) or a neural-like network for short.
- ANN artificial neural network
- NN neural-like network
- the artificial neural network can include convolutional neural network (CNN), deep neural network (DNN), time delay neural network (TDNN) and multilayer perceptron (MLP) etc. Neural Networks.
- FIG. 3 is a schematic structural diagram of a system-on-chip provided by an embodiment of the present application. It should be understood that the following description will only take the system-on-chip as an example, and the actual solution can also be applied to other types of chips or devices.
- the system-on-chip 100 may include a processor 110, a processor 111, an interface controller 112, a memory 120, a communication line 130, a buffer 140, and at least one communication interface 150 and peripheral circuits (not shown in the figure). ). It should be noted that, in addition to the device shown in FIG.
- the system-on-chip 100 may further include a communication interface 150 .
- the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the system-on-chip 100 .
- the system-on-chip 100 may include more or less components than shown, or some components may be combined, or some components may be split, or have different component arrangements.
- the illustrated components may be implemented in hardware or a combination of software and hardware.
- the processor 110 is the computing core and the control unit of the system-on-chip 100 .
- the processor 110 may include a central processing unit (CPU), a digital signal processor (DSP), an image signal processor (ISP), a video encoder (video codec), Neural network processor (neural processing unit, NPU), graphics processing unit (graphics processing unit, GPU), DSS, application-specific integrated circuit (application-specific integrated circuit, ASIC).
- the processor may be in the form of pure hardware or hardware capable of loading software programs.
- the processor 110 may include one or more processor cores (cores), such as core 0 and core 1 in FIG. 3 .
- the system-on-a-chip 100 may include multiple processors, such as the processor 110, the processor 111 and the interface controller 112 in FIG. 1 .
- processors may be a single-core processor (ie, the processor includes one core) or a multi-core processor (ie, the processor includes multiple cores).
- the memory 120 may exist independently of the processor 110 and be connected to the processor 110 through a communication line 130 .
- memory 120 such as static random access memory (SRAM)
- SRAM static random access memory
- the instructions are controlled by the processor 110 for execution.
- the interface controller 112 may execute the instructions stored in the memory 120, thereby implementing the data transmission methods provided in the following embodiments of the present application.
- the instructions in the embodiments of the present application may also be referred to as application code, which is not specifically limited in the embodiments of the present application.
- Communication line 130 may include a path to communicate information between the components described above.
- Communication interface 150 using any transceiver-like device, for communicating with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area networks (WLAN), etc. .
- RAN radio access network
- WLAN wireless local area networks
- the cache 140 is a high-speed memory located between the processor 110 and the memory 120;
- the cache 140 may be used to store instructions or data that have just been used or cycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the buffer 140 .
- the instructions or data stored in the cache 140 is a small part of the memory 120, but this small part of the instructions or data is about to be accessed by the processor 110 in a short time; when the processor 110 calls the instructions or data, it can be avoided. Open memory 120 is called directly from cache 140, thereby speeding up reading.
- the interface controller 112 may be a single-core or multi-core processor.
- the interface controller 112 is further provided with a buffer.
- the buffer is also called a buffer register.
- the buffer is mainly used to compensate for the data accumulation caused by the difference in data processing speed between the devices at both ends of the buffer.
- the embodiments of the present application provide a data transmission method, which is applied to the interface controller 112 .
- the interface controller 112 is connected to the first processor 110 and the second processor 111.
- the interface controller 112 is connected to the first processor 110 and the second processor 111.
- the connection of the processor 111 may be through a communication line.
- the data transmission method includes the following steps:
- the interface controller 112 acquires a plurality of data slices in the data frame, where the data frame is generated by the first processor.
- the data frame may be in the form of image, text, audio or video, and the like.
- the plurality of data slices (slices or tiles) in the data frame may be in the form of a data slice array.
- the data frame may be a frame of image, and the data frame of the image may be a three-dimensional data array (3D data cube) including pixel data on each channel.
- the x direction in the data frame is the width W of the image frame (width, where the width can be the number of pixels in a row in the x direction), and the y direction in the data frame is The height H of the image frame (height, where the height can be the number of pixels in a column in the y direction), taking a 4K image as an example, its resolution is 4096 ⁇ 2160, then the width W of the data frame of the image is 4096, Height H is 2160.
- the z direction in the data frame is the channel C of the image frame (channel, for example, the data frame of the image with RGB (red, green, blue, red, green and blue) as the three primary colors contains three channels: R channel, G channel, B channel) .
- the data frame of the image can be divided into multiple data slices on the xy plane, and the embodiment of the present application also provides a structure of a data slice, as shown in FIG. It also contains three channels of data, the difference from the whole frame data is that the height and width of a data slice are only fractional times of the whole frame data. It can be understood that one channel of each data slice contains pixel data distributed in several rows and several columns in an array. In addition, there are also two-dimensional data arrays for other data.
- the color image can also be converted into a single-channel grayscale image after being grayed (or binarized), and the data frame of such a grayscale image can also be in the form of a two-dimensional data array.
- the first processor 110 may divide the data frame into multiple data slices, or the interface controller 112 provided in the embodiment of the present application may divide the data frame into multiple data slices.
- the interface controller 112 may directly receive the multiple data slices transmitted by the first processor 110 .
- the first processor 110 may divide the data frame into multiple data slices, process each data slice, and then send the data slices to the interface controller 112 in a certain order. For example, refer to FIG. 7 .
- the data slices in the data slice array can be processed in a zigzag scanning manner and then sent to the interface controller 112 in sequence; or the first processor 110 can also send the processed data slices to the interface controller 112 out of sequence.
- the interface controller 112 may directly receive the data frame transmitted by the first processor 110; and then divide the data frame into a plurality of data slices.
- the CPU may configure registers for the first processor 110 and the interface controller 112 to notify that the first processor 110 or the interface controller 112 divides the data frame into multiple data slice.
- the interface controller 112 needs to obtain the information of the data frame in order to divide the data frame into a plurality of data slices, or to determine the position of the data slice in the data frame.
- the information of the data frame includes the base address, width, height and number of channels of the data frame.
- the information of the data frame may be notified to the interface controller 112 by the CPU in the manner of configuring registers for the interface controller 112.
- the first processor 110 may send data of a data frame to the interface controller 112 through a burst event, wherein a burst event may send a data frame, a data slice or any other data length The data.
- the interface controller 112 divides the data frame into multiple data slices, when a burst event sends data of any length (for example, it can be a data frame or other data of any length), the burst The (burst) event can carry the write address and data length of the data at the same time; then as long as it is determined that the length of the transmitted data exceeds one data slice according to the write address and the data length, the interface controller 112 divides the received data according to the information of the data frame into Multiple data slices. For example, as shown in FIG.
- the interface controller 112 can sequentially convert several consecutive rows of data according to the base address, width, height and number of channels of the data frame. Divide into a data slice. If the first processor 110 divides the data frame into multiple data slices, when a burst event sends one data slice, the burst event can carry the write address and data length of the data slice at the same time; Then the interface controller 112 can determine the position of the data slice in the data frame according to the write address of the data slice, the data length and the above-mentioned information of the data frame.
- the interface controller 112 writes the multiple data slices into the buffer. Wherein, when it is determined that the buffer is full, the interface controller 112 can store the data slices in the buffer into the buffer of the system, and in addition, the interface controller 112 can also store the data slices in the buffer into the DDR; When there is remaining space in the device, it can be read from the buffer or DDR of the system-on-chip. In this way, if there is enough space in the buffer, the data transmission will only be carried out in the buffer. If the space in the buffer is insufficient, in order to avoid interruption of data transmission, the data slices can be stored in the buffer or DDR of the system-on-chip.
- the interface controller 112 acquires a first data slice according to a target data slice among the multiple data slices, where the first data slice includes at least the target data slice.
- the first data slice includes a target data slice, and may optionally further include at least a portion of a data slice adjacent to an edge of the target data slice.
- the interface controller 112 may directly read any target data slice in the buffer and send it to the second processor 111 in the subsequent process.
- the target data slices are read in a certain order (eg, zigzag) and sent to the second processor 111 in the subsequent process.
- the premise is that the target data slice has been written in the buffer through the processing of steps 101 and 102 .
- the interface controller 112 may generate a bitmap in advance before receiving the data slice or data frame for the first time; wherein, the bitmap may be preconfigured by default, and in the initial state (for example, All bits of the bitmap are set to 0 before the interface controller 112 begins to receive data frames or data slices.
- the bit in the bitmap corresponding to the data slice is set to 1.
- the interface controller divides the data frame into data slices, it can directly map each data slice to a bit of the bitmap according to the position of the data slice in the data frame.
- the interface controller 112 can also receive the position information of the data slice in the data frame sent by the first processor 110 while acquiring the data slice (for example, it can be a data slice The coordinates of the data slice, or the write address and data length of the data slice), so that the data slice can be in a one-to-one correspondence with the bits in the bitmap according to the location information of the data slice.
- the data slice corresponding to the bit set to 1 in the bitmap is the data slice completely transmitted to the interface controller 112 .
- “1” in the bitemap table indicates that the corresponding data slice is ready, and “0” indicates that the data slice has not been completely output from the first processor 110 to the interface controller 112 .
- the data transmitted between the GPU and the NPU is an image.
- the neural network in the NPU is performing the data slice process During the calculation, the width and height of the data slice on the input side of the NPU will be output after being shrunk according to the number of layers and stride of the neural network. Therefore, usually, the width (height) of the data slice on the output side of the NPU will be smaller than the input side.
- the width (height) of the data slice as shown in Figure 10, the width W1 of the data slice on the input side of the NPU > the width W2 of the data slice on the output side; the height H1 of the data slice on the input side > the height of the data slice on the output side H2.
- the stride in the W and H directions are both 1
- the number of layers of the neural network is N in total, and each layer is a 3x3 convolution
- the data slice shown in FIG. 5 is used as the input of the neural network, there will be a gap between each data slice output after being processed by the NPU, which affects the effect of the output image.
- step 103 may specifically be after determining that the data slices adjacent to the edge of the target data slice are stored in the buffer (combined with the above bitmap, that is, after determining that the bits around the corresponding position of the target data slice in the bitmap table are all 1) , read the target data slice in the buffer, and the data slice adjacent to the edge of the target data slice; generate the first data slice according to the target data slice and the data slice adjacent to the edge of the target data slice, wherein the first data slice covers The target data slice, and the first data slice has an overlapping area with the data slice adjacent to the edge of the target data slice.
- the first data slice includes a target data slice and a part of a plurality of data slices adjacent to the edge of the target data slice, and this part is the overlapping area.
- the width of the overlapping area is determined by the number of layers and stride of the NPU's convolutional network.
- the first data slice corresponding to the target data slice (i, j) covers the target data slice (i, j); and the first data slice and the data slices (i-1, j-1), (i -1, j), (i-1, j+1), (i, j-1), (i, j+1), (i+1, j-1), (i+1, j) and (i+1, j+1) has an overlapping area, that is, the first data slice includes data slices (i-1, j-1), (i-1, j), (i-1, j+1), A part of (i, j-1), (i, j+1), (i+1, j-1), (i+1, j) and (i+1, j+1), as shown in Figure 11 shown.
- a first data slice that satisfies the size of the data slice on the input side of the NPU can be generated by padding. For example, you can Fill the left and upper parts of the first data slice corresponding to the data slice (i-1, j-1) with 0 or directly fill the copied data slice (i-1, j) -1) in the pixel data.
- the first data slice in FIG. 11 may also include all of the multiple adjacent data slices.
- the first data slice may further include more.
- the plurality of data slices adjacent to the edge of the target data slice are one data slice adjacent to the target data slice in each direction of the edge, but in fact, alternatively, in The number of adjacent data slices in each direction of the edge may be multiple, which depends on the ratio between the heights H1 and H2 and the ratio between the widths W1 and W2.
- the first data slice may also include a plurality of data slices adjacent to the edge of the target data slice in each direction, that is, on the basis of the target data slice, a plurality of data slices are expanded outward to cover a larger area. This embodiment does not limit the extent to which the first data slice is extended on the basis of the target data slice.
- the data slices may be out of order, instead of being output in the general order of zigzag scanning.
- the following description is given in the case of out-of-order (it can be understood that sequential output can be classified as out-of-order output).
- the embodiments of the present application may use a greedy algorithm to process the data slices that can be processed in order (eg, zigzag scan order) for the data slices output out of order. For example, if only one target data slice can be processed, that is, the target data slice and the data slices adjacent to the edge of the target data slice have been stored in the buffer of the interface controller 112, then the calculation of the first target data slice corresponding to the target data slice can be started. A data slice. If there are multiple target data slices at the same time and the corresponding first data slices can be calculated, the first data slices corresponding to the multiple target data slices are sequentially calculated according to the zigzag scanning order.
- step 104 can be continued to transmit the first data slice to the NPU, and the NPU processes the first data slice.
- the greedy algorithm can be used to calculate in order (for example, zigzag scan order).
- the size of the data slice is 256 ⁇ 256, and the size of one bitmap is 16 ⁇ 8 bits.
- the target data slice (i, j) and its adjacent data slices are all stored to the interface controller buffer.
- Cur_stat[i,j] flag[i-1,j-1]&flag[i-1,j]&flag[i-1,j+1]&flag[i , j+1]&flag[i+1, j-1]&flag[i+1, j]&flag[i+1, j+1]; where, flag[i-1, j-1] indicates that the bit bit of the data slice with coordinates (i-1, j-1) has been marked as 1, and Cur_stat[i, j] indicates coordinates (i, j), (i-1, j-1) , (i-1, j), (i-1, j+1), (i, j-1), (i, j+1), (i+1, j-1), (i+1, Both the data slices of j) and (i+1, j+1) have been stored to the buffer of the interface controller.
- the interface controller 112 may also maintain a bitmap of the target data slice for which the first data slice has been calculated, wherein the bit in the bitmap is 1, indicating that the corresponding target data slice has been calculated through the first data slice. A bit of 0 indicates that the first data slice has not been calculated for the corresponding target data slice.
- FIG. 14 may be updated to the bitmap shown in FIG. 15 . In this way, by maintaining and updating the bitmaps provided in FIG. 14 and FIG. 15 , it can be determined which target data slice in the current FIG. 13 needs to continue to calculate the corresponding first data slice.
- the granularity of the target data slice calculated above is 1x1.
- each target data slice is calculated once, and it is necessary to obtain at least eight other data slices adjacent to the edge of the target data slice (except for the target data slice located at the edge of the data frame); then the first data slice and these other data slices
- the data in the overlapping area of the slices is repeatedly transmitted multiple times.
- the target data slice provided by the embodiment of the present application may include 2n data slices.
- the granularity of the target data slice (slice) is 1x2, 2x2, ... and other different sizes.
- the granularity of the target data slice is 2 ⁇ 2.
- the data slices (1,1), (1,2), (2,1) and (2,2 are included)
- the data slices (1,3), (2,3), (3,1), (3,2) and (3,3) around the target data slice are stored in the interface controller's If the buffer is used, the calculation is performed on the first data slice corresponding to the target data slice.
- the target data slice of 2x2 granularity is the most 12 data slices adjacent to the edge of the target data slice are required (since the target data slice of the example 2x2 granularity is located in the upper left corner, the 5 data slices adjacent to the edge of the target data slice are shown in Figure 13), so 2x2 Among the granular target data slices, the data slice is equivalent to providing three data slices whose edges are adjacent to each other, and then the first data slice can be calculated, which reduces the repeated transmission of data in the overlapping area.
- the target data slice is 2 ⁇ 2 granularity, specifically including data slices (i, j), (i+1, j), (i, j+1) and (i+1) , j+1), the target data slice and its adjacent other data slices (i-1, j-1), (i-1, j), (i-1, j+1), (i- 1, j+2), (i, j+1), (i, j+2), (i+1, j-1), (i+1, j+2), (i+2, j- 1), (i+2, j), (i+2, j+1), and (i+2, j+2) have all been stored to the interface controller's buffer.
- Cur_stat[(i, j)&(i+1, j)&(i, j+1)&(i+1, j+1)] flag[i -1,j-1]&flag[i-1,j]&flag[i-1,j+1]&flag[i-1,j+2]&flag[i,j+1]&flag[i,j+2 ]&flag[i+1,j-1]&flag[i+1,j+2]&flag[i+2,j-1]&flag[i+2,j]&flag[i+2,j+1]&flag [i+2, j+2]; wherein, flag[i-1, j-1] indicates that the bit bit of the data slice whose coordinates are (i-1, j-1) has been marked as 1.
- the interface controller 112 may also maintain a bitmap representing the target data slice for which the first data slice can be calculated, wherein, in (x, y) of the bitmap, x is 1, indicating that the granularity of the first data slice is 2 ⁇ 2.
- the target data slice and other data slices adjacent to its edges have been stored in the buffer of the interface controller 112, and x being 0 means that the target data slice of 2x2 granularity and other data slices adjacent to its edges have not been all stored in the interface controller y is 0 indicates that the first data slice has not been calculated for the corresponding target data slice, and y is 1 indicates that the first data slice has been calculated for the corresponding target data slice.
- the first data slice is not calculated for the target data slices (i, j), (i+1, j), (i, j+1) and (i+1, j+1).
- the target data slice is correspondingly represented as (1, 0) in Fig. 18a
- the target data slice can be correspondingly updated in Fig. 18b as (1, 0). 1).
- it can be determined which target data slice in the current FIG. 17 is to calculate the corresponding first data slice.
- the interface controller 112 transmits the first data slice to the second processor 111, where the second processor is used for processing the first data slice. Specifically, the interface controller 112 may also send a first processing instruction corresponding to the first data slice to the second processor 111, where the second processor 111 processes the first data slice according to the first processing instruction.
- the interface controller 112 can store the multiple data slices in the data frame processed by the first processor 110 in the buffer, and can acquire the first data slice according to the target data slice in the multiple data slices , wherein the first data slice includes at least the target data slice, and then the first data slice is sent to the second processor 111, and the second processor 111 processes the corresponding first data slice.
- the size of the data transmitted each time is reduced compared to the data frame, and is transmitted to the second processor 111 through the buffer of the interface controller 112, so the shared memory can be avoided in the first processing
- the data is transmitted between the processor 110 and the second processor 111, and since the interface controller 112 can transmit to the second processor 111 every time it acquires a first data slice and notify the second processor 111 to process it, the second processor 111 is processed in this way.
- the interface controller 112 is further configured to transmit other data slices to the second processor 111, which can effectively form a computing link pipeline, and since the data frame is divided into data slices, Then, after the data of each data slice is processed by the first processor 110, it can be transmitted to the second processor 111 through the interface controller 112, and the second processor 111 starts to process it, thereby winning the second processor 111. More processing time, compared to the serial execution of the algorithm in each processor in the prior art, in this solution, the performance requirement of the second processor 111 is lower; for example, the first processor 110 divides the data frame into If there are multiple data slices, as shown in FIG. 19, the first processor and the second processor need to complete the processing of one frame of data in a unit cycle.
- the data slice can be transmitted to the second processor 111 through the interface controller 112.
- the interface controller 112 acquires a data slice, it can be transmitted to the second processor 111 and the second processor 111 can transmit the data slice.
- the second processor 111 can start to process the data slice transmitted by the interface controller 112 after the first processor 110 finishes processing the data slice t1 (that is, the second processor 111 is in the processing cycle T1 of the first processor.
- the processing period T2) for the data frame is started in the first time, so as to obtain more processing time for the second processor 111, and reduce the performance requirements of the second processor 111.
- the first The processor 110 can still continue to process other data, and the interface controller 112 is further configured to transmit other data slices to the second processor 111, which can effectively form a computing link pipeline; If it is divided into multiple data slices, as shown in FIG. 8 , as long as the first processor 110 starts to process the data frame and outputs the data of the data frame to the interface controller 112, the interface controller 112 can start to process the data frame. Slicing is performed, so that after the interface controller 112 acquires a data slice, the data slice can be transmitted to the second processor 111 and processed by the second processor 111. In this way, the second processor 111 can also be processed after t1. Starting to process the data slices transmitted by the interface controller 112, more processing time is obtained for the second processor 111, the performance requirements of the second processor 111 are reduced, and the delay of the entire computing link can be reduced; Thereby improving the overall performance of the system.
- the data transmission method provided by the embodiment of the present application is described below with reference to FIG. 20 , taking the first processor as the GPU and the second processor as the NPU as an example.
- the interface controller sets all bits in the bitmap to 0.
- the initial state may be a reset (reset) after the end of the last data frame transmission.
- the end of the data frame usually carries a data end indication.
- the bitmap may be reset to the initial state according to the data end indication.
- step 201 the interface controller determines whether the GPU has data output. If so, determine whether the GPU output is a data slice or a data frame. If it is determined that the GPU outputs a data frame, the data frame is divided into multiple data slices, and step 202 is executed. If it is determined that the GPU output is a data slice, step 202 is executed. In step 202, the interface controller sets the corresponding bit in the bitmap to 1 according to the data slice. In step 203, the interface controller determines whether there is a target data slice according to the bitmap so that the corresponding first data slice can be calculated. If so, calculate the first data slice, and execute step 204 .
- step 204 the interface controller transmits the first data slice to the NPU, and sends a first processing instruction to the NPU to notify the NPU to process the first data slice.
- step 205 it is determined whether all data slices have been processed, and if so, the process ends, and if otherwise, step 203 is performed.
- the above steps 201-205 introduce the basic logic process of a method for data transmission between a GPU and an NPU using an interface controller provided by the present application, and the specific methods in each process can refer to the descriptions in the above-mentioned steps 101-106 .
- the above-mentioned interface controller includes corresponding hardware structures and/or software modules for executing each function.
- the present application can be implemented in hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
- the interface controller may be divided into functional modules according to the foregoing method examples.
- each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
- the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation. The following description will be given by taking as an example that each function module is divided corresponding to each function.
- FIG. 21 is a schematic diagram of a logical structure of an interface controller 300 provided by an embodiment of the present application.
- the interface controller 300 can implement the data transmission method provided by the embodiment of the present application.
- the interface controller 300 may be a hardware structure, a software module, or a hardware structure plus a software module.
- the interface controller 300 includes an information acquisition unit 301, a processing unit 302, a transmission unit 303, and a buffer.
- the obtaining unit 301 may be configured to perform the above-mentioned step 101 .
- the processing unit 302 may be used to perform the above-mentioned steps 102 and 103 .
- the sending unit 303 may be configured to perform the above-mentioned step 104 .
- all relevant contents of the steps involved in the foregoing method embodiments can be cited in the functional descriptions of the corresponding functional units, which will not be repeated here.
- the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
- a software program it may be implemented in whole or in part in the form of a computer program product.
- the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions described in the embodiments of the present application are generated.
- the computer may be a general purpose computer, a special purpose computer, a computer network, network equipment, user equipment, or other programmable apparatus.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line, DSL) or wireless (eg, infrared, wireless, microwave, etc.).
- the computer-readable storage medium can be any available medium that can be accessed by a computer or data storage devices including one or more servers, data centers, etc., that can be integrated with the medium.
- the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, digital video disc (DVD)), or semiconductor media (eg, solid state disk (SSD)) Wait.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer And Data Communications (AREA)
- Information Transfer Systems (AREA)
- Image Processing (AREA)
Abstract
本申请的实施例提供一种接口控制器、数据传输方法及片上系统,涉及接口技术领域,能够提升系统整体性能。该数据传输方法包括:获取数据帧中的多个数据切片,其中,数据帧由第一处理器生成;将多个数据切片写入缓冲器;根据多个数据切片中的目标数据切片获取第一数据切片,第一数据切片至少包括目标数据切片;将第一数据切片传输至第二处理器,其中第二处理器用于处理第一数据切片。
Description
本申请涉及接口技术领域,尤其涉及一种接口控制器、数据传输方法及片上系统。
在片上系统(system on chip,SoC,或者称作系统级芯片)中,对数据帧实现某种算法的处理的一般过程是:将数据帧分别通过不同的处理器(通常,处理器可以包含一个或多个知识产权保护核(intellectual property core,IP核),其中一个IP核即一个逻辑单元)的联合处理形成算法链路流水线(pipeline)。其中,算法链路用到的处理器可以是纯硬件的处理器或者可软件编程的处理器,例如:中央处理器(central processing unit,CPU)、数字信号处理器(digital signal processor,DSP)、图像信号处理器(image signal processor,ISP)、视频编码器(video codec)、神经网络处理器(neural processing unit,NPU)、图形处理器(graphics processing unit,GPU)、或显示子系统(display sub system,DSS)等。
一般,处理器之间的数据交互通常都是通过内存共享实现。例如,参照图1所示,前一级处理器对当前数据帧处理完后,放入共享内存(缓存器或DDR中至少一个),并通知后一级处理器继续处理,后一级处理器从共享内存读取该数据帧,并进行后续处理。可见,在上述过程中算法链路在前一级处理器和后一级处理器之间是串行执行的,因此对于周期性实时处理,且延时敏感的算法链路(例如30FPS(30帧每秒)的游戏超分),要求一帧数据的处理要在单位周期内完成,如图2所示。前一级处理器处理并输出一帧数据会占据单位周期中一帧的处理时间T1,此时后一级处理器都处在空闲状态。当一帧数据处理完时,单位周期剩下的时间段T2让后一级处理器处理。由于剩下的时间段T2可能会很短,因此这种方法对后一级处理器的性能提出了很高的要求。并且,通常当数据帧太大时,在使用共享内存时,无法放到片内的缓存器(SoCcache),因此需要放在外部存储器(例如,双倍速率动态随机存储器(double data rate dynamic random access memory,DDR)上,从而会导致存储器的带宽和功耗规格限制引起的整体的性能。总之,目前的SoC架构下处理器协同执行算法时,数据传输的方式会影响整体的性能提升。
发明内容
本申请提供一种接口控制器、数据传输方法及片上系统,能够提升系统整体性能。
第一方面,提供一种数据传输方法。该数据传输方法应用于接口控制器,接口控制器连接第一处理器以及第二处理器。该数据传输方法包括:获取数据帧中的多个数据切片,其中,数据帧由第一处理器生成;将多个数据切片写入缓冲器(buffer);根据多个数据切片中的目标数据切片获取第一数据切片,第一数据切片至少包括目标数据切片;将第一数据切片传输至第二处理器,其中第二处理器用于处理第一数据切片。在该方案中,接口控制器可以将第一处理器处理的数据帧中的多个数据切片在缓冲器中进行存储,并能够根据多个数据切片中的目标数据切片获取第一数据切片,其中第一数据切片至少包括该目标数据切片,然后将第一数据切片发送至第二处理器, 并由第二处理器处理对应的第一数据切片,上述过程中由于将数据帧分为了多个数据切片,因此相对于数据帧减小了每次传输数据的大小,并通过接口控制器的缓冲器向第二处理器传输,因此可以避免采用共享内存的方式在第一处理器和第二处理器之间传输数据,并且由于接口控制器每获取一个第一数据切片即可向第二处理器传输并通知第二处理器进行处理,这样在第二处理器处理第一数据切片时,接口控制器进一步用于将其他数据切片传输至所述第二处理器,这样可以有效的形成计算链路流水,并且由于将数据帧划分成了数据切片,则在每个数据切片的数据在第一处理器处理完成后就可以通过接口控制器传输给第二处理器,由第二处理器开始处理,从而为第二处理器争取到了更多的处理时间,相对于现有技术中算法数据帧在各个处理器中串行执行,该方案中,只要第一处理器处理完一个数据切片的数据,则可以通过接口控制器传输给第二处理器进行处理,因此对第二处理器的性能要求较低,并且能够减少整个计算链路的延时;从而提升系统整体性能。
在一种可能的实现方式中,将数据帧划分为多个数据切片可以由第一处理器或者接口控制器执行。例如,第一处理器为类似于GPU的处理器时,由于GPU通常是按照数据切片对图像进行渲染,因此,第一处理器将数据帧划分为多个数据切片,获取数据帧中的多个数据切片;具体包括:接收第一处理器传输的多个数据切片。当然对于对数据帧不进行数据切片划分的第一处理器,将数据帧划分为多个数据切片可以由接口控制器执行,例如:在获取数据帧中的多个数据切片之前还包括:接收第一处理器传输的数据帧;获取数据帧中的多个数据切片,包括:将数据帧划分为多个数据切片。
在一种可能的实现方式中,根据多个数据切片中的目标数据切片获取第一数据切片,包括:在缓冲器读取目标数据切片,以及目标数据切片的边缘相邻的数据切片;根据目标数据切片,以及目标数据切片的边缘相邻的数据切片生成第一数据切片,其中第一数据切片覆盖目标数据切片,并且第一数据切片与目标数据切片的边缘相邻的数据切片具有重叠区域。示例性的,第二处理器可以为神经网络处理器NPU;在GPU和NPU之间传输的数据为图像,此外,由于NPU中的神经网络在对数据切片进行计算时,在NPU输入侧的数据切片的宽度、高度会被依据神经网络的层数、步长(stride)收缩后输出,因此通常,NPU输出侧的数据切片的宽度(高度)会小于输入侧的数据切片的宽度(高度),这样为了确保NPU输出侧的数据切片能够完全覆盖整个数据帧,则需要输入至NPU的数据切片的具有更大的宽度和高度。因此为了确保对第二处理器对第一数据切片处理后的数据切片与目标数据切片的尺寸相同,因此第一数据切片需要具有比目标数据切片更大的尺寸,即第一数据切片除了覆盖目标数据切片,还需要与目标数据切片的边缘相邻的数据切片具有重叠区域,其中重叠区域的宽度由NPU的卷积网络的层数及步长确定。
在一种可能的实现方式中,根据多个数据切片中的目标数据切片获取第一数据切片之前,还包括:确定目标数据切片的边缘相邻的数据切片存储于缓冲器中。其中当需要根据目标数据切片以及目标数据切片的边缘相邻的数据切片计算第一数据切片时,需要保证目标数据切片的边缘相邻的数据切片存储于缓冲器中。
在一种可能的实现方式中,还包括:生成比特地图表,当确定获取所述数据切片后,将所述比特地图表中所述数据切片对应位置的比特置为1;确定目标数据切片的 边缘相邻的数据切片存储于缓冲器中,包括:确定所述比特地图表中所述目标数据切片对应位置周围的比特均为1。其中,通过该比特地图表bitmap可以记录各个bit相应的数据切片已经存储于缓冲器中。
在一种可能的实现方式中,目标数据切片包括2n个数据切片,其中,n为正整数。如果,目标数据切片(slice)的粒度是1x1,即只包含一个数据切片。则对每个目标数据切片计算一次,需要至少获取目标数据切片的边缘相邻的八个其他数据切片(其中位于数据帧边缘的目标数据切片除外);则第一数据切片与这些其他数据切片重叠区域的数据会多次被重复传输。为降低对重叠区域的数据重复传输,本申请的实施例提供的目标数据切片可以包括2n个数据切片。例如,对2x2粒度的目标数据切片计算一次,最多需要目标数据切片的边缘相邻的12个数据切片,这样2x2粒度的目标数据切片中针对每个数据切片相当于平均满足提供其边缘相邻的三个数据切片即可进行第一数据切片的计算,降低了对重叠区域的数据重复传输。
在一种可能的实现方式中,接口控制器为了将数据帧划分为多个数据切片,或者为了确定数据切片在数据帧中的位置需要获取数据帧的信息。因此还包括:获取数据帧的信息,数据帧的信息包括基地址、宽度、高度及通道数;根据所述数据帧的信息将数据帧划分为多个数据切片。
在一种可能的实现方式中,接口控制器还与片上系统的缓存器连接;当确定缓冲器存满时,将缓冲器中的数据切片存入缓存器。这样,当确定缓冲器存满时,接口控制器可以将缓冲器中的数据切片存入系统的缓存器,此外,接口控制器还可以将缓冲器中的数据切片存入DDR;当缓冲器有剩余空间时,再从片上系统的缓存器或DDR读出。这样如果缓冲器有足够的空间,数据的传输只在缓冲器中进行,如果缓冲器的空间不足,为避免造成数据传输中断,可以将数据切片片上系统的缓存器或DDR。
第二方面,提供一种接口控制器,所述接口控制器连接第一处理器以及第二处理器,所述接口控制器包括:获取单元,用于获取数据帧中的多个数据切片,其中,所述数据帧由第一处理器生成;处理单元,用于将所述获取单元获取的所述多个数据切片写入缓冲器;所述处理单元,用于根据所述缓冲器中的所述多个数据切片中的目标数据切片获取第一数据切片,所述第一数据切片至少包括所述目标数据切片;发送单元,用于将所述处理单元获取的所述第一数据切片传输至所述第二处理器,其中所述第二处理器用于处理所述第一数据切片。
在一种可能的实现方式中,所述获取单元具体用于接收所述第一处理器传输的所述多个数据切片,其中所述第一处理器用于将所述数据帧划分为所述多个数据切片。
在一种可能的实现方式中,所述获取单元还用于接收所述第一处理器传输的所述数据帧;所述获取单元具体用于将所述数据帧划分为所述多个数据切片。
在一种可能的实现方式中,所述处理单元,具体用于在所述缓冲器读取目标数据切片,以及所述目标数据切片的边缘相邻的数据切片;根据所述目标数据切片,以及所述目标数据切片的边缘相邻的数据切片生成所述第一数据切片,其中所述第一数据切片覆盖所述目标数据切片,并且所述第一数据切片与所述目标数据切片的边缘相邻的数据切片具有重叠区域。
在一种可能的实现方式中,第二处理器为神经网络处理器NPU。
在一种可能的实现方式中,所述处理单元还用于确定目标数据切片的边缘相邻的数据切片存储于缓冲器中。
在一种可能的实现方式中,所述处理单元还用于生成比特地图表,当确定获取所述数据切片后,将所述比特地图表中所述数据切片对应位置的比特置为1;所述处理器具体用于确定所述比特地图表中所述目标数据切片对应位置周围的比特均为1。
在一种可能的实现方式中,所述目标数据切片包括2n个数据切片其中n为正整数。
在一种可能的实现方式中,还包括:所述获取单元,还用于获取数据帧的信息,数据帧的信息包括基地址、宽度、高度及通道数;处理单元具体用于根据数据帧的信息将数据帧划分为多个数据切片。
在一种可能的实现方式中,其特征在于,接口控制器还与片上系统的缓存器连接;处理单元具体用于当确定缓冲器存满时,通过发送单元将缓冲器中的数据切片存入缓存器。
第三方面,提供一种接口控制器,包括缓冲器和一个或多个处理器,处理器用于调用计算机指令,以执行如上述的方法。
第四方面,提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,当该计算机程序在计算机上运行时,使得该计算机执行如上述的方法。
第五方面,提供一种片上系统,包括第一处理器、第二处理器以及如上述的接口控制器,其中,所述接口控制器与所述第一处理器及所述第二处理器连接,接口控制器用于执行上述的数据传输方法。
在一种可能的实现方式中,在第二处理器处理第一数据切片时,接口控制器进一步用于将其他数据切片传输至所述第二处理器。可选地,第一处理器进一步在第二处理器处理第一数据切片时用于处理其他数据,以形成流水。
其中,第二方面至第五方面中任一种可能实现方式中所带来的技术效果可参见上述第一方面不同的实现方式所带来的技术效果,此处不再赘述。
图1为本申请的实施例提供的一种片上系统的结构示意图;
图2为本申请的实施例提供的一种处理器之间的算法链路流水线示意图;
图3为本申请的另一实施例提供的一种片上系统的结构示意图;
图4a为本申请的又一实施例提供的一种片上系统的结构示意图;
图4b为本申请的实施例提供的一种数据传输方法的流程示意图;
图5为本申请的实施例提供的一种数据帧的数据切片阵列的结构示意图;
图6为本申请的实施例提供的一种数据切片的结构示意图;
图7为本申请的实施例提供的一种数据帧的数据切片的处理顺序示意图;
图8为本申请的实施例提供的一种数据帧的数据切片的划分顺序示意图;
图9为本申请的又一实施例提供的一种片上系统的结构示意图;
图10为本申请的实施例提供的NPU输出侧的数据切片和NPU输入侧的数据切片的尺寸示意图;
图11为本申请的另一实施例提供的一种数据帧的数据切片阵列的结构示意图;
图12为本申请的实施例提供的一种bitmap的结构示意图;
图13为本申请另一实施例的提供的一种bitmap的结构示意图;
图14为本申请的又一实施例提供的一种bitmap的结构示意图;
图15为本申请的再一实施例提供的一种bitmap的结构示意图;
图16为本申请的另一实施例提供的一种bitmap的结构示意图;
图17为本申请的又一实施例提供的一种bitmap的结构示意图;
图18a为本申请的再一实施例提供的一种bitmap的结构示意图;
图18b为本申请的另一实施例提供的一种bitmap的结构示意图;
图19为本申请的另一实施例提供的一种处理器之间的算法链路流水线示意图;
图20为本申请的另一实施例提供的一种数据传输方法的流程示意图;
图21为本申请的实施例提供的一种接口控制器的结构示意图。
在本实施例中,术语“第一”、“第二”等仅用于描述方便,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”等的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,除非另有说明,“多个”的含义是两个或两个以上。例如,多个处理单元是指两个或两个以上的处理单元。
此外,本申请实施例中,“上”、“下”、“左”以及“右”不限于相对附图中的部件示意置放的方位来定义的,应当理解到,这些方向性术语可以是相对的概念,它们用于相对于的描述和澄清,其可以根据附图中部件附图所放置的方位的变化而相应地发生变化。在附图中,为了清楚起见,夸大了层和区域的厚度,图示中的各部分之间的尺寸比例关系并不反映实际的尺寸比例关系。
本申请实施例中,除非另有明确的规定和限定,术语“连接”应做广义理解,例如,“连接”可以是固定连接,也可以是可拆卸连接,或成一体;可以是直接相连,也可以通过中间媒介间接相连。
本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或“例如”等词旨在以具体方式呈现相关概念。
本申请的实施例提到的人工神经网络(artificial neural network,ANN)简称为神经网络(neural network,NN)或类神经网络,在机器学习和认知科学领域,是一种模仿生物神经网络(如动物的中枢神经系统,特别是大脑)的结构和功能的数学模型或计算模型,用于对函数进行估计或近似。人工神经网络可以包括卷积神经网络(convolutional neural network,CNN)、深度神经网络(deepneural network,DNN)、时延神经网络(time delay neural network,TDNN)和多层感知器(multilayer perceptron,MLP)等神经网络。
本申请提供的技术方案可以应用于包括各类芯片,例如片上系统,的电子设备,例如手机、移动终端、个人计算机(personal computer,PC)、服务器、笔记本电脑、平板电脑、车载电脑、智能摄像头、智能手表、嵌入式设备等。本申请实施例对该电 子设备的具体形式不做特殊限制。
下面结合附图对本申请实施例提供的一种接口控制器、数据传输方法及片上系统进行详细描述。图3为本申请实施例提供的一种片上系统的结构示意图。需理解,后续仅以片上系统为例介绍,实际方案也可应用于其他类型的芯片或设备。如图3所示,片上系统100可以包括处理器110、处理器111、接口控制器112、存储器120、通信线路130、缓存器140、以及至少一个通信接口150及外围电路(图中未示出)。需要说明的是,除了图3所示的器件外,片上系统100还可以包括通信接口150。可以理解的是,本发明实施例示意的结构并不构成对片上系统100的具体限定。在本申请另一些实施例中,片上系统100可以包括比图示更多或更少的部件,或者可以组合某些部件,或者可以拆分某些部件,或者具有不同的部件布置。图示的部件可以以硬件或软件和硬件的组合实现。
处理器(processor)110是片上系统100的运算核心和控制核心(control unit)。其中,处理器110可以包括中央处理器(central processing unit,CPU)、数字信号处理器(digital signal processor,DSP)、图像信号处理器(image signal processor,ISP)、视频编码器(video codec)、神经网络处理器(neural processing unit,NPU)、图形处理器(graphics processing unit,GPU)、DSS,特定应用集成电路(application-specific integrated circuit,ASIC)。在本申请的实施例中,处理器的形式可以是纯硬件的或者能够加载软件程序的硬件。在具体实现中,作为一种实施例,处理器110可以包括一个或多个处理器核(core),例如图3中的核0和核1。在具体实现中,作为一种实施例,片上系统100可以包括多个处理器,例如图1中的处理器110、处理器111和接口控制器112。这些处理器中的每一个可以是一个单核(single-core)处理器(即处理器包括一个core),也可以是一个多核(multi-core)处理器(即处理器包括多个core)。
存储器120可以是独立于处理器110存在,通过通信线路130与处理器110相连接。在一种示例中,存储器120可以可以用于存储指令和数据,包括用于执行本申请方案的指令,该存储器120例如是静态随机存取存储器(SRAM)。该指令由处理器110来控制执行。在本申请的实施例中,可以由接口控制器112执行存储器120中存储的指令,从而实现本申请下述实施例提供的数据传输方法。可选的,本申请实施例中的指令也可以称之为应用程序代码,本申请实施例对此不作具体限定。
通信线路130可包括一通路,在上述组件之间传送信息。通信接口150,使用任何收发器一类的装置,用于与其他设备或通信网络通信,如以太网,无线接入网(radio access network,RAN),无线局域网(wireless local area networks,WLAN)等。
缓存器(cache)140是位于处理器110和存储器120之间的高速存储器;其容量比存储器120小,但是与处理器110的交换速度快。缓存器140可以用于保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从缓存器140中直接调用。例如,缓存器140保存的指令或数据是存储器120中的一小部分,但这一小部分指令或数据是短时间内处理器110即将访问的;当处理器110调用指令或数据时,可避开存储器120直接从缓存140中调用,从而加快读取速度。
如上所述,接口控制器112可以是一个单核或多核的处理器,此外,在本申请的 实施例中,接口控制器112中还设置有缓冲器。缓冲器也称作缓冲寄存器,当数据经过两个装置之间的缓冲器传输时,缓冲器主要用来弥补缓冲器两端的装置对数据处理速度的差距造成的数据累积。
基于上述的片上系统,本申请的实施例提供一种数据传输方法,应用于接口控制器112。参照图4a所示,接口控制器112连接第一处理器110以及第二处理器111,其中结合上述针对图3示出的片上系统的描述,接口控制器112与第一处理器110以及第二处理器111的连接方式可以是通过通信线路连接。参照图4b所示该数据传输方法包括以下步骤:
101、接口控制器112获取数据帧中的多个数据切片,其中,数据帧由第一处理器生成。在步骤101中,数据帧的形式可以是图像、文本、音频或视频等。其中,由数据帧中的多个数据切片(slice或tile)可以是数据切片阵列的形式。以第一处理器为GPU为例,数据帧可以为一帧图像,图像的数据帧可以为包括各个通道上的像素数据的三维数据阵列(3D data cube)。如图5所示,在xyz直角坐标系中,数据帧中x方向为图像帧的宽度W(width,其中,该宽度可以为在x方向上一行像素的个数),数据帧中y方向为图像帧的高度H(height,其中,该高度可以为在y方向上一列像素的个数),以4K图像为例,其分辨率为4096×2160,则图像的数据帧的宽度W为4096,高度H为2160。数据帧中z方向为图像帧的通道C(channel,例如,以RGB(red、green、blue,红绿蓝)为三原色的图像的数据帧包含三个通道:R通道、G通道、B通道)。图像的数据帧可以在xy平面上被划分为多个数据切片,本申请的实施例还提供了一个数据切片的结构,如图6所示,对于图像的数据帧划分的数据切片,一个数据切片也包含了三个通道的数据,其与整帧数据的区别是一个数据切片的高度和宽度仅为整帧数据的分数倍。可以理解的是,每个数据切片的一个通道上包含若干行和若干列阵列分布的像素数据。此外,对于其他数据也有二维数据阵列的形式。当然也可以将彩色图像灰度化(或二值化)后,变换成单通道形式的灰阶图像,这样的灰阶图像的数据帧也可以是二维数据阵列的形式。
其中,在本申请的实施例中可以由第一处理器110将数据帧划分为多个数据切片或者由本申请的实施例提供的接口控制器112将数据帧划分为多个数据切片。例如,如果第一处理器110将数据帧划分为多个数据切片,则在步骤101中,接口控制器112可以直接接收第一处理器110传输的多个数据切片。其中,第一处理器110可以将数据帧划分为多个数据切片,并对每个数据切片进行处理,之后按照一定的顺序将数据切片依次发送给接口控制器112,例如,参照图7所示,可以以Z字形扫描方式在将数据切片阵列中的数据切片处理后,依次发送给接口控制器112;或者第一处理器110也可以乱序将处理后的数据切片发送给接口控制器112。在另一种方式中,接口控制器112可以直接接收第一处理器110传输的数据帧;然后将数据帧划分为多个数据切片。需要说明的是,在SoC系统中,可以由CPU为第一处理器110以及接口控制器112配置寄存器的方式,通知是由第一处理器110或由接口控制器112将数据帧划分为多个数据切片。在一些示例中,接口控制器112为了将数据帧划分为多个数据切片,或者为了确定数据切片在数据帧中的位置需要获取数据帧的信息。该数据帧的信息包括数据帧的基地址、宽度、高度和通道数。其中,数据帧的信息可以是由CPU为接口控 制器112配置寄存器的方式通知接口控制器112。具体的,第一处理器110可以通过突发(burst)事件向接口控制器112发送数据帧的数据,其中一次突发(burst)事件可以发送一个数据帧、或者一个数据切片或者其他任意数据长度的数据。这样,若由接口控制器112将数据帧划分为多个数据切片,当一次突发(burst)事件发送任意长度的数据(例如可以是一个数据帧或者其他任意长度的数据)时,该突发(burst)事件可以同时携带数据的写地址、数据长度;则只要根据写地址和数据长度确定发送的数据长度超过一个数据切片,接口控制器112则根据数据帧的信息将接收到的数据划分为多个数据切片。例如,参照图8所示,对于按照Z字形扫描逐行输出数据帧的数据第一处理器110,接口控制器112可以按照数据帧的基地址、宽度、高度和通道数依次将连续若干行数据划分为一个数据切片。若由第一处理器110将数据帧划分为多个数据切片,当一次突发(burst)事件发送一个数据切片时,该突发(burst)事件可以同时携带数据切片的写地址、数据长度;则接口控制器112可以根据数据切片的写地址、数据长度以及上述的数据帧的信息确定数据切片在数据帧中的位置。
102、接口控制器112将多个数据切片写入缓冲器。其中,当确定缓冲器存满时,接口控制器112可以将缓冲器中的数据切片存入系统的缓存器,此外,接口控制器112还可以将缓冲器中的数据切片存入DDR;当缓冲器有剩余空间时,再从片上系统的缓存器或DDR读出。这样如果缓冲器有足够的空间,数据的传输只在缓冲器中进行,如果缓冲器的空间不足,为避免造成数据传输中断,可以将数据切片存入片上系统的缓存器或DDR。
103、接口控制器112根据多个数据切片中的目标数据切片获取第一数据切片,第一数据切片至少包括目标数据切片。例如,第一数据切片包括目标数据切片,可以选择性进一步包括与该目标数据切片的边缘相邻的数据切片的至少一部分。在步骤103中,接口控制器112可以直接在缓冲器中读取任意目标数据切片并在后续过程中发送至第二处理器111。或者按照一定的顺序(例如Z字形)读取目标数据切片在后续过程中发送至第二处理器111。前提是,经步骤101和102的处理已经在缓冲器中写入该目标数据切片。
在本申请的实施例中,接口控制器112在初次接收数据切片或者数据帧之前,可以预先生成一张比特地图表(bitmap);其中,该bitmap可以是默认预配置的,在初始状态(例如在接口控制器112开始接收数据帧或数据切片之前)bitmap的所有比特置为0。按照上述的方式接收到第一处理器110发送的数据切片或者在接口控制器112本地将数据帧划分为数据切片后,将bitmap中数据切片对应位置的比特置为1。当接口控制器将数据帧划分为数据切片时,可以直接根据数据切片在数据帧中的位置将各个数据切片分别影射至bitmap的一个bit。而当数据切片是由第一处理器110发送时,则接口控制器112在获取数据切片的同时还可以接收第一处理器110发送的数据切片在数据帧中的位置信息(例如可以是数据切片的坐标,或者数据切片的写地址和数据长度),这样可以根据数据切片的位置信息将数据切片与bitmap中的bit一一对应。这样,bitmap中被置为1的bit对应的数据切片是完全传输至接口控制器112的数据切片。如图12所示,bitemap表中的“1”表示相应的数据切片已经准备好,“0”表示数据切片还没有完全从第一处理器110输出到接口控制器112。
另外,参照图9所示,当第一处理器为GPU,且第二处理器为NPU时,在GPU和NPU之间传输的数据为图像,此外,由于NPU中的神经网络在对数据切片进行计算时,在NPU输入侧的数据切片的宽度、高度会被依据神经网络的层数、步长(stride)收缩后输出,因此通常,NPU输出侧的数据切片的宽度(高度)会小于输入侧的数据切片的宽度(高度),如图10所示,NPU输入侧的数据切片的宽度W1>输出侧的数据切片的宽度W2;输入侧的数据切片的高度H1>输出侧的数据切片的高度H2。具体的,假定W,H方向stride都为1,神经网络的层数一共为N层,每层是3x3卷积,那么W2=W1-N×1,H2=H1-N×1。这样如果将图5示出的数据切片作为神经网络的输入,则经过NPU处理后输出的各个数据切片之间会存在间隔,影响输出的图像的效果。为了避免该问题,需要确保输出侧的数据切片能够完全覆盖整个数据帧,即需要保证NPU输出侧的数据切片的尺寸(W2和H2)与图5中示出的各个数据切片的尺寸相同。则需要,输入神经网络的数据切片具有比图5示出的数据切片更大的宽度和高度。例如,步骤103具体可以是在确定目标数据切片的边缘相邻的数据切片存储于缓冲器中后(结合上述的bitmap,即确定比特地图表中目标数据切片对应位置周围的比特均为1后),在缓冲器读取目标数据切片,以及目标数据切片的边缘相邻的数据切片;根据目标数据切片,以及目标数据切片的边缘相邻的数据切片生成第一数据切片,其中第一数据切片覆盖目标数据切片,并且第一数据切片与目标数据切片的边缘相邻的数据切片具有重叠区域。具体参见图11,第一数据切片包括了目标数据切片以及与该目标数据切片的边缘相邻的多个数据切片的一部分,该部分即为所述重叠区域。其中重叠区域的宽度由NPU的卷积网络的层数及步长确定。结合图11所示,目标数据切片(i,j)对应的第一数据切片覆盖目标数据切片(i,j);并且第一数据切片与数据切片(i-1,j-1)、(i-1,j)、(i-1,j+1)、(i,j-1)、(i,j+1)、(i+1,j-1)、(i+1,j)和(i+1,j+1)具有重叠区域,即第一数据切片包括了数据切片(i-1,j-1)、(i-1,j)、(i-1,j+1)、(i,j-1)、(i,j+1)、(i+1,j-1)、(i+1,j)和(i+1,j+1)中的一部分,如图11所示。此外需要说明的是,对于位于数据帧边缘的数据切片,由于某些方向没有相邻的数据切片,例如数据切片(i-1,j-1)的左侧和上侧没有相邻的数据切片,则在计算数据切片(i-1,j-1)对应的第一数据切片时,可以通过填充(padding)的方式生成满足NPU输入侧的数据切片的尺寸的第一数据切片,例如,可以对数据切片(i-1,j-1)对应的第一数据切片在(i-1,j-1)左侧和上侧的部分填充0或者直接填充复制的数据切片(i-1,j-1)中的像素数据。
可替换地,图11中第一数据切片还可以包所述缘相邻的多个数据切片的全部。或者,第一数据切片还可以进一步包括的更多。例如,在图11中,该目标数据切片的边缘相邻的多个数据切片是在该边缘的每个方向上与该目标数据切片相邻的一个数据切片,但实际上,可替换地,在该边缘的每个方向上相邻的数据切片的个数可以是多个,这取决于高度H1、H2之间的比例和宽度W1、W2之间的比例。可以理解为,第一数据切片还可以在每个方向上包括多个与目标数据切片的边缘相邻的数据切片,即在目标数据切片的基础上向外扩展了多个数据切片以覆盖更大的范围,本实施例对第一数据切片在目标数据切片的基础上所扩展的程度不做限定。
由于第一处理器110输出数据切片时,数据切片可能是乱序的,而非一般的按Z 字扫描的顺序输出。以下按乱序的情况进行说明(可以理解的是顺序输出可归类为乱序输出)。本申请的实施例可以采用贪婪算法对乱序输出的数据切片按序(例如Z字扫描顺序)处理可以处理的数据切片。例如,如果只有一个目标数据切片可以处理,即该目标数据切片以及目标数据切片的边缘相邻的数据切片均已存储至接口控制器112的缓冲器,则可以开始计算该目标数据切片对应的第一数据切片。如果同时有多个目标数据切片可以开始计算对应的第一数据切片,则按照Z字扫描顺序依次计算多个目标数据切片对应的第一数据切片。
具体的,参考图12提供的bitmap,若一个目标数据切片tile的边缘相邻的数据切片tile都准备好(bit为1)就可以计算输入至NPU的第一数据切片,其中图12中虚线框为计算的第一数据切片。然后,就可以继续执行步骤104,将第一数据切片传输至NPU由NPU对第一数据切片进行处理。而当同时有几个这类似的切片都满足上述条件时,可以采用贪婪算法按序(例如Z字扫描顺序)计算。
具体的,以4K图像的数据帧为例,数据切片的尺寸256×256,则一个bitmap大小为16x8bits。参照图13所示,目标数据切片(i,j)及其边缘相邻的数据切片(i-1,j-1)、(i-1,j)、(i-1,j+1)、(i,j-1)、(i,j+1)、(i+1,j-1)、(i+1,j)和(i+1,j+1)均已存储至接口控制器的缓冲器。则触发第一数据切片计算的开始条件为:Cur_stat[i,j]=flag[i-1,j-1]&flag[i-1,j]&flag[i-1,j+1]&flag[i,j-1]&flag[i,j+1]&flag[i+1,j-1]&flag[i+1,j]&flag[i+1,j+1];其中,flag[i-1,j-1]表示坐标为(i-1,j-1)的数据切片的bit位已经标记为1,Cur_stat[i,j]表示坐标(i,j)、(i-1,j-1)、(i-1,j)、(i-1,j+1)、(i,j-1)、(i,j+1)、(i+1,j-1)、(i+1,j)和(i+1,j+1)的数据切片均已存储至接口控制器的缓冲器。
此外,参照图14所示,接口控制器112还可以维护一张已经计算过第一数据切片的目标数据切片的bitmap,其中,该bitmap中的bit位为1表示针对对应的目标数据切片已经计算过第一数据切片。bit位为0表示针对对应的目标数据切片还未计算过第一数据切片。在执行图13计算目标数据切片(i,j)对应的第一数据切片之前,已完成数据切片(i-2,j-2)、(i-2,j-1)、(i-2,j)以及(i-1,j)对应的第一数据切片的计算。本次执行完对目标数据切片(i,j)对应的第一数据切片的计算之后可以将图14的bitmap更新为图15所示的bitmap。这样通过维护更新图14、图15提供的bitmap,可以判定出需要继续对当前图13中哪一个目标数据切片计算对应的第一数据切片。
以上计算的目标数据切片(slice)的粒度是1x1。这样对每个目标数据切片计算一次,需要至少获取与目标数据切片的边缘相邻的八个其他数据切片(其中位于数据帧的边缘的目标数据切片除外);则第一数据切片与这些其他数据切片重叠区域的数据会多次被重复传输。为降低对重叠区域的数据重复传输,本申请的实施例提供的目标数据切片可以包括2n个数据切片。例如,目标数据切片(slice)的粒度是1x2,2x2,…等不同大小尺寸。
以下以目标数据切片(slice)的粒度是2x2为另一个实施例,参照图16所示,当包括数据切片(1,1)、(1,2)、(2,1)以及(2,2)的目标数据切片以及目标数据 切片周围的数据切片(1,3)、(2,3)、(3,1)、(3,2)以及(3,3)均已存储至接口控制器的缓冲器,则对目标数据切片对应的第一数据切片进行计算。这样相较于图13提供的1x1粒度的目标数据切片需要同时完成目标数据切片的边缘相邻的八个相邻的其他数据切片才可以计算对应的第一数据切片,2x2粒度的目标数据切片最多需要目标数据切片的边缘相邻的12个数据切片(由于示例2x2粒度的目标数据切片是位于左上角,因此图13中示出了目标数据切片的边缘相邻的5个数据切片),这样2x2粒度的目标数据切片中针对数据切片相当于平均满足提供其边缘相邻的三个数据切片即可进行第一数据切片的计算,降低了对重叠区域的数据重复传输。
在另一个示例中,对于图17提供的bitmap,目标数据切片为2x2粒度,具体包括数据切片(i,j)、(i+1,j)、(i,j+1)以及(i+1,j+1),目标数据切片及其边缘相邻的的其他数据切片(i-1,j-1)、(i-1,j)、(i-1,j+1)、(i-1,j+2)、(i,j+1)、(i,j+2)、(i+1,j-1)、(i+1,j+2)、(i+2,j-1)、(i+2,j)、(i+2,j+1)以及(i+2,j+2)均已存储至接口控制器的缓冲器。则触发第一数据切片计算的开始条件为:Cur_stat[(i,j)&(i+1,j)&(i,j+1)&(i+1,j+1)]=flag[i-1,j-1]&flag[i-1,j]&flag[i-1,j+1]&flag[i-1,j+2]&flag[i,j+1]&flag[i,j+2]&flag[i+1,j-1]&flag[i+1,j+2]&flag[i+2,j-1]&flag[i+2,j]&flag[i+2,j+1]&flag[i+2,j+2];其中,flag[i-1,j-1]表示坐标为(i-1,j-1)的数据切片的bit位已经标记为1。
此外,参照图18a所示,接口控制器112还可以维护一张表示可以计算第一数据切片的目标数据切片的bitmap,其中,该bitmap的(x,y)中,x为1表示2x2粒度的目标数据切片及其边缘相邻的其他数据切片已经存储至接口控制器112的缓冲器,x为0表示2x2粒度的目标数据切片及其其边缘相邻的其他数据切片还未全部存储至接口控制器112的缓冲器;y为0表示针对对应的目标数据切片还未计算过第一数据切片,y为1表示针对对应的目标数据切片已经计算过第一数据切片。如上述图17提供的bitmap,在未对目标数据切片(i,j)、(i+1,j)、(i,j+1)以及(i+1,j+1)计算第一数据切片时,目标数据切片在图18a中对应表示为(1,0),本次执行完对目标数据切片对应的第一数据切片的计算之后可以将目标数据切片在图18b中对应更新为(1,1)。这样通过维护更新图18a、图18b提供的bitmap,可以判定出对当前图17中哪一个目标数据切片计算对应的第一数据切片。
进一步地,104、接口控制器112将第一数据切片传输至第二处理器111,其中第二处理器用于处理第一数据切片。具体的,接口控制器112还可以向第二处理器111发送对应第一数据切片的第一处理指示,其中,第二处理器111根据第一处理指示处理第一数据切片。
在该方案中,接口控制器112可以将第一处理器110处理的数据帧中的多个数据切片在缓冲器中进行存储,并能够根据多个数据切片中的目标数据切片获取第一数据切片,其中第一数据切片至少包括该目标数据切片,然后将第一数据切片发送至第二处理器111,并由第二处理器111处理对应的第一数据切片,上述过程中由于将数据 帧分为了多个数据切片,因此相对于数据帧减小了每次传输数据的大小,并通过接口控制器112的缓冲器向第二处理器111传输,因此可以避免采用共享内存的方式在第一处理器110和第二处理器111之间传输数据,并且由于接口控制器112每获取一个第一数据切片即可向第二处理器111传输并通知第二处理器111进行处理,这样在第二处理器处理第一数据切片时,接口控制器112进一步用于将其他数据切片传输至所述第二处理器111,这样可以有效的形成计算链路流水,并且由于将数据帧划分成了数据切片,则在每个数据切片的数据在第一处理器110处理完成后就可以通过接口控制器112传输给第二处理器111,由第二处理器111开始处理,从而为第二处理器111争取到了更多的处理时间,相对于现有技术中算法在各个处理器中串行执行,该方案中,对第二处理器111的性能要求较低;例如,第一处理器110对数据帧划分为多个数据切片的话,参照图19所示,在单位周期内第一处理器和第二处理器需要完成一帧数据的处理,在第一处理器的处理周期T1内,当第一处理器110处理完数据切片t1后,即可以通过接口控制器112将数据切片传输给第二处理器111,接口控制器112每获取一个数据切片即可向第二处理器111传输并由第二处理器111进行处理,这样第二处理器111可以在第一处理器110处理完数据切片t1之后开始对接口控制器112传输的数据切片进行处理(即第二处理器111在第一处理器的处理周期T1内开启对数据帧的处理周期T2),为第二处理器111争取到了更多的处理时间,降低了对第二处理器111性能的要求,并且,第二处理器111处理t1时,第一处理器110仍然可以继续处理其他数据,接口控制器112进一步用于将其他数据切片传输至所述第二处理器111,这样可以有效的形成计算链路流水;另外,接口控制器112对数据帧划分为多个数据切片的话,如图8所示,只要第一处理器110对数据帧开始处理并向接口控制器112输出数据帧的数据,则接口控制器112就可以开始对数据帧进行切片,这样当接口控制器112获取到一个数据切片之后便可以将数据切片传输给第二处理器111,并由第二处理器111进行处理,这样,第二处理器111也是可以在t1之后开始对接口控制器112传输的数据切片进行处理,为第二处理器111争取到了更多的处理时间,降低了对第二处理器111性能的要求,并且能够减少整个计算链路的延时;从而提升系统整体性能。
以下以第一处理器为GPU、第二处理器为NPU为例,参照图20所示,对本申请的实施例提供的数据传输方法进行说明如下。在初始状态,接口控制器将bitmap中的所有bit置为0。其中该初始状态可以是上一次数据帧传输结束后的重置(reset)为初始状态,例如通常数据帧的帧尾携带数据结束指示,具体可以根据该数据结束指示将bitmap重置为初始状态。
在步骤201中,接口控制器确定GPU是否有数据输出。若是,则确定GPU输出的是数据切片还是数据帧。若确定GPU输出的是数据帧,则将该数据帧划分为多个数据切片,并执行步骤202。若确定GPU输出的是数据切片则执行步骤202。在步骤202中,接口控制器根据数据切片将bitmap中对应的bit置为1。在步骤203中,接口控制器根据bitmap确定是否有目标数据切片以便可以计算对应的第一数据切片。若是,则计算第一数据切片,并执行步骤204。在步骤204中,接口控制器将第一数据切片传输给NPU,并向NPU发送第一处理指示,以通知NPU处理该第一数据切片。在步骤 205中,确定是否所有的数据切片均以处理完毕,若是则结束,若否则执行步骤203。以上步骤201-205介绍了本申请提供的一种采用接口控制器在GPU与NPU之间数据传输的方法的基本逻辑过程,其中每个过程中的具体方式可以参照上述步骤101-106中的描述。
上述主要对本申请实施例提供的方案进行了介绍。可以理解的是,上述接口控制器为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该可以理解,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对接口控制器进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。下面以采用对应各个功能划分各个功能模块为例进行说明。
图21是本申请实施例提供的接口控制器300的逻辑结构示意图,接口控制器300能够实现本申请实施例提供的数据传输方法方法。接口控制器300可以是硬件结构、软件模块、或硬件结构加软件模块。如图21所示,接口控制器300包括信息获取单元301、处理单元302和发送单元303以及缓冲器。其中,获取单元301可以用于执行上述的步骤101。处理单元302可以用于执行上述的步骤102和103。发送单元303可以用于执行上述的步骤104。其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能单元的功能描述,在此不再赘述。
本领域普通技术人员可知,上述方法中的全部或部分步骤可以通过程序指令相关的硬件完成,该程序可以存储于一计算机可读存储介质中,该计算机可读存储介质如ROM、RAM和光盘等。本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质可以包括存储器。上述提供的任一种接口控制器300中相关内容的解释及有益效果均可参考上文提供的对应的方法实施例,此处不再赘述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式来实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或者数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能 够存取的任何可用介质或者是包含一个或多个可以用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带),光介质(例如,数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
尽管在此结合各实施例对本申请进行了描述,然而,在实施所要求保护的本申请过程中,本领域技术人员通过查看所述附图、公开内容、以及所附权利要求书,可理解并实现所述公开实施例的其他变化。在权利要求中,“包括”(comprising)一词不排除其他组成部分或步骤,“一”或“一个”不排除多个的情况。单个处理器或其他单元可以实现权利要求中列举的若干项功能。相互不同的从属权利要求中记载了某些措施,但这并不表示这些措施不能组合起来产生良好的效果。
尽管结合具体特征及其实施例对本申请进行了描述,显而易见的,在不脱离本申请的精神和范围的情况下,可对其进行各种修改和组合。相应地,本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明,且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。
Claims (14)
- 一种数据传输方法,其特征在于,包括:获取数据帧中的多个数据切片,其中,所述数据帧由第一处理器生成;将所述多个数据切片写入缓冲器;根据所述多个数据切片中的目标数据切片获取第一数据切片,所述第一数据切片至少包括所述目标数据切片;将所述第一数据切片传输至第二处理器,其中所述第二处理器用于处理所述第一数据切片。
- 根据权利要求1所述的数据传输方法,其特征在于,所述获取数据帧中的多个数据切片;包括:接收所述第一处理器传输的所述多个数据切片,其中所述第一处理器用于将所述数据帧划分为所述多个数据切片。
- 根据权利要求1所述的数据传输方法,其特征在于,在获取数据帧中的多个数据切片之前,所述方法还包括:接收所述第一处理器传输的所述数据帧;所述获取数据帧中的多个数据切片,包括:将所述数据帧划分为所述多个数据切片。
- 根据权利要求1-3任一项所述的数据传输方法,其特征在于,根据所述多个数据切片中的目标数据切片获取第一数据切片,包括:在所述缓冲器读取目标数据切片,以及所述目标数据切片的边缘相邻的数据切片;根据所述目标数据切片,以及所述目标数据切片的边缘相邻的数据切片生成所述第一数据切片,其中所述第一数据切片覆盖所述目标数据切片,并且所述第一数据切片与所述目标数据切片的边缘相邻的数据切片具有重叠区域。
- 根据权利要求1-4任一项所述的数据传输方法,其特征在于,所述第二处理器为神经网络处理器NPU。
- 根据权利要求4所述的数据传输方法,其特征在于,所述目标数据切片包括2n个所述数据切片,其中,n为正整数。
- 一种接口控制器,其特征在于,所述接口控制器包括:获取单元,用于获取数据帧中的多个数据切片,其中,所述数据帧由第一处理器生成;处理单元,用于将所述获取单元获取的所述多个数据切片写入缓冲器;所述处理单元,用于根据所述缓冲器中的所述多个数据切片中的目标数据切片获取第一数据切片,所述第一数据切片至少包括所述目标数据切片;发送单元,用于将所述处理单元获取的所述第一数据切片传输至第二处理器,其中所述第二处理器用于处理所述第一数据切片。
- 根据权利要求7所述的接口控制器,其特征在于,所述获取单元具体用于接收所述第一处理器传输的所述多个数据切片,其中所述第一处理器用于将所述数据帧划分为所述多个数据切片。
- 根据权利要求7所述的接口控制器,其特征在于,所述获取单元还用于接收所述第一处理器传输的所述数据帧;所述获取单元具体用于将所述数据帧划分为所述多 个数据切片。
- 根据权利要求7-9任一项所述的接口控制器,其特征在于,所述处理单元,具体用于在所述缓冲器读取目标数据切片,以及所述目标数据切片的边缘相邻的数据切片;根据所述目标数据切片,以及所述目标数据切片的边缘相邻的数据切片生成所述第一数据切片,其中所述第一数据切片覆盖所述目标数据切片,并且所述第一数据切片与所述目标数据切片的边缘相邻的数据切片具有重叠区域。
- 根据权利要求7-10任一项所述的接口控制器,其特征在于,所述第二处理器为神经网络处理器NPU。
- 根据权利要求10所述的接口控制器,其特征在于,所述目标数据切片包括2n个所述数据切片,其中,n为正整数。
- 一种片上系统,其特征在于,包括第一处理器、第二处理器以及接口控制器,其中,所述接口控制器与所述第一处理器及所述第二处理器连接,所述接口控制器用于执行如权利要求1-6任一项所述的数据传输方法。
- 根据权利要求13所述的片上系统,其特征在于,在所述第二处理器处理所述第一数据切片时,所述接口控制器进一步用于将其他数据切片传输至所述第二处理器。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/075307 WO2022165718A1 (zh) | 2021-02-04 | 2021-02-04 | 一种接口控制器、数据传输方法及片上系统 |
CN202180077291.2A CN116508000A (zh) | 2021-02-04 | 2021-02-04 | 一种接口控制器、数据传输方法及片上系统 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/075307 WO2022165718A1 (zh) | 2021-02-04 | 2021-02-04 | 一种接口控制器、数据传输方法及片上系统 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022165718A1 true WO2022165718A1 (zh) | 2022-08-11 |
Family
ID=82740784
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/075307 WO2022165718A1 (zh) | 2021-02-04 | 2021-02-04 | 一种接口控制器、数据传输方法及片上系统 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN116508000A (zh) |
WO (1) | WO2022165718A1 (zh) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090125538A1 (en) * | 2007-11-13 | 2009-05-14 | Elemental Technologies, Inc. | Video encoding and decoding using parallel processors |
US20090310686A1 (en) * | 2008-06-17 | 2009-12-17 | Do-hyoung Kim | Distributed decoding device of sequential parallel processing scheme and method for the same |
WO2017203096A1 (en) * | 2016-05-27 | 2017-11-30 | Picturall Oy | A computer-implemented method for reducing video latency of a computer video processing system and computer program product thereto |
CN111465919A (zh) * | 2017-12-28 | 2020-07-28 | 深圳市大疆创新科技有限公司 | 用于支持可移动平台环境中的低延迟的系统和方法 |
-
2021
- 2021-02-04 WO PCT/CN2021/075307 patent/WO2022165718A1/zh active Application Filing
- 2021-02-04 CN CN202180077291.2A patent/CN116508000A/zh active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090125538A1 (en) * | 2007-11-13 | 2009-05-14 | Elemental Technologies, Inc. | Video encoding and decoding using parallel processors |
US20090310686A1 (en) * | 2008-06-17 | 2009-12-17 | Do-hyoung Kim | Distributed decoding device of sequential parallel processing scheme and method for the same |
WO2017203096A1 (en) * | 2016-05-27 | 2017-11-30 | Picturall Oy | A computer-implemented method for reducing video latency of a computer video processing system and computer program product thereto |
CN111465919A (zh) * | 2017-12-28 | 2020-07-28 | 深圳市大疆创新科技有限公司 | 用于支持可移动平台环境中的低延迟的系统和方法 |
Also Published As
Publication number | Publication date |
---|---|
CN116508000A (zh) | 2023-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI765979B (zh) | 操作神經網路裝置的方法 | |
WO2016011886A1 (zh) | 对图像进行解码的方法及装置 | |
JP6654048B2 (ja) | Ufs装置の作動方法、ufsホストの作動方法、及びそれらを含むシステムの作動方法 | |
US8661440B2 (en) | Method and apparatus for performing related tasks on multi-core processor | |
JP2015531524A (ja) | メモリアクセス制御モジュールおよびこれに関連する方法 | |
WO2021082969A1 (zh) | 核间数据处理方法、系统、片上系统以及电子设备 | |
US20060161720A1 (en) | Image data transmission method and system with DMAC | |
CN112005251B (zh) | 运算处理装置 | |
WO2019084788A1 (zh) | 用于神经网络的运算装置、电路及相关方法 | |
JP2006318178A (ja) | データ転送調停装置およびデータ転送調停方法 | |
US7793012B2 (en) | Information processing unit, system and method, and processor | |
US20200379928A1 (en) | Image processing accelerator | |
CN112235579A (zh) | 视频处理方法、计算机可读存储介质及电子设备 | |
US20220113944A1 (en) | Arithmetic processing device | |
WO2021136433A1 (zh) | 一种电子设备及计算机系统 | |
CN118394681A (zh) | 核间通信方法、系统、计算机设备、可读存储介质和产品 | |
WO2022165718A1 (zh) | 一种接口控制器、数据传输方法及片上系统 | |
US20070208887A1 (en) | Method, apparatus, and medium for controlling direct memory access | |
US20170124695A1 (en) | Data processing apparatus | |
US6771271B2 (en) | Apparatus and method of processing image data | |
JP4839489B2 (ja) | ディスクリプタ制御方法、ダイレクトメモリ転送装置およびプログラム | |
US20180095929A1 (en) | Scratchpad memory with bank tiling for localized and random data access | |
WO2022027172A1 (zh) | 数据处理装置、方法和系统以及神经网络加速器 | |
WO2021092941A1 (zh) | 感兴趣区域-池化层的计算方法与装置、以及神经网络系统 | |
US20220318955A1 (en) | Tone mapping circuit, image sensing device and operation method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21923743 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202180077291.2 Country of ref document: CN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21923743 Country of ref document: EP Kind code of ref document: A1 |