CN110149802A

CN110149802A - Compiler for being translated between the target hardware with two-dimensional shift array structure in Virtual Image Processor instruction set architecture (ISA)

Info

Publication number: CN110149802A
Application number: CN201680020203.4A
Authority: CN
Inventors: 阿尔伯特·迈克斯纳
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2015-04-23
Filing date: 2016-03-28
Publication date: 2019-08-20
Anticipated expiration: 2036-03-28
Also published as: US11182138B2; WO2016171846A1; GB2554204A; GB201715795D0; US20200201612A1; US20190004777A1; CN110149802B; US10095492B2; EP3286645A1; US10599407B2; US20170242669A1; US9785423B2; GB2554204B; US20160313984A1

Abstract

The present invention describes a kind of method, this method include will include that the high level program code of high level instructions of the instruction format with the pixel that accessed from memory with the first coordinate of orthogonal coordinate system and the identification of the second coordinate be translated into low-level instructions for the hardware structure with the shift register array structure for executing channel array and data capable of not coaxially being made to shift along two.The translation includes that the high level instructions with described instruction format are substituted for the rudimentary shift instruction for shifting data in shift register array structure.

Description

For at Virtual Image Processor instruction set architecture (ISA) and with two-dimensional shift battle array The compiler translated between the target hardware of array structure

Technical field

Field of the invention generally relates to image procossings, more particularly, in virtual image process instruction collection frame The compiler translated between structure (ISA) and target hardware with two-dimensional shift array structure.

Background technique

Image procossing is usually directed to the processing to the pixel value for being organized into array.Here, the two-dimensional array through spatial organization (dimension in addition may include time (for example, sequence of two dimensional image) and data type (example to the two-dimensional nature of capture image Such as, color).Under typical scene, the pixel value of arrangement is provided by the camera for having generated still image or frame sequence, to catch Catch the image of movement.Traditional image processor usually will appear two kinds it is one of extreme.

The first it is extreme as in general processor or general class processor (for example, leading to vector instruction enhancing With processor) on the software program that executes execute image processing tasks.Although the first extreme usually offer highly versatile is answered With Software Development Platform, but it combines associated overhead (for example, instruction is extracted and decoding, processing on piece and the outer data of piece, speculates and holds Row) using more fine-grained data structure can eventually lead to the energy of during executing program code per unit data consumption compared with It is more.

Second of opposite extreme hard-wired circuitry by fixed function is applied to much bigger data block.Using bigger (relative to more fine-grained) data block directly applies to the circuit of custom design, substantially reduces the power consumption of per unit data.So And one group of limited task can only be executed by normally resulting in processor using the fixed function circuit of custom design.In this way, Two kinds of extreme middle extensive general programmed environments of shortage (it is extreme associated with first).

The applied software development chance for providing highly versatile combines the technology of the power efficiency of improved per unit data flat Platform is still ideal but missing solution.

Summary of the invention

The present invention describes a kind of method, and this method includes including with the first coordinate of orthogonal coordinate system and the second seat The high level program code of the high level instructions of the instruction format for the pixel that mark will not be accessed from memory is translated into be directed to have and hold The rudimentary finger of row of channels array and the hardware structure for the shift register array structure that can not coaxially make data shift along two It enables.The translation includes will there are the high level instructions of described instruction format to be substituted for make in shift register array structure The rudimentary shift instruction of data displacement.

Detailed description of the invention

The following description and drawings are for illustrating the embodiment of the present invention.In figure:

Fig. 1 shows the various assemblies of technology platform；

Fig. 2 a shows the embodiment of the application software constructed with kernel；

Fig. 2 b shows the embodiment of the structure of kernel；

Fig. 3 shows the embodiment of the operation of kernel；

Fig. 4 a, Fig. 4 b and Fig. 4 c depict the virtual place for the developing kernel thread in advanced applied software exploitation environment Manage the various aspects of the memory model of device；

Fig. 5 a shows the embodiment of the thread with the load instruction write-in with position relative format；

Figure 5b shows that the images with different pixels density；

Fig. 6 shows the embodiment of applied software development and simulated environment；

Fig. 7 shows the embodiment of image processor hardware structure；

Fig. 8 a, Fig. 8 b, Fig. 8 c, Fig. 8 d and Fig. 8 e are depicted image data analyzing through overlapping template at line-group, by line Group is parsed into table and executes operation to table；

Fig. 9 a shows the embodiment of template processor；

Fig. 9 b shows the embodiment of the coding line of template processor；

Figure 10 shows the embodiment of the Data Computation Unit in template processor；

Figure 11 a, Figure 11 b, Figure 11 c, Figure 11 d, Figure 11 e, Figure 11 f, Figure 11 g, Figure 11 h, Figure 11 i, Figure 11 j and Figure 11 k are retouched Draw the example for determining a pair of adjacent output pixel value using two-dimensional shift array and execution channel array by overlapping template；

Figure 12 shows the embodiment for the integrated unit cell for executing channel array and two-dimensional shift array；

Figure 13 is related to the first operation executed by grid builder；

Figure 14 is related to the second operation executed by grid builder；

Figure 15 is related to being operated by the third that grid builder executes；

Figure 16 is related to the 4th operation executed by grid builder；

Figure 17 is related to the 5th operation executed by grid builder；

Figure 18 is related to the 6th operation executed by grid builder；

Figure 19 shows the embodiment of grid builder；

Figure 20 is related to the first operation executed by compiler；

Figure 21 is related to the second operation executed by compiler；

Figure 22 is related to being operated by the third that compiler executes；

Figure 23 a, Figure 23 b, Figure 23 c and Figure 23 d are related to the 4th operation executed by compiler；

Figure 24 shows the embodiment of computing system.

Specific embodiment

I. brief introduction

The following contents describes several embodiments about new images processing technique platform, which, which provides, uses larger data The extensive general applied software development environment of block (for example, line-group and table for being discussed further below), to provide improvement Power efficiency.

1.0 applied software development environment

A. the application and structure of kernel

Fig. 1 shows the high-level view of image processor technology platform, which includes virtual image processing environment 101, reality Border image processing hardware 103 and for the high-level code that virtual processing environment 101 is written to be translated into actual hardware 103 The compiler 102 for the object code that physics executes.As detailed below, virtual processing environment 101 is wide in the application aspect that can be developed It is general general and be customized for simple visualization using anabolic process.Work is developed completing program code by developer 104 After work, the code being written in virtual processing environment 101 is translated into the object code for actual hardware 103 by compiler 102.

Fig. 2 a shows the example of structure and form that the application software being written in virtual environment can be taken.Such as Fig. 2 a institute Show, it is contemplated that program code handles one or more frames of input image data 201, to carry out to input image data 201 Certain integral transformation.Pass through the program code operated with the chronological order expressed by developer to input image data The operation of 202 one or more kernels converts to realize.

For example, as shown in Figure 2 a, carrying out integral transformation by handling each input picture with the first kernel K1 first.So Afterwards, the output image generated by kernel K1 is operated by kernel K2.Then, by kernel K3_1 or K3_2 to by interior Each of the output image that core K2 is generated is operated.Then, the output by kernel K4 to being generated by kernel K3_1/K3_2 Image is operated.Kernel K3_1 and K3_2 can be identical kernel, it is intended to be added by implementing parallel processing in the K3 stage Fast disposed of in its entirety or they can be different kernel (for example, kernel K3_1 to the first certain types of input picture into Row operation, and kernel K3_2 operates second of different types of input picture).

In this way, biggish general image processing sequence can take image processing pipeline or directed acyclic graph (DAG) Form, and develop environment and may be configured to that the performance of program code just developed like this is presented to developer is practical.It is interior Core can individually be developed by developer and/or can by the entity that provides any basic technology (such as actual signal processor is hard Part and/or its design) and/or by third party (for example, developing the supplier for the kernel software that environment is write) offer.In this way, pre- It will include kernel " library " that phase, which nominally develops environment, and developer freely " temporarily connects " kernel library by various modes, with complete At the overall procedure of its large-scale development.It is expected that become this library a part some basic core contents may include provide with The kernel of one or more of lower primary image processing task: convolution, denoising, color space conversion, edge and Corner Detection, Sharpening, white balance, gamma correction, tone mapping, matrix multiple, image registration, pyramid construction, wavelet transformation, piecemeal are discrete Cosine and Fourier transformation.

Fig. 2 b shows the example plot of the structure for the kernel 203 that developer is contemplated that.As shown in Figure 2 b, kernel 203 It can be considered as the parallel thread (" thread ") 204 of several program codes, each is grasped in corresponding floor processor 205 Make, wherein each processor 205 is for the specific position in output array 206 (in the output image that such as kernel is just generating Specific pixel location).For the sake of simplicity, three processors and corresponding thread are only shown in figure 2b.In various embodiments, often The output array position of a description can have the application specific processor and corresponding thread of own.I.e. it is capable to for output Each pixel in array distributes individual processor and thread.In alternative method, identical thread be can be generated more than defeated The data of pixel and/or two different threads (for example, under certain limited cases) can cooperate with generation for identical out The data of output pixel.

As detailed below, in various embodiments, in actual bottom hardware, the array and corresponding line in channel are executed Journey unanimously works (for example, in a manner of single-instruction multiple-data (SIMD) class), to generate " line-group " of the frame for currently just handling A part output image data.Line-group is the continuous and sizable part of picture frame.In various embodiments, it develops Person may realize that hardware operates line-group, or exploitation environment can present it is abstract, wherein there are individual processors And thread, such as each pixel in output frame (for example, the application specific processor of each pixel in output frame by own It is generated with thread).Anyway, in various embodiments, developer understands that kernel includes for the independent of each output pixel Whether thread (makes output array visualization be entire output frame or a portion).

As detailed below, in one embodiment, the processor 205 presented in virtual environment to developer has instruction set Framework (ISA) is not only supported standard (for example, RISC) operation code, but also is referred to including the data access with special format It enables, the latter allows developer easily to visualize the processing pixel-by-pixel being carrying out.By traditional mathematics and program control operations code Entire ISA allow highly versatile programmed environment in conjunction with the ability of any input array position of light definition/visualization, Application developer is substantially allowed ideally to define any required function to execute on the imaging surface of any size. For example, it is desirable that easily any mathematical operation can be programmed to suitable for any template size.

About data access instruction, in one embodiment, the ISA (" virtual ISA ") of virtual processor includes special number According to load instruction and special data store instruction.Data load instruction can be from any position in the input array of image data It reads.Any position in the output array of image data can be written in instruction data storage.Latter instruction allows easily to Multiple instance-specifics of same processor are in different output pixel position (the different pictures in each processor write-in output array Element).In this way, for example, template size itself (for example, being expressed as the width of pixel and the height of pixel) can become be easily programmed Feature.Special load and store instruction are each with special instruction format, and thus the visualization of processing operation is by into one Step simplifies, and Target Aerial Array position is simply thus appointed as X-coordinate and Y-coordinate.

Anyway, by instantiating individual processor for each of multiple positions in output array, processing Device can be performed in parallel their own thread, for example to generate the analog value of whole positions in output array simultaneously.Value It obtains it is noted that many image procossing routines usually execute identical operation to the different pixels of identical output image.In this way, It develops in one embodiment of environment, it is assumed that each processor is identical and executes identical multi-threaded program code.Therefore, The environment of virtualization can be considered as a type of two-dimentional (2D) SIMD processor, by the 2D array of such as same processor Composition, each processor execute identical code with lock-step.

Fig. 3 shows the place of two virtual processors of the same code of two different pixels positions in positive processing output array Manage the more detailed example of environment.Fig. 3 shows the output array 304 for corresponding to the output image just generated.Here, first is virtual Processor is handling the code of thread 301 just to generate output valve, and the second virtual processing at the position X1 of output array 304 Device is handling the code of thread 302 just to generate output valve 304 at the position X2 of output array.For again it, in various implementations In example, developer is it will be appreciated that for each location of pixels in output array 304, and there are individual processor and thread are (concise For the sake of, Fig. 3 only shows two of them).However, in various embodiments, developer only needs exploitation for a processor and line The code (because SIMD class property of machine) of journey.

As known in the art, output pixel value includes and surrounds the defeated of corresponding output pixel position frequently by processing Enter the pixel of array to determine.For example, as from figure 3, it can be seen that the position X1 of output array 304 corresponds to input array 303 Position E.Therefore, it will input will be corresponded to the template for determining 303 pixel value of input array of output valve X1 through handling Value ABCDEFGHI.Similarly, it will input will be corresponded to the template for determining the input array pixel of output valve X2 through handling Value DEFGHIJKL.

Fig. 3, which is shown, can be used in the corresponding virtual of a pair of of the thread 301,302 for calculating separately output valve X1 and X2 The example of environment program code.In the example of fig. 3, two pairs of codes are identical and respectively nine input array values moulds Plate, with the corresponding output valve of determination.Unique difference between two threads is the variable called from input array and write-in The position of output array.Specifically, the thread that output position X1 is written is run on template ABCDEFGHI, and output bit is written The thread for setting X2 is run on template DEFGHIJKL.

It such as can be seen that each virtual processor from the corresponding program code of a pair of of thread 301,302 and include at least inside Register R1 and R2 and at least support are to give an order: 1) LOAD instruction from input array to R1；2) from input array to R2 LOAD instruction；3) it adds the content of R1 and R2 and places the result in the ADD instruction in R2；4) by the value in R2 divided by immediately The DIV of operand 9 is instructed；And 5) STORE of the content deposit dedicated output array position of thread of R2 is instructed.For again It can set although depicting only two output array positions and only two threads and corresponding processor in Fig. 3 Think, the virtual processor and corresponding thread of these functions can be executed for each position distribution in output array.Various In embodiment, according to the SIMD class property of processing environment, multiple threads execute in isolation each other.That is, virtual processor Between there is no the communication of thread to thread (a SIMD channel, which is hampered by, crosses into another SIMD channel).

B. the memory model of virtual processor

In various embodiments, the correlated characteristic of virtual processor is their memory model.As managed in this field Solution, processor operate and write back to new data in memory from memory read data, to the data.Memory Model is the perspective view or view for the mode that processor has by data organization into memory.Fig. 4 a to Fig. 4 c is related to being used for Develop the embodiment of the memory model of the virtual processor of environment.For exemplary purposes, using only relating to three virtual places Manage the simplification environment of device and corresponding thread 401.As detailed below, the memory model of virtual processor pays attention to retaining SIMD language Justice, and scalar operation and dedicated median memory space are provided simultaneously for each virtual processor.

As shown in fig. 4 a, in one embodiment, the storage region except each virtual processor operation is based on storage The type of information and be organized into six different subregions.Specifically, exist: 1) private staging area 402；2) global input number According to array region 403；3) global output data array region 404；4) global look-up table information area 405；5) global atom system Count region 406；And 6) global constant table information area 407.

As the discribed subregion of Fig. 4 a attempt to make according to the SIMD class property of disposed of in its entirety environment virtual processor it Between share or those of the memory area visualization of " overall situation ".Similarly, Fig. 4 a also attempt to make not virtual processor it Between share or be to specific virtual processor the memory of " dedicated " other area visualizations.Specifically, as shown in fig. 4 a, Other than the dedicated staging area 402 of each virtual processor, whole memory partitions are all global.As further below Description, several different memory areas also have different memory addressing schemes.

It is interim to store average information not during executing complicated image processing algorithm about staging area 402 Rare (for example, then read back again information and later using the information).In addition, between such information cross-thread and not It is also not uncommon for (there may be different medians for different input values).Therefore, memory model includes each processor Dedicated staging area 402, for storing such average information by the correspondence thread of each virtual processor.At one In embodiment, the staging area of par-ticular processor passes through typical (for example, linear) random access memory address by the processor To access 409 and be the read/write area of memory (that is, virtual processor can read information from private memory and will Private memory is written in information).It discusses in more detail further below formal for accessing the virtual processor ISA of staging area The embodiment of instruction.

Input array part 403 includes input data set, is transferred 408 sets of threads to generate output data.In typical case In the case of, input array corresponds to the image (for example, frame) or image section that per thread is run on it or in it.It is defeated Entering image can be true input, such as Pixel Information by camera offer or some form of intermediate image, such as exist The information provided in biggish general image processing sequence by previous kernel.Virtual processor does not compete identical input usually Data item, because they operate the different pixels position of input image data during same period.

In one embodiment, which calls from input array 403 to define using novel memory addressing scheme Specific input value.Specifically, " position is opposite " addressing scheme is used, by the unconventional linear memory of X, Y coordinates Location defines required input data.In this way, the load instruction of the ISA of virtual processor includes being identified by X-component and Y-component The instruction format of specific memory location in input array.In this way, using two-dimensional coordinate system come to be read from input array 403 The input value addressing memory taken.

The image-region for allowing virtual processor just operating using position facing memory addressing method is easier to be developed Person's identification.As described above, the entire ISA of traditional mathematics and program control operations code and light definition/visualization is any defeated The combination for entering the ability of array position allows highly versatile programmed environment, substantially permission application developer ideally It is easy to define any required function to execute on the imaging surface of any size.It is more fully described and is used for further below Using the implementation of other features of the various instruction format embodiments and ISA supported of the instruction of position relative addressing scheme Example.

Output array 404 includes that thread is responsible for the output image data generated.The output image data can be such as after The final image data of actual image data over the display are presented after whole image processing sequence, or can be entirety The subsequent kernel of image processing sequence is used as the intermediate image data of its input image data information.For again it, in general, virtually Processor will not compete identical output data item, because the different pictures of output image data are written in they during same period Plain position.

In one embodiment, relative addressing scheme in position is also used in write-in output array.In this way, each virtual place The ISA for managing device includes store instruction, and the target writing position in memory is defined as two-dimensional X, Y coordinates by instruction format, And unconventional random access memory address.The embodiment of the position relative instruction about virtual ISA is further provided below More details.

Fig. 4 a is also shown each virtual processor and executes in the look-up table 411 being stored in look-up table storage region 405 Search 410.Look-up table is commonly used in image processing tasks, for example, filter or transformation to obtain different array positions Coefficient realizes complex function (for example, gamma curve, sine, cosine), and wherein it is defeated to be that input index value etc. provides function for look-up table Out.Herein, it is contemplated that SIMD image processing sequence is often executed into identical look-up table during the identical clock cycle and looked into It looks for.In this way, it is complete for any virtual processor to search table section 405 as output and input array memory region 403,404 Office's access.Fig. 4 a equally shows from the same look-up table 411 being stored in look-up table storage region 405 and effectively searches information Each of three virtual processors.

In one embodiment, since index value is commonly used for defining required lookup table entries, using just Normal linear access scheme accesses look-up table information area.In one embodiment, the lookup region of memory is read-only (that is, information that processor can not be changed in look-up table and be only allowed from wherein reading information).For the sake of simplicity, Fig. 4 a is only shown One look-up table, which resides on, to be searched in table section 405, but virtual environment allows multiple and different look-up tables during dry run Resident.The embodiment of the virtual ISA instruction format for the instruction that lookup is executed in look-up table is further provided below.

Fig. 4 b shows each of three virtual processors to the write-in of atom statistical regions 406 413.To output information into Appropriateness change is gone " update " or makes to be not uncommon for for image procossing.It is then possible to which the information of update is used for again Utilize other downstream process of the information of update.Such update or the example moderately changed include output data and fix inclined Move it is simple be added, output data and the minimum value for being simply multiplied or making output data and some threshold value of multiplicand or most Big value compares.

In these sequences, as shown in Figure 4 b, can to just being operated by the calculated output data of each thread, And write the result into atom statistical regions 406.Semantic according to realizing, the output data operated by atomic action can be by Reason device is saved in inside or is called from output array, and Fig. 4 b shows the latter 412.It in various embodiments, can be to output The atomic action that data execute includes addition, multiplication, minimizes and maximizing.In one embodiment, in view of to output The update of data can logically be organized into two-dimensional array identical with output data itself, use position relative addressing scheme (as outputting and inputting array accesses) accesses atom statistical regions 406.It is more fully described further below for defeated Data execute atomic action and write the result into the embodiment of the virtual ISA instruction format of statistical regions 406 out.

Fig. 4 c shows the virtual processing that 414 constant values are read from the constant look-up table 415 in constant storage region 407 Each of device.Here, for example, it is contemplated that in different threads in the same clock cycle (for example, for the spy of whole image application Determine multiplier) on may need identical constant or other values.Therefore, as illustrated in fig. 4 c, to the access of constant look-up table 415 The identical scalar value of each return into virtual processor.Because usually look-up table is accessed using index value, in a reality It applies in example, accesses storage address using linear random to access constant look-up table memory block.In one embodiment, memory Constant section be read-only (that is, information that processor can not be changed in constant table and be only allowed from wherein reading information). For the sake of simplicity, Fig. 4 c only shows the single constant look-up table 415 in constant area domain 407.Since thread can be used one The above such table memory area 407 is configured to the constant table that capacity is sufficient to accommodate needs/use.

C. virtual processor ISA

As mentioned in above in multiple examples, virtual processor ISA may include several correlated characteristics.It below will be right Some of them are described in detail.

In various embodiments, the instruction format of the ISA of each virtual processor is following using relative positioning method Items define X, Y coordinates: 1) LOAD instruction of input image data is read from input array storage region；2) picture number will be exported According to the STORE instruction of write-in output array；And 3) to the atomic update of the statistical regions of memory.

Define the entire ISA of traditional data access, mathematics and program control operations code and easily any input array Highly versatile programmed environment is allowed in the combination of the ability of position, and substantially permission application developer, which ideally defines, wants Any required function executed on the imaging surface of any size.For example, it is desirable that can be easily by any mathematical operation It is programmed to and is suitable for any template size.

In one embodiment, for there is following format from the instruction for input array/load/store to output array:

[OPCODE]LINEGROUP_(name)[(((X*XS+X0)/XD)；((Y*YS+Y0)/YD)；Z] wherein, [OPCODE] is certain types of operation (LOAD from input array, the STORE to output array), and LINEGROUP_ It (name) is that the title of the specific part of specific image is distributed in input or output array storage region (for example, image The line-group of the frame of data).Here, different line-groups is endowed different titles because operating respectively to different line-groups, They can be uniquely identified/access (for example, LINEGROUP_1, LINEGROUP_2 etc.) in this way.It is deposited in input array There may be line-groups of the same name in storage area domain and output array storage region.The origin of any line-group may, for example, be appropriate at it Its lower left corner in storage region.

In the case where executing the instruction updated to atomic summary table, in one embodiment, instruction format is taken following Similar structures:

[OPCODE]STATS_(name)[(((X*XS+X0)/XD)；((Y*YS+Y0)/YD)；Z] significant difference is, it is defeated Enter operand information and defines position in certain statistical table (STATS_ (name)), and it is specific in non-input or output array Line-group.For line-group, different titles is assigned to different statistical forms, so that thread can be in its operating process to difference Statistical form uniquely operated.[OPCODE] specified specific atoms to be executed act (for example, STAT_ADD, STAT_ MUL、STAT_MIN、STAT_MAX)。

Input/output array accesses or atomic summary table are accessed, the Z operand definition instruction of instruction, which is directed to, is ordered The line-group of name or which channel of statistical form.Here, in general, single image will have multiple channels.For example, for video flowing Same frame, video image usually has red channel (R), green channel (G) and blue channel (B).In some sense, complete Whole image can be considered as independent R, G and channel B image superposed on one another.Z operand defines instruction in these channels Which (for example, Z=0 corresponds to red channel, Z=1 is corresponding to blue channel, and Z=2 is corresponding to green channel).Each line Group and statistical form be thereby configured to include each channel of specific image for just handling content.

(X*XS+X0)/XD operand defines the X position in the instruction line-group named or statistical form that are directed to, and (Y* YS+Y0)/YD operand defines the Y location in the line-group named or statistical form that instruction is directed to.XS and XD of X position And YS and YD of Y location are used in and contract between the input picture with different pixels density and output image It puts.Scaling is detailed further below.

In the simplest case, without scaling between input picture and output image, and the X and Y-component of instruction format The form of X+X0 and Y+Y0 are simply taken, wherein X0 and Y0 is the position offset relative to thread slot.Thread is considered as Distribute to the position that its output valve is written in output array line-group.Corresponding same position is easy to can be in input array line-group It is identified in any statistical form.

As an example, being instructed if thread is assigned specific X, Y location in output array LINEGROUP_1

LOAD LINEGROUP_1[(X-1)；(Y-1)；Z]

Can from LINEGROUP_1 load from input array identical X, Y location is to the first from left location of pixels and downward The value of one location of pixels.

Therefore, as shown in Figure 5 a, the simple fuzzy control core being averaged to X, Y location together with the pixel value of its left and right neighbours can To be written into pseudocode.As shown in Figure 5 a, position ((X)；(Y)) correspond to the position of the virtual processor of positive write-in output array It sets.In above-mentioned pseudocode, LOAD corresponds to the operation code loaded from input array, and STORE corresponds to storage to output The operation code of array.Note that there are LINEGROUP_1 in input array, and there are LINEGROUP_1 in output array.

Fig. 5 b depicts the zoomed image for explaining the zoom feature of relative positioning load and store instruction format.Under adopt Sample refers to by being provided in the output image less than whole pixel present in input picture, high-definition picture is converted into low Image in different resolution.Up-sampling refers to by being created in the output image more than pixel present in input picture, by low resolution Image is converted to high-definition picture.

For example, referring to Fig. 5 b, if image 501 indicates input picture, and image 502 indicates output image, then will hold Row down-sampling, because the pixel in output image is less than the pixel in input picture.Here, for each picture in output image It is defeated in " separate " output image to determine that the related pixel in the input picture of the output valve of output pixel is moved along any axis for element Location of pixels is in progress out.For example, being directed to the down-sampling ratio of 3:1, exports in image and correspond to input along the first pixel of any axis Along the first, second, and third pixel of same axis in image, the second pixel in image corresponds in input picture the is exported Four, the 5th and the 6th pixel etc..Therefore, the first output pixel has related pixel in the third place, and the second output pixel exists 6th position has related pixel.

In this way, XS the and YS multiplicand item in relative positioning instruction format is used in realization down-sampling.If Fig. 5 a's Fuzzy pseudocode is rewritten as the 3:1 down-sampling along two axis, then code will be rewritten as:

R1≤LOAD LINEGROUP_1 [((3X) -1)；3(Y)；0]

R2≤LOAD LINEGROUP_1 [3 (X)；3(Y)；0]

R3≤LOAD LINEGROUP_1 [((3X)+1)；3(Y)；0]

R2≤ADD R1, R2

R2≤ADD R2, R3

R2≤DIV R2,3

STORE LINEGROUP_1[(X)；(Y)；(0)]；R2.

Conversely, in the case where 1:3 up-sampling, (for example, image 502 is input picture, and image 501 is output figure Picture), XD and YD factor will be used in create three output pixels along each input pixel of any axis.In this way, fuzzy generation Code will be rewritten as:

R1≤LOAD LINEGROUP_1 [(X-1)/3；(Y)/3；0]

R2≤LOAD LINEGROUP_1 [(X)/3；(Y)/3；0]

R3≤LOAD LINEGROUP_1 [(X+1)/3；(Y)/3；0]

R2≤ADD R1, R2

R2≤ADD R2, R3

R2≤DIV R2,3

STORE LINEGROUP_1[(X)；(Y)；(0)]；R2

In various embodiments, access the dedicated of memory, constant and search part instruction instruction format include The operand of a*b+c form is taken, wherein a is home position, and b is scaling item, and c is offset.It is linearly sought however, taking herein Location method, wherein a*b+c linear directories corresponded essentially to applied to object table.Each of these instructions further include behaviour The identifier of memory area making code and just accessing.It can be with table for example, executing the instruction searched to storage region from look-up table It is shown as:

LOAD LKUP_(name)[(A*B+C)]。

Wherein, LOAD is the operation code of identification load operation, and the specified look-up table just accessed of LKUP_ (name) stores The title of look-up table in region.Furthermore multiple look-up tables can be used in thread, therefore identify look-up table using nomenclature scheme Appropriate one in more than one look-up table present in memory area.

Similar format with the similar operation code being intended to can be used in the finger for constant and dedicated memory region It enables (for example, LOAD CNST_ (name) [(A*B+C)], LOADPRVT_ (name) [(A*B+C)].In one embodiment, it looks into It looks for table and constant table access is read-only (processor can not change the data having been placed in herein).In this way, these storage regions are just There is no STORE instructions.In one embodiment, the reserved area of memory is by read/write.In this way, the storage region there is Store instruction (for example, STORE PRVT [(A*B+C)]).

In various embodiments, each virtual processor include can comprising integer, floating-point or fixed-point values it is general Register.In addition, general register may include the data value of configurable bit width, the place value of such as 8,16 or 32.Therefore, defeated The image data for entering each pixel position in array or output array can have 8,16 or 32 size of data.? This, virtual processor can be configured as the position size for the value established in general register and the execution pattern of numeric format.Refer to Order can also specify the immediate operand (input value of these input operands directly expression in instruction itself, and not specified Register in find).Real time operation number can also have the bit width of configurable 8,16 or 32.

In the embodiment of extension, each virtual processor can also be with the scalar mode of itself inside or SIMD mould Formula is operated.That is, the data in specific array position can be considered as scalar value or with multiple elements to Amount.For example, the first configuration can establish 8 scalar operations, wherein each pattern matrix position keeps 8 place values of scalar. Conversely, another configuration can establish 32 parallel/SIMD operations, wherein assuming that each pattern matrix position is kept for four 8 Place value, 32 total data sizes for each array position.

In various embodiments, each virtual processor further includes the register for saving predicate value.Single predicate value In length usually only one and indicate to available data hold it is true/false or greater than/be less than the operation code of test As a result.Using predicate value, for example, (thus being referred to as conditional branching to be determined during execution by the branch direction of code Operand in order).Predicate value can also be expressed as the immediate operand in instruction.

In various embodiments, each virtual processor includes the register for saving scalar value.Here, scalar value quilt Store by constant retain memory model partition space in and from wherein read (being discussed above with reference to Fig. 4 c).? This, each virtual processor handled in one group of virtual processor of identical image is used from the identical of constant storage space Scalar value.In the embodiment of extension, there is also scalar predicates.These, which are stored in register space, meets predicate and scalar Definition value.

In various embodiments, each virtual processor is designed to RISC class instruction set, the arithmetic instruction behaviour supported It include any feasible combination of following code: 1) ADD as code (operand A is added with B)；2) SUB (operand A and B subtract each other)；3) Operand (is moved to another register from a register) by MOV；4) MUL (operand A is multiplied with B)；5) MAD (operation Number A is multiplied with B and C is added in result)；6) ABS (absolute value for returning to operand A)；7) DIV is (by operand B divided by behaviour Count A)；8) SHL (being moved to the left operand A)；9) SHR (move right operand A)；10) MIN/MAX (return operand A and Biggish one in B)；11) SEL (specified bytes of selection operation number A)；12) AND (logical AND for returning to operand A and B)； 13) the OR logic of operand A and B (return to or)；14) XOR (the logic exclusive or for returning to operand A and B)；15) NOT (returns to behaviour It counts the logical inverse of A).

Instruction set further includes the predicate operation of standard, such as: 1) SEQ (1 is returned if A equals B)；2) SNE is (if A It is then returned 1) not equal to B；3) SLT (returning to 1 if A is less than B)；4) SLE (returning to 1 if A is less than or equal to B).Also wrap Control stream instruction, such as JMP (jumping) and BRANCH are included, wherein each may include nominal variable or predicate as operation Number.

D. applied software development and simulated environment

Fig. 6 depicts applied software development and simulated environment 601.As discussed above with reference to Fig. 2, developer can pass through Kernel is arranged in the policy sequence being consistent with the conversion of overall goals image, develops comprehensive image processing function (for example, flowing water Each of line grade executes the image processing pipeline of proprietary image processing tasks, routine set as defined in some other DAG Deng).It can call kernel and/or developer that can develop one or more customization kernels from library 602.

Kernel in library 602 can be provided by the third-party vendor of kernel and/or the supplier of any basic technology (e.g., including the supplier of the hardware platform of target hardware image processor or the supplier of target hardware image processor (for example, being provided as its design or as actual hardware)).

In the case where the kernel of customized development, in many cases, developer only needs to write program for single thread 603 Code.That is, developer only need to be by reference to the input pixel value relative to output pixel position (for example, using above-mentioned Position facing memory access instruction format), write the program code for determining single output pixel value.Meeting single thread 603 Operation after, exploitation environment can instantiate multiple examples of thread code automatically on corresponding virtual processor again, with Kernel is realized on the processor array operated to imaging surface region.Imaging surface region can be one section of picture frame (such as line-group).

In various embodiments, the multi-threaded program code of customization be written into the object code of virtual processor ISA (or It is compiled to the high-level language of virtual processor ISA object code).It is including memory of the access according to memory model tissue Virtual processor dry run when environment in, the execution of program code of customization kernel can be simulated.Here, empty The software model 605 (object-oriented or otherwise) of the software model 604 of quasi- processor and memory comprising model It is instantiated.

Then, the execution of 604 simulation thread code 603 of virtual processor model.Meet thread, its larger kernel and After the performance of any larger function belonging to the kernel, it is integrally compiled into the realistic objective code of bottom hardware.Entire simulation Environment 601 may be implemented as the software run in computer system (for example, work station) 606.

The embodiment of 2.0 hardware structures

A. image processor hardware structure and operation

Fig. 7 shows the embodiment of the framework 700 for hardware-implemented image processor.Image processor can be such as As the target of compiler, the compiler is by the program code conversion write for the virtual processor in simulated environment at by hardware The practical program code executed of processor.As shown in fig. 7, framework 700 include by network 704 (for example, network-on-chip (NOC), Including on piece exchange network, on piece loop network or other kinds of network) it is interconnected to multiple template processor unit 702_1 To multiple line buffer memory unit 701_1 to 701_M of 702_N and corresponding grid builder unit 703_1 to 703_N.? In one embodiment, any line buffer memory unit can be connected to any grid builder and corresponding mould by network 704 Sheet processor.

In one embodiment, program code is compiled and is loaded into corresponding template processor 702 to execute by soft (the related table that program code can also be loaded into template processor generates the image processing operations that part developer defines previously On device 703, for example, this depends on designing and implementing).In at least some examples, image processing pipeline can be by that will use It is loaded into the first template processor 702_1 in the first kernel program of the first pipeline stages, the second pipeline stages will be used for Second kernel program is loaded into that the second template processor 702_2 is medium to be realized, wherein the first of the first kernel execution pipeline The function of grade, the function etc. of the second level of the second kernel execution pipeline, and additional control stream method is installed, it will export Image data is transformed into the next stage of assembly line from the level-one of assembly line.

In other configurations, image processor may be implemented as the two or more with operation same kernel program code The parallel machine of template processor 702_1,702_2.For example, the high density and high data rate stream of image data can be by across them In be each performed both by the multiple template processor extension frame of identical function and handle.

In other other configurations, any DAG of substantially kernel be may be loaded on hardware processor, this is By configuring the corresponding template processor with its respective corresponding program code kernel, and will be appropriate in DAG is designed Control stream hook is configured in hardware will export input of the image from a boot kernel to next kernel.

As general flow, the frame of image data is received by macro I/O unit 705 and is transmitted to line buffer memory unit frame by frame One or more of 701.The frame of its image data is parsed into the smaller area of image data by specific line buffer memory unit, Then line-group is transmitted to specific grid builder by network 704 by referred to as " line-group ".Complete or " complete " single line group can It is made of that (for the sake of simplicity, this specification is primarily referred to as continuously the data of multiple continuous whole row or column of a frame with for example Row).The line-group of image data is further parsed into the smaller area of image data, referred to as " table " by grid builder, and Table is presented to its corresponding template processor.

In the case where there is the image processing pipeline individually inputted or DAG flows, in general, input frame is directed into phase Same line buffer memory unit 701_1, the line buffer memory unit are directed to table by image data analyzing at line-group and by line-group The code of the first kernel in generator 703_1, the corresponding positive execution pipeline/DAG of template processor 702_1.In template After the line-group that processor 702_1 handles it completes operation, output line mass-sending is sent to the line in " downstream " by grid builder 703_1 (in certain use-cases, it is slow that output line-group can be sent back to the previous same line for having sent input line-group to buffer memory unit 701_2 Storage unit 701_1).

Then, indicate its other respective grid builder and template processor (for example, grid builder 703_2 and Template processor 702_2) on next stage/operation one or more " consumer " kernel in assembly line/DAG for executing under The line buffer memory unit 701_2 of trip receives the image data generated by the first template processor 702_1.In this way, exist " producer " kernel for running in first template processor outputs it data forwarding to running in the second template processor " consumer " kernel, wherein consumer's kernel executes next group task after producer's kernel, this and overall pipeline or DAG Design be consistent.

Template processor 702 is designed to simultaneously operate multiple overlapping templates of image data.Template processor Multiple overlapping templates and internal hardware processing capacity effectively determine the size of table.Here, in template processor 702, The array processing one for executing channel is shown while handling the image data surface region covered by multiple overlapping templates.

As detailed below, in various embodiments, the two dimension that the table of image data is loaded into template processor 702 is posted In storage array structure.There is provided power consumption using table and two-dimentional register array structure with being considered valid improves, this is to pass through Mass data is moved into hundreds of circulations space, for example, as single load operation, it is followed by, straight by execution channel array It connects and processing task is executed to data.In addition, can be easy to program/match using channel array and the offer of corresponding register array is executed The different templates size set.

Fig. 8 a to Fig. 8 e illustrates the more fine granularity of the parsing activity of outlet buffer memory unit 701, grid builder unit 703 Parsing activity and be coupled to the template of template processor 702 of grid builder unit 703 and handle movable advanced implementation Example.

Fig. 8 a depicts the embodiment of the input frame of image data 801.Fig. 8 a also depicts template processor and is designed to use Come the skeleton diagram of the three overlapping templates 802 (each with 3 pixels × 3 pixels size) operated.Each template generates respectively The output pixel of image data is exported to highlight with filled black.For the sake of simplicity, three overlapping templates 802 are depicted as only It is overlapped in vertical direction.It is necessary to recognize, in fact, template processor can be designed in the vertical direction and the horizontal direction On all have overlapping template.

Due to the vertically superposed template 802 in template processor, as shown in Figure 8 a, can be operated in single template processor Frame in there are the image datas of wide scope.As detailed below, in one embodiment, template processor is across image data with from a left side It is overlapped in template to right mode at it and handles data (and then being repeated with sequence from top to bottom for next group of line).Cause This, as template processor persistently carries out its operation, the number of filled black output pixel block can increase to the right in the horizontal direction Add.As discussed above, line buffer memory unit 701 is responsible for the line-group of input image data of the parsing from input frame, is enough to make mould Sheet processor is operated in more upcoming periods.The example plot of line-group is illustrated as shadow region 803. In one embodiment, further described below, line buffer memory unit 701 can comprising for sent to/from grid builder/ Receive the Different Dynamic of line-group.For example, according to a kind of mode for being referred to as " Quan Qun ", inline cache device unit and grid builder it Between transmit image data complete overall with line.According to the second mode for being referred to as " virtual high ", initially passed with the subset of overall with row Line sending group.Then, sequentially with smaller segment (being less than overall with) the remaining row of transmission.

As the line-group 803 of input image data is defined by line buffer memory unit and is transmitted to grid builder list Member, line-group is further parsed into finer table by grid builder unit, more acurrate to adapt to the hard of template processor Part limitation.More specifically, as detailed below, in one embodiment, each template processor is by two-dimensional shift register array group At.Two-dimensional shift register array is substantially by image data shift to the array " lower section " for executing channel, the mould shifted herein Formula operates each execution channel (that is, each execution channel is to their own to the data in its each self-template Information model handled to generate the output of the template).In one embodiment, table is " filling " or otherwise It is loaded into the surface region of the input image data in two-dimensional shift register array.

Therefore, as shown in Figure 8 b, grid builder parses the initial table 804 from line-group 803 and provides it to Template processor (here, the table of data corresponds to the shadow region usually indicated by appended drawing reference 804).Such as Fig. 8 c and Fig. 8 d Shown, template processor is effectively moving overlapping template 802 in a manner of from left to right on table and to input picture The table of data is operated.As shown in figure 8d, the pixel number for the output valve that can be calculated from the data in table, which is used up, (not to be had Other location of pixels can have the output valve determined by the information in table).For the sake of simplicity, the frontier district of image is ignored Domain.

As figure 8 e shows, grid builder provides next table 805 again for template processor to continue to operate.Note that Template start initial position when being operated to next table be from first table use up a little to the right it is next into Degree (as preceding shown in figure 8d).About new table 805, as template processor is with identical as the processing to first table Mode new table is operated, template will simply continue to move right.

Note that due to the borderline region of the template around output pixel position, in the data and the second table of the first table 804 There are some overlappings between the data of lattice 805.Overlapping can retransmit the data two of overlapping simply by grid builder It is secondary to handle.In the alternative embodiment, for next table is fed to template processor, grid builder can only after It is continuous to send template processor for new data, and template processor reuses the overlapped data from previous table.

B. the design and operation of template processor

Fig. 9 a shows the embodiment of template processor framework 900.As illustrated in fig. 9, template processor includes that data calculate list Member 901, scalar processor 902 and associated memory 903 and I/O unit 904.Data Computation Unit 901 includes executing Channel array 905, two-dimensional shift array structure 906 and independent random access memory associated with the specific row or column of array Device 907.

I/O unit 904 is responsible for " input " data form received from grid builder being loaded into Data Computation Unit In 901 and the storage of " output " data form of self-template processor in future is into grid builder.In one embodiment, will List data is loaded into row/column and general of the form analysis in Data Computation Unit 901 needed to receive at image data The row/column of image data is loaded into the two-dimensional shift register structure 906 or corresponding random for the row/column for executing channel array It (is described in more detail below) in access memory 907.If table is initially loaded into memory 907, execute logical List data can be loaded into two dimension from random access memory 907 in due course by each execution channel in channel array 905 In shift register structure 906 (for example, as the load instruction before being operated to list data).It completes tables of data After lattice are loaded into register architecture 906 (either directly from grid builder still from memory 907), channel array is executed 905 execution channel operates data, and finally generates using the data of completion as table directly " writing back " to table Device or into random access memory 907.If I/O unit 904 extracts data from random access memory 907 later Output formats are formed, then transfer it to grid builder.

Scalar processor 902 includes cyclelog 909, and the program of template processor is read from scalar memory 903 Code instructs and sends the execution channel executed in channel array 905 for these instructions.In one embodiment, individually Identical instruction is broadcast to whole channels that execute in array 905 with the generation SIMD class behavior from Data Computation Unit 901. In one embodiment, it is read from scalar memory 903 and is sent to the instruction for executing the execution channel of channel array 905 Instruction format includes very long instruction word (VLIW) type format comprising every more than one operation code of instruction.At another In embodiment, VLIW format includes indicating by ALU operation code (the following institute of each ALU for executing channel mathematical function executed State, can specify more than one traditional ALU operation in one embodiment) and storage operation code (its instruction is used for The specific storage operation for executing channel or one group of execution channel).

Term " executing channel " refers to the set for being able to carry out one or more execution units of instruction (for example, can hold The logic circuit of row instruction).However, in various embodiments, executing channel can include the more many places in addition to execution unit Manage device class function.For example, executing channel can also include solving to the instruction received in addition to one or more execution units The logic circuit of code, or in the case where more class MIMD design, the logic circuit including extracting reconciliation code instruction.About Class MIMD method can in the embodiment of various alternatives although being largely described to centralized program control method herein To realize more distributed method (e.g., including program code and cyclelog in each execution channel of array 905).

The group for executing channel array 905, cyclelog 909 and two-dimensional shift register structure 906 is combined on a large scale Programmable functions provide the hardware platform for being broadly applicable/configuring.For example, being able to carry out in view of each execution channel various Function and be readily able to access the input image data of neighbouring any output array position, applied software development person can compile Journey has the kernel and size (for example, template size) of extensive different function ability.

In addition to the data storage area for serving as the image data for being operated by execution channel array 905, arbitrary access is deposited Reservoir 907 can also save one or more look-up tables, it is Section 1.0 such as above described in virtual processing memory lookup Any look-up table retained in component in table.In various embodiments, one can also be instantiated in scalar memory 903 Or multiple scalar look-up tables.One or more scalar look-up tables can be Section 1.0 above described in memory model mark Any scalar look-up table retained in amount look-up table component.

Scalar lookup is related to the identical data value from identical look-up table being transmitted to execution channel array from same index Each of execution channel in 905.In various embodiments, above-mentioned VLIW instruction format is extended to further include scalar operations Code, the search operation executed by scalar processor is directed in scalar look-up table.The specified index same with operation code can To be immediate operand or be extracted from some other data storage location.Anyway, in one embodiment, it is deposited from scalar It carries out searching in scalar look-up table in reservoir and essentially relates to broadcast identical data value during the identical clock cycle To the whole execution channels executed in channel array 905.Other for using and operating about look-up table are further provided below Details.

Fig. 9 b summarizes the embodiment of VLIW instruction word as discussed above.As shown in figure 9b, VLIW instruction word format includes The field individually instructed for three: the 1) scalar instruction 951 executed by scalar processor；2) by each in execution channel array The ALU instruction 952 that a ALU is broadcast and executed in a manner of SIMD；And 3) the memory broadcast and executed in a manner of the SIMD of part Instruction 953 (for example, if executing the same random access memory of execution channels share in channel array along same a line, comes One execution channel of every row from not going together is practical to be executed instruction, and (format of memory instructions 953 may include that identification comes The operand executed instruction from which execution channel of every row)).

It further include the field 954 for one or more immediate operands.Can be identified in instruction format instruction 951, 952, which of 953 which immediate operand information used.Each of instruction 951,952,953 further includes its respectively phase The input operand and result information answered are (for example, for the local register of ALU operation and for memory reference instruction Local register and memory address).In one embodiment, other instructions are executed in the execution channel executed in channel array 952, before any one of 953, scalar instruction 951 is executed by scalar processor.That is, executing VLIW word includes the One period executed scalar instruction 951 during this period, and was followed by second round, can execute other instructions 952,953 during this period (note that in various embodiments, instruction 952 and 953 can be executed in parallel).

In one embodiment, by scalar processor execute scalar instruction include be sent to grid builder with from/to The memory or 2D shift register of Data Computation Unit load/store the order of table.Here, the operation of grid builder Operation or its dependent variable of line buffer memory unit can be depended on, it includes prerun that these variables, which prevent periodicity, and this meeting Grid builder is set to complete any order issued by scalar processor.In this way, in one embodiment, scalar instruction 951 is right Ying Yu or be otherwise result in grid builder issue order any VLIW word further include other two instruction field 952, being instructed without operation (NOOP) in 953.Then, the NOOP instruction cycles of program code Input Command Word section 952,953, directly Its loading/storing to/from Data Computation Unit is completed to grid builder.Here, after issuing order to grid builder, The position for the interlock register that grid builder is reset after completing order can be set in scalar processor.During NOOP circulation, Scalar processor monitors the position of mutual lock-bit.When scalar processor detects that grid builder has completed its order, normally hold Row restarts.

Figure 10 shows the embodiment of data computation module 1001.As shown in Figure 10, data computation module 1001 includes logic On be positioned at the execution channel array 1005 of two-dimensional shift register array structure 1006 " top ".As described above, in various realities It applies in example, the table of the image data provided by grid builder is loaded into two-dimensional shift register 1006.Then, it executes Channel operates the list data from register architecture 1006.

It executes channel array 1005 and shift register structure 1006 is fixed relative to one another in position.However, displacement is posted Data in storage array 1006 are with strategy and coordination mode displacement, so as to execute each execution channel processing in channel array Different templates in data.In this way, each output image values for executing the different pixels that channel determines in the output table just generated. From the point of view of the framework of Figure 10, it should be clear that overlapping template is not only vertically disposed so, but also is horizontally arranged, because executing channel Array 1005 include vertically adjacent to execution channel and horizontally adjacent execution channel.

Some significant architectural features of Data Computation Unit 1001 include having than executing the wider size of channel array 1005 Shift register structure 1006.That is, there are " haloings " 1009 of register executing except channel array 1005.Though Right haloing 1009 is expressed as being present in the two sides for executing channel array, but according to embodiment, haloing can reside in execution On less side (side) of channel array 1005 or more side (three sides or four sides).Haloing 1009 is used to be logical in execution in data The data except the boundary for executing channel array 1005 are overflowed when road 1005 " lower section " shifts, and " spilling " space is provided.Simple In the case of, when the leftmost pixel of processing template, 5 × 5 templates for occuping the right hand edge center of execution channel array 1005 are needed It will four haloing register positions more to the right.For convenient for drawing, Figure 10 is shown, in standards Example, on the right side of haloing Register only has horizontal shift connection, and the register of haloing bottom side only has vertical movement connection, and any side are (right Side, bottom side) on register there is horizontal connection and vertical connection.

Additional overflow space is provided by random access memory 1007, which is coupled in array Every a line and/or it is each column or part thereof (for example, random access memory can be assigned to execute channel array " area Domain ", the region is across 4 execution channel rows and 2 execution channel columns.For the sake of simplicity, remaining content of the application relates generally to base In the allocation plan of row and/or column).Here, if the kernel operation for executing channel requires it to handle two-dimensional shift register battle array Pixel value (required for certain image procossing routines may have) except column 1006, then the plane of image data can further overflow Out, for example, being spilt in random access memory 1007 from haloing region 1009.For example, it is contemplated that 6 × 6 templates, wherein hardware includes Executing the haloing region for executing only four storage elements on the right side of channel on channel array right hand edge.In this case, number According to the right side for the right hand edge for needing further to be displaced to haloing 1009, with completely processing template.It is displaced to haloing region Data except 1009 can then spill into random access memory 1007.1007 He of random access memory is further provided below The other application of the template processor of Fig. 3.

It is logical to executing in two-dimensional shift register array internal shift that Figure 11 a to Figure 11 k shows image data as described above The Working Examples of the mode of channel array " lower section ".As shown in fig. 11a, two-dimensional shift array is depicted in the first array 1107 Data content, and execution channel array is depicted by frame 1105.In addition, simply depicting two executed in channel array Adjacent execution channel 1110.In this simple description 1110, each execution channel includes register R1, can be received Data from shift register receive the data (for example, accumulator is shown as between the period) exported from ALU, or Output destination is written into output data.

In local register R2, each channel that executes also be can be applicable in two-dimensional shift array positioned at the interior of its " lower section " Hold.Therefore, R1 is the physical register for executing channel, and R2 is the physical register of two-dimensional shift register array.It executes logical Road includes the ALU that can be operated to the operand provided by R1 and/or R2.As detailed below, in one embodiment, it shifts Register actually uses multiple storage/register elements (" depth ") Lai Shixian of each array position, but displacement activity quilt It is limited to a storage element plane (for example, each period is only capable of one storage element plane of displacement).Figure 11 a to Figure 11 k is retouched One in these deeper register positions is drawn, for storing from the corresponding result X for executing channel.For convenient for diagram, compared with Deep result register be plotted in the side of its paring registers R2 rather than below.

Figure 11 a to Figure 11 k concentrates on the calculating of two templates, the center of the two templates and executes in channel array A pair of channel position 1111 that executes drawn is aligned.For convenient for diagram, this is drawn into horizontal neighbor to channel 1110 is executed, and According to following example, their actually vertical neighbors.

As initially shown in fig. 11 a, execution channel is centered on the template position at its center.Figure 11 b is shown by two Execute the object code that channel executes.As shown in figure 11b, the program code in two execution channels makes in shift register array Data shift one position in a position and right shift downwards.This just makes two execution its each self-templates of channel alignment The upper left corner.Then, program code is loaded into the data (in R2) positioned at their corresponding positions in R1.

As shown in fig. 11c, next, program code make this to execute channel by the data in shift register array to Shift left a unit, this value for allowing on the right side of each corresponding position for executing channel moves on to each position for executing channel. Then, value (previous value) addition in R1 is had shifted to the new value for executing the position in channel (in R2).It writes the result into R1.As illustrated in fig. 11d, it repeats with above with reference to identical processing described in Figure 11 c, it includes above holding that this, which allows for resulting R1 at present, Value A+B+C and downlink in row of channels execute the F+G+H in channel.At this point, two execution channels have all been handled The uplink of each mold.Note that execution can be spilt into if haloing region is not present in the left side for executing channel array In haloing region (if there are a haloing regions in left side) or random access memory on the left of channel array.

As illustrated in fig. 11e, next, program code makes one list of the data upward displacement in shift register array Position, this allows for two execution channels and is all aligned with the right hand edge of the center row of each template.Two execute posting for channel Storage R1 is currently included the top row of template and most the sum of r value of center row.Figure 11 f and Figure 11 g, which are shown, executes channel across two The continuous progress that the center row of template is moved to the left.Continue to add up, in this way when processing terminate by Figure 11 g, two execution channels The sum of the value of top row and center row including each template.

Figure 11 h is shown another displacement of the most downlink alignment of the corresponding template in each execution channel.Figure 11 i and figure 11j, which is shown, continues displacement in the process of the template in two execution channels to complete to handle.Figure 11 k shows additional shift, so that Each channel that executes is aligned with its correct position in a data array and writes the result into wherein.

In the example of Figure 11 a to Figure 11 k, it is noted that the object code for shifting function may include identification with (X, Y) The direction of the displacement of coordinate expression and the instruction format of magnitude.For example, the object code for one position of upward displacement can be with It is expressed as SHIFT0 in object code ,+1.As another example, one position of right shift can express in object code For SHIFT+1,0.In various embodiments, the displacement of larger magnitude can also be specified in object code (for example, SHIFT 0, +2).Here, if 2D shift register hardware only support each cycle shift a position, instruction can by machine interpretation at Need multiple periods execution or 2D shift register hardware that can be designed to that each cycle is supported to shift more than one position. The embodiment of latter case is described more fully.

Another more detailed description that Figure 12 shows the unit cell that array executes channel and shift register structure is (dizzy Register in collar region does not include corresponding execution channel).In one embodiment, by executing each of channel array Instantiate circuit shown in Figure 12 at node, realize execution channel associated with each position in execution channel array and Register space.As shown in figure 12, unit cell includes executing channel 1201, is coupled to by four register R2 to R5 group At register file 1202.During any period, executing channel 1201 can read any one of from register R1 to R5 It takes or is written to.For needing the instruction of two input operands, executing channel can be any one of from R1 to R5 Retrieve two operands.

In one embodiment, realize that two-dimensional shift register structure is by allowing register R2 during signal period Any (unique) one content into R4 passes through output multiplexer 1203 " removal " to one in the register file of its neighbour It is a, and any (unique) one content of the register R2 into R4 is substituted for through inputoutput multiplexer 1204 from its neighbour Corresponding one " immigration " content, so as to the displacement between neighbours be in identical direction (for example, all execute channel to Shift left, all execute channel right shift etc.).Although its content may be removed and be substituted for by usual identical register The content moved into same period, but multiplexer apparatus 1203,1204 allows during same period in same register file Difference displacement source register and shifted target register.

As shown in figure 12, it is noted that during shift sequence, execute channel for content and be displaced to it from its register file 1202 Left, right, above and below each of neighbours.Cooperate with identical shift sequence, execute channel also by content from its left, Right, above and below specific one in neighbours be displaced in its register file.It, executes channel for whole for again, Removal target and immigration source should meet same direction of displacement, and (for example, if removal is neighbours to the right, immigration should be Neighbours from left).

Although in one embodiment, each channel that executes of each period allows to shift the content of unique register, its His embodiment can permit the content for being moved in and out multiple registers.For example, if by multiplexer circuit shown in Figure 12 1203,1,204 second example is incorporated in the design of Figure 12, then two registers can be removed/moved into during same period Content.Certainly, in the embodiment of content for allowing to shift unique register in each period, by consuming more clock weeks Phase for the displacement between mathematical operation, can carry out from multiple register shifts between mathematical operation (for example, by number Student movement consumes two shifting functions between calculating, and the content of two registers can be shifted between mathematical operation).

If the content removed during shift sequence is less than the full content for executing the register file in channel, note that The content of each register not moved out for executing channel is held in place and (does not shift).In this way, any be not substituted for immigration content Do not shift content in entire shift cycle still locally retaining in execute channel.That observes in each execution channel deposits Storage unit (" M ") be used in from/to with the row for executing the execution channel in channel array and/or column are associated deposits at random Access to memory space load/store data.Here, M unit serves as the M unit of standard, because it is frequently used for loading/depositing The data that storage can not be loaded/stored from/to the register space for executing channel itself.In various embodiments, M unit is main Operation is that memory is written from local register in data and from memory read data and is written into local register.

About the ISA operation code of the ALU unit support by hardware execution paths 1201, in various embodiments, by hardware The mathematical operations code that ALU is supported with supported by virtual execution channel mathematical operations code (for example, ADD, SUB, MOV, MUL, MAD, ABS, DIV, SHL, SHR, MIN/MAX, SEL, AND, OR, XOR, NOT) on the whole identical (for example, substantially the same).Institute as above It states, memory reference instruction can be executed by executing channel 1201 to extract/deposit from/to its associated random access memory Store up data.In addition, hardware execution paths 1201 support shifting function instruction (right, left, upper and lower) to shift two-dimensional shift register Data in structure.As described above, program control instruction is mainly executed by the scalar processor of template processor.

C. the operation and design of grid builder

Figure 13 to Figure 18 is related to special consideration and/or the operation of grid builder.As described above, grid builder is responsible for life At the table of the information for being handled by corresponding template processor.To implement extensive more function in the design of entire processor Energy property/programmability, grid builder may need to execute additional operations when being ready for table in some cases, without Part appropriate is only parsed from the line-group received.

For example, in some cases, program code will handle multiple channels of same image when will seek common ground.For example, many Video image has red channel (R), the blue channel (B) and green channel (G).In one embodiment, grid builder benefit The processor of the program code executed with associated memory and outside memory is realized.

As shown in figure 13, in response to from application software detect kernel by and meanwhile handle the data from different channels need (it may be prompted from compiler) is asked, the program code executed by grid builder will continue along different " plane " Individual table (that is, forming different tables from each channel) is formed, and they are loaded into Data Computation Unit together In.That is, grid builder will generate R table, B table and G table for the same section of array, and will all three Table is loaded into computing unit.Then, the execution channel executed in channel array optionally carries out freely R, G and B table G table (for example, be stored in another layer of register file simultaneously by operation by the way that R table to be stored in one layer of register file And B table is stored in the another layer of register file).

Figure 14 is related to generating for the table of multidimensional input picture.Here, although many input pictures are simpler arrays Form, but in some cases, each position in array can correspond to multidimensional data structure.As illustrated examples, Figure 14 Depict the image that each array position includes 27 different values of the different sections corresponding to 3 × 3 × 3 cubes.Here, every A array position all has multidimensional data structure, and grid builder is by " expansion " input array, for each data structure dimension Form individual table.Therefore, as shown in figure 14, grid builder will generate 27 tables (each cube section one), Wherein each array position of each table in all forms includes scalar value (a cube section).Then, by 27 Table is loaded into template processor.Then, understood by the program code that channel executes of executing in execution channel array 27 tables are operated in the case where mode through multidimensional input array is unfolded.

Figure 15 is related to for allowing to execute the technology for executing channel processing different data bit wide in channel array.Here, As understood in the art, by increasing the bit wide of data value, to reach bigger dynamic range, (16 place values can be expressed than 8 Place value has the value of more Larger Dynamic range).In one embodiment, it is contemplated that template processor is to such as 8,16 or 32 pictures The image of the different bit wides of plain value is operated.In this way, executing channel itself is 32 machines, from certain according to a kind of method In meaning, executes channel interior and be capable of handling 32 positional operands.

However, in order to reduce the size of two-dimensional shift register and complexity, in each register file for executing channel Each storage element of register is constrained to 8.There is no problem in the case where 8 bit image data for this, because of entire data Table can adapt to a register in register file.Conversely, table generates in the case where 16 or 32 positional operand Device generates multiple tables come the data set for the input operand that expresses properly.

For example, as shown in figure 15, in the case where 16 input operands, grid builder will generate half table of HI and LO Half table.Half table of HI includes the most-significant byte of each data item at correct array position.Half table of LO includes correct array position The least-significant byte of each data item at place.Then, execute 16 bit manipulations be by the way that two tables are loaded into template processor and Notice executes channel hardware (for example, via immediate value in program code) and carries out 16 bit manipulations.Here, can as only one kind Capable operation mode, HI table and LO table are all loaded into each two different registers for executing channel register heap.

Execute channel unit can be by adding from a reading data in register file position and wherein first The data that another reads from register file position, in the correct operand of internal structure.Similarly, on write-in direction, Execution channel unit must execute to be written twice.Specifically, for the first time by the of register file of the least-significant byte write-in comprising LO table One register, then for the second time by the second register of register file of the most-significant byte write-in comprising HI table.

The discussion of Figure 12 is looked back, in various embodiments, each period allows to shift the content of unique register.In this way, For in mobile 16 bit data values of two-dimensional shift register structure periphery, in the case where 8 bit data value, each shift sequence ( Between mathematical operation) two periods of consumption, and on-consumable a cycle.That is, in the calibration of 8 bit data values, In signal period, total data can be shifted between position.Conversely, each displacement is posted in the case where 16 bit data value The shifting function (half table of half table of HI and LO) of storage must then shift two 8 place values.In one embodiment, in 32 feelings Under condition, in addition to creating four tables rather than two tables come other than indicating whole image data, using identical principle.Equally, Each shift sequence may need to consume up to four periods.

Figure 16 is related to image processor from lower density resolution to high density resolution ratio " up-sampling " input image data Situation.Here, template processor is responsible for generating the output valve for the every image as unit area for including more than input picture.Table generates Device handles up-sampling problem by repeating identical data value on table, so that table data value density corresponds to up-sampling The output image of (high density).That is, for example, as shown in figure 16, in view of the density of output image, executing channel in output In the case that array density corresponds to 4:1 up-sampling (corresponding four output pixels of each input pixel), grid builder is every There are four the tables of identical value for a input value production tool.

Figure 17 is related to the contrary circumstance of " down-sampling ".In the case where down-sampling, grid builder will be generated more than low close Spend the table of input picture.Specifically, if input picture has the high-resolution factor of S on the direction (for example, X), And there is the high-resolution factor of T on another direction (for example, Y), then grid builder will be biggish just from initial density Beginning table generates S*T table.More input pixels are effectively distributed to any specific output pixel by this.

Figure 18 is related to being needed to be greater than two-dimensional shift by the mathematical operation that the execution channel in execution channel array executes depositing The case where image data surface area of the size of device structure.As shown in figure 18, the two-dimensional shift deposit for processing is loaded into Table in device structure corresponds to the shadow region 1801 of input frame.However, by the defeated of the array position in computational shadowgraph region The value in frame that the mathematical operation being worth out needs to be defined shown in Figure 18 by dotted border 1802.Therefore, in two-dimensional shift Exist except the surface region of register architecture big " support area ", will be comprised in operation.

Under these conditions, the table for corresponding to shadow region 1801 can be not only loaded into template processing by grid builder In device, but also three (non-shadow) adjacent tables are loaded into Data Computation Unit.The program executed by executing channel Code optionally will call and remove table to/from random access memory, and/or some or all of table is deposited Storage is in the deeper register of two-dimensional shift register array.

Figure 19 provides the embodiment of the hardware design 1900 for grid builder.As shown in figure 19, in one embodiment In, grid builder is implemented as the computing system with processor/controller 1901, and the processor/controller execution is deposited The program code in memory 1902 is stored up, with executive table generator task, such as above with reference to described in Figure 13 to Figure 18 Any one of task.Grid builder further includes for receiving/sending line-group from/to network and generate from/to table The associated template processor of device receives/sends the I/O unit 1903 of table.

The correlated characteristic of grid builder is its configuration space 1904, which may be implemented as grid builder Interior independent register space (as shown in figure 19), in processor/controller 1901 and/or in memory 1902.Configuration space 1904 facilitate the extensive adaptability and programmability of integral platform.Here, the setting made in configuration space 1904 can be with For example including relevant characteristics of image and size, such as frame sign, line-group size, table size, the pixel of input picture are differentiated Rate, the pixel resolution for exporting image etc..Then, the program code in memory 1902 use the information in configuration space as Input variable correctly operates table for having correct size etc..

As an alternative or in some combination mode, the extensive adaptability and programmability of integral platform can be by that will customize Program code is loaded into memory 1902 and realizes for specific application and/or picture size.Here, for example, compiler energy The X, Y coordinates and/or any one of frame sign and line-group size of enough easily reference position relative addressing schemes, in order to Determine table size, table boundary etc., and by it is general by program code model customization to being exclusively used in image processing tasks at hand Software program.Similarly, any such translation and actual use relative positioning or other picture sizes can be input into In configuration space 1904, wherein program code present on grid builder is determined table boundary, table size etc..

D. the embodiment of embodiment

Hardware design embodiment can partly be led in semiconductor chip and/or as finally targeting as discussed above The description of the circuit design of body manufacturing process is realized.In the later case, such circuit description can take advanced/behavior Grade circuit explanation (for example, VHDL is described) or low level circuitry description are (for example, register transfer level (RTL) description, transistor level Description or mask description) or its various combination.Circuit description is usually embodied in computer readable storage medium (such as CD- ROM or other kinds of memory technology) on.

3.0 compiler routines

Figure 20 to Figure 23 a-d is related to the embodiment of the special routines executed by compiler, and the compiler will be written above Section 1.0, the program code in virtual ISA environment discussed in translates into the image procossing as discussed in Section 2.0 above The object code that device hardware executes.

Figure 20 depicts the first compiler, and wherein the memory load instruction of virtual processor is for example with virtual ISA " position is opposite " format is converted into object code shift instruction when being written, the target hardware for such as above-mentioned template processor Platform, with two-dimensional shift array structure.Here, reiterating, position relative instruction format identification is waited for by memory model Input array part X, Y coordinates load data item.

However, loading data into the execution channel for execution as being described in detail in Section 2.0 for hardware structure above In be to be realized by the data value in displacement two-dimensional shift array in many cases.One or more in shift register After a data displacement appropriate, seek for data to be aligned with the identical array position in execution channel of data item is needed.Then, Data can be directly executed by executing channel.

In this way, a part of compilation process needs the sequence that virtual processor is loaded to instruction to be converted into two-dimensional shift array Shift instruction.For illustrative purposes, Figure 20 is shown similar to the virtual of virtual code sequence shown in Fig. 3 in same figure Code sequence 2001 and target code sequences 2002 similar to the target code sequences presented in Figure 11 a-k, for same The average value of evaluation in one template.

Compare code sequence 2001 and 2002, it is noted that the load of virtual code 2001 instructs substantially by target Shift instruction in code 2002 replaces.Here, compiler can the data field that is just calling of "comprising" virtual code 2001 with And two-dimensional shift array is structurally and operationally.By the inclusion of these features, compiler, which can determine, to be referred in virtual code load Number of shift bits and table direction needed for the execution channel alignment in data item and execution channel array identified in order.

Figure 21 shows the compiler data loadingsequence that will resequence and to reduce or minimize required data is loaded into it Respectively execute the correlated process of number of shift bits needed for channel in hardware.As an example, Figure 21 is presented Figure 20's Another version 2 101 of virtual code sequence 2001.At in Figure 20 with row sequence (ABC-FGH-KLM) from left to right The virtual code sequence 2001 of template is managed, if with sequential access data identical with two-dimensional shift array structure, Figure 21's The efficiency in terms of the sequence of access data of virtual code sequence 2101 is extremely low.

Specifically, the retrieval of virtual code sequence 2101 of Figure 21 is needed according to specified sequence (A-M-C-K-B-L-F-G-E) Access the maximum of data or the data value close to maximum shift number.When the angle faces from two-dimensional shift array are to inefficient number According to access sequence virtual code when, compiler will resequence data access sequence, to keep least between mathematical operation Number of shift bits (for example, one between mathematical operation shifts).In this way, if the virtual code sequence 2101 of Figure 21 is presented to Compiler, then compiler can still generate object code as target code sequences 2002 as shown in Figure 20, with back and forth The access of alternating sequence (A-B-C-HG-F-K-L-M) sorting data.Note that the sequence of original virtual code 2001 with Figure 20 loads The sequence (its row sequence are as follows: ABC-FGH-KLM) of operation is compared, and the alternating sequence back and forth of object code is more efficient (and not Together).

Although Figure 21 is mainly for the access order on same table, to prevent in random access memory and displacement battle array Load/unload table between column, compiler can also classify to data access based on table.For example, if image has 10 A table, then table will be numbered as 1 to 10, and will be numbered based on their own table come the access (pro forma interview sheet 1 that sorts Pro forma interview sheet 2 later, pro forma interview sheet 3 etc. after pro forma interview sheet 2).In one embodiment, compiler also keeps accessing phase together Channel together (for example, the access path G after the table of access path R, and access and lead to after the table of access path G Road B).

Figure 22 is related to that another compiler operations of random access memory access are unfolded, so during operation, in actual hardware In there is no competition memory access.Here, the program pin of Figure 22 is in view of the data and bottom machine operated by virtual code The physical limit of device and construct object code.It is discussed as preceding, each execution channel executed in channel array is associated Register file (for example, each four registers in execution channel).Execute channels as most of, the execution channel from/to with The register read and/or write-in data that object code instruction is consistent.Such as most of compilers, which the compiler recognizes Which register is one data reside in and identify the physical limit of available register space.

In this way, needing not in register space when execution channel is possible but being located at and execute holding in channel array The row of row of channels and/or the data item in the associated random access memory of column.Similarly, it is needed when execution channel is possible Data item is written, but the register space of data can not be written (because currently the total data in register space is still There is correlation).In these cases, compiler can by memory load or memory store instruction insertion object code (rather than Register load or register store instructions), with from/to random access memory rather than register space extract/data are written.

Figure 22 depicts the embodiment of hardware structure, shows the individual random access memory of every a line along array 2207_1 to 2207_R.Find out from the framework, execute same a line of channel array execution channel be able to access that it is identical random Access memory.As shown, each channel that executes includes the memory cell for accessing its respective random access memory. Therefore, two on do not go together are different when executing channels and executing memory load instruction during same period, instruct not It is at war with, because they are directed to different random access memory.

, whereas if memory access is executed on same period with the execution channel in a line, then memory access It will be at war with.It is intended to operate in a manner of SIMD class it is assumed that executing channel array, then program code will lead in array naturally It executes channel (it includes row and column) and issues memory access request on same period.Therefore, the execution in same a line The competition memory access in channel is the risk that can appreciate that.Figure 22 is shown for two different execution channels on a same row A pair of of thread 2201 of upper execution.The SIMD class property of given machine, two execution channels execute identical in the identical period Operation code, including executing a pair of of memory load instruction in the period that the first two is described.Check memory load instruction Address, it is noted that whole addresses are all different.Therefore, first memory load instruction of two threads is veritably competing each other It strives, and second memory load instruction of two threads veritably contends with one other.

In this way, it also identifies that memory loads when memory load instruction is applied in object identification code by compiler Instruction will apply conflict to resident execution channel on a same row.In response, compiler can load successive memory Instruction is applied in code, the memory load instruction along the competition of same a line is effectively unfolded, so as to each execution channel All have the retention periods of the access memory of their own.In the example of Figure 22, it is noted that final goal code 2202 include across The sequence of four consecutive memories load instruction in four periods, to ensure that edge is not interfered in the memory access in an execution channel Another with a line executes the memory access in channel.

Note that look-up table part of the method for Figure 22 especially suitable for memory model discussed in Section 1.0 above. Here, reiterating, different execution channels can be accessed in same period using the different index in same inquiry table The different entries of same inquiry table.In one embodiment, the different replica instances of identical look-up table can be melted into every by compiler A random access memory 2207_1 to 2207_R.Therefore, the execution channel on not going together can be passed through during same period This earth's surface copy is made in lookup.Such lookup will not compete, and the index of each lookup may be different.Conversely, by edge It by the identical look-up table accessed in the same memory and will need sequentially to be unfolded simultaneously with the lookup of the execution channel of a line execution It executes.It is sequentially accessed by being launched into, allows index value different.In one embodiment, together with for mathematical operation Operation code, the VLIW instruction format of object code further include the operation code for storage operation, further comprise along really The identity in the execution channel (executing channel along other for the row for being considered as no operation) for the row that should be executed instruction on border.

In various embodiments, it is similar to look-up table, compiler processes atomic update instructs (the 1.0th markingoff pin pair such as above Described in the memory model of virtual environment).That is, for atomic instructions as a result, random access memory 2207_1 extremely Retain storage space (for example, every row) in 2207_R.Permission executed during same period non-competing update (for example, by Along the execution channel for the identical positioning that do not go together), and the update of competition (for example, by along execution channel of same a line) is launched into Individually instruction.Atomic update instruction is often embodied as read-modify-write instruction by compiler, wherein leading to residing in execute Latest result data in the register space in road are read out, and are performed mathematical calculations to such data, are then written into The atomic update table specially retained.

Figure 23 a to Figure 23 d is related to another feelings that different execution channels wants access to the index data on different channels Condition, wherein there are identical offsets between the Data Position needed for executing channel position and each execution channel.For example, as schemed Shown in 23a, channel P1, P2 and P3 expected data item X, Y and Z are executed.Particularly, each of data item X, Y and Z is located at difference Channel in and thus be located at different tables on (for data item X on B table, data item Y is on G table, and data item Z On R table).In addition, each data item is respectively positioned on two spaces on the right side of each position for executing channel.

Another table includes index data, specifies in each execution channel the rope to be used when obtaining information needed Draw.Here, with the data item that executes two positions to the right in the index value instruction execution channel P1 expectation passage B that channel P1 be directed at (B, -2), with the data item for executing two positions to the right in the index value instruction execution channel P2 expectation passage G that channel P2 be directed at (G, -2), and the number of two positions to the right in the index value instruction execution channel P3 expectation passage R being aligned with execution channel P3 According to item (R, -2).

In one embodiment, as described above, two-dimensional shift register structure includes multiple deposits of each array position Device, but during same period, only one register layer can participate in displacement activity.In this way, alignment is from different channels Correct data will need to shift all three tables respectively in the different periods, later could be supplied to suitable data often A execution channel.

The acceleration that compiler can execute in object code is shown in Figure 23 b to Figure 23 d.As shown in fig. 23b, not Shift R, G, any one of B data table, but concordance list is shifted listed in internal index value amount (that is, -2=to Right two spaces).In this way, the desired data value across channel is aligned with index value itself.Then, as shown in Figure 23 c, by with rope Execution channel (executing channel P3, the P4 and P5) Lai Zhihang for drawing value and data value alignment is loaded directly into.

It is desirable that whole three in R, G and B table are all loaded into the different registers grade of two-dimensional shift register Three in, and concordance list is loaded into the 4th (displaceable) register stage.If so, being then not necessarily to memory access It asks, and uniquely displacement is two displacements on the right side of concordance list.Then, executing channel P3, P4 and P5 can be loaded directly into Data value X, Y and Z (for example, be loaded into it is each execute channel R1), this be by reference in its position index value and its Correct channel/grade in local register file.As shown in Figure 23 d, after having loaded data value, the data value quilt that newly loads It is displaced to two positions in left side.At this point, three execution channels all have their desired data.

In various embodiments, compiler can also execute extracode improvement and correlation function.It is described more fully below It is some in these improvement and function.

It includes constant folding that a kind of extracode, which improves,.For example, if compiler identify two constants be added (for example, From constant table), then compiler will simply will be distributed with value as immediate operand lower in code, without allowing code real Border promote execute channel execute this to given value and value.For example, being compiled if compiler observes operation R1≤ADD 95 It translates device and the operation is substituted for R1≤9 or equivalent statement.

Another relevant compiler features includes understanding the minimum value and maximum value of all expression formulas.By so grasping Make, is easy to determine the number of active lanes output and input in expression formula.For example, min/max range is [- 3/+2；-2/+3； [null]] input expression formula be understood as that and be directed toward first passage and second channel rather than third channel.

Another relevant compiler features is to understand line-group, table, its support area (that is, in the input picture by table Any one of " off-balancesheet " information needed for the mathematical operation executed on the output pixel position that plain position is included) and look-up table Size (either scalar or other).Here, by compiler analysis to any one access in these structures, to understand Table need it is much, the support area of table need much and/or any look-up table have it is much.Once knowing these structures Size, their correspondence stool and urine provides as the prompt from compiler, for example, providing as metadata.It is searching In the case where table, for example, metadata is affixed to inquiry table or is otherwise contained in look-up table.In this way, table generates Processor/the controller and/or scalar processor of device are able to confirm that in memory area, there are enough spaces, are inciting somebody to action later Any such data structure is loaded into any memory.In one embodiment, metadata includes RAM entry used Whether number deviates and using any amplification scaling or the minimum and maximum reduced in scaling, X-dimension and Y dimension including more Few channel.According to these parameters, processor/controller of line buffer memory unit (in the case where line-group) and grid builder And/or scalar processor can take the circumstances into consideration to determine size appropriate and distribute storage space.

Another compiler features is to eliminate redundancy clamper.Image access instruction expressed in dummy program environment often by To limitation, without allowing them beyond these limitations.For example, it may be possible to which input array load instruction is clearly forbidden to access some boundaries Except region.Compiler analyzes the memory access in virtual environment, and if any such access actually will not Boundary is offended, then eliminates any expression formula for defining boundary from code.This applies to the ability of hardware handles out-of-bounds expression formula, The result is that software need not manage again/handle expression formula.

In one embodiment, compiler is that the template processor with very long instruction word (VLIW) type format constructs target Code command.These instructions will read from scalar memory and be issued to the whole execution channels for executing channel array.? In one embodiment, VLIW instruction format includes more than one operation code of each instruction.For example, VLIW format includes ALU behaviour Make code (it indicates the mathematical function executed by each ALU for executing channel)；(it indicates to be used for specific execution storage operation code The storage operation in channel or one group of execution channel is (for example, the execution of the identical positioning of every row in the multirow of processing array is logical Road)；And scalar operations code, instruct the activity of scalar processor.

As previously mentioned, repeatedly instantiating look-up table in template processor (for example, executing every a line one in channel array It is a), non-competing access is carried out to allow while access look-up table content, conversely, constant look-up table is substantially scalar, Because same value is broadcast to from scalar memory in same period and all executes channel.Therefore, in one embodiment, exist Defining scalar is searched in the scalar processor opcode field of VLIW instruction word.

In various embodiments, compiler may be implemented as above with reference to a part for developing environment described in Fig. 6 Software program.In this way, the example of compiler can be loaded into identical calculations systems (or the computing system for supporting exploitation environment Cluster) on or otherwise operated on it.

4.0 summarizing

For preceding sections, it is necessary to recognize, virtual environment described in Section 1.0 above can be in computer It is instantiated in system.Similarly, the image processor as described in Section 2.0 above can be embodied in computer system (for example, one of the system on chip (SOC) of the handheld device of the data as camera of the processing from handheld device in hardware Point).

It is necessary to note that above-mentioned various image processor architecture features are not necessarily limited to traditional image procossing simultaneously It is possible thereby to which be applied to may (or may not) other application for promoting image processor to be characterized again.For example, if above-mentioned Any one of various image processor architecture features are used for creating and/or generating and/or present animation, rather than locate Actual camera image is managed, then image processor can be characterized as being graphics processing unit.In addition, above-mentioned image processor architecture Feature can be applied to other technologies application, such as video processing, visual processes, image recognition and/or machine learning.Pass through This mode application, image processor can (for example, as coprocessor) and more general processor (for example, as meter The CPU or part of it of calculation system) it is integrated, or can be the independent processor in computing system.

For preceding sections, it is necessary to recognize, image processor as described above can be embodied in department of computer science In hardware on system (for example, the system on chip (SOC) of the handheld device of the data as camera of the processing from handheld device A part).In the case where image processor is implemented as hardware circuit, it is noted that the image data handled by image processor It can directly be received from camera.Here, image processor can be a part of discrete camera, or with integrated camera A part of computing system.In the later case, image data can be stored directly from camera or from the system of computing system Device receives (for example, camera sends system storage rather than image processor for its image data).It shall yet further be noted that preceding sections Described in many features can be adapted for graphics processor unit (its present in animation).

The illustrative plot of Figure 24 offer computing system.Many components of following computing systems can be applied to have integrated phase The computing system (for example, handheld device, such as smart phone or Tablet PC) of machine and associated image processor.This Field those of ordinary skill is understood that relationship between the two.

As shown in figure 24, basic calculating system may include that (it can be for example including multiple logical for central processing unit 2401 With processing core 2415_1 to 2415_N and the main memory controller 2417 being arranged on multi-core processor or application processing Device), system storage 2402, display 2403 (for example, touch screen, plate), local wired point-to-point link (for example, USB) Interface 2404, various network I/O functions 2405 (such as Ethernet interface and/or cellular modem subsystem), wireless office Domain net (for example, WiFi) interface 2406, wireless point-to-point link (for example, bluetooth) interface 2407 and global positioning system interface 2408, various sensor 2409_1 to 2409_N, one or more cameras 2410, battery 2411, power management control unit 2424, loudspeaker and microphone 2413 and audio encoder/decoder 2414.

Application processor or multi-core processor 2450 may include one or more general procedure cores in its CPU 2401 2415, one or more graphics processing units 2416, memory management functions 2417 (for example, Memory Controller), I/O control Function 2418 and image processing unit 2419.General procedure core 2415 usually executes the operating system of computing system and using soft Part.Graphics processing unit 2416 usually executes figure sensitive function, for example, to generate the figure presented on display 2403 Information.Memory control function 2417 and 2402 interface of system storage, with to/from 2402 write-ins of system storage/reading number According to.The power consumption of the usual control system 2400 of power management control unit 2424.

Image processing unit 2419 can be according in the image processing unit embodiment of the detailed description in preceding sections Any one is realized.As an alternative or in combination, IPU 2419 can be coupled to one in GPU 2416 and CPU 2401 Or both, as its coprocessor.In addition, in various embodiments, GPU 2416 can pass through the figure of above-detailed It is realized as any one of processor feature.

Touch-screen display 2403, communication interface 2404 to 2407, GPS interface 2408, sensor 2409,2410 and of camera Each of speaker/microphone codec 2413,2414 can be considered as various forms of I/O (input and/or output), This is for entire computing system, further includes integrated peripheral equipment (for example, one or more in the appropriate case A camera 2410).According to embodiment, each in these I/O components can be integrated at application processor/multicore It manages on device 2450, can either be located at except the tube core of processor/multi-core processor 2450 or except its encapsulation.

In one embodiment, one or more cameras 2410 include that can survey between the object in camera and its visual field Measure the depth camera of depth.Application processor or other processors general-purpose CPU (or with for executing program code The other function block of instruction execution pipeline) on execute application software, operating system software, device driver software and/or Firmware can execute any one of above-mentioned function.

The embodiment of the present invention may include various processes as described above.These processes can be embodied in machine and can hold In row instruction.These instructions, which can be used in, promotes general or specialized processor to execute certain processes.As an alternative, these processes Can the specialized hardware components by the inclusion of the firmware hardwired logic for implementation procedure or the computer module by programming and Any combination of the hardware component of customization executes.

Element of the invention is also used as the machine readable media for storing machine-executable instruction to provide.Machine Readable medium can include but is not limited to floppy disk, CD, CD-ROM and magneto-optic disk, flash memory, ROM, RAM, EPROM, EEPROM, magnetic or optical card, propagation medium or the other kinds of medium/machine readable media for being suitable for storing e-command. For example, the present invention can be used as computer program to download, which can be via communication link (for example, modulatedemodulate Adjust device or network connection) and the data-signal being embodied in carrier wave or other propagation mediums from remote computer (for example, clothes Business device) it is transmitted to requesting computer (for example, client).

In specification above, the present invention is described by referring to its specific illustrative embodiment.It may be evident, however, that It can be modified and is changed, without departing from wider spirit and scope of the invention as described in the appended claims.Cause This, the description and the appended drawings are considered illustrative and not restrictive meaning.

Claims

1. a kind of method, comprising:

High level program code including high level instructions is translated into low-level instructions, the high level instructions have with orthogonal coordinate system The instruction format for the pixel that first coordinate and the identification of the second coordinate will be accessed from memory, the low-level instructions are directed to have and execute The hardware structure of channel array and the shift register array structure that can not coaxially make data shift along two, the translation packet Including for the high level instructions with described instruction format to be substituted for shifts data in the shift register array structure Rudimentary shift instruction.

2. according to the method described in claim 1, wherein, the translation further comprises will be in the institute with described instruction format State the convenient sequence of displacement that the pixel access sequence arranged in high level instructions is re-ordered into low other shift instruction.

3. according to the method described in claim 2, wherein, the high level instructions with described instruction format include that load refers to It enables.

4. according to the method described in claim 2, wherein, the high level instructions with described instruction format include that storage refers to It enables.

5. according to the method described in claim 1, wherein, the hardware structure includes multiple memories, the multiple memory The shift register array structure is coupled to overflow to receive from the shifting function of the shift register array structure Data.

6. according to the method described in claim 5, wherein, the compiler further executes following operation:

If realized in the hardware according to advanced specified access module, identifying be will lead in the high level instructions To those of one conflict specific in memory access high level instructions；And

The access module is reconstructed to avoid the conflict.

7. according to the method described in claim 6, wherein, the conflict includes carrying out two to the same memory in same period The secondary above access, and the reconstruct includes across the more than two successive access of expansion of multiple periods.

8. according to the method described in claim 7, wherein, the access is directed to identical look-up table.

9. according to the method described in claim 7, wherein, the access is due to identical atomic update.

10. according to the method described in claim 1, wherein, the compiler further executes following operation:

Identify that the different channels that execute of the execution channel array described in same period it is expected that difference of the processing from image is logical The corresponding data in road, wherein the data have offset relative to the corresponding position for executing channel；

Insertion program code is so that each index information table for executing the desired corresponding data in channel of identification shifts the offset Amount；

Insertion program code so that the different different sets for executing channel separate the offset from different execution interchannels, with It loads from the index information table and is loaded from the appropriate channel of the correspondence of each such load；And

Insertion program code is so that desired data are displaced to it corresponding one in different execution channels.

11. a kind of machine readable storage medium is stored thereon with the program generation for promoting execution method when being handled by computing system Code, which comprises

12. machine readable storage medium according to claim 11, wherein the translation further comprise will have it is described The pixel access sequence arranged in the high level instructions of instruction format is re-ordered into the convenient sequence of displacement of low other shift instruction Column.

13. machine readable media according to claim 12, further comprise rearrangement mathematical operation instruction with institute It states mobile convenient sequence and consistently consumes input value.

14. machine readable storage medium according to claim 12, wherein the advanced finger with described instruction format Enable includes load instruction and/or store instruction.

15. machine readable storage medium according to claim 11, wherein the hardware structure includes multiple memories, The multiple memory is coupled to the shift register array structure with from the displacement of the shift register array structure Overflow data is received in operation.

16. machine readable storage medium according to claim 15, wherein the compiler further executes following behaviour Make:

The access module is reconstructed to avoid the conflict.

17. machine readable storage medium according to claim 16, wherein the conflict includes in same period to phase It is accessed more than twice with memory, and the reconstruct includes across the more than two successive access of expansion of multiple periods.

18. machine readable storage medium according to claim 17, wherein the access is directed to identical look-up table.

19. machine readable storage medium according to claim 17, wherein described to access due to identical atom more Newly.

20. a kind of computing system including multiple processing cores and machine readable media, the machine readable media includes program generation Code, said program code are performed method when being handled by the multiple processing core, which comprises

21. computing system according to claim 20, wherein the translation further comprises will have described instruction format The high level instructions in the pixel access sequence that arranges be re-ordered into the convenient sequence of displacement of low other shift instruction.

22. computing system according to claim 20, wherein the hardware structure includes multiple memories, the multiple Memory is coupled to the shift register array structure to connect from the shifting function of the shift register array structure Overflow data is received, and wherein, the compiler further executes following operation:

The access module is reconstructed to avoid the conflict.

23. computing system according to claim 22, wherein the conflict includes in same period to the same memory It is accessed more than twice, and the reconstruct includes across the more than two successive access of expansion of multiple periods.

24. computing system according to claim 23, wherein the access is directed to identical look-up table.

25. computing system according to claim 23, wherein the access is due to identical atomic update.