Processor architecture
TECHNICAL FIELD OF THE INVENTION This invention relates to data processing in a multi-processor system, and in particular to pixel processing in fixed and mobile appliances, for example televisions, mobile phones or PDAs.
BACKGROUND TO THE INVENTION In a system such as a mobile phone with an integrated camera, a still image or a video sequence is acquired with a camera sensor. The pixels acquired in the still image, or in each frame of the video sequence, may need to be processed for correcting lens effects, zooming, smart colouring, or filtered to produce a clean digital image. This digital image can then be encoded (e.g. compressed). The encoded image or video can then be stored, for example on a hard-disk drive or other storage device, or it can be transmitted. In other devices, or in the same device, an image or video sequence can be loaded from a storage device, or received from a transmission channel. The received image or video may be encoded, and must therefore be decoded. The decoded image or video frames may then require some pixel processing to enhance the image quality. The resulting images are then displayed. Pixel processing typically has higher computational demands than the encoder or decoder, particularly in terms of the number of calculations required per second. For this reason, pixel processing is often tackled by an array of processors or data-path elements operating in parallel. Typically, different processors process different pixels in a frame at the same time. Therefore, the array of processors must access the pixels in parallel, so to achieve the required real-time performance. However, sensors and displays typically produce or receive pixels serially, that is, one pixel is produced or received at a time. Therefore, the pixel processing array must interface with the rest of the system via serial writers and readers. Furthermore, where the pixel processing array interfaces with the encoder or decoder, this interface typically takes place through memory into which pixels are written or read serially.
Figure 1 illustrates a pixel processing system 2 that uses internal memories between a serial writer 4 and serial reader 6 and a pixel processing array 8. In this system, both serial and parallel access to pixels is enabled. Typically, an image frame is divided into horizontal lines, with a number of pixels arranged side-by-side to form a line. In this way, for example, a Common
Intermediate Format (CIF) resolution video frame would comprise 288 lines, with each line having 352 pixels. Pixels are received serially from a pixel source 10, such as a camera sensor or a storage device containing previously captured images, and are written by the serial writer 4 into a first local memory 12. Each memory address in the first local memory 12 corresponds to a pixel in a line of the image frame. The writing process is typically performed line by line for each frame. Therefore, the first local memory 12 will have as many addresses as there are pixels in a line for the given resolution. For example, for a CIF resolution frame, the first local memory 12 will have 352 addresses, as there are 352 pixels in a line. When all of the addresses of the first local memory 12 are filled with the data relating to pixels in a line, the entire contents of the first local memory 12 are read out in parallel, and are loaded into a single address of a memory 14, known as a line memory. In the line memory 14, each address corresponds to a different line in the image frame. Therefore, each memory word must be large enough to accommodate all of the pixels in the line. In addition, the memory needs to have as many addresses as the number of lines in the frame that need to be concurrently visible to the pixel processing array 8: This number depends on the algorithm being executed by the processor 8, so the system hardware should be configured to allow for the worst case (i.e. where the most number of lines needs to be available). Once a line has been processed and is ready for display or storage, that line is read out from the line memory 14, and loaded into a second local memory 16. As with the first local memory 12, each address in the second local memory 16 corresponds to a single pixel, and the second local memory 16 must have as many addresses as there are pixels in a line. From this second local memory 16, the serial reader 6 reads out one pixel at a time to a pixel sink 18, which may be a display or storage device. However, a disadvantage with the system in Figure 1 is that two extra local memories, 12 and 16, are required to allow for serial pixel writing and reading.
Figure 2 shows a more detailed view of the pixel processing array 8 and line memory 14 in Figure 1. A number, N, of lines from an image frame are concurrently stored in the line memory 14. Each address in the line memory 14 contains the data for all of the pixels in a line in the image frame. Each pixel comprises B bits of information. Typically, B is 10 or 12. It should be noted that some processing algorithms require lines from different frames to be concurrently stored, although this is not represented in the figure. The pixel processing array 8 comprises a number of processors 20. Each processor 20 is responsible for M pixels in each frame line. In total, each processor 20 is responsible for MxN pixels. To access the data for a particular pixel, the processor 20 must access the address in the line memory 14 containing the data for that pixel. However, as the data for the required pixel is stored in a memory address with the data for all of the other pixels in a line of the image, the processor 20 must retrieve the data for all of the pixels in that address, and then isolate the data for the required pixel from the other data, in order to perform a processing operation on that pixel. Therefore, due to this particular configuration of the line memory 14, a processor 20 performs a memory access and retrieves data relating to pixels for which it is not responsible. As mentioned above, as each processor 20 can only access all of the pixels in parallel, several shift and masking operations need to be performed in order to process individual pixels. This translates into higher processing time and higher power dissipation, as well as being harder to program. An alternative memory configuration is shown in Figure 3. Here, each processor 20 is responsible for all of the pixels in a frame line. Each pixel in the image is loaded into a different address of the memory 22. Therefore, instead of there being a different memory address for each line of pixels in the image as in Figure 2, the memory 22 in Figure 3 has M addresses. The total memory capacity is unchanged when compared with the system in Figure 2. The advantage of this arrangement is that the processors 20 can access individual pixels at a time, without having to extract them from a larger parallel word with several pixels. However, individual processors 20 cannot address the memory 22 independently. Instead, when the memory 22 is addressed, all processors 20 must operate on the pixels in that particular address. This limitation is acceptable for some pixel processing algorithms, but newer picture quality enhancement algorithms require different processors to concurrently and independently address pixels in different frame lines. Therefore, there is a
need for a flexible multi-processor system for pixel processing that allows different processors to have concurrent and independent access to pixels in the memory. In addition, it is desirable that the system is adaptable to allow for differing requirements, for example, an increased number of pixels in a frame line (i.e. an increased resolution), or an increased frame rate, without having to completely redesign the processing system.
SUMMARY OF THE INVENTION There is therefore provided an image processing system comprising: first and second memories, each memory having a plurality of addresses, each address adapted to store the data for a single pixel from an image; means for writing data for a plurality of pixels to the first and second memories, the data being written such that each address contains the data for a single pixel; first and second processors connected to the first and second memories respectively, the processors being adapted to access the data for a single pixel stored in an address of their respective memory, process the data according to a processing algorithm and write the processed data back to an address in their respective memory; and means for reading data for the plurality of pixels from the first and second memories. Therefore, each processor can address its own memory, (and specific pixels within that memory) at the same time that other processors perform a memory access. Therefore, the present invention allows different processors to concurrently work on different lines or portions of an image frame, and allows the time required to process a pixel to be reduced, and hence provides higher flexibility and programmability for arbitrary processing algorithms. Preferably, the first and second processors are interconnected, so that the first processor can request the second processor provide the data for a pixel stored in the second memory to the first processor, and vice versa. In an alternative embodiment, the first processor is connected to the second memory, such that the first processor can access the data for a pixel stored in an address of the second memory. In one embodiment, the first and second memories are multi-port memories to allow concurrent access by the means for writing data, the connected processor and the means for reading data. Alternatively, the first and second memories have wrappers to allow concurrent access by the means for writing data, the connected processor and the means for reading data.
Preferably, the wrapper operates with a faster clock than the clock of the image processing system, thereby allowing the memories to be accessed several times during a single system clock cycle. Alternatively, the wrapper assigns a priority level to concurrently received access requests from the means for writing data, connected processor or means for reading data, and accesses the address specified in the request with the highest priority. Preferably, the means for writing data and means for reading processed data comprise first and second routers connected to the first and second memories respectively, the first router being connected to the second router. Preferably, the first router is connected to a serial writer that receives serialised pixel data from a pixel source. Preferably, the second router is connected to a serial reader. Preferably, the serial writer or reader considers the first and second memories as a single memory, the number of addresses in the single memory being equal to the sum of the number of addresses in each of the first and second memories. Preferably, the serial writer or reader accesses an address in one of the first or second memories by specifying the identity of one of the first or second routers, and the memory address in the memory connected to the specified router. According to an alternative embodiment, the means for writing data comprises first and second routers connected to the first and second memories respectively, the first router being connected to the second router. Preferably in this embodiment, the first router is connected to a serial writer that receives serialised pixel data from a pixel source. Preferably in this embodiment, the means for reading processed data comprises third and fourth routers connected to the first and second memories respectively, the third router being connected to the fourth router. Preferably, the third router is connected to a serial reader. According to a third alternative embodiment, the means for writing data comprises a serial writer connected to the first and second memories. Preferably, the means for reading processed data comprises a serial reader connected to the first and second memories. Preferably, the serial writer or reader accesses an address in one of the first or second memories by specifying the identity of one of the first or second memories, and the required memory address.
According to a second aspect of the present invention, there is provided a portable device comprising an image processing system as described above. Therefore, the present invention allows serial data devices to access the memories to read or write data as if the memories were unified into a single memory space, whilst still allowing each processor to independently access individual pixels stored in the memories. Furthermore, the same memories (the first and second) are used for both serial and parallel access, which means that the silicon area required by the processing system is reduced compared to the system in Figure 3. Finally, unlike the system shown in Figure 2, a system according to the present invention can be easily modified to address higher resolution images (i.e. more pixels per line, and/or more lines) simply by increasing the size (i.e. number of addresses) of the memories connected to the processors. Therefore, no changes to the number of processors, the processor architecture or the interface with the connected memories are required. There is therefore provided a multi-processor system for processing data that is easier and quicker to design and implement than alternative single-processor solutions.
BRIEF DESCRIPTION OF THE DRAWINGS For a better understanding of the invention, and to show more clearly how it may be carried into effect, reference will now be made, by way of example, to the accompanying drawings, in which:- Figure 1 is a block diagram of a conventional pixel processing system with internal memories; Figure 2 shows the configuration of an internal memory of the pixel processing system in Figure 1 ; Figure 3 shows an alternative internal memory configuration for the pixel processing system in Figure 1 ; Figure 4 shows a processing system according to a first embodiment of the present invention; Figure 5 shows an alternative embodiment of part of the processing system shown in Figure 4; Figure 6 is a block diagram of part of the processing system shown in Figure 4;
Figure 7 is a block diagram of another part of the processing system shown in Figure 4; Figure 8 is a block diagram of an alternative processing system according to the first embodiment of the present invention; Figure 9 is a block diagram of a processor according to the invention; Figure 10 is a block diagram of the communication mechanism between processors.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Figure 4 shows a processing system according to a first embodiment of the present invention. The processing system 30 comprises a number of processors 32a, 32b, 32c, 32d, 32e and 32f, each connected to respective memories 34a, 34b, 34c, 34d, 34e and 34f. Each processor can access the data stored in its respective memory. Herein, 'access' or 'accessing' means retrieving data from, or writing data to, a memory. Each of the memories 34a-34f has a number of addresses, and each address can store a number of bits. As the processing system 30 is used for processing image data, each memory address stores the information for one pixel in the image. It will be appreciated that there may be more or less addresses in the memory than there are pixels in a line. Since each processor has a memory associated therewith, a processor can access the data stored in its memory independently of the other processors. Furthermore, as the memories store information for pixels in one or more line portions of an image, each processor can work on aline of the image, or a portion of the image, independently of the other processors. This allows the algorithm designer maximum flexibility in designing the processing algorithm. As pixel data is usually provided from a pixel source 36 serially (i.e. one pixel at a time), a serial writer 38 is provided to write the pixel data into the appropriate memories 34a-34f. To allow the serial writer 38 to do this, each memory 34a-34f is connected to a respective router 40a, 40b, 40c, 40d, 40e or 40f, and each of these routers is connected to at least one of the other routers. The serial writer 38 is connected to router 40a. In this illustrated embodiment, the routers 40a-40f are connected in a chain with the serial writer 38 at one end of the chain, and a serial reader 42 (connected to a pixel sink 44, for example a display) at the other end.
However, it will be appreciated that the routers do not have to be connected in a chain and may be connected in other ways, as long as a communication path exists between the serial writer 38 and serial reader 42 and each of the routers. The routers 40a-40f allow the serial writer 38 to treat the memories 34a-34f as a single 'combined' memory space. The number of addresses in this single memory is equal to the sum of the number of addresses in each of the memories 34a-34f. The serial writer 38 provides a memory address in the single memory space to the first router in the chain, 40a, and this router passes the memory address on to each of the other routers. Each router then determines if its associated memory contains the address specified by the serial writer 38. If a router determines that its associated memory does contain the specified address, a memory access is performed to that memory, and the appropriate data written to the specified address. The serial reader 42 may request a memory access in the same way, although, of course, the data is read out from the specified address, and passed to the serial reader 42. As described above, to access a memory address in one of the memories 34a- 34f, the serial writer 38 or serial reader 42 provides an address in the single memory space. Preferably, this address will comprise two parts. The first part will identify one of the memories 34a-34f as the memory to be accessed, (in other words, it identifies the router connected to the memory that is to be accessed) and the second part will identify the particular address in the memory that should be accessed. As described, the address in the single memory space is provided to each of the routers 40a-40f. Each router examines the address and determines whether it is the router identified in the first part. If a router determines that it is the router identified in the first part of the address, the router forwards the second part of the address to its respective memory (along with the data to be written into this memory address, if the access has been requested by the serial writer 38). The memory address specified in the second part of the address is then accessed. If a router determines that it is not the router identified in the first part of the memory address, the router does not pass the second part of the address to its respective memory. It will be noted that in this illustrated embodiment, each router 40a-40f is shown having two separate connections to their respective memories 34a-34f. These two separate connections represent the routers 34a-34f being able to access their memories to read and write data.
As shown, each processor must also be able to access their respective memory to read and write data. Therefore, each memory 34a-34f must allow for multi-port access. In one embodiment of the invention, the memories are multi-port memories with as many physical ports as access points required (e.g. three ports in Figure 4). However, multi-port memories have the disadvantage that, typically, they are quite large and hence consume a large silicon area, and also consume a large amount of power in comparison with single port memories. Therefore, in a preferred embodiment of the present invention, each memory 34a-34f is provided with a wrapper that controls access to the memory from the rest of the system 30. The wrapper allows several different sources to access the memory, without the memory being a true 'multi-port' memory. There are two main ways that the wrapper can facilitate this according to the invention. Firstly, the wrapper can be an asynchronous' wrapper, in that the wrapper clocks the memory with a clock that is faster than the rest of the system, therefore appearing to provide multi-port access to the memory. For example, in the system shown in Figure 4, the wrapper would clock the memory three times faster than the rest of the system, thus allowing a router and the processor to access the memory during a single clock cycle for the rest of the system. Alternatively, the wrapper may include an arbiter that serializes concurrently received memory access requests according to a predetermined priority scale, thereby allowing access to the memory by the source having the highest priority. Alternatively, the wrapper may be a combination of the above two types. For example, in a system where there are five access points to the memory (see Figure 5 below), the wrapper may clock the memory two times faster than the system clock, and prioritise the access requests if more than two access requests are received during any single system clock cycle. Therefore, a system is described that allows processors of a processing system to independently access their memories, and to independently access specific pixels in those memories, whilst allowing the system to interface with external serial data devices. However, it will be appreciated that many image processing algorithms, require data from different frame lines to be combined during processing. As described above, each memory associated with a processor will contain the data relating to pixels in a portion of the image frame (a portion being a complete line of the
image, or an area covered by pixels from multiple lines in the image). Therefore, in a preferred embodiment of the invention, to allow the processors 32a-32f to combine or consider pixels from multiple regions of the frame during processing (and hence data from several different memories 34a-34f), the processors 32a-32f are interconnected. That is, each processor 32a-32f is connected to at least one other processor 32a-32f in the processing system 30. In a preferred embodiment, these connections are Data Path Connections, or DPCs. Data Path Connections preferably connect a processor 32a-32f to all of its immediate neighbours. If a first processor requires a pixel that is in the memory of a second, neighbouring, processor, the second processor is instructed to perform a memory access for that pixel, and then passes the retrieved data to the first processor via the DPC. Alternatively, if the second processor is not a direct neighbour of the first processor, the pixel can be passed to the first processor via multiple processors. In either case, several system clock cycles may be required to complete the pixel communication. It will be appreciated that although only six processors and six memories are described and illustrated in the system shown in Figure 4, an actual processing system may incorporate many more, or fewer, processors. Of course, each extra processor will be connected to its own memory, which is in turn connected to its own router. This router will be connected to at least one other router in the system, so that a serial writer or reader can access the memory connected to the router. Although the serial writer 38 and serial reader 42 are shown as separate components to the routers 40a-40f, in an alternative embodiment of the invention the routers can be omitted from the system if the serial writer and reader are connected directly to each of the memories 34a-34f. Figure 5 shows an alternative embodiment of the processing system shown in
Figure 4. In order to reduce the amount of data being passed through the processors, and hence increase the overall processing speed of the system, each memory 34a-34c, in addition to being connected to its respective processor 32a-32c, is connected directly to a sub-set of the neighbouring processors 32a-32c. For example, memory 34b, only associated with processor 32b in the embodiment described above, is also connected to processors 32a and 32c. Consequently, more access points to the memories are required, and these can be realized using the techniques described above. Figure 6 shows the connections between the system components in more detail. Here, only processors 32a and 32b are shown, and they are connected to their
respective memories 34a and 34b. Routers 40a and 40b are connected to the memories 34a and 34b respectively. Although the connections between the processors are not shown, it will be appreciated that this is for ease of illustration, and that, in accordance with a preferred embodiment of the invention, they can be connected. The memories 34a and 34b are shown as having three 'ports'. The routers 40a and 40b are connected to their respective 'Write Port' so that they can write data to the memory, and the processors 32a and 32b are connected to their respective 'Read/Write Port' so that they can write processed data and retrieve unprocessed data from the memory. As shown in Figure 4, the third port, 'Read Port', will also be connected to router 40a or 40b, although these connections are omitted for ease of illustration. The connections between processors 32a and 32b and their respective memories 34a and 34b, are shown as comprising one line for specifying the required address, one for reading and writing data and one for a control signal, but it will be appreciated that there may be other arrangements having greater or fewer lines between the processors 32a and 32b and their respective memories 34a and 34b. In Figure 6, four lines between the serial writer 38 and router 40a, and router 40a and router 40b are shown. Specifically, one line carries the router identity (ROUTER_ID), and corresponds to the first part of the address described above. The second line carries the address in the memory (ADDRESS), and corresponds to the second part of the address described above. The third line (DATA) carries the data that is to be written into the memory address. Finally, the fourth line carries memory access control bits (CONTROL). It will be appreciated that the use of four lines is for ease of illustration only, and that other arrangements may be used in practice. As described above, a router, if it determines that it is the router identified in the ROUTERJDD, passes the ADDRESS to its respective memory, along with the data (DATA) to be written to the memory address specified in the ADDRESS and the memory access control bits (CONTROL). In this manner, the serial writer 38 produces a "global" pixel address, comprising a first part corresponding to the ROUTER_ID, and a second part corresponding to the specific pixel address (ADDRESS) in the local memory. A processor, when writing data to an address in its respective memory, specifies the appropriate address (ADDRESS), provides memory access control bits (CONTROL) and the data to be written to the address (DATA). Alternatively, if the processor is retrieving information from its respective memory, it specifies the appropriate
address (ADDRESS) and provides the memory access control bits (CONTROL) and receives the data in the address on the DATA line. Figure 7 also shows the connections between the system components in more detail. Here, only processors 32e and 32f are shown, and they are connected to their respective memories 34e and 34f as described above. Routers 40e and 40f are connected to the 'Read Ports' of memories 34e and 34f respectively. Router 40e is connected to router 40f, which is also connected to serial reader 42. The system shown in Figure 7 functions in exactly the same way as the system shown in Figure 6, except that data is now read from the memories 34e and 34f by the routers 40e and 40f and passed to the serial reader 42. It will be appreciated that, when the routers are connected in a chain (as in Figure 4), each router receives two sets of the ROUTER D, ADDRESS and CONTROL signals; one from the serial writer 38, and the other from the serial reader 42. In the alternative embodiment mentioned above where the serial writer and reader are connected directly to the memories, the first part of the address (ROUTER_ID) will be used by the writer or reader internally to identify the memory that the data is to be written to, or read from. Figure 8 shows an alternative processing system according to the first embodiment of the present invention. This system is similar to the system shown in Figure 4, although here, there are separate router chains for the serial writer 50 and serial reader 52. Here, each memory 54a and 54b is connected to two routers, routers 56a and 58a, and 56b and 58b, respectively. Routers 56a and 56b allow the serial writer 50 to write data into the memories as though they are a single memory space, whilst routers 58a and 58b allow the serial reader 52 to read data from the memories as though they are a single memory space. Processors 60a and 60b are connected to their respective memories 54a and 54b, so that they are able to independently access their pixel data. Again, the connections between the processors are not shown. Figure 9 illustrates the preferred embodiment of a processor. Each processor comprises one or more operation issue slots (IS), with each issue slot comprising one or more functional units. The processor in Figure 9 comprises two arithmetic and logic units (ALU), a multiply-accumulate unit (MAC), and a plurality of load/store units for managing data communication through the Data Path Connections, as well as access to the memory associated with the processor. Issue slot IS1 comprises two functional units, an ALU and a
MAC. Functional units in a common issue slot share read ports from a register file (RF) and write ports to an interconnect network IN. In an alternative embodiment, a second interconnect network could be used between register files and operation issue slots. The functional units in an issue slot have access to at least one register file associated with that issue slot. In Figure 9, there is one register file (RF) associated with each issue slot. Alternatively, more than one issue slot can be connected to a single register file. Alternatively, multiple, independent register files can be connected to a single issue slot (e.g. one different RF for each separate read port of a functional unit in the issue slot). A controller CT, which has access to an instruction memory IM, controls the functional units. A program counter PC determines the current instruction address in the instruction memory IM. The instruction indicated by the current address is loaded into an internal instruction register (IR) in the controller. The controller then controls the data-path elements (function units, register files and interconnect network) to perform the operations specified by the instruction stored in the instruction register IR. To achieve this, the controller communicates with the functional units via an opcode-bus OB, which provides opcodes to the function units; with the register files via an address-bus AB, which provides addresses for reading and writing registers in the register file; and with the interconnect network IN through a routing-bus RB, which provides routing information to the interconnect multiplexers. Figure 10 shows the communication mechanism between processors according to the invention. Preferably, the data path connections between load/store units in different processors 68a and 68b go through blocking FIFOs 70. The FIFOs use control signals hold_w and hold_r. When a load/store unit in a processor 68a is trying to write (store) to a FIFO
70 that is full, the signal hold_w is activated, halting the entire processor 68a (controller and data-path) until another processor 68b reads (loads) at least one sample from the FIFO 70, freeing up a FIFO position. A clock-gating mechanism can be preferably used to halt the processor 68a with the hold_w signal. When a load/store unit in a processor 68b is trying to read (load) from a FIFO
70 that is empty, the holdjr signal is activated, halting the entire processor 68b (controller and data-path) until another processor 68a writes (stores) at least one sample to the FIFO 70. The processor 68b can then read the sample. Again, a clock-gating mechanism can be used to halt the processor with the hold_r signal.
This allows for a data-driven synchronisation mechanism to govern communication through the DPCs, guaranteeing that no data is lost during communication. To achieve an equivalent degree of parallelism to the described system with a single processor, a much larger SIMD or VLIW machine must be designed. This implies longer, more complex design cycles, usually slower clock frequencies, and back-end design complications (e.g. wire congestion). With the multi-processor solution, the design task is broken down to that of designing a single, considerably smaller processor. That design is then simply replicated to form the complete system. As each processor has independent access to its own memory, the present invention allows different processors to concurrently work on different lines of an image frame, and this provides higher flexibility and programmability for arbitrary processing algoritlims. In addition, as this access can occur at the same time that other processors access other memories, the present invention can be easily integrated into a system in which serial readers and writers are present, whilst allowing efficient and high-performance parallel access by processors. As the same memories (the first and second) are used for both serial and parallel access, which means that the silicon area required by the processing system is reduced compared to the system in Figure 3. Finally, a system designed according to the present invention can be easily adapted for use in systems with different requirements, such as the number of pixels in a line, or numbers of frames processed per second. To adapt the processing system to images having a larger number of pixels per line, the size of the memories connected to the processors can be increased. In contrast, to adapt a prior art system (such as that shown in Figure 2) to an image with a larger number of pixels per line, the memory size must be increased, and additional processors are required to perform the processing on the extra pixels. There is therefore provided a flexible multi-processor system for pixel processing that allows different processors to have concurrent and independent access to the pixel data.