WO2000036562A9

WO2000036562A9 - Digital camera using programmed parallel computer for image processing functions and control

Info

Publication number: WO2000036562A9
Application number: PCT/US1999/029718
Authority: WO
Inventors: Todd E Rockoff; Robert Lang; Murray Wallace
Original assignee: Intensys Corp; Todd E Rockoff; Robert Lang; Murray Wallace
Priority date: 1998-12-15
Filing date: 1999-12-15
Publication date: 2000-12-07
Also published as: CN1338090A; EP1141891A1; TW429331B; JP2002532810A; WO2000036562A1; AU2362200A

Abstract

A digital camera apparatus includes a sensor that generates image data. The apparatus further includes a parallel processor to process the sensed image data. Programmed parallel computing circuitry accomplishes compute-intensive image processing functions on the generated image data. Using semiconductor-efficient programmed parallel computing structures, the ratio of performance to hardware cost is maximized in the digital imaging apparatus, while enabling a great degree of functional flexibility and product diversity both within and across digital imaging product categories. In particular embodiments, the programmed parallel computing structures are instruction-cached SIMD computers.

Description

DIGITAL CAMERA USING PROGRAMMED

PARALLEL COMPUTER FOR IMAGE PROCESSSING FUNCTIONS AND CONTROL

Related Applications This application claims priority to U.S. provisional patent application number

60/112,410 filed December 15, 1998 by Todd E. Rockoff and entitled "I-Cached SIMD Based Digital Imaging Apparatus".

Technical Field

The present invention relates to digital cameras and, in particular, to digital cameras that employ programmed parallel computers to accomplish image manipulation functions.

Background

Digital cameras promise significant advantages for amateurs and professionals alike with respect to functionality, reliability, convenience, and cost. For example, whereas exposed film typically must be chemical developed before the images can be viewed — a time-consuming and expensive process ~ a digital camera image may be viewed directly through an LCD on the camera, viewed on a computer, printed using a color printer, or shared via the Internet. Despite such advantages, digital cameras remain largely a novelty today rather than a mainstream consumer product, accounting for only a small percentage of all camera sales. The slow adoption of digital camera technology is attributable to the relatively high cost, and relatively low image quality, that characterize today's digital cameras.

Digital processing of image sequences (video) is even more compute-intensive than the processing of still images. At 30 frames per second, processing of real-time digital video is many tens of times more computationally demanding than processing of still images. Although digital video today is used for teleconferencing and for consumer products including DVD and camcorders, the relatively low image quality and high costs associated with digital video today mean that its mainstream adoption is further in the future than even that of still digital photography. Similarly, for digital imaging applications in general, conventional solutions are ad hoc, usually consisting of expensive application-specific hardware circuits that perform specific compute-intensive image functions. For example, Fig. 1 is a block diagram of a the DCAM-101 "Single Chip for Digital Cameras" manufactured by LSI Logic of Milpitas, California. As can be seen from Fig. 1, the DCAM-100 employs separate hardware circuits to gamma correction, color space conversion and JPEG encoding ad decoding. Fig. 2 is a block diagram illustrating another (generic) digital camera image processing solution with separate hardware circuits for each of the several image processing functions. This fracturing of the digital imaging functions frustrates growth because products cannot readily exchange image data and product manufacturers enjoy only limited economies of scale.

Furthermore, even if a digital camera image processing function were programmed (as opposed to being specific hardware circuits), and even if the digital image processing function were programmed for a parallel computer, conventional programmed image processors use one processing element per pixel in a scan line of an image. See, for example, Allan L. Fisher, Peter T. Highnam, and Todd E. Rockoff, "A Four-Processor Building Block for S MD Processor Arrays", IEEE Journal of Solid State Circuits, vol. 25, No. 2, April, 1990, pp. 369-375. The Fisher article discloses a scan line array processor (SLAP) architecture (although not for use in a digital camera) that consists of a linear array of processors (processing elements, or "PE's") controlled in a single-instruction, multiple data (SHvID) fashion by a broadcast instruction. Fig. 3 is a block diagram summarizing the SLAP topology. Scan line data is shifted serially into the stages of a pixel data shift register 302 — one stage for each pixel in the scan line--, then transferred in parallel into PE's PE1 through PEP which operate in parallel based on the broadcast instruction.

Fig. 4 is a block diagram illustrating the allocation of image data to PE's in a SLAP, and Figs. 5 A through 5E illustrate a three-stage pipeline operation in a SLAP. With the conventional SLAP topology, unfortunately, as the resolution of the image being processed increases, so does the number of required processors, and a large number of processors is not conducive for use with a portable device such as a digital camera.

Summary The present invention is a digital camera apparatus. The apparatus includes a sensor that generates image data, and the apparatus further includes a parallel processor to process the sensed image data. Programmed parallel computing circuitry that accomplishes compute-intensive image processing functions on the generated image data. Using semiconductor-efficient programmed parallel computing structures, the present invention maximizes the ratio of performance to hardware cost in the digital imaging apparatus, while it also enables a great degree of functional flexibility and product diversity both within and across digital imaging product categories. In particular embodiments, the programmed parallel computing structures are instruction- cached SIMD computers.

In accordance with another aspect of the invention, a parallel computer for processing image data processes the image data with fewer processing elements than pixels in a scan line of the image.

Brief Description of the Figures

Figure 1 is a block diagram illustrating a conventional single-chip processor for use in a digital camera.

Figure 2 is a block diagram illustrating functions of a conventional digital image processing apparatus for a digital camera. Figure 3 is a block diagram illustrating the topology of a conventional Scan

Line Array Processor, suitable for a wide variety of image computations.

Figure 4 is a diagram illustrating the allocation of image data to PE's in a conventional Scan Line Array Processor such as the Figure 3 SLAP.

Figures 5 A through 5E illustrate the conventional operation of a three-stage image data pipeline of a Scan Line Array Processor.

Figure 6 is a functional block diagram of a digital imaging apparatus, in accordance with an embodiment of the invention, that incorporates a programmed parallel computer as applied to digital cameras.

Figure 7 is a functional block diagram of a single digital imaging chip with a single Instruction cached SIMD PE module having fewer PE's than scan line pixels.

Figure 8 is a functional block diagram of a single digital imaging chip with multiple Instruction cached SLMD PE modules, at least some of the modules having fewer PE's than the number of scan line pixels allocated to that module.

Figure 9 is a functional block diagram of a digital imaging system incorporating multiple instances of the Instruction cached SLMD chip, as would for example be used for a high-end digital video camera. Figure 10 depicts one embodiment of an enhancement of the Scan Line Array Processor image data pipeline that accommodates fewer PE's than there are pixels per scan line.

Figure 11 depicts a second embodiment of an enhancement of the Scan Line Array Processor image data pipeline that accommodates fewer PE's than there are pixels per scan line.

Detailed Description

In accordance with a broad aspect of the invention, image processing functions of a digital camera are performed by a programmed parallel computer. An important observation underlying this aspect of the invention, but that has heretofore not been fully exploited, is that the digital imaging functions are scalable data-parallel. This property of imaging functions makes them amenable to efficient realization on programmed parallel computers incorporating dozens or as many as thousands of processing elements (PEs). One measure of efficiency is the speedup over a single processor exhibited by a parallel computer. The maximum efficiency of an N-PE parallel computer is N.

An embodiment of a digital camera employing a programmed parallel computer for image processing functions is shown in Fig. 6. Referring to Fig. 6, an image is focused through a lens 602 onto a sensor 604 such as a charge coupled device (CCD) which generates a plurality of analog signals corresponding to the image. The analog signals are passed through A D converter circuitry 606 to generate a digitized version of the image. The digital image (pixel) data from the A/D converter circuitry 606 is provided to the input data port 612 of a parallel computer 608 via a multiplexer 610. The digitized pixel data is operated upon by the parallel computer 608, which then provides the processed data to an image data output port 614 of the parallel computer 608. Furthermore, the pixel data may be operated upon by the parallel computer 608 to control the digital camera itself such as to control image acquisition.

The output data port 614 of the parallel computer 608 and the input data port 612 of the parallel computer 608 (via the multiplexer 610) are connected to a bus 616. Various other circuitry are also provided off the bus 616, including a microprocessor 618 (with associated ROM 620 and RAM 621), general purpose I/O circuitry 622 to external devices, serial I/O circuitry 624 to a personal computer serial port, an electrical interface 630 to a liquid crystal display 632, and an NTSC/PAL video digital to analog converter interface 634 to a television. The bus 616 is also connected to a control/status port of the parallel computer 608 and to an electrical interface 636 for controlling the image sensor 604.

Finally, the parallel computer 608 also includes a memory controller 638 to interface to a DRAM 640 (or other RAM) and an inter-PE ("PE" is processing element) communication interface 642, to interface to a multi-chip inter-PE communication network. In general, the parallel computer 608 is comprised of a number of processing elements (PE's) connected by an inter-PE communication network. The specific topology of the inter-PE communication network within the parallel computer 608, although usually considered an important characteristic of parallel computer architecture, is not central to the present aspect of the invention. However, a topology that appears appropriate for the parallel computer 608 in the digital imaging application is a linear array, such as is illustrated by the Scan Line Array Processor (SLAP). Again, for background on SLAP, the reader is referred to Allan L. Fisher, Peter T. Highnam, and Todd E. Rockoff, "A Four-Processor Building Block for SIMD

Processor Arrays", IEEE Journal of Solid State Circuits, vol. 25, No. 2, April, 1990, pp. 369-375.

The parallel computer 608 may implement a variety of image analysis, manipulation, and enhancement functions. The set of functions provided, the size of image, and the rate of function application, are the principal discriminators among digital imaging products. Because the parallel computer 608 is programmed (e.g., as opposed to being a hardwired ASIC), not only are the image processing functions of the digital camera executed efficiently, but also development and upgrading of digital camera features is much simplified. Imaging tasks which may be provided include compensation for image sensor characteristics (including resolution, aspect ratio, pixel shape, and others), compensation for image display characteristics (including resolution, aspect ratio, pixel shape, and others), color correction and color space conversion, improvement of picture quality, generation of enhanced viewfinder displays, compression and decompression for storage and/or communication, encryption and decryption for image communication, and others.

As is discussed in the Background, SLAP provides an image data shifter, one stage of which is provided per pixel along the horizontal image dimension. The SLAP concept is well matched to the serial scan output characteristics of image sensors. The SLAP concept creates an inexpensive three-stage image data pipeline wherein one scan line of output pixels is being shifted out, while a second scan line's output values are being calculated, while a third scan line of image sensor data is being shifted into the parallel computer for processing. Some functions that may be implemented in a digital imaging apparatus with parallel computer 608 include are listed below. The example functions are listed in order as they affect the output of the sensor 604 to the processed image data at the processor output 614. 1) Pixel Data Correction Functions applied to digital pixel data received from the image sensor 604 a) Pixel trimming

Pixel trimming requires known calibration values for individual pixels. Pixel trimming transforms sensed pixel values to compensate for imperfect response characteristics of individual elements in the sensor array. Calibration information is obtained by measuring the response to a known image (for example, such as could be provided on the inside of a lens cap). Pixel trimming for each pixel in the image is a function only of the sensed pixel value and calibration values for the corresponding image sensor element. b) Gamma correction

The response characteristic for an image sensor over the dynamic range differs from that of the human eye. Gamma correction transforms measured pixel values non- linearly to maximize the subjective significance of the least-significant bit of the pixel value. Gamma correction for each pixel in the image is a function only of the sensed pixel value and the desired shape of the response curve; the target response curve is common among all pixels and does not change from image to image, c) Color space conversion Image sensors typically present integer intensity values in each of the primary colors (RGB). Viewed from a linear algebra perspective, the "basis vectors" R, G, and B are not orthogonal. This observation means, that changing the R- value of a pixel also changes the G and B values. A more efficient representation commonly exploited for image processing is based on YC_t>C_r space, where Y represents the pure luminance (brightness) of the pixel and C_b and C_r represent the pixel's position in the two-dimensional color plane. Y ,C_r are orthogonal basis vectors. The transformation from an RGB image to a YC_t>C_r image requires multiplying a 3x1 vector by a 3x3 transformation matrix at each pixel in the image. Color space conversion for each pixel is a function only of the sensed pixel value and the values in the transformation matrix; the transformation matrix values are fixed and common among all pixels. 2) Image Optimization: Scene Analysis and Manipulation Adjustments to the sensed image to improve the quality of output images a) Over-Sampling (Digital Zoom)

When an image is desired at higher resolution then is available from the image sensor, it is possible to generate values for "pixels" lying between sensed pixels through a process of interpolation. Conventional digital cameras apply linear interpolation, whereby each pixel is replaced by a weighted average of its neighbors. An aspect of the present invention is to exploit the power of a parallel computer to apply higher-order interpolation algorithms. b) Digital Image Stabilization Digital image stabilization compensates for video camera motion when framing a static scene. The video camera motion creates offsets that can be compensated by shifting the pixels from frame to frame. The estimation of frame-to-frame motion is discussed below under the MPEG function. Given a motion vector, the digital image stabilization for each pixel is a function only of the motion vector and the set of values of the limited-extent neighborhood of pixels centered at that pixel in a previous frame. c) Advanced Functions, for example Blink Elimination and Framing Functions that allow an image to be captured at an ideal moment and in the ideal way are performed in the imaging device prior to the activation of the electronic shutter that commits the image to memory. Such functions analyze various properties of a particular scene to determine how to capture image. While the definition of such functions themselves is beyond the scope of this invention, it would appear to be the case that such functions require performing compute-intensive scalable data- parallel computations at high rates. 3) JPEG compression

A widely accepted standard means of minimizing the representation of an image as measured in bits. JPEG has several modes of operation, some "loss-less" that preserve all of the original image sensor data and some "lossy" that remove some information, so that the restored compressed image is different from the original image. The psycho-visual principle underlying the lossy modes of JPEG is that the human eye is less sensitive to high spatial frequency components of an image. In other words, when presented with a somewhat blotchy image, the human eye will emphasize edge information. The lossy modes of the JPEG standard work by analyzing the spatial frequency spectrum of an image, then selectively removing resolution from the higher- frequency components, thereby allowing a more compact representation of the image. The majority of the computational work in JPEG compression is applied to 8x8 blocks of pixels, such that the intermediate results generated during JPEG compression that relate to a given pixel are determined as a function only of the 8x8 block of pixels within that pixel resides. a) Raster-Block Conversion

This first step maps the line-by-line (raster) scan output from the image sensor to the 8x8-block representation suitable for JPEG operations.

Were the entire image stored in memory, raster-block conversion is achieved by accessing stored pixel values appropriately. On-line raster- block conversion requires buffering 16 scan lines of pixel values; 8x8 blocks begin being made available on the converter's output only after 8 scan lines of raster data have been received. Such buffering is easily accomplished with a SLAP-style linear array computer. b) Block Discrete Cosine Transform (DCT)

DCT applied to 8x8 pixel blocks is akin to a signal processing function and is among the most calculation-intensive functions. The DCT transforms the spatial representation of color values into a frequency representation. The frequency representation is key to applying the psycho-physical principle of JPEG compression that the resolution of high-frequency information is not as important to the human eye as low- frequency information. The 8x8 DCT is given by the following equation:

F( u ,v)

See Gregory K. Wallace, "The JPEG Still Picture Compression Standard," Communications of the ACM, vol. 34, no. 4, April 1991, pp. 30-44.

Similar to the 2-D FFT, the block DCT is a separable transform. This means that an 8x8 DCT consists of eight 1-D DCTs on the columns and another eight 1-D DCTs on the rows. It is estimated that an eight- element I-D DCT requires about 20 multiply/add steps. Therefore the number of multiply/add steps in an 8x8 DCT is given in the following equation:

MACs MACs MACs 20 * 8 cols + 20 *8 rows = 320

1 - D DCT 1 - D DCT 2 - D DCT

c) Quantization

The quantization step is where the JPEG algorithm judiciously removes information from the compressed image in an innocuous manner. Quantization is applied to each coefficient of each 8x8 pixel block given a set of quantization parameters Q(u,v) which is common for all blocks in the image. The quantization algorithm is given as follows:

This equation suggests that quantization requires one divide operation per pixel. d) Differential Pulse-Code Modulation

The DC (zero frequency) parameters F(0,0) of the DCT are coded differentially across the image. This step requires communication among immediately neighboring pixel blocks. e) Entropy Coding The quantized DCT coefficients are compactly represented, for example, by applying Huffman code. Entropy coding has two steps, a first intra- block step wherein coefficients are assigned symbols, and an intra- and inter-block step wherein the symbols are translated into variable-length sequences of bits. The first step requires no communication among pixel blocks, whereas the second step requires communication among immediately neighboring pixel blocks.

4) MPEG compression

MPEG is a compression standard commonly applied to video images. The core of MPEG is identical to the JPEG algorithm, relying on quantization of frequency- domain information achieved via the DCT to remove information from a compressed image in an inconspicuous manner. MPEG defines the following additional function that exploits the observation that a sequence of video images of a single scene share a great deal of common information. a) Motion Estimation

The goal of motion estimation is, for a given block of pixels in a given video frame, to determine where that block of pixels "came from" in preceding frames and where it "goes to" in succeeding frames. Blocks of pixels associated with a patch of a visible object in an image appear to move when either the object moves without occlusion or the camera moves.

Motion estimation generally operates on 64x64 macroblocks, attempting to calculate the minimum difference between a given macroblock in the current frame and neighboring macroblocks in neighboring frames. The extent of the search, both spatially (within a single neighboring frame) and temporally (the number of frames searched), is limited by the available processing power.

Motion estimation requires only local communication among pixels, and its effectiveness is proportional to the processing power brought to bear. Therefore, MPEG compression appears to be one of those applications that is arbitrarily computationally demanding, in that no amount of processing power (affordable by a consumer in the next 20 years) would saturate the requirement.

5) Display Interface 630 Some conventional digital cameras include circuits for managing an LCD panel such as LCD panel 632. a) Under-Sampling

Often the pixel resolution of an affordable LCD is lower than the image resolution. Conventional cameras under-sample the image by ignoring extra pixels in the output of the sensor 604 or perhaps performing a simple average. Under-sampling is a local algorithm that requires communication only among neighboring pixels. b) Color Space Conversion LCD panels do not take as input the YC_t,C_r values that are convenient for image manipulation algorithms. Therefore, inverse transforms are performed to convert the pixel values back to RGB representation for display. This conversion requires multiplying a 3x1 vector by a 3x3 matrix at each pixel. c) Often, LCD displays are replete with aliasing ("jagged line") effects.

Anti-aliasing algorithms are applied to achieve "best looking" LCD panel images. The foregoing discussion establishes the premise that many (if not all) of the digital imaging algorithms useful in conventional still and moving digital cameras are scalable and data-parallel. This observation is the basis for the assertion that the broad diversity of digital imaging products implement scalable data-parallel algorithms. Such algorithms are amenable to parallel implementation.

Replacing a collection of fixed-function circuits with a programmed parallel computer improves the functional flexibility of the camera and minimizes the time required to develop additional functions.

In addition to providing a possibly cost-effective alternative to fixed-function circuits, this invention enables the following valuable functional capabilities:

1) Automatic sensor calibration for optimal image quality by accommodating less-than-perfect image sensors that are less expensive to manufacture than perfect image sensors with no loss of image quality. Thus, the camera maker can reduce costs by incorporating relatively low-cost image sensors.

2) The ability to accommodate a wide range of sensor sizes and image formats. 3) The ability to allow the camera user to trade off compression ratio versus image resolution over a wide range of values.

4) The ability to apply interpolation algorithms from the realm of computer-graphics to achieve a relatively high quality digital zoom (versus the relatively low quality nearest-neighbor linear interpolation used in conventional digital cameras).

5) The ability to cost-effectively perform the functions for still cameras and video cameras. In the still camera application, the additional processing power that would otherwise be allocated to processing large volumes of data in the video camera is used instead to optimize image quality. One example is continually executing a background process of compression, decompression, and error calculation to empirically determine optimum quantization tables for the current shooting context.

6) To achieve best-looking LCD panel images.

7) To provide a general means to import any digital image file or stream formats and compression standards, such as would be required for a universal display device.

8) To provide a general means to export any digital image file or stream formats and compression standards, such as would be required for a universal image capture device. 9) To provide a general means for a product manufacturer to quickly accommodate rapidly evolving standards in a given digital imaging product category.

10) To provide a general means for a product manufacturer to add or subtract product functions in software, thus allowing a complete line of related digital imaging products to leverage a single digital imaging chip, for increased economy of scale in the manufacture of the product line.

In accordance with another aspect of the invention, embodiments of which are shown in Figs. 7 and 8, the parallel computer 608 is implemented as a SIMD computer with instruction cache, as described in U.S. Patent No. 5,511,212 ("the '212 patent"). The '212 patent is incorporated herein by reference in its entirety. The '212 patent discloses one way to implement a SIMD computer to maximize the ratio of performance (as measured in aggregate pixel operations per second) to hardware cost (as measured in chip area).

In general, a compact digital imaging product incorporates a microcontroller 618 (see Fig. 6) to regulate the various system functions. In accordance with this present aspect, the microcontroller 618 (sometimes called "microprocessor" or "embedded microprocessor") serves as a system controller for the instruction-cached SIMD computer. The microcontroller bus 616 serves as both the global instruction broadcast network and the response network. As disclosed in the '212 patent and as shown in Figs. 7 and 8, one local controller 705 is provided per PE module, wherein each PE module incorporates a plurality of PEs. The number of PE modules in the system depends upon such parameters as the total number of PEs required, the logical complexity of the PE, and the size of the isochronous region defined by the VLSI implementation technique used to realize the digital camera processing apparatus. Figure 7 illustrates a single-module instruction-cached SIMD computer, while

Figure 8 illustrates a multiple module instruction-cached SIMD computer. (Where elements of the Figure 7 computer are duplicated in Figure 8, the multiple elements are designated with and appended "a" and "b".)

The following valuable capabilities are enabled: 1) This apparatus may be implemented either in a single chip or in multiple chips. The single chip would be appropriate for the still camera or the low-end video camera, whereas multiple instances of a single chip might be used in very high- performance cameras.

2) The apparatus maximizes the ratio of performance to hardware cost for any VLSI implementation technique used to realize the digital camera processing apparatus

3) The apparatus is suitable for integration with low-cost CMOS image sensors

Instruction-cached SIMD computers are well described in the '212 patent and, therefore, most portions of Figures 7 and 8 are not described in detail. The pixel data shifter 702 is augmented with external interfaces (e.g., local external memory interface 704) to facilitate the creation of systems comprising multiple instances of this chip. Each PE is specialized for image computations. One appropriate PE would have a 16- bit ALU along with a 128-word register file and the context management and communication interface circuits needed for SIMD operation.

The characteristic of most compute-intensive image functions (including all of those listed in the background section above) is that they entail generating an output image wherein each pixel is determined as a function of its spatial neighbors. Such a function is described as a (usually fairly brief) sequence of instructions applied at each pixel. In this case, the instruction stream broadcast to the array of PEs via the local instruction broadcast network 706 would exhibit significant repetition, because the common sequence of instructions is repeated at each pixel. Use of the SIMD instruction cache 708 is very effective in such circumstances. A linear array topology for inter-PE communication is well matched to the image data strewn from a sensor device made with a serial output. However, the linear array topology is not a necessary choice; if the sensor were integrated with the processing apparatus in a single chip as will be allowed by advanced semiconductor manufacturing technology, then the wider interface possible within the chip might favor a two-dimensional PE inter-communication network topology.

Referring still to Figs. 7 and 8 (but also with reference to Fig. 6), the embedded microprocessor 618 serves as the system controller for the instruction-cached SIMD computer. The instruction-cached SLMD computer shown here assumes the linear array inter-PE communication topology depicted in Fig. 3, although the linear array topology is not a critical choice. In accordance with an embodiment of the invention, while the pixel data shifter 702 has one stage per pixel of a scan line, the "scan line array processor" portion of Figs. 7 and 8, show in greater detail in Fig. 9, has less than one PE per pixel of a scan line. To put it another way, each PE processes more than one pixel —a pixel "swath"-- of a scan line. Using Fig. 9 as an example, the pixel data shifter 902 is divided into the swaths

(904a through 904c) for the corresponding PE's (PE1, PE2 and PE P), and the pixel data of each swath are transferred into a corresponding swath buffer (906a through 906c, respectively). Then, each PE (PE 1 through PE P) operates on the pixels of the corresponding swath. In accordance with some embodiments, a parameter L (the number of pixels per scan line per PE) is configurable, so as to allow the width of the pixel swath assigned to each PE to be programmed according to the application. To understand the allocation of pixel data to PE's in the Fig. 7 and Fig. 8 embodiments, assume for example that there are 1024 pixels per sensor scan line and 16 PEs in the instruction cached SIMD computer; in this case, each PE would be allocated an image swath that is 8 pixel blocks (64 pixels) wide. At 2 bytes per pixel, 128KB of per-PE on-chip DRAM is needed to be able to store a megapixel image in this example. Storing 16 such image frames on the chip, as might be needed for single- chip MPEG encoding, would require a total of 32MB (256Mb) of on-chip RAM. Another embodiment in which fewer PE's are provided to process a scan line than there are pixels is illustrated in Fig. 11. The pixel data shifter 1002 of the Fig. 11 embodiment has a stage corresponding to each PE (as opposed to corresponding to each pixel as in the Fig. 7 and Fig. 8 embodiments). Each stage can hold one pixel. In most cases, the scan line width exceeds the number of PE's for that scan line, so each PE processes multiple pixels by either processing each received pixel before receiving another pixel, or by storing the pixels locally (i.e., locally accessible to the PE) until the required number of pixels arrive.

Referring still to Fig. 11, the pixel data shifter 1102 has an input scan line ordering buffer (SLOB) 1103 prepended to it, and an output SLOB 1104 appended to it. Each SLOB 1103, 1104 has enough memory to hold at least two scan lines worth of pixels. After the first scan line is saved in the memory of the input SLOB 1103, the input SLOB 1103 does reordering on it. During reordering, the second scan line is saved in the memory of the input SLOB 1103. In one embodiment, the pixels of the scan line are saved in consecutive memory locations and the memory is read "out of order" so that all neighboring pixels are provided to the same PE.

For example, in one embodiment, if four PE's are to process a sixteen pixel scan line, PE0 gets pixels numbered 0-3, PE1 gets pixels numbered 4-7, PE2 gets pixels numbered 8-11, and PE3 gets pixels numbered 12-15. But, the input SLOB 1103 reorders the pixels so that pixel data shifter 1102 carries the pixels in the order 0, 4, 8, 12; 1, 5, 9, 13; 2, 6, 10, 14; 3, 7, 11 15 as this is the order in which the pixels are to be provided to the PE's. It should be noted that the "stride" is consistent for each PE - four. After the pixel data is processed by the PE's, it is reordered by the output SLOB 1104 at the output of the pixel data shifter 1102. Reordering is more complicated if the number of PE's does not divide evenly into the number of pixels per scan line. In this case, the "extra" pixels can be distributed to one or more PE's. In one embodiment, if there are N extra pixels, the N extra pixels are distributed one each to the first N PE's. For example, if four PE's are to process an eighteen pixel scan line, PE0 gets pixels numbered 0-4, PE1 gets pixels numbered 5-9, PE2 gets pixels numbered 10-13, and PE3 gets pixels numbered 14-17. But, the pixel data shifter 1102 carries the pixels in the order 0, 5, 19, 14; 1, 6, 11, 15; 2, 7, 12, 16; 3, 7, 13, 17; 4, 9 as this is the order in which the pixels are to be provided to the PE's. In this case, the stride is not consistent for each PE, as the stride is sometimes four and sometimes five. The last two pixels shifted in are received by two PE's while the other PE's receive nothing.

One objective of the present invention is to maximize the computing resource brought to bear on the computation. Doing so ordinarily requires placing as many PEs as may fit within the available chip area. As chip sizes increase and circuit geometries decrease, the diameter of an isochronous region becomes significantly smaller than the linear dimension of the chip. Therefore, given the history of VLSI scaling trends, that the digital imaging chip containing an instruction-cached SIMD computer would require multiple PE modules and therefore multiple instances of the local controller circuit. A difference from the single-controller chip needed to make the multi-controller chip work is the inclusion of a response arbitrator. The response arbitrator connects the plurality of instruction cached SJJVTD local controllers with the control/status port connecting to the microprocessor bus, allowing the detection by the embedded microprocessor of some/none conditions among the PE's. An apparatus containing a plurality of these instruction cached SEMD based digital imaging chips to create a system suitable for high-end video camera applications is depicted in Fig. 10. Note that the image data shift register is chained through the set of chips, and that the functions from the single-chip solution are assigned to various individuals in the set of chips. The inter-PE communication network topology is a parameter, although a preferred embodiment would be to extend the linear array topology used within the preferred embodiment of the single chip.

Claims

WHAT IS CLAIMED IS:

1. A digital camera apparatus, comprising: a sensor to generate image data corresponding to an image; a processor with which to process the image data, the processor including a plurality of processor elements (PE's) connected by an inter-PE network, the PE's configured to operate in parallel to process the obtained image; and a memory to hold the processed obtained image.

2. The digital camera apparatus of claim 1, wherein: the processor is for processing an N-pixel scan line of image pixel data, the processor includes M PE's, wherein M < N; the processor includes a pixel data buffer via which the N pixels are provided to the PE's; the M PE's operate at least partly in parallel to process the N pixels of the scan line from the pixel data buffer, and at least some of the M PE's operate on more than one pixel of the scan line; whereby fewer than N processing elements are required to process the N-pixel scan line.

3. The digital camera apparatus of claim 2, wherein the processor further comprises: a local controller that outputs decoded instructions; and a local instruction broadcast network via which the decoded instructions are broadcast to the PE's for execution by the PE's operating in parallel on the pixels.

4. The digital camera apparatus of claim 3, and wherein: each PE of the processor includes an instruction cache, and the instruction caches are coupled to the local instruction broadcast network to receive the decoded instructions.

5. The digital camera of claim 3, wherein the pixel data buffer includes N stages, and at least some particular ones of the PE's are coupled to more than one of the N stages so as to be configured to receive more than one pixel of the scan line, such that the particular ones of the PE's operate on more than one pixel of the scan line.

6. The digital camera apparatus of claim 2, wherein the processor further comprises: a local buffer associated with at least some of the PE's and coupled to the pixel data buffer, the local buffer for a particular PE for temporarily holding the more than one pixel of the scan line operated on by that PE.

7. The digital camera apparatus of claim 2, wherein the pixel data shifter includes M stages; the processor further includes a scan line ordering circuit prepended to the pixel data shifter that reorders the input pixel data such that the input pixel data is presented to the pixel data shifter in an order other than the order in which the pixels are situated in the scan line; and each of the M stages of the pixel data shifter is configured to provide a pixel to one of the PE's as the reordered pixel data is shifted through the pixel data shifter.

8. The digital image processor of claim 7, wherein: the scan line ordering circuit is a first scan line ordering circuit, and the processor further comprises a second scan line ordering circuit appended to the pixel data shifter that reorders the processed pixel data.

9. The digital image processor of claim 8, wherein: the second scan line ordering circuit reorders the processed pixel data to correspond to the original order of the input pixel data in the scan line.

10. A digital image processor to process an N-pixel scan line of image pixel data, the processor comprising: a plurality (M) of processing elements (PE's), wherein M < N; a pixel data buffer via which the pixels are provided to the PE's, wherein the M PE's operate at least partly in parallel to process the pixels of the scan line from the pixel data buffer, and at least some of the PE's operate on more than one pixel of the scan line; whereby fewer than N processing elements are required to process the

N-pixel scan line.

11. The digital image processor of claim 10, and further comprising: a local controller that outputs decoded instructions; and a local instruction broadcast network via which the decoded instructions are broadcast to the PE's for execution by the PE's operating in parallel on the pixels.

12. The digital image processor of claim 11, wherein each PE includes an instruction cache, and the instruction caches are coupled to the local instruction broadcast network to receive the decoded instructions.

13. The digital image processor of claim 10, wherein the pixel data buffer includes N stages, and at least some particular ones of the PE's are coupled to more than one of the N stages so as to be configured to receive more than one pixel of the scan line, such that the particular ones of the PE's operate on more than one pixel of the scan line.

14. The digital image processor of claim 13, and further comprising: a local buffer associated with at least some of the PE's and coupled to the pixel data buffer, the local buffer for a particular PE for temporarily holding the more than one pixel of the scan line operated on by that PE.

15. The digital image processor of claim 10, wherein the pixel data shifter includes M stages; the digital image processor further includes a scan line ordering circuit prepended to the pixel data shifter that reorders the input pixel data such that the input pixel data is presented to the pixel data shifter in an order other than the order in which the pixels are situated in the scan line; and each of the M stages of the pixel data shifter is configured to provide a pixel to one of the PE's as the reordered pixel data is shifted through the pixel data shifter.

16. The digital image processor of claim 14, wherein: the scan line ordering circuit is a first scan line ordering circuit, and the digital image processor further comprises a second scan line ordering circuit appended to the pixel data shifter that reorders the processed pixel data.

17. The digital image processor of claim 16, wherein: the second scan line ordering circuit reorders the processed pixel data to correspond to the original order of the input pixel data in the scan line.