CN112019803A

CN112019803A - Image signal processor, image processing apparatus, and neural network image processing system

Info

Publication number: CN112019803A
Application number: CN202010463164.3A
Authority: CN
Inventors: 朱红雷; 曾万苇; 商权
Original assignee: Jinpupil Semiconductor Technology Shanghai Co ltd
Current assignee: Jinpupil Semiconductor Technology Shanghai Co ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-12-01

Abstract

The invention provides an image signal processor, an image processing device and a neural network image processing system, which can be used for vehicle-mounted ADAS and video monitoring, provide a method for streaming media real-time video processing and target identification, and are realized by combining a Convolutional Neural Network (CNN) and a plurality of link algorithms and technologies of image pre-and post-processing, so that automatic auxiliary driving danger target identification is completed, security and video storage overhead is reduced, and video big data analysis capability is improved.

Description

Image signal processor, image processing apparatus, and neural network image processing system

Technical Field

The present invention relates to the field of artificial intelligence technology, and in particular, to an image signal processor, an image processing apparatus, and a neural network image processing system.

Background

In real life, image processing technology is widely applied to various fields such as mobile phone photographing, robots, computer vision, security monitoring, automatic auxiliary driving, internet of things and the like. With the development of artificial intelligence AI, the core technology of neural network has been widely researched, wherein the research of neural network combined with image processing has become one of the hottest technical directions in recent years. Even so, its application falls to the ground and is not ideal.

For example, in the processes of dangerous target identification, security video big data storage and the like in the automatic auxiliary driving of an automobile, the problems of high cost, high analysis and calculation pressure and the like exist, and the problems still need to be solved.

Disclosure of Invention

The invention provides an image signal processor, an image processing device and a neural network image processing system, which improve the data analysis capability and reduce the pressure of analysis and calculation.

According to a first aspect of the present invention, there is provided an image signal processor comprising:

the system comprises an execution functional unit, a data shuffling cross network, an internal storage unit and a register file; the video input unit or the external data access unit acquires data and delivers the data to the data shuffling cross network, the data shuffling network delivers or distributes the data to partial functional units or internal storage units, and the data processed by the functional units are delivered to the data shuffling cross network again for next data delivery or distribution; the plurality of or all functional units cooperate with the plurality of or all storage units to process and access simultaneously, so that the pipeline operation of the real-time video stream can be completed; the register file is used for caching temporary data or firmware parameters operated by the functional unit; the data shuffle cross network is a full cross connect network.

Optionally, for the image signal processor, the execution functional unit includes a basic operation constituent unit obtained by extracting a CNN algorithm characteristic, that is, a CNN functional unit; after the convolution filter completes a new neuron calculation through one or more clock beats, the activation correction operation is carried out,

ReLU _ out is the activation output, P is the input neuron; four types of basic operation composition units are obtained: convolution activation sheetThe device comprises an element, a pooling unit, a full connection unit and an interlayer characteristic map mapping unit.

Optionally, for the image signal processor, the convolution activating unit includes a first operation unit and a first storage unit, the first operation unit includes multiple groups of M × M first filters and a correction operator, the first storage unit includes first memories with the same group number, each group of the first memories includes M line buffers, M taps in the vertical direction corresponding to the first filters are obtained through register pipeline register, so that the multiplication-addition arithmetic pipeline operation of M × M convolution is realized through one or more clock beats, and the control logic of each group of the first memories simultaneously reads M lines of data to enter convolution filtering and simultaneously completes the inter-line replacement storage operation; each line cache is realized by a double-port SRAM or a single-port SRAM, and a plurality of clock delay beats are inserted between reading and writing so as to avoid reading and writing the same storage address conflict;

the pooling unit comprises a second operation unit and a second storage unit, wherein the second operation unit comprises a plurality of groups of second filters which are in one-to-one correspondence with filtering activation operation in the convolution activation unit; the second storage unit comprises second memories with the same group number, and each group of second memories comprises M line caches; and writing the line 1 of the feature diagram output from the convolution activation unit into the line buffer 1, writing the line 2 of the feature diagram into the line buffer 2, and so on, writing the line S of the feature diagram into the line buffer S, and then entering circular writing, namely writing the line S +1 of the feature diagram into the line buffers 1 and … …, writing the line S + S of the feature diagram into the line buffer S, reading the line S at the same time by the S line control logic after writing, and performing vertical span S downsampling, wherein the horizontal downsampling is realized by register pipeline moving operation.

Optionally, for the image signal processor, in the data shuffling crossbar network, the method includes: each arithmetic unit can access any one of the memory units, and conversely, each memory unit can respond to any one of the arithmetic units.

Optionally, for the image signal processor, the image signal processor may perform CNN algorithm characteristic extraction, including:

the information of the image is input and,

extracting the characteristic graph of the multi-layer network, mapping the characteristic graph combination of the upper layer network to the next layer network to generate a new characteristic graph,

full connection and classification processing; the full-connection layer completes full connection on neurons generated in the last layer of the multilayer network, flattens the multi-dimensional characteristic diagram array into a row, and obtains probability classification through matrix transformation

exp (x) is the x power of the exponent,

the basic constituent unit of the CNN forward propagation network obtained by the characteristic extraction of the CNN algorithm comprises at least one of convolution, activation, pooling, flattening, classification and feature map mapping.

Optionally, for the image signal processor, the basic operation component unit is integrated onto the data shuffle cross network; loading the reduced image inside the processor to the CNN functional unit according to different CNN firmware algorithms, or loading a previous layer of network characteristic graph from an external main memory to an arithmetic unit and/or a storage unit; the CNN functional unit combination can realize one-layer CNN network operation, and the operation result of each layer is stored in the external main memory or read from the main memory to complete multi-layer CNN network iterative operation.

Optionally, for the image signal processor, the method further includes: the device comprises an external data access unit, a bus interface, a video input unit and a video output unit.

According to a second aspect of the present invention, there is provided an image processing apparatus comprising:

the system comprises an algorithm module, an upper computer, an image signal processor and a storage module; the image signal processor receives the original data of the sensor, sends video data to the algorithm module for CNN training, the upper computer sets parameters of image processing and CNN according to the training result and stores the parameters by the firmware or program storage module, the set parameters are called by the image signal processor and continuously send the video data to the algorithm module until the obtained optimal parameters of the image processing and CNN are obtained, and the image signal processor is the image signal processor.

Optionally, for the image processing apparatus, the algorithm module includes a CNN training module integrated in the upper computer, or the CNN training module includes an FPGA.

According to a third aspect of the present invention, there is provided an image processing system comprising a camera assembly and a host, the camera assembly comprising an image signal processor according to the first aspect of the present invention or comprising an image processing apparatus according to the second aspect of the present invention.

The image signal processor is easy to chip, and the training learning and back propagation network of the convolutional neural network CNN is not realized by the image signal processor chip, but is cooperatively operated by software or FPGA (field programmable gate array), so that the pressure of analysis and calculation of the chip can be greatly reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of an onboard ADAS image processing system in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a security monitoring image processing system in an embodiment of the invention;

FIG. 3 is a diagram of an image processing apparatus according to an embodiment of the present invention;

FIG. 4 is a flow chart of a CNN operation method in an embodiment of the present invention;

FIG. 5 is a first block diagram illustrating an exemplary architecture of an image signal processor according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of cross data access in one embodiment of the invention;

FIG. 7 is a second schematic diagram illustrating an architecture of an image signal processor according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a CNN forward direction identification process in an embodiment of the present invention;

FIG. 9 is a diagram illustrating the implementation of a CNN execution function unit according to an embodiment of the present invention;

fig. 10 is a schematic diagram of full connection mapping of CNN inter-layer feature maps in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Example 1

Embodiment 1 of the present invention provides an image processing system, including a camera module and a host, where the camera module may be: at least including an image signal processor as follows, or including an image processing apparatus as follows.

Wherein the image signal processor may include:

Wherein the image processing apparatus may include:

the system comprises an algorithm module, an upper computer, an image signal processor and a storage module; the image signal processor receives the original data of the sensor, sends video data to the algorithm module for CNN training, the upper computer sets parameters of image processing and CNN according to the training result and stores the parameters by the firmware or program storage module, the set parameters are called by the image signal processor and continuously send the video data to the algorithm module until the optimal parameters of the image processing and CNN are obtained, and the image signal processor is the image signal processor.

In particular, as shown in fig. 1, the application of the present invention in an automated assisted driving or autonomous vehicle (ADAS on board) is illustrated.

The cameras can be arranged at different angles, and video data and recognition classification results are transmitted to the vehicle-mounted host computer through image processing links such as data acquisition, recovery, enhancement, recognition and the like, so that the main control end can store, display and analyze the images, and corresponding behavior actions can be finished, prompted or directed. Each edge camera mainly comprises a lens, an image sensor and the image signal processor, transmits video to the host through a coaxial or differential equal-line video transmission line and transmits the identification and classification result to the vehicle-mounted host through a CAN bus and other transmission protocols.

Specifically, please refer to fig. 2, which illustrates an application of the present invention in video surveillance. The cameras of the multiple scenes provide video data and transmit the video data to a back-end video storage device DVR/NVR and/or a cloud computing platform, and monitoring information can be displayed in a monitor in real time. Since the position of the monitoring camera is fixed in most cases, the shot scene and angle are also fixed, and a large number of redundant video frames are generated. These redundant frames are mainly background pictures or accident-free scenes, but occupy most of the storage resources and computing resources. In order to better save storage resources and improve data extraction and analysis efficiency, the invention can introduce a Convolutional Neural Network (CNN) algorithm into an image processor chip of the camera at the edge end, namely, the image processor can be adopted. The CNN identifies and classifies various scenes of interest, such as video pictures of people, pets, vehicles, moving bodies and the like entering the scenes, and the processor transmits the videos of the pictures to back-end equipment or a computing platform, so that the storage resources are saved, and the data extraction, computation and analysis efficiency is improved.

Example 2

Embodiment 2 of the present invention provides an image processing apparatus. Implementation details of the present embodiment are specifically described below, and the following description is provided only for the sake of understanding and is not necessary for implementing the present embodiment.

An embodiment of the present invention provides an image processing apparatus, including:

the system comprises an algorithm module, an upper computer, an image signal processor and a storage module; the image signal processor receives the original data of the sensor, sends video data to the algorithm module for CNN training, the upper computer sets parameters of image processing and CNN according to the training result, and stores the parameters by the firmware or program storage module, the set parameters are called by the image signal processor, and continuously sends the video data to the algorithm module until the obtained optimal parameters of the image processing and CNN are obtained, and the image signal processor is as follows:

the system comprises an execution function unit, a data shuffling cross network, an internal storage unit and a register file; the video input unit or the external data access unit acquires data and delivers the data to the data shuffling cross network, the data shuffling network delivers or distributes the data to partial functional units or internal storage units, and the data processed by the functional units are delivered to the data shuffling cross network again for next data delivery or distribution; the plurality of or all functional units cooperate with the plurality of or all storage units to process and access simultaneously, so that the pipeline operation of the real-time video stream can be completed; the register file is used for caching temporary data or firmware parameters operated by the functional unit; the data shuffle cross network is a full cross connect network.

Example 3

Embodiment 3 of the present invention provides an image processing apparatus, and may be further optimized on the basis of embodiment 2, wherein descriptions thereof are omitted for the same or similar parts. Implementation details of the present embodiment are specifically described below, and the following description is provided only for the sake of understanding and is not necessary for implementing the present embodiment.

The embodiment of the invention relates to a software and hardware cooperative neural network image processing device, wherein the system architecture is shown in fig. 3, and the CNN operation flow is shown in fig. 4.

The Image Signal Processor (ISP) runs firmware from the program storage space flash to start, receives original (raw) data sent by the sensor, restores the raw data into an image after processing, and sends out video data.

And the video data is sent to an upper computer Host, and the upper computer realizes deep training learning of the convolutional neural network by operating a reverse training learning algorithm part consistent with the convolutional neural network which is transmitted forward in the ISP.

The CNN learning process is a closed negative feedback process continuously circulating iteration, the result error is reversely backtracked and propagated through the CNN network every time so as to correct the matrix coefficient of the initial convolution filter, the corrected filter coefficient enters the next training learning, and the matrix coefficient of the convolution filter gradually approaches the optimization after the circulation is repeated.

The obtained optimal parameters are recorded and updated to an image signal processor ISP through Host computer Host configuration. The ISP can be separated from debugging equipment such as an upper computer and the like, independently works in a specific application scene, sends out video data and recognition classification results through processing links such as collection, recovery, enhancement, recognition and the like, and sends out a prompt signal to the inside or the outside of the chip through a chip interrupt mechanism so as to generate subsequent behavior actions.

The CNN training of the embodiment of the invention is completed by a GPU or a CPU of an upper computer Host, and in addition, the CNN training is also provided through an FPGA. The programmable logic in the FPGA only needs to realize the CNN forward and reverse networks without considering the realization of other parts of the image processor, and the trained video image is directly provided by the ISP chip. The FPGA uploads a training result to an upper computer through interfaces such as a serial port UART or a USB, and the Host computer downloads parameters to a program storage space of an ISP through a compiling and burning environment.

Example 4

Embodiment 4 of the present invention provides an image signal processor. Implementation details of the present embodiment are specifically described below, and the following description is provided only for the sake of understanding and is not necessary for implementing the present embodiment. Referring to fig. 5, in this embodiment, the image signal processor includes:

The image signal processor provided by the embodiment of the invention can flexibly solve the problem of uncertainty of the flow steps of the image pre-and post-processing algorithm, and can also flexibly and conveniently adapt to the selection of the CNN operation stage and the stage of image scaling, thereby meeting the selection of neural network algorithms of different applications.

Example 5

Embodiment 5 of the present invention provides an image signal processor, and may be further optimized on the basis of embodiment 4, wherein a description thereof is omitted for the same or similar parts. Implementation details of the present embodiment are specifically described below, and the following description is provided only for the sake of understanding and is not necessary for implementing the present embodiment. Fig. 5 to 7 can be referred to for the schematic diagrams of the present embodiment.

The image signal processor ISP comprises an execution function unit (Func), a data shuffle cross network (Crossbar), an internal storage unit (Mem), a register file (register bank), an external data access unit (Load/Store), a bus interface (BusIF), a main memory controller (DDRCtrl), a Video Input Unit (VIU), a Video Output Unit (VOU), a digital high-definition encoder (HEVC/H264/MJPEG), an analog high-definition encoder (AHD/TVI/CVI), a microcontroller (MCU/CPU), a plurality of peripheral interfaces and other components.

Specifically, the Video Input Unit (VIU) or the external data access unit (Load/Store) acquires data and delivers the data to a data shuffling Crossbar network (Crossbar), and the data shuffling Crossbar network is dispatched or distributed to a part of the functional units (Func) or the internal storage units (Mem) through the data. The processed data is delivered to the data shuffling cross network again to carry out next data dispatching or distribution, and a plurality of or all functional units cooperate with a plurality of or all internal storage units to simultaneously process and access so as to complete the pipeline (pipeline) operation of the real-time video stream. The register bank (register bank) can be used to buffer temporary data or firmware parameters of the functional unit, such as the filter matrix coefficients of the convolutional neural network CNN.

The digital high-definition encoder, the analog high-definition encoder and the data shuffling network (Crossbar) share an external data access unit (Load/Store), the Load/Store integrates a DMA (direct memory access) function, and a large amount of image data can be transmitted and accessed in a burst mode without an MCU or a CPU.

The video output can be cut and selected according to the application, for example, the video output can be used in the field of security monitoring through an Ethernet port, can be used in the field of vehicle-mounted ADAS through a coaxial interface, and can be used in the fields of video conferences and the like through BT1120/BT 656.

The data shuffling cross network (Crossbar) provided by the embodiment of the invention is used for data exchange between the ISP kernel function and the storage. The network is a full cross-connect network as shown in fig. 6. The arithmetic functional units FuncA, FuncB, and FuncC can access any one of the memory units MemA, MemB, and MemC, and conversely, the memory units MemA, MemB, and MemC can also respond to any one of the arithmetic functional units FuncA, FuncB, and FuncC. In the actual operation of the kernel, a partially connected network is often adopted, which can be reasonably configured by firmware according to the operation flow and the sequence steps of the image processing algorithm. The partially shuffled crossover network is only a subset of a fully connected shuffled network, and either a fully connected or partially connected crossover network may be employed at the verify-in-time for flexible configuration as desired.

The data shuffling cross network (Crossbar) of the image signal processor ISP provided by the embodiment of the invention can very flexibly solve the problem of uncertainty of the flow steps of the image pre-and-post processing algorithm, and can configure the functional units in a targeted manner.

For example, as shown in fig. 7, the algorithm a includes a step of performing spatial domain denoising and then performing image enhancement, and the algorithm B includes a step of performing image enhancement and then performing spatial domain denoising, in which case, the configuration data shuffling network may be selected as the algorithm a structure or the algorithm B structure according to the upper firmware algorithm.

The convolution neural network CNN often has a large amount of data computation, and particularly, the deep neural network DNN often has a large amount of computation. This makes the arithmetic operation stage optional as well, such as RAW data stage, RGB data stage, YUV data stage. The data shuffling cross network of the image signal processor ISP can be very flexibly and conveniently adapted to CNN operation stage selection and the stage of image scaling, so that the neural network algorithm selection of different applications is met.

Example 6

Embodiment 6 of the present invention provides an image signal processor, and may be further optimized on the basis of embodiment 4 or embodiment 5, wherein a description thereof is omitted for the same or similar parts. Implementation details of the present embodiment are specifically described below, and the following description is provided only for the sake of understanding and is not necessary for implementing the present embodiment. Fig. 5 to 10 can be referred to for the schematic diagrams of the present embodiment.

The image signal processor may perform CNN algorithm feature extraction, including:

the information of the image is input and,

exp (x) is the x power of the exponent,

As shown in fig. 8, a simple two-layer CNN forward propagation network is taken as an example for illustration.

1) The input image resolution is 28x28, taking the luma component map as an example, typically Y, U, V or R, G, B three channels.

2) The first layer network has 4 characteristic convolution kernels of 5x5, each convolution kernel is also called a filter, is a matrix of 5x5, has 25 coefficients, and has 26 systems if the offset is substituted. After the first convolution, 4 frames of feature maps are generated, and the size of each frame is 24(28-5+1) × 24(28-5+1) neurons (the filter is taken to be the part which is completely overlapped with the image). And then, performing activation correction on each frame of feature map, replacing and filtering negative-value neurons by 0 values, and entering a pooling layer.

3) The pooling layer is 2x2 downsampling filtering, each 2x2 neurons is mapped into a new neuron, and the algorithm usually adopts 2x2 average or maximum filtering. And 4 frames of 12x12 feature maps are obtained through pooling and enter the next layer of network.

4) And when the network enters the second layer network, the feature diagram of the previous layer network needs to be mapped to a new feature diagram combination. The second layer network has 8 feature convolution kernels, so that a combination of 4 frame feature maps of a previous layer is mapped to a next layer to generate a new 8 frame feature map, such as B1 ═ Σ (a1, a2, A3), B2 ═ Σ (a2, A3, a4), …, B8 ═ Σ (a1, a2, A3, a4), where a1, a2, A3, a4 are feature maps output by the previous layer, and B1, B2, and B8 are new feature maps generated by a combination of linear superposition operations.

5) The second layer network is convoluted, activated and pooled again, and the output characteristic diagram enters the next layer network. In the example there are two total layers of networks, so the next layer of networks is a fully connected network.

6) The full-connection layer is responsible for completing full connection of the neurons generated by the previous layer. Flattening the multi-dimensional characteristic diagram array into a row, converting the row into a result category through matrix multiplication to finish final probability classification,

exp (x) is the x power of the exponent.

7) Through the steps 1) to 6), the basic constitutional units of the extractable convolutional neural network forward propagation network are as follows: convolution, activation, pooling, flattening, classification, feature map mapping, and the like.

In the invention, the training learning and back propagation network of the convolutional neural network CNN is not realized by an image signal processor chip, but is subjected to cooperative operation by software or FPGA.

The execution functional unit comprises a basic operation composition unit obtained by extracting the characteristic of the CNN algorithm, namely a CNN functional unit; after the convolution filter completes a new neuron calculation through one or more clock beats, the activation correction operation is carried out,

ReLU _ out is the activation output, P is the input neuron; four types of basic operation composition units are obtained: the device comprises a convolution activation unit, a pooling unit, a full connection unit and an interlayer characteristic map mapping unit.

Referring to fig. 9, the convolution activating unit (CNN-ConvReLU) includes a first operation unit and a first storage unit, the first operation unit includes a plurality of groups of M × M first filters and a correction operator, the first storage unit includes the same number of groups of first memories, each group of first memories includes M line buffers, M taps in the horizontal direction are obtained through register pipelining corresponding to M taps in the vertical direction of the first filters, so that the M taps in the horizontal direction are obtained through one or more clock ticks, and the control logic of each group of first memories simultaneously reads M rows of data to enter convolution filtering, and simultaneously completes the inter-line replacement storage operation; each line cache is realized by a double-port SRAM or a single-port SRAM, and a plurality of clock delay beats are inserted between reading and writing so as to avoid the conflict of reading and writing the same storage address.

The number of the filters and the number of the correction arithmetic units are fixed in hardware implementation, namely the maximum feature extraction number of each layer of the CNN layer which can be supported by the processor is fixed. The algorithm of the convolution activation unit comprises the following steps:

and (3) convolution filtering process:

activating a correction process:

where I (I, j) represents each layer of input pixels or neurons, Conv _ filter (I, j), B is a convolution filter matrix coefficient, where Conv _ filter (I, j) represents a two-dimensional convolution filter matrix coefficient, B represents an offset, P (x, y) represents a neuron value obtained through one or more clock-beat multiply-add operations, ConvReLU _ out (x, y) represents convolution activation output, and (x, y) represents neuron coordinates of the sought signature.

The algorithm of the convolution activation unit further includes: and writing the data of the line buffer 1 into the line buffer 2 point by point, writing the data of the buffer 2 into the buffer 3 point by point, and so on, writing the data of the line buffer M-1 into the line buffer M point by point, and simultaneously writing the line of the characteristic diagram adjacent to the lower boundary of the convolution window into the line buffer 1, thereby realizing that the convolution filter slides in the vertical direction of the characteristic diagram, and the neuron in the register file operates according to the running water of each clock beat, so that the convolution filter can slide in the horizontal direction of the characteristic diagram.

The filter is S multiplied by S mean value or maximum value down sampling filter, S is sampling span, when S is 2, namely, the sampling is performed every 2 points in horizontal and vertical directions of the feature map, and the algorithm of the pooling unit comprises:

and (3) mean pooling:

maximum pooling process: pool _ out (xp, yp) max (P (x + i, y + j), i ≦ S-1, j ≦ S-1),

wherein P represents an input feature diagram neuron, S represents each stepping of pooling downsampling, Pool _ out represents a neuron obtained after downsampling, x and y represent feature diagram coordinates before sampling, and xp and yp represent feature diagram coordinates after sampling.

In hardware implementation, the above two algorithms can be implemented simultaneously, and in operation of a specific CNN algorithm, the selection can be realized through firmware configurationOne of which is an algorithm. Division in the mean operation can be realized by precision normalized multiplication, and the desirable values of 1/S are 1/4, 1/9, 1/16, 1/25 and …, which can be realized by solidifying one or more of the values and can be used as a selection configuration. If only 1/2 is takenⁿDivision operations can also be simply implemented by shifting.

The CNN-Pool memory unit is similar to the CNN-ConvReLU memory unit, and can be realized by a single-port or double-port SRAM, for example, the CNN-Pool memory unit can perform read-write operation simultaneously, and a plurality of clock delay beats are inserted between read-write operations, so that the conflict of reading and writing the same memory address is avoided.

And the characteristic graph output by the CNN-Pool enters a next layer of network or a full connection layer, depending on the number of network layers configured by the CNN algorithm, if the CNN-Pool output characteristic graph is the last layer of network output, the CNN-Pool output characteristic graph enters the full connection layer CNN-FC, otherwise, the CNN-Pool output characteristic graph enters the interlayer characteristic graph for mapping to generate the characteristic graph of the next layer of network. The interlayer characteristic diagram mapping is characterized in that a characteristic diagram output by the upper layer and a convolution filtering of the lower layer are combined linearly, but because the matrix coefficient of each layer of convolution filter is fixed, the matrix coefficient of each filter is extracted into a common factor, only the input characteristic diagram needs to be combined linearly, in order to diversify the characteristics as much as possible, each newly generated characteristic diagram is combined by partial input characteristic diagrams, and the algorithm of the interlayer characteristic diagram mapping comprises the following steps:

where I is the input signature, P is the generated signature, Conv _ filter, B is the convolution filter matrix coefficients, k is a subset of the input signature, and P (x, y) represents the neuron values resulting from one or more clock tick multiply-add operations.

In hardware implementation, a fully connected combination is required to be constructed so as to configure different partial mapping combinations according to the CNN algorithm, as shown in fig. 10. The maximum number of the interlayer characteristic map mappings is consistent with the maximum characteristic extraction number of each layer of the CNN, the mapped characteristic map is written into an external main memory (DDR/SDRAM) through an external data access unit (Load/Store), or the characteristic map mapped by the upper layer is read from the external main memory and loaded into the next layer of network operation, and the processor configures the operation parameters of the corresponding network layer according to the network layer into which the algorithm is in the cycle stage.

The full-connection unit flattens the characteristic diagram matrix output by the pooling unit, namely, the multi-dimensional matrix array is arranged according to a certain sequence and is written into the characteristic cache point by point, and if the characteristic matrix is

Writing each characteristic value into a cache according to PP (p 11 p12 p13 p21 p22 p23 p31 p32 p 33), reading the cached characteristic values in FIFO sequence, entering a full-connection unit, performing matrix multiplication operation compression by the full-connection unit according to a pipeline mode to obtain a matrix Yk, performing correction activation to obtain Yk', performing matrix multiplication operation compression again to obtain a result matrix Yn, performing Softmax operation on the Yn to obtain an identification classification result, writing the identification classification result into a register, and triggering interruption or generation state to enter subsequent behaviors.

The operational expression includes:

where Yt is a T × 1 matrix, Wk is a K × T matrix, Yk, Yk' is a K × 1 matrix, Wn is an N × K matrix, and Yn is an N × 1 matrix. Continuously reading T values from the characteristic buffer according to clock beats, simultaneously continuously reading T corresponding CNN matrix coefficients from a register file or a RAM, and obtaining the value of the first row and the first column of the matrix Yk through multiplication and accumulation, thereby performing the cycle operation for K times to calculate the value of the first column of all the rows of Yk (Yk is in a column). The same operation is applied to the second eigenvalue buffer to obtain the matrix Yn (Yn in one column). Softmax can compare the size of the data of the previous beat and the size of the data of the next beat of the clock through a comparator, if the data of the next beat is larger, the data of the previous beat and the corresponding index of the data of the previous beat are registered and reserved for replacementAnd then the next comparison is made until all the data comparison is completed. The characteristic value cache may be implemented by a single ported SRAM or a register file.

The above illustrates the extraction implementation of the execution functional unit of CNN, which can be hooked to the data shuffling crossbar network of the image signal processor after its function is modularized. Loading the image after the internal scaling of the ISP to a CNN execution functional unit according to different CNN algorithm firmware, or loading a previous layer of network characteristic graph to an arithmetic unit and/or a storage unit from an external main memory; the CNN functional unit combination can realize one-layer CNN network operation, and the operation result of each layer is stored in the external main memory or read from the main memory to complete multi-layer CNN network iterative operation. The image resolution adopted by the CNN algorithm is usually small (e.g. 256 × 256), so that the image scaling unit can adopt multi-level reduction operation, and the line buffer depth of the CNN is set correspondingly small, thereby reducing the on-chip storage overhead.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An image signal processor comprising:

2. The image signal processor according to claim 1, wherein the execution functional unit includes a basic operation constituent unit, namely, a CNN functional unit, obtained by extracting a CNN algorithm characteristic; after the convolution filter completes a new neuron calculation through one or more clock beats, the activation correction operation is carried out,

3. The image signal processor according to claim 2, wherein the convolution activating unit includes a first operation unit and a first storage unit, the first operation unit includes a plurality of sets of M × M first filters and correction operators, the first storage unit includes a same number of sets of first memories, each set of first memories includes M line buffers corresponding to M taps in a vertical direction of the first filter, M taps in a horizontal direction are obtained through register pipelining, so that a multiply-add arithmetic pipelining of M × M convolution is realized through one or more clock beats, and control logic of each set of first memories simultaneously reads M rows of data into convolution filtering and simultaneously completes an inter-line replacement storage operation; each line cache is realized by a double-port SRAM or a single-port SRAM, and a plurality of clock delay beats are inserted between reading and writing so as to avoid reading and writing the same storage address conflict;

4. The image signal processor of claim 3, wherein the data shuffling crossbar network involves: each arithmetic unit can access any one of the memory units, and conversely, each memory unit can respond to any one of the arithmetic units.

5. The image signal processor of claim 4, wherein the image signal processor is capable of performing CNN algorithm feature extraction, comprising:

the information of the image is input and,

exp (x) is the x power of the exponent,

6. The image signal processor of claim 5, wherein the basic operation component unit is integrated onto the data shuffle crossover network; loading the reduced image inside the processor to the CNN functional unit according to different CNN firmware algorithms, or loading a previous layer of network characteristic graph from an external main memory to an arithmetic unit and/or a storage unit; the CNN functional unit combination can realize one-layer CNN network operation, and the operation result of each layer is stored in the external main memory or read from the main memory to complete multi-layer CNN network iterative operation.

7. The image signal processor according to any one of claims 1 to 6, characterized by further comprising: the device comprises an external data access unit, a bus interface, a video input unit and a video output unit.

8. An image processing apparatus comprising:

the system comprises an algorithm module, an upper computer, an image signal processor and a storage module; the image signal processor receives original data of the sensor, sends video data to the algorithm module for CNN training, the upper computer sets parameters of image processing and CNN according to the training result, the parameters are stored by the firmware or program storage module, the set parameters are called by the image signal processor, and the video data are continuously sent to the algorithm module until the obtained optimal parameters of the image processing and CNN are obtained, and the image signal processor is the image signal processor as claimed in any one of claims 1 to 7.

9. The image processing apparatus according to claim 1, wherein the algorithm module comprises a CNN training module integrated in the upper computer, or the CNN training module comprises an FPGA.

10. A neural network image processing system, comprising a camera assembly and a host, wherein the camera assembly comprises the image signal processor according to any one of claims 1 to 7, or comprises the image processing device according to claim 8 or 9.