WO2017071176A1 - 一种图像处理方法与图像处理装置 - Google Patents

一种图像处理方法与图像处理装置 Download PDF

Info

Publication number
WO2017071176A1
WO2017071176A1 PCT/CN2016/080997 CN2016080997W WO2017071176A1 WO 2017071176 A1 WO2017071176 A1 WO 2017071176A1 CN 2016080997 W CN2016080997 W CN 2016080997W WO 2017071176 A1 WO2017071176 A1 WO 2017071176A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing unit
memory
image
partition
classification
Prior art date
Application number
PCT/CN2016/080997
Other languages
English (en)
French (fr)
Inventor
姚骏
汪涛
汪玉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP16858624.6A priority Critical patent/EP3352132B1/en
Publication of WO2017071176A1 publication Critical patent/WO2017071176A1/zh
Priority to US15/964,045 priority patent/US10740657B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24317Piecewise classification, i.e. whereby each classification requires several discriminant rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/955Hardware or software architectures specially adapted for image or video understanding using specific electronic processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/96Management of image or video recognition tasks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the invention provides a field of image processing, and in particular, to an image processing method and an image processing apparatus.
  • the image partition model accepts the input image and divides the input image into regions of different sizes; image classification
  • the model uses a convolutional neural network or other classification algorithm to continuously extract the features of each region of the image through a hierarchical structure, and finally identify the target object.
  • the heterogeneous platform is generally used for image recognition.
  • a central processing unit (CPU, Central Processing Unit) + GPU (Graphics Processing Unit) heterogeneous platform is used for image recognition.
  • the GPU is an easy to program, high performance processor. Unlike CPUs, which are primarily used for data calculation and instruction interpretation, GPUs are designed to perform complex mathematical and geometric calculations, primarily for graphic image processing.
  • the CPU+GPU heterogeneous platform is used in image recognition, and the CPU is used for image partitioning, and then the GPU is used for image classification.
  • the memory of different types of processors is independent of each other.
  • the CPU has independent CPU memory
  • the GPU also has independent GPU memory (also called video memory). Therefore, when heterogeneous platforms are used for target detection and recognition, heterogeneous processors (such as between CPU and GPU) need constant interaction data. A large number of data interactions will cause long delays and affect the entire heterogeneous platform. Detection performance.
  • the present invention provides an image processing method for improving the performance of image processing.
  • the present invention also provides related image processing apparatus.
  • a first aspect of the present invention provides an image processing method suitable for use in an image processing apparatus.
  • the picture The processing unit of the processing device comprises a partition processing unit and a classification processing unit
  • the memory comprises a first memory and a second memory
  • the partition processing unit can be a CPU, a DSP, a processing core or other hardware circuit capable of realizing image partitioning operation
  • the classification processing unit It can be a hardware circuit that can realize image classification operation for GPU and FPGA.
  • the partition processing unit and the classification processing unit are heterogeneous processing units, and share the first memory and the second memory.
  • the first memory stores a first image to be processed
  • the partition processing unit acquires the first image from the first memory, and partitions the first image to obtain a first partition result, and then saves the first partition result in the second In memory.
  • the classification processing unit acquires the first partition result saved by the partition processing unit from the second memory, and acquires the first image from the first memory.
  • the classification processing unit then classifies the first image according to the first image and the first partition result to obtain a first classification result.
  • the first memory and the second memory are shared by the partition processing unit and the classification processing unit, and thus the shared data is as the first image, the first Data such as a partition result does not need to be transported between the partition processing unit and the sort processing unit, thereby avoiding the delay caused by data transfer between the processing units, speeding up image processing, and improving image processing performance.
  • the processing unit of the image processing apparatus may further include a scheduling processing unit, configured to acquire the first image to be processed from outside the image processing apparatus, and save the acquired first image in the first memory.
  • the scheduling processing unit shares the first memory together with the partition processing unit and the classification processing unit because the scheduling processing unit needs to access the first memory.
  • each processing unit does not process a plurality of images in batches, but processes a plurality of images serially by means of a pipeline.
  • the scheduling processing unit performs the following operations: acquiring the second image to be processed, and saving the second image in the first memory.
  • the partition processing unit performs the following steps: acquiring the second image from the first memory, partitioning the second image to obtain the second partition result, and saving the second partition result in the second memory. .
  • the classification processing unit After obtaining the first classification result, the classification processing unit performs the following steps: acquiring a second image from the first memory, obtaining a second partition result from the second memory, and then, according to the second image and the second partition result, The second image is classified to obtain a second classification result.
  • the use of pipelines can reduce the memory capacity requirements of each processing unit and make full use of each processing unit.
  • the scheduling processing unit can control startup, work, and suspension of each processing unit.
  • the scheduling processing unit may control the time point at which the scheduling processing unit starts to acquire the second image operation to be processed, not earlier than the time when the partition processing unit starts the partitioning operation on the first image, to reduce scheduling processing.
  • the average power of the unit similarly, the scheduling processing unit may also control the time point at which the partition processing unit starts the partitioning operation on the second image, not earlier than the time when the classification processing unit starts the classification operation on the first image, so as to reduce The average power of the partition processing unit.
  • the scheduling processing unit may suspend the partition processing unit when the occupancy of the second memory reaches the first preset occupancy rate or the second memory is full, so as to prevent the partition processing result calculated by the partition processing unit from being in the second memory. In the back.
  • the image processing apparatus may further include a third memory, configured to save the classification result calculated by the classification processing unit. After obtaining the first classification result, the classification processing unit saves the first classification result to the third memory.
  • the scheduling processing unit is responsible for reading the first classification result from the third memory and outputting the first classification result to the outside of the image processing apparatus, such as writing the first classification result to the disk.
  • the third memory is shared by at least the scheduling processing unit and the classification processing unit.
  • the scheduling processing unit may suspend the classification processing unit to prevent the classification result calculated by the classification processing unit from being in the third memory. In the back.
  • the scheduling processing unit may also be responsible for adjusting the size or bandwidth of the memory in the image processing apparatus.
  • the scheduling processing unit may acquire a first duration in which the partition processing unit performs a partitioning operation on the first image, and a classification processing unit performs a second duration in which the first image is classified. If the first duration is greater than the second duration, it indicates that the classification processing unit performs the classification operation faster than the partition processing unit performs the partition operation, and the scheduling processing unit may increase the size of the second memory, and/or increase the size.
  • the bandwidth of the second memory, and/or the size of the third memory, and/or the bandwidth of the third memory may be responsible for adjusting the size or bandwidth of the memory in the image processing apparatus.
  • the partition processing unit performs a partitioning operation at a faster rate than the classification processing unit performs image classification
  • the scheduling processing unit may reduce the size of the second memory, and / / reducing the bandwidth of the second memory, and / or increasing the size of the third memory, and / or increasing the bandwidth of the third memory.
  • the image processing apparatus may further include a fourth memory and/or a fifth memory, wherein the fourth memory is used to save an intermediate result when the partition processing unit performs the partitioning operation, and the fifth memory is used to save the classification processing unit into the classification operation.
  • the intermediate result of the time When the first duration is greater than the second duration, the dispatch office The unit may also increase the size of the fourth memory, and/or reduce the size of the fifth memory, and/or increase the bandwidth of the fourth memory, and/or reduce the bandwidth of the fifth memory.
  • the scheduling processing unit may reduce the size of the fourth memory, and/or increase the size of the fifth memory, and/or decrease the bandwidth of the fourth memory, and/or Or increase the bandwidth of the fifth memory.
  • the scheduling processing unit may suspend the partition processing unit when the fourth memory usage reaches the third preset occupancy rate or the fourth memory is full. And/or, the scheduling processing unit may suspend the classification processing unit when the occupancy rate of the fifth memory reaches the fourth preset occupancy rate or the fifth memory is full.
  • the image processing apparatus may further include a sixth memory, configured to save an algorithm parameter of the partition processing unit performing the partitioning operation, and an algorithm parameter of the classification processing unit performing the classification operation, where the sixth memory is at least the partition processing unit and the classification processing unit. Shared.
  • a second aspect of the present invention provides an image processing apparatus.
  • the processing unit of the image processing apparatus includes a partition processing unit and a classification processing unit.
  • the memory includes a first memory and a second memory, and the partition processing unit can be a CPU, a DSP, and a processing core. Or other hardware circuits capable of realizing image partitioning operations, the classification processing unit may be a hardware circuit capable of performing image classification operations by the GPU and the FPGA.
  • the partition processing unit and the classification processing unit are heterogeneous processing units, and share the first memory and the second memory.
  • the partition processing unit is configured to: obtain a first image to be processed from the first memory, partition the first image to obtain a first partition result, and save the first partition result in the second memory.
  • the classification processing unit is configured to: obtain a first partition result saved by the partition processing unit from the second memory, and acquire the first image from the first memory. The classification processing unit then classifies the first image according to the first image and the first partition result to obtain a first classification result. Since the first memory and the second memory are shared by the partition processing unit and the classification processing unit, data shared such as the first image, the first partition result, and the like need not be carried between the partition processing unit and the classification processing unit, thereby avoiding The delay caused by data handling between processing units speeds up image processing and improves image processing performance.
  • the processing unit of the image processing apparatus may further include a scheduling processing unit, configured to acquire the first image to be processed from outside the image processing apparatus, and save the acquired first image in the first memory.
  • the scheduling processing unit shares the first memory together with the partition processing unit and the classification processing unit because the scheduling processing unit needs to access the first memory.
  • each processing unit does not process multiple images in batches, but uses a pipeline method. Serial processing of multiple images.
  • the scheduling processing unit is further configured to: perform the following steps: acquiring the second image to be processed, and saving the acquired second image in the first memory .
  • the partition processing unit is further configured to perform the following steps: acquiring a second image from the first memory, partitioning the second image to obtain a second partition result, and saving the second partition result in the first Two in memory.
  • the classification processing unit is further configured to: perform the following steps: acquiring the second image from the first memory, obtaining the second partition result from the second memory, and then, according to the second image and the second partition result The second image is classified to obtain a second classification result.
  • the startup time of each processing unit when using the pipeline mode can be controlled by the scheduling processing unit.
  • the scheduling processing unit is further configured to control start, work, and suspend of each processing unit.
  • the scheduling processing unit is configured to control a time when the scheduling processing unit starts to acquire the second image operation to be processed, The point in time at which the partitioning operation is performed on the first image is initiated earlier than the partition processing unit to reduce the average power of the scheduling processing unit.
  • the scheduling processing unit is further configured to control a time point at which the partition processing unit starts the partitioning operation on the second image, not earlier than the time point at which the classification processing unit starts the classification operation on the first image, to reduce the partition processing unit. Average power.
  • the scheduling processing unit is further configured to suspend the partition processing unit when the occupancy of the second memory reaches the first preset occupancy rate or the second memory is full, so as to prevent the partitioning result calculated by the partition processing unit from being Two in-memory backlog.
  • the image processing apparatus may further include a third memory, configured to save the classification result calculated by the classification processing unit.
  • the classification processing unit is further configured to save the first classification result into the third memory.
  • the scheduling processing unit is further configured to read the first classification result from the third memory, and output the first classification result to the outside of the image processing apparatus, such as writing the first classification result to the disk.
  • the third memory is shared by at least the scheduling processing unit and the classification processing unit.
  • the scheduling processing unit is further configured to suspend the classification processing unit when the occupancy of the third memory reaches the second preset occupancy rate or the third memory is full, so as to prevent the classification result calculated by the classification processing unit from being Three in-memory backlogs.
  • the scheduling processing unit is further configured to adjust a size or a bandwidth of the memory in the image processing apparatus.
  • the scheduling processing unit is further configured to acquire the partition processing unit to perform the partitioning operation on the first image.
  • the first duration of time, and the classification processing unit performs a second duration of performing the sorting operation on the first image. If the first duration is greater than the second duration, it indicates that the classification processing unit performs the classification operation faster than the partition processing unit performs the partition operation, the scheduling processing unit increases the size of the second memory, and/or increases the number.
  • the bandwidth of the second memory, and/or the size of the third memory, and/or the bandwidth of the third memory are examples of the third memory.
  • the partition processing unit performs the partitioning operation at a faster rate than the classification processing unit performs image classification, and the scheduling processing unit decreases the size of the second memory, and/ Or reducing the bandwidth of the second memory, and/or increasing the size of the third memory, and/or increasing the bandwidth of the third memory.
  • the image processing apparatus may further include a fourth memory and/or a fifth memory, wherein the fourth memory is used to save an intermediate result when the partition processing unit performs the partitioning operation, and the fifth memory is used to save the classification processing unit into the classification operation.
  • the intermediate result of the time When the first duration is greater than the second duration, the scheduling processing unit is further configured to increase the size of the fourth memory, and/or reduce the size of the fifth memory, and/or increase the bandwidth of the fourth memory, and/or subtract The bandwidth of the small fifth memory.
  • the scheduling processing unit is further configured to reduce the size of the fourth memory, and/or increase the size of the fifth memory, and/or reduce the bandwidth of the fourth memory. And / or increase the bandwidth of the fifth memory.
  • the scheduling processing unit is further configured to suspend the partition processing unit when the occupancy of the fourth memory reaches the third preset occupancy rate or the fourth memory is full. And/or, the scheduling processing unit is further configured to suspend the classification processing unit when the occupancy of the fifth memory reaches the fourth preset occupancy rate or the fifth memory is full.
  • the image processing apparatus may further include a sixth memory, configured to save an algorithm parameter of the partition processing unit performing the partitioning operation, and an algorithm parameter of the classification processing unit performing the classification operation, where the sixth memory is at least the partition processing unit and the classification processing unit. Shared.
  • FIG. 1 is a schematic diagram of an image target detection process
  • Figure 2 is a structural diagram of a CPU+GPU heterogeneous platform in the current technology
  • FIG. 3 is a structural diagram of an image processing apparatus according to an embodiment of the present invention.
  • FIG. 4 is a schematic flow chart of an image processing method according to an embodiment of the present invention.
  • FIG. 5 is another structural diagram of an image processing apparatus according to an embodiment of the present invention.
  • FIG. 6 is another schematic flowchart of an image processing method according to an embodiment of the present invention.
  • FIG. 7 is another structural diagram of an image processing apparatus according to an embodiment of the present invention.
  • FIG. 8(a) is another schematic flowchart of an image processing method according to an embodiment of the present invention.
  • FIG. 8(b) is another schematic flowchart of an image processing method according to an embodiment of the present invention.
  • FIG. 9 is a structural diagram of an image processing apparatus in an application scenario of the present invention.
  • FIG. 10 is a schematic flowchart diagram of an image processing method in an application scenario of the present invention.
  • Embodiments of the present invention provide an image processing method for improving performance of image processing. Embodiments of the present invention also provide related image processing devices, which will be separately described below.
  • the image partition model accepts the input image and divides the input image into regions of different sizes by partition operation.
  • the image classification model uses a convolutional neural network or other classification algorithm to perform classification operations, and finally identifies the target object.
  • heterogeneous platforms are generally used for image recognition.
  • the Central Processing Unit is the core of the system's operation and control. Its function is mainly to interpret system instructions and process data in the system software.
  • the GPU Graphics Processing Unit
  • the CPU+GPU heterogeneous platform is generally used in the current technology to realize the detection and recognition of targets in the image.
  • the heterogeneous platform refers to a platform in which two or more types of processors are integrated. For convenience of description, an example in which the CPU + GPU is used as a heterogeneous platform for image recognition is described in the embodiment of the present invention.
  • the CPU In the target detection and recognition process, the CPU first partitions the detected source image and writes the partition result into the CPU memory. Since the memory of different types of processors in the heterogeneous platform is not shared, the CPU needs to transfer the partition result to the GPU memory (also called video memory), and the GPU combines the source image and the partition result to classify the source image to obtain a classification. The result (ie, the detection and recognition result of the target in the image).
  • the CPU will continuously calculate the partition result and write it into the CPU memory, so the CPU needs to continuously transfer the data in the CPU memory to the GPU memory. A large amount of data transfer between CPU memory and GPU memory will cause a long delay and slow down the image processing speed. Degree, which in turn affects the target detection performance of the entire heterogeneous platform.
  • the present invention provides an image processing apparatus, and provides a corresponding image processing method based on the image processing apparatus.
  • the basic structure of the image processing apparatus provided by the present invention will be described below with reference to FIG. 3, which specifically includes:
  • the partition processing unit 301 is mainly used for performing image partitioning operations.
  • the partition processing unit 301 may be specifically served by one or more processors, and the processor may be one or more of a CPU, a digital signal processing (DSP), or another type of processor.
  • the partition processing unit 301 can also be served by one or more cores in the processor, or by other hardware circuits capable of performing image partitioning operations, which are not limited in the embodiment of the present invention.
  • the classification processing unit 302 is mainly used to perform an image classification operation.
  • the classification processing unit 302 may be specifically served by one or more processors, and the processor may be one or more of a GPU, a Field-Programmable Gate Array (FPGA), or another type of processor.
  • the classification processing unit 302 may also be served by one or more cores in the processor, or by other hardware circuits capable of performing image classification operations, which are not limited in the embodiment of the present invention.
  • the partition processing unit 301 and the classification processing unit 302 are heterogeneous processing units, that is, the partition processing unit 301 and the classification processing unit 302 are different types of processing units. For example, if the partition processing unit 301 is served by the CPU, the classification processing unit 302 cannot be served by the CPU, but may employ a GPU, FPGA, or other type of processing unit.
  • the first memory 303 is configured to store an image to be processed, that is, a source image.
  • the first memory 303 is connected to the partition processing unit 301 and the classification processing unit 302, and is shared by the partition processing unit 301 and the classification processing unit 302. That is, both the partition processing unit 301 and the classification processing unit 302 can directly access the first memory 303.
  • the data is configured to store an image to be processed, that is, a source image.
  • the first memory 303 is connected to the partition processing unit 301 and the classification processing unit 302, and is shared by the partition processing unit 301 and the classification processing unit 302. That is, both the partition processing unit 301 and the classification processing unit 302 can directly access the first memory 303.
  • the data is configured to store an image to be processed, that is, a source image.
  • the second memory 304 is configured to store the result of the image partitioning operation performed by the partition processing unit 301.
  • the second memory 304 is connected to the partition processing unit 301 and the classification processing unit 302, and is shared by the partition processing unit 301 and the classification processing unit 302. That is, both the partition processing unit 301 and the classification processing unit 302 can directly access the second memory 303.
  • the data is configured to store the result of the image partitioning operation performed by the partition processing unit 301.
  • the second memory 304 is connected to the partition processing unit 301 and the classification processing unit 302, and is shared by the partition processing unit 301 and the classification processing unit 302. That is, both the partition processing unit 301 and the classification processing unit 302 can directly access the second memory 303.
  • the data is configured to store the result of the image partitioning operation performed by the partition processing unit 301.
  • the present invention provides a corresponding image processing method.
  • the basic flow of the method will be explained by taking the processing procedure of the first image as an example. Please refer to FIG. 4:
  • the first image to be processed is saved in the first memory, and the partition processing unit is from the first memory. Obtaining the first image;
  • the partition processing unit partitions the first image to obtain a first partition result.
  • the partition processing unit saves the first partition result in the second memory.
  • the classification processing unit acquires, from the second memory, a first partition result saved by the partition processing unit.
  • the first partition result may be deleted from the second memory;
  • the classification processing unit acquires the first image from the first memory.
  • the first image may be deleted from the first memory.
  • step 405 and steps 401 to 404 have no specific sequence, and step 405 may also be located before any of steps 401 to 404;
  • the classification processing unit classifies the first image according to the first image and the first partition result, to obtain a first classification result.
  • the first image needs to be shared by the partition processing unit and the classification processing unit, and the first partition result calculated by the partition processing unit needs to be used by the classification processing unit.
  • the first image is saved in the first memory, and the first memory is shared by the partition processing unit and the classification processing unit. Therefore, the partition processing unit does not need to carry the first image from the memory of the partition processing unit into the memory of the sort processing unit.
  • the first partition result is saved in the second memory, and the second memory is also shared by the partition processing unit and the classification processing unit, so the partition processing unit does not need to carry the first partition result from the memory of the partition processing unit.
  • the method provided in this embodiment reduces the data handling operation between different processing units in the image processing process by storing the shared data in the shared memory, thereby avoiding data handling between the processing units.
  • the delay speeds up the speed of image processing and improves the performance of image processing.
  • the first image to be processed is stored in the first memory.
  • the first image may be acquired by the partition processing unit and saved in the first memory.
  • the embodiment of the present invention further introduces a scheduling processing unit, configured to acquire the first image to be processed before step 401, and save the acquired first image in the first memory. It can be understood that the scheduling processing unit should share the first memory together with the partition processing unit and the classification processing unit.
  • the scheduling processing unit is introduced in the discussion of the previous paragraph, and the scheduling processing unit is used for image processing.
  • the first image is acquired outside the device into the first memory. It can be understood that after step 406, the image processing apparatus needs to output the first classification result to the image processing apparatus (such as writing the first classification result to the disk). Alternatively, the operation may still be performed by the scheduling processing unit.
  • the present invention may also introduce a third memory. After the step 406, the classification processing unit saves the first classification result in the third memory.
  • the scheduling processing unit outputs the first classification result in the third memory from the image processing apparatus.
  • FIG. 5 Another image processing apparatus provided by the embodiment of the present invention is shown in FIG. 5.
  • the partition processing unit 501, the classification processing unit 502, the first memory 503, and the second memory 504 are substantially the same as the partition processing unit 301, the classification processing unit 302, the first memory 303, and the second memory 304 shown in FIG. Do not repeat it.
  • the scheduling processing unit 505 and the third memory 506 are both optional modules. For the operations or functions that are specifically performed, refer to the discussion in the previous two paragraphs, and details are not described herein.
  • image processing devices In the current technology, image processing devices generally process multiple images in batches. For example, the scheduling processing unit acquires 50 images in batches. After the acquisition of the 50 images is completed, the partition processing unit performs batch partitioning on the 50 images, and writes the result into the memory of the partition processing unit. After the 50 image partitions are completed, the classification processing unit further classifies the 50 images. It can be seen that this batch processing method requires that each processing unit has a large memory capacity, and when a certain processing unit operates, other processing units are in an idle state (for example, batch processing the 50 images in the partition processing unit. During the partitioning process, the classification processing unit is always idle.
  • the image processing apparatus of the present invention can process the image in a pipelined manner to reduce the memory capacity requirement and make full use of each processing unit. The following describes the pipeline mode only by taking the first image and the second image as an example:
  • the horizontal axis direction represents time
  • the numerical axis direction represents processed image.
  • the task scheduling processing unit acquires the first image and saves the first image into the first memory, it starts an operation of acquiring the second image and saving the second image into the first memory.
  • the scheduling processing unit may start the operation of acquiring the second image immediately after the operation of acquiring the first image is completed, or may start the operation of acquiring the second image at some time after the operation of acquiring the first image is completed. There is no limit here.
  • the partition processing unit partitions the first image, obtains the first partition result, and saves the first partition result in the second memory
  • the partition processing unit may control or start the following operations by the partition processing unit.
  • the classification processing unit classifies the first image according to the first image and the first partition result, and after obtaining the first classification result, may be controlled by the scheduling processing unit or spontaneously controlled by the partition processing unit to start the following operation: from the first memory Obtaining a second image, obtaining a second partition result from the second memory, and classifying the second image according to the second image and the second partition result to obtain a second classification result. If the data processing apparatus includes the scheduling processing unit and the third memory, the classification processing unit further saves the second classification result in the third memory, and the second processing result is output from the third memory by the scheduling processing unit.
  • the scheduling processing unit in the embodiment of the present invention is only a processing unit abstracted from a function.
  • the scheduling processing unit may be an independent processing unit, or may be combined with a partition processing unit or a classification processing.
  • the unit is the same processing unit (that is, the actual product may also not include the entity's scheduling processing unit, and the operations performed by the scheduling processing unit in the embodiment of the present invention may be performed by the partition processing unit and/or the classification processing unit).
  • a DSP + single core CPU + dual core GPU can be used to construct the data processing device provided by the embodiment of the present invention.
  • the CPU functions as a scheduling processing unit
  • the DSP functions as a partition processing unit
  • the dual core GPU functions as a classification processing unit.
  • a 4-core CPU and a dual-core GPU can be used to construct the data processing device provided by the embodiment of the present invention.
  • the dual-core GPU functions as a classification processing unit, and the first three cores of the 4-core CPU serve as a partition processing unit. While the four cores are responsible for the role of the partition processing unit, they are also responsible for performing the operations that need to be performed by the scheduling processing unit.
  • the scheduling processing unit is a functionally abstracted processing unit.
  • the present invention also provides more scheduling tasks for the scheduling processing unit. Includes memory tuning tasks, and/or process control tasks, specific:
  • the scheduling processing unit can adjust parameters such as the size and bandwidth of the memory in the image processing apparatus according to the operation of the image processing apparatus.
  • the partition calculation amount of an image is smaller than the classification calculation amount, so the image partitioning operation tends to be faster than the image classification operation.
  • the partition processing unit performs the partitioning operation too fast, the classification processing unit cannot digest the partition result of the partition processing unit in time, causing the partition result to be backlogged in the second memory, which makes the fast calculation performance of the partition processing unit waste. It can be understood that, for the same image, if the duration of the partitioning by the partition processing unit is equal to the length of the classification processing unit, the rate of processing the image by the partition processing unit and the classification processing unit is exactly the same, and the classification processing unit can be just in time.
  • the scheduling processing unit may acquire the first duration in which the partition processing unit performs the partitioning operation on the first image, and acquire the second duration in which the classification processing unit performs the classification operation on the first image. If the first duration is greater than the second duration, that is, the partition processing unit partitions the first image longer than the classification processing unit sorts the first image, indicating that the classification processing unit performs the classification operation faster than the partition processing unit.
  • the scheduling processing unit can increase the size of the second memory, so that more partitioning results can be saved in the second memory, and the classification processing unit does not need to spend a long time waiting for the partitioning result of the partition processing unit.
  • increasing the second memory may make the third memory become smaller, so that when the classification result of the classification processing unit has no space in the third memory, the classification processing unit is forced to hang.
  • the scheduling processing unit may also directly reduce the size of the third memory to limit the rate at which the classification processing unit performs image classification.
  • the scheduling processing unit may further increase the bandwidth of the second memory, so that the partitioning result of the partition processing unit can be quickly moved to the second memory, which is advantageous for improving the image partitioning rate of the partition processing unit.
  • increasing the bandwidth of the second memory may make the bandwidth of the third memory smaller, thereby reducing the rate at which the classification processing unit saves the classification result to the third memory, and restricting the classification processing.
  • the rate at which the unit performs image classification may also directly reduce the bandwidth of the third memory to limit the rate at which the classification processing unit performs image classification.
  • the partition processing unit performs the partition operation rate.
  • the scheduling processing unit may reduce the size of the second memory, and/or reduce the bandwidth of the second memory, and/or increase the size of the third memory, and/or increase The bandwidth of the third memory to limit the rate at which the partition processing unit performs partition operations.
  • the image processing apparatus may further include a fourth memory for holding an intermediate result when the partition processing unit performs the partitioning operation, and/or a fifth memory for storing the intermediate result when the classification processing unit performs the sorting operation, as shown in FIG. 7 is shown.
  • the scheduling processing unit may increase the size of the fourth memory, and/or increase the bandwidth of the fourth memory, and / / reducing the size of the fifth memory, and / or reducing the bandwidth of the fifth memory to reduce the rate at which the classification processing unit performs the classification operation.
  • the scheduling processing unit may reduce the size of the fourth memory, and Or reducing the bandwidth of the fourth memory, and/or increasing the size of the fifth memory, and/or increasing the bandwidth of the fifth memory.
  • the start, work, and suspend of the partition processing unit and the sort processing unit can be controlled by the scheduling processing unit.
  • the scheduling processing unit can suspend the partition processing unit.
  • the scheduling processing unit may further wait for the partition result in the second memory to be no longer accumulated (specifically, after the partition processing unit suspends the first preset duration, or after the occupancy of the second memory decreases below a preset threshold, Or under other set conditions, it can be considered that the partition result in the second memory is no longer backlogged, and then the partition processing unit is started.
  • the scheduling processing unit can suspend the classification processing unit.
  • the scheduling processing unit may further activate the classification processing unit when the classification result in the third memory is no longer accumulated.
  • the scheduling processing unit may suspend the partition processing unit.
  • the scheduling processing unit may suspend the classification processing unit.
  • the first preset occupancy rate to the fourth preset occupancy rate are preset values, for example, 80% or 90%, or 100% (that is, the memory is full).
  • the first preset occupancy rate to the fourth preset occupancy rate may be the same as the value of any two or more preset occupancy rates, and the values of the first preset occupancy rate to the fourth preset occupancy rate may also be different. The same, here is not limited.
  • the scheduling processing unit may suspend the partition. Processing unit.
  • the scheduling processing unit may further wait for the partition result in the second memory to be no longer accumulated (specifically, after the partition processing unit suspends the first preset duration, or after the occupancy of the second memory decreases below a preset threshold, Or after the remaining space of the second memory is greater than the preset size, or under other set conditions, it can be considered that the partition result in the second memory is no longer backlogged, and then the partition processing unit is started.
  • the scheduling processing unit can suspend the classification processing unit.
  • the scheduling processing unit may further activate the classification processing unit when the classification result in the third memory is no longer accumulated.
  • the scheduling processing unit may suspend the partition processing unit.
  • the scheduling processing unit may suspend the classification processing unit.
  • the second preset size to the fifth preset size are preset values, and may be positive values or 0 (ie, the memory is full). In the first preset size to the fifth preset size, the values of any two or more preset sizes may be the same, and the values of the first preset size to the fifth preset size may also be different. Make a limit.
  • the scheduling processing unit may further control the acquisition image operation, the partition processing unit to perform the partitioning operation, and the start time of the classification processing unit to perform the classification operation.
  • FIG. 6 introduces the image processing flow of the pipeline mode by taking the first image and the second image as an example, and the image processing flow of the pipeline mode is introduced below with the first image, the second image, and the third image.
  • Figure 8(a) first.
  • the image processing apparatus in FIG. 8(a) sequentially processes the third image, the first image, and the second image, wherein the scheduling processing unit starts the operation of acquiring the first image immediately after acquiring the third image, and is completed. Immediately after the operation of acquiring the first image, the operation of acquiring the second image is started.
  • the scheduling processing unit In the entire image processing flow, the scheduling processing unit is always in the operational state of acquiring an image. However, at time T1 (the timing at which the scheduling processing unit initiates the operation of acquiring the second image), the partition processing unit also partitions the third image, does not start partitioning the first image, and does not start the second. The image is partitioned. Therefore, at time T1, the scheduling processing unit only needs to complete acquiring the first image, so that the partition processing unit can start partitioning the first image at any time, and it is not necessary to prepare the second image at this time. Only at time T2 (the point at which the partition processing unit starts partitioning the first image), the scheduling processing unit needs to acquire the second image, so that the partition processing unit starts the pair at any time after completing the operation of partitioning the first image.
  • the second image is partitioned.
  • the scheduling processing unit may temporarily stop acquiring the second image, so that the scheduling processing unit is rested to reduce the average power of the scheduling processing unit. Therefore, in the embodiment of the present invention, the scheduling processing unit may control the time point at which the scheduling processing unit starts to acquire the second image operation to be processed, not earlier than the time point at which the partition processing unit starts the partitioning operation on the first image. Similarly, the scheduling processing unit may further control the time point at which the partition processing unit starts the partitioning operation on the second image, not earlier than the time when the classification processing unit starts the classification operation on the first image, as shown in FIG. 8(b). Show.
  • the classification processing unit may classify images by using a Convolutional Neural Network (CNN) or other algorithms, and does not do this in the present invention. limited.
  • the image processing apparatus provided by the present invention may further include a sixth memory, an algorithm and a parameter for storing the partition processing unit for performing the partitioning operation, and an algorithm and a parameter for the classification processing unit to perform the classification operation.
  • the sixth memory is shared by the partition processing unit and the classification processing unit.
  • the present invention introduces the first to sixth memories of the image processing apparatus and defines which processing units the memory needs to be shared by. It should be pointed out that these memories can be shared by undefined processing units in addition to being shared by defined processing units.
  • the second memory needs to be shared by the partition processing unit and the classification processing unit, but the second memory can also be shared by the scheduling processing unit.
  • the third memory needs to be classified and processed.
  • the scheduling processing unit is shared, but the third memory can also be shared by the partition processing unit.
  • the fourth memory, the fifth memory, and the sixth memory can also be partitioned by the processing unit, the classification processing unit, and the scheduling processing unit. Shared by one, two or more processing units.
  • the first memory to the sixth memory introduced in the present invention are logically divided. Although each is illustrated in a separate and independent form in the drawings, in actual form, any two or more blocks of memory may be used. Integration into one.
  • the entire image processing apparatus may have only one memory shared by each processing unit, and the scheduling processing unit divides the address segment of the shared memory into six blocks, and functions as the first to sixth memory respectively.
  • the image processing apparatus provided by the present invention may further include more memory, and the more memory may be shared by multiple processing units, or may be It is used exclusively by the processing unit and is not limited here.
  • Figure 9 shows an image processing device consisting of a CPU+GPU heterogeneous system with a 4-PLUS-1 Cortex A15 model, including four ARM A15 cores and one ARM A15 low-power management core.
  • the GPU is a Kepler GPU model that includes 192 GPU CUDA cores.
  • the heterogeneous system uses three ARM A15 operation cores for image partitioning (ie, as a partition processing unit), one ARM A15 operation core for task scheduling (ie, as a scheduling processing unit), and 192 GPU cores for image classification (ie, As a classification processing unit).
  • the CPU and the core in the GPU exchange data through the shared memory of 2GB Double Data Rate 2 (DDR2) and form a hardware pipeline.
  • the image partitioning algorithm uses the EdgeBox algorithm and the image classification algorithm uses CNN.
  • the scheduling processing unit divides 500MB of space into the first memory in 2GB DDR2, and stores the source image acquired by the scheduling processing unit, and divides 200MB of space in the 2GB DDR2 as the second memory for storing the partition processing unit. As a result of the partitioning, 100MB of space is also divided into 2GB DDR2 as the third memory for storing the classification result of the classification processing unit.
  • the user uses the heterogeneous system shown in FIG. 9 to detect the targets in the recognition image A, the image B, and the image C.
  • the scheduling processing unit of the heterogeneous system first acquires the image A and writes the image A into the first memory; at time T4, the partition processing unit starts the operation of partitioning the image A, specifically, from the first memory. Reading image A, partitioning image A and writing the partition result of image A into the second memory, while the scheduling processing unit starts the operation of acquiring image B and writing image B to the first memory; at time T5, classification processing The unit initiates a sorting operation on the image A.
  • the image A is read from the first memory and the partition result of the image A is read from the second memory, and the image A is classified according to the partition result of the image A and the classification result is obtained.
  • the partition processing unit starts the operation of partitioning the image B, reads the image B from the first memory, partitions the image B, and writes the partition result of the image B into the second memory, and simultaneously schedules The processing unit initiates the operation of acquiring image C and writes image C to the first memory.
  • the classification processing unit does not complete the classification operation on the image A, and therefore the partition result of the image A in the second memory cannot be completely consumed.
  • the partition processing unit still partitions the image B continuously, and outputs the partition result of the image B to the second memory, which results in the partition result of the image A and the image B.
  • Two in-memory backlog Assume that at the time T6, the occupancy rate of the second memory reaches 100%, and the scheduling processing unit suspends the partition processing unit.
  • the scheduling processing unit starts the partition processing unit again, and the partition processing unit completes the partitioning operation on the image B, and saves the partition result to the second memory.
  • the classification processing unit starts classifying the image B and writes the classification result to the third memory, and the partition processing unit starts the operation of partitioning the image C, and writes the partition result of the image C into the second memory.
  • the classification processing unit starts the classification operation on the image C and writes the classification result into the third memory.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the storage medium includes: a USB flash drive, a mobile hard disk, a memory (which may include a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.)
  • the medium in which the program code is stored can all read the instructions in the storage medium and perform all or part of the steps of the method according to various embodiments of the present invention according to the read instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

本发明实施例公开了一种图像处理装置,用于提升处理图像的性能。本发明实施例提供的图像处理装置包括异构的分区处理单元与分类处理单元、用于存放待处理的图像的第一内存以及用于存放分区处理单元进行图像分区操作的结果的第二内存。其中,第一内存与第二内存均由分区处理单元和分类处理单元所共享。本发明实施例将共用的数据保存在共享的内存中,进而减少了图像处理过程中不同处理单元之间的数据搬运操作,避免了处理单元之间数据搬运所造成的时延,能够加快图像处理的速度,提高图像处理的性能。本发明实施例还提供了相关的图像处理方法。

Description

一种图像处理方法与图像处理装置 技术领域
本发明设计图像处理领域,尤其涉及一种图像处理方法与图像处理装置。
背景技术
在图像处理领域,图像中目标的检测识别一般由分区和分类两步操作来实现,如图1所示:图像分区模型接受输入的图像,并把输入的图像划分成大小不同的区域;图像分类模型采用卷积神经网络或其它分类算法,通过层次化结构不断提取图像每个区域的特征,最终识别出目标物体。
在进行图像的目标检测识别过程中,分区和分类操作对于处理器的性能要求有所不同,现阶段的技术中一般采用异构平台进行图像识别。例如,采用中央处理器(CPU,Central Processing Unit)+图形处理器(GPU,Graphics Processing Unit)异构平台进行图像识别。GPU是一种易编程、高性能的处理器。与主要用于数据计算和指令解读的CPU不同,GPU专为执行复杂的数学和几何计算所设计,主要用于图形图像处理。为了充分的发挥CPU的计算性能以及GPU的图像处理性能,在图像识别时采用CPU+GPU异构平台,先使用CPU进行图像分区,然后再采用GPU来进行图像分类。
但是异构平台中,不同类型的处理器的内存相互独立,例如在CPU+GPU异构平台中,CPU有独立的CPU内存,GPU也有独立的GPU内存(也可以称为显存)。因此,采用异构平台进行目标检测识别时,异构处理器之间(如CPU与GPU之间)需要不断的交互数据,大量的数据交互操作会造成较长的时延,影响整个异构平台的检测性能。
发明内容
本发明提供了一种图像处理方法,用于提升图像处理的性能。本发明还提供了相关的图像处理装置。
本发明第一方面提供了一种图像处理方法,适用于图像处理装置。该图 像处理装置的处理单元包括分区处理单元和分类处理单元,内存包括第一内存与第二内存,分区处理单元可以为CPU、DSP、处理核或其它能够实现图像分区操作的硬件电路,分类处理单元可以为GPU、FPGA能够实现图像分类操作的硬件电路。其中,分区处理单元和分类处理单元是异构的处理单元,且共享该第一内存与第二内存。第一内存中保存有待处理的第一图像,分区处理单元从第一内存中获取该第一图像,并对第一图像进行分区,得到第一分区结果,然后将第一分区结果保存在第二内存中。分类处理单元从第二内存中获取分区处理单元保存的第一分区结果,并从第一内存中获取第一图像。然后分类处理单元根据第一图像与第一分区结果,对第一图像进行分类,得到第一分类结果。与现阶段的技术中异构的处理单元之间的内存相互独立不同,本发明中第一内存与第二内存被分区处理单元和分类处理单元共享,因此被共用的数据如第一图像、第一分区结果等数据无需在分区处理单元与分类处理单元之间搬运,这样就避免了处理单元之间数据搬运所造成的时延,加快了图像处理的速度,提高了图像处理的性能。
可选的,图像处理装置的处理单元还可以包括调度处理单元,用于从图像处理装置外部获取待处理的第一图像,并将获取的第一图像保存在第一内存中。其中,由于调度处理单元需要访问第一内存,因此调度处理单元与分区处理单元、分类处理单元一起共享第一内存。
可选的,各处理单元并不会批量处理多幅图像,而是采用流水线的方式,串行处理多幅图像。具体的,调度处理单元在将获取的第一图像保存在第一内存中之后,再执行如下操作:获取待处理的第二图像,并将第二图像保存在第一内存中。分区处理单元在得到第一分区结果后,再执行如下步骤:从第一内存中获取第二图像,对第二图像进行分区得到第二分区结果,并将第二分区结果保存在第二内存中。分类处理单元在得到所述第一分类结果之后,再执行如下步骤:从第一内存中获取第二图像,从第二内存中获取第二分区结果,然后根据第二图像与第二分区结果,对第二图像进行分类得到第二分类结果。采用流水线的方式能够降低各处理单元对内存容量的要求,并充分利用每个处理单元。
可选的,调度处理单元可以对各个处理单元的启动、工作、挂起进行控 制,具体的,调度处理单元可以控制调度处理单元启动获取待处理的第二图像操作的时刻点,不早于分区处理单元启动对所述第一图像进行分区操作的时刻点,以降低调度处理单元的平均功率;同理的,调度处理单元还可以控制分区处理单元启动对第二图像进行分区操作的时刻点,不早于分类处理单元启动对第一图像进行分类操作的时刻点,以降低分区处理单元的平均功率。
可选的,调度处理单元在第二内存的占用率达到第一预置占用率或第二内存已满时,可以挂起分区处理单元,以避免分区处理单元计算得到的分区结果在第二内存中积压。
可选的,图像处理装置还可以包括第三内存,用于保存分类处理单元计算得到的分类结果。分类处理单元在得到第一分类结果后,将第一分类结果保存到第三内存中。调度处理单元负责从第三内存中读取第一分类结果,并将第一分类结果输出到图像处理装置外部,如将第一分类结果写入磁盘。其中,第三内存至少被调度处理单元与分类处理单元所共享。
可选的,调度处理单元在第三内存的占用率达到第二预置占用率或第三内存已满时,可以挂起分类处理单元,以避免分类处理单元计算得到的分类结果在第三内存中积压。
可选的,调度处理单元还可以负责调整图像处理装置中的内存的大小或带宽。具体的,调度处理单元可以获取分区处理单元执行对第一图像进行分区操作的第一时长,以及分类处理单元执行对第一图像进行分类操作的第二时长。若第一时长大于第二时长,则说明分类处理单元进行分类操作的速率要快于分区处理单元进行分区操作的速率,调度处理单元可以增大第二内存的大小,和/或增大所述第二内存的带宽,和/或减小第三内存的大小,和/或减小第三内存的带宽。可替换的,若第一时长小于所述第二时长,则说明分区处理单元进行分区操作的速率要快于分类处理单元进行图像分类的速率,调度处理单元可以减小第二内存的大小,和/或减小所述第二内存的带宽,和/或增大第三内存的大小,和/或增大第三内存的带宽。
可选的,图像处理装置还可以包括第四内存和/或第五内存,其中第四内存用于保存分区处理单元进行分区操作时的中间结果,第五内存用于保存分类处理单元进分类操作时的中间结果。当第一时长大于第二时长时,调度处 理单元还可以增大第四内存的大小,和/或减小第五内存的大小、和/或增大第四内存的带宽,和/或减小第五内存的带宽。可替换的,当第一时长小于第二时长时,调度处理单元可以减小第四内存的大小,和/或增大第五内存的大小、和/或减小第四内存的带宽,和/或增大第五内存的带宽。
可选的,调度处理单元在第四内存的占用率达到第三预置占用率或第四内存已满时,可以挂起分区处理单元。和/或,调度处理单元在第五内存的占用率达到第四预置占用率或第五内存已满时,可以挂起分类处理单元。
可选的,图像处理装置还可以包括第六内存,用于保存分区处理单元进行分区操作的算法参数以及分类处理单元进行分类操作的算法参数,该第六内存至少被分区处理单元与分类处理单元所共享。
本发明第二方面提供了一种图像处理装置,该图像处理装置的处理单元包括分区处理单元和分类处理单元,内存包括第一内存与第二内存,分区处理单元可以为CPU、DSP、处理核或其它能够实现图像分区操作的硬件电路,分类处理单元可以为GPU、FPGA能够实现图像分类操作的硬件电路。其中,分区处理单元和分类处理单元是异构的处理单元,且共享该第一内存与第二内存。其中,分区处理单元用于:从第一内存中获取待处理的第一图像,对第一图像进行分区得到第一分区结果,并将第一分区结果保存在第二内存中。分类处理单元用于:从第二内存中获取分区处理单元保存的第一分区结果,并从第一内存中获取第一图像。然后分类处理单元根据第一图像与第一分区结果,对第一图像进行分类,得到第一分类结果。由于第一内存与第二内存被分区处理单元和分类处理单元共享,因此被共用的数据如第一图像、第一分区结果等数据无需在分区处理单元与分类处理单元之间搬运,这样就避免了处理单元之间数据搬运所造成的时延,加快了图像处理的速度,提高了图像处理的性能。
可选的,图像处理装置的处理单元还可以包括调度处理单元,用于从图像处理装置外部获取待处理的第一图像,并将获取的第一图像保存在第一内存中。其中,由于调度处理单元需要访问第一内存,因此调度处理单元与分区处理单元、分类处理单元一起共享第一内存。
可选的,各处理单元并不会批量处理多幅图像,而是采用流水线的方式 串行处理多幅图像。具体的,调度处理单元在将获取的第一图像保存在第一内存中之后,还用于启动执行如下步骤:获取待处理的第二图像,并将获取的第二图像保存在第一内存中。分区处理单元在得到第一分区结果后,还用于启动执行如下步骤:从第一内存中获取第二图像,对第二图像进行分区得到第二分区结果,并将第二分区结果保存在第二内存中。分类处理单元在得到第一分类结果之后,还用于启动执行如下步骤:从第一内存中获取第二图像,从第二内存中获取第二分区结果,然后根据第二图像与第二分区结果对所述第二图像进行分类,得到第二分类结果。其中,采用流水线方式时各个处理单元的启动时刻可以由调度处理单元来控制。
可选的,调度处理单元还用于对各个处理单元的启动、工作、挂起进行控制,具体的,调度处理单元用于控制调度处理单元启动获取待处理的第二图像操作的时刻点,不早于分区处理单元启动对所述第一图像进行分区操作的时刻点,以降低调度处理单元的平均功率。和/或,调度处理单元还可以用于控制分区处理单元启动对第二图像进行分区操作的时刻点,不早于分类处理单元启动对第一图像进行分类操作的时刻点,以降低分区处理单元的平均功率。
可选的,调度处理单元还用于在第二内存的占用率达到第一预置占用率或第二内存已满时,挂起分区处理单元,以避免分区处理单元计算得到的分区结果在第二内存中积压。
可选的,图像处理装置还可以包括第三内存,用于保存分类处理单元计算得到的分类结果。分类处理单元在得到第一分类结果后,还用于将第一分类结果保存到第三内存中。调度处理单元还用于从第三内存中读取第一分类结果,并将第一分类结果输出到图像处理装置外部,如将第一分类结果写入磁盘。其中,第三内存至少被调度处理单元与分类处理单元所共享。
可选的,调度处理单元还用于在第三内存的占用率达到第二预置占用率或第三内存已满时,挂起分类处理单元,以避免分类处理单元计算得到的分类结果在第三内存中积压。
可选的,调度处理单元还用于调整图像处理装置中的内存的大小或带宽。具体的,调度处理单元还用于获取分区处理单元执行对第一图像进行分区操 作的第一时长,以及分类处理单元执行对第一图像进行分类操作的第二时长。若第一时长大于第二时长,则说明分类处理单元进行分类操作的速率要快于分区处理单元进行分区操作的速率,调度处理单元增大第二内存的大小,和/或增大所述第二内存的带宽,和/或减小第三内存的大小,和/或减小第三内存的带宽。可替换的,若第一时长小于所述第二时长,则说明分区处理单元进行分区操作的速率要快于分类处理单元进行图像分类的速率,调度处理单元减小第二内存的大小,和/或减小所述第二内存的带宽,和/或增大第三内存的大小,和/或增大第三内存的带宽。
可选的,图像处理装置还可以包括第四内存和/或第五内存,其中第四内存用于保存分区处理单元进行分区操作时的中间结果,第五内存用于保存分类处理单元进分类操作时的中间结果。当第一时长大于第二时长时,调度处理单元还用于增大第四内存的大小,和/或减小第五内存的大小、和/或增大第四内存的带宽,和/或减小第五内存的带宽。可替换的,当第一时长小于第二时长时,调度处理单元还用于减小第四内存的大小,和/或增大第五内存的大小、和/或减小第四内存的带宽,和/或增大第五内存的带宽。
可选的,调度处理单元还用于在第四内存的占用率达到第三预置占用率或第四内存已满时,挂起分区处理单元。和/或,调度处理单元还用于在第五内存的占用率达到第四预置占用率或第五内存已满时,挂起分类处理单元。
可选的,图像处理装置还可以包括第六内存,用于保存分区处理单元进行分区操作的算法参数以及分类处理单元进行分类操作的算法参数,该第六内存至少被分区处理单元与分类处理单元所共享。
附图说明
图1为图像目标检测流程示意图;
图2为现阶段的技术中CPU+GPU异构平台的结构图;
图3为本发明实施例中图像处理装置的一个结构图;
图4为本发明实施例中图像处理方法的一个流程示意图;
图5为本发明实施例中图像处理装置的另一个结构图;
图6为本发明实施例中图像处理方法的另一个流程示意图;
图7为本发明实施例中图像处理装置的另一个结构图;
图8(a)为本发明实施例中图像处理方法的另一个流程示意图;
图8(b)为本发明实施例中图像处理方法的另一个流程示意图;
图9为本发明应用场景中图像处理装置的一个结构图;
图10为本发明应用场景中图像处理方法的一个流程示意图。
具体实施方式
本发明实施例提供了一种图像处理方法,用于提升图像处理的性能。本发明实施例还提供了相关的图像处理装置,以下将分别进行描述。
在图像处理领域,图像中目标的检测识别一般由分区和分类两步操作来实现,如图1所示:图像分区模型接受输入的图像,并通过分区操作把输入的图像划分成大小不同的区域;图像分类模型采用卷积神经网络或其它分类算法进行分类操作,最终识别出目标物体。现阶段的技术中一般采用异构平台进行图像识别。
中央处理器(CPU,Central Processing Unit)是系统的运算和控制核心,它的功能主要是解释系统指令以及处理系统软件中的数据。图形处理器(GPU,Graphics Processing Unit)是一种易编程、高性能的处理器,能够执行复杂的数学和几何计算,一般用于系统的图形图像处理。为了充分发挥CPU与GPU各自的优点,现阶段的技术中一般采用CPU+GPU异构平台来实现图像中目标的检测识别。其中,异构平台指的是集成了两种或多种类型的处理器的平台,为便于说明,本发明实施例中仅采用CPU+GPU作为异构平台进行图像识别的示例进行说明。
现阶段技术所采用的CPU+GPU异构平台的基本结构请参阅图2。在目标检测识别过程中,CPU先对待检测的源图像进行分区,并将分区结果写入CPU内存中。由于异构平台中不同类型的处理器之间内存不共享,因此CPU需要将分区结果搬运至GPU内存(也称为显存),GPU再结合源图像和该分区结果对源图像进行分类,得到分类结果(即图像中目标的检测识别结果)。在图像处理过程中,CPU会不断的计算得到分区结果并写入CPU内存中,因此CPU需要源源不断的将CPU内存中的数据搬运到GPU内存中。CPU内存与GPU内存之间大量的数据搬运会造成较长的时延,拖慢图像处理速 度,进而影响整个异构平台的目标检测性能。
为了解决现阶段的技术中CPU+GPU异构平台的目标检测性能不足的问题,本发明提供了一种图像处理装置,并在该图像处理装置的基础上提供了相应的图像处理方法。下面将首先结合图3来介绍本发明提供的图像处理装置的基本结构,具体包括:
分区处理单元301、主要用于进行图像分区操作。该分区处理单元301具体可以由一个或多个处理器来担任,处理器可以为CPU、数字信号处理(DSP,Digital Signal Processing)或其他类型处理器中的一种或多种。分区处理单元301也可以由处理器中的一个或多个核来担任,或由其它能够实现图像分区操作的硬件电路来担任,本发明实施例中不做限定。
分类处理单元302、主要用于进行图像分类操作。该分类处理单元302具体可以由一个或多个处理器来担任,处理器可以为GPU、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他类型处理器中的一种或多种。分类处理单元302也可以由处理器中的一个或多个核来担任,或由其它能够实现图像分类操作的硬件电路来担任,本发明实施例中不做限定。
其中,分区处理单元301和分类处理单元302是异构处理单元,即分区处理单元301和分类处理单元302为不同类型的处理单元。例如,若分区处理单元301由CPU来担任,则分类处理单元302不能由CPU来担任,但可以采用GPU、FPGA或其它类型的处理单元来担任。
第一内存303,用于存放待处理的图像,即源图像。该第一内存303与分区处理单元301与分类处理单元302均连接,并由分区处理单元301与分类处理单元302共享,即分区处理单元301与分类处理单元302都可以直接访问第一内存303中的数据。
第二内存304,用于存放分区处理单元301进行图像分区操作的结果。该第二内存304与分区处理单元301与分类处理单元302均连接,并由分区处理单元301与分类处理单元302共享,即分区处理单元301与分类处理单元302都可以直接访问第二内存303中的数据。
在图3所示的图像处理装置的基础上,本发明提供了相应的图像处理方法,下面将以第一图像的处理过程为例解释该方法的基本流程,请参阅图4:
401、第一内存中保存有待处理的第一图像,分区处理单元从第一内存 中获取该第一图像;
402、分区处理单元对第一图像进行分区,得到第一分区结果;
403、分区处理单元将第一分区结果保存在第二内存中;
404、分类处理单元从第二内存中获取分区处理单元保存的第一分区结果。可选的,在分类处理单元获取了第一分区结果之后,第一分区结果可以从第二内存中删除;
405、分类处理单元从第一内存中获取第一图像。可选的,分类处理单元从第一内存中获取了第一图像之后,第一图像可以从第一内存中删除。其中,步骤405与步骤401至404没有特定的先后顺序,步骤405也可以位于步骤401至404中任一步骤之前;
406、分类处理单元根据第一图像与第一分区结果,对第一图像进行分类,得到第一分类结果。
从图4所示的流程可以看出,第一图像需要被分区处理单元和分类处理单元所共用,且分区处理单元计算得到的第一分区结果需要被分类处理单元所使用。与现阶段的技术中异构的处理单元之间的内存相互独立不同,本发明实施例中,第一图像被保存在第一内存中,而第一内存被分区处理单元和分类处理单元所共享,因此分区处理单元无需从分区处理单元的内存中搬运该第一图像到分类处理单元的内存中。同理的,第一分区结果被保存在第二内存中,而第二内存也被分区处理单元和分类处理单元所共享,因此分区处理单元无需从分区处理单元的内存中搬运该第一分区结果到分类处理单元的内存中。综上所述,本实施例提供的方法通过将共用的数据保存在共享的内存中,减少了图像处理过程中不同处理单元之间的数据搬运操作,进而避免了处理单元之间数据搬运所造成的时延,加快了图像处理的速度,提高了图像处理的性能。
步骤401中提到,第一内存中保存有待处理的第一图像。其中,该第一图像可以由分区处理单元来获取并保存在第一内存中。但是可替换的,本发明实施例还引入了调度处理单元,用于在步骤401之前获取该待处理的第一图像,并将获取的该第一图像保存在第一内存中。可以理解的,调度处理单元应和分区处理单元与分类处理单元一同共享第一内存。
上一段的论述中引入了调度处理单元,并采用调度处理单元从图像处理 装置之外获取第一图像到第一内存中。可以理解的,在步骤406之后,图像处理装置需要将第一分类结果输出该图像处理装置(如将第一分类结果写入磁盘)。可替换的,该操作仍可以由调度处理单元来执行,具体的:本发明还可以引入第三内存,步骤406之后,分类处理单元将第一分类结果保存在第三内存中。调度处理单元将第三内存中的第一分类结果从图像处理装置中输出。
结合上两段的论述,本发明实施例提供的又一种图像处理装置请参阅图5。其中,分区处理单元501、分类处理单元502、第一内存503、第二内存504与图3所示的分区处理单元301、分类处理单元302、第一内存303、第二内存304基本相同,此处不做赘述。调度处理单元505与第三内存506均为可选模块,具体执行的操作或功能可参照上两段中的论述,此处不做赘述。
现阶段的技术中,图像处理装置一般会批量处理多幅图像。例如,调度处理单元批量获取50幅图像。待该50幅图像获取完成后,分区处理单元对这50幅图像进行批量分区,并将结果写入分区处理单元的内存中。待该50幅图像分区完成后,分类处理单元再对这50幅图像进行分类。可以看出,这种批量处理的方式要求每个处理单元有较大的内存容量,且在某个处理单元工作时,其它处理单元处于闲置状态(例如在分区处理单元对该50幅图像进行批量分区的过程中,分类处理单元一直处于闲置状态)。可替换的,本发明中图像处理装置可以采用流水线的方式来对图像进行处理,以降低对内存容量的要求,并充分利用每个处理单元。下面仅以第一图像和第二图像为例对该流水线方式进行说明:
请参阅图6,图6中横轴方向表示时间,数轴方向表示处理的图像。从图6中可以看出,任务调度处理单元获取第一图像并将第一图像保存到第一内存中后,启动获取第二图像并将第二图像保存到第一内存中的操作。其中,调度处理单元可以在完成了获取第一图像的操作之后立刻启动获取第二图像的操作,也可以在完成了获取第一图像的操作之后的某个时刻再启动获取第二图像的操作,此处不做限定。同理的,分区处理单元在对第一图像进行分区,得到第一分区结果并将第一分区结果保存在第二内存中后,可以由调度处理单元控制或由分区处理单元自发控制启动如下操作:对第二图像进行分 区,得到第二分区结果,并将第二分区结果保存在第二内存中。分类处理单元在根据第一图像与第一分区结果,对第一图像进行分类,得到第一分类结果后,可以由调度处理单元控制或由分区处理单元自发控制启动如下操作:从第一内存中获取第二图像,从第二内存中获取第二分区结果,并根据第二图像与第二分区结果,对第二图像进行分类得到第二分类结果。若数据处理装置中包括调度处理单元和第三内存,则分类处理单元还将第二分类结果保存在第三内存中,并由调度处理单元将第二分类结果从第三内存中输出。
需要指出的是,本发明实施例中的调度处理单元只是一种从功能上抽象出的处理单元,在实际产品中,调度处理单元可以是独立的处理单元,也可以与分区处理单元或分类处理单元是同一个处理单元(即实际产品也可以不包括实体的调度处理单元,本发明实施例中调度处理单元所执行的操作可以交由分区处理单元和/或分类处理单元来执行)。例如,可以采用一个DSP+单核CPU+双核GPU来搭建本发明实施例提供的数据处理装置,其中CPU担任调度处理单元的角色,DSP担任分区处理单元的角色,双核GPU担任分类处理单元的角色。又例如,可以采用一个4核CPU与双核GPU来搭建本发明实施例提供的数据处理装置,其中双核GPU担任分类处理单元的角色,4核CPU的前3个核担任分区处理单元的角色,第4个核负责担任分区处理单元的角色的同时,还负责执行调度处理单元所需要执行的操作。
上一段中提到,调度处理单元是一种从功能上抽象出的处理单元。可选的,本发明还为调度处理单元引入的更多的调度任务。包括内存调整任务,和/或流程控制任务,具体的:
一、内存调整任务。调度处理单元可以根据图像处理装置的运行情况,调整图像处理装置中内存的大小、带宽等参数。
例如,一般情况下,一副图像的分区计算量要小于分类计算量,因此图像分区操作往往要快于图像分类操作。但是若分区处理单元进行分区操作过快,会导致分类处理单元不能及时消化掉分区处理单元的分区结果,造成分区结果在第二内存中积压,使得分区处理单元的快速计算性能造成了浪费。可以理解的,对于同一幅图像,若分区处理单元进行分区的时长与分类处理单元进行分类的时长等长,则说明分区处理单元与分类处理单元处理图像的速率恰好相同,分类处理单元能够恰好及时的消化掉分区处理单元的分区结 果,不会造成分区结果的积压,也不会造成分区处理单元或分类处理单元的性能的浪费。因此,调度处理单元可以获取分区处理单元执行对第一图像进行分区操作的第一时长,并获取分类处理单元执行对第一图像进行分类操作的第二时长。若第一时长大于第二时长,即分区处理单元对第一图像进行分区的时长长于分类处理单元对第一图像进行分类的时长,则说明分类处理单元进行分类操作的速率要快于分区处理单元进行分区操作的速率,调度处理单元可以增大第二内存的大小,这样第二内存中能保存更多的分区结果,分类处理单元不需要花费较长的时间等待分区处理单元的分区结果。且由于图像处理装置的总内存一定,因此增大第二内存可以使得第三内存变小,这样当分类处理单元的分类结果在第三内存中没有空间存放时,分类处理单元会被迫挂起,这样就达到了限制分类处理单元进行图像分类的速率的目的。可替换的,调度处理单元也可以直接减小第三内存的大小来限制分类处理单元进行图像分类的速率。和/或,调度处理单元还可以增大第二内存的带宽,使得分区处理单元的分区结果能够快速搬运至第二内存,有利于提升分区处理单元的图像分区速率。且由于图像处理装置的总带宽一定,因此增大第二内存的带宽可以使得第三内存的带宽变小,这样就降低了分类处理单元将分类结果保存到第三内存的速率,限制了分类处理单元进行图像分类的速率。可替换的,调度处理单元还可以直接减小第三内存的带宽来限制分类处理单元进行图像分类的速率。基于类似的理由,若第一时长小于第二时长,即分区处理单元对第一图像进行分区的时长短于分类处理单元对第一图像进行分类的时长,则说明分区处理单元进行分区操作的速率要快于分类处理单元进行图像分类的速率,调度处理单元可以减小第二内存的大小,和/或减小第二内存的带宽,和/或增大第三内存的大小,和/或增大第三内存的带宽,以限制分区处理单元进行分区操作的速率。
又例如,图像处理装置还可以包括用于保存分区处理单元进行分区操作时的中间结果的第四内存,和/或用于保存分类处理单元进行分类操作时的中间结果的第五内存,如图7所示。若第一时长大于第二时长,则说明分区处理单元的速率要慢于分类处理单元的速率,调度处理单元可以通过增大第四内存的大小、和/或增大第四内存的带宽、和/或减小第五内存的大小、和/或减小第五内存的带宽来降低分类处理单元进行分类操作的速率。同理可选的,若第一时长小于第二时长,则调度处理单元可以减小第四内存的大小,和/ 或减小第四内存的带宽、和/或增大第五内存的大小、和/或增大第五内存的带宽。
二、流程控制任务。分区处理单元和分类处理单元的启动、工作、挂起均可以由调度处理单元来控制。
例如,若第二内存的占用率达到第一预置占用率,则说明分区处理单元的分区结果在第二内存中有积压,调度处理单元可以挂起分区处理单元。调度处理单元还可以待第二内存中的分区结果不再积压时(具体的,在分区处理单元挂起第一预置时长后,或第二内存的占用率降低到预设的阈值以下后,或在其它设定的条件下,就可以认为第二内存中的分区结果不再积压),再启动分区处理单元。同理的,若第三内存的占用率达到第二预置占用率,则说明分类处理单元的分类结果在第三内存中有积压,调度处理单元可以挂起分类处理单元。调度处理单元还可以待第三内存中的分类结果不再积压时,再启动分类处理单元。同理的,若图7中的第四内存的占用率达到第三预置占用率,则调度处理单元可以挂起分区处理单元。若图7中的第五内存的占用率达到第四预置占用率,则调度处理单元可以挂起分类处理单元。其中,第一预置占用率至第四预置占用率均为预置的数值,例如可以为80%或90%,也可以为100%(即内存已满)。第一预置占用率至第四预置占用率中,任意两个或更多个预置占用率的数值可以相同,第一预置占用率至第四预置占用率的数值也可以各不相同,此处不做限定。
又例如,若第二内存的剩余空间小于第二预置大小,则说明分区处理单元的分区结果在第二内存中有积压,导致第二内存中的剩余空间不足,调度处理单元可以挂起分区处理单元。调度处理单元还可以待第二内存中的分区结果不再积压时(具体的,在分区处理单元挂起第一预置时长后,或第二内存的占用率降低到预设的阈值以下后,或在第二内存的剩余空间大于预设的大小后,或在其它设定的条件下,就可以认为第二内存中的分区结果不再积压),再启动分区处理单元。同理的,若第三内存的剩余空间小于第三预置大小,则说明分类处理单元的分类结果在第三内存中有积压,调度处理单元可以挂起分类处理单元。调度处理单元还可以待第三内存中的分类结果不再积压时,再启动分类处理单元。同理的,若图7中的第四内存的剩余空间小于第四预置大小,则调度处理单元可以挂起分区处理单元。若图7中的第五 内存的剩余空间小于第五预置大小,则调度处理单元可以挂起分类处理单元。其中,第二预置大小至第五预置大小均为预置的数值,可以为正值,也可以为0(即内存已满)。第一预置大小至第五预置大小中,任意两个或更多个预置大小的数值可以相同,第一预置大小至第五预置大小的数值也可以各不相同,此处不做限定。
又例如,调度处理单元还可以控制获取图像操作、分区处理单元进行分区操作以及分类处理单元进行分类操作的起始时间。具体的,图6以第一图像和第二图像为例来介绍流水线方式的图像处理流程,下面以第一图像、第二图像和第三图像来介绍流水线方式的图像处理流程。首先请参阅图8(a)。图8(a)中图像处理装置依次对第三图像、第一图像、第二图像进行处理,其中,调度处理单元在获取了第三图像之后,立刻启动获取第一图像的操作,并在完成了获取第一图像的操作后,立刻启动获取第二图像的操作。在整个图像处理流程中,调度处理单元一直处于获取图像的操作状态。但是,在时刻T1处(调度处理单元启动获取第二图像的操作的时刻点),分区处理单元还在对第三图像进行分区,并没有开始对第一图像开始分区,更没有开始对第二图像进行分区。因此时刻T1时调度处理单元只需要完成获取第一图像,就能够使得分区处理单元随时启动对第一图像进行分区,并不需要在此时就准备好第二图像。只有在时刻T2处(分区处理单元开始对第一图像进行分区的时刻点),调度处理单元才需要获取第二图像,以便分区处理单元在完成对第一图像进行分区的操作后随时启动对第二图像进行分区。在T1~T2时间段内,调度处理单元可以暂时停止获取第二图像,使得调度处理单元得到休息,以降低调度处理单元的平均功率。因此,本发明实施例中,调度处理单元可以控制调度处理单元启动获取待处理的第二图像操作的时刻点,不早于分区处理单元启动对第一图像进行分区操作的时刻点。同理的,调度处理单元还可以控制分区处理单元启动对第二图像进行分区操作的时刻点,不早于分类处理单元启动对第一图像进行分类操作的时刻点,如图8(b)所示。
本发明中,分区处理单元进行图像分区的算法有很多,如Edge Box、BING等,本发明中不做限定。分类处理单元可以采用卷积神经网络(CNN,Convolutional Neural Network)或其它算法对图像进行分类,本发明中不做 限定。可选的,本发明提供的图像处理装置还可以包括第六内存,用于存放分区处理单元进行分区操作的算法和参数,以及分类处理单元进行分类操作的算法和参数。该第六内存由分区处理单元与分类处理单元所共享。
本发明介绍了图像处理装置的第一内存至第六内存,并限定了这些内存需要能够被哪些处理单元所共享。需要指出的是,这些内存除了可以被限定的处理单元所共享,也可以被非限定的处理单元所共享。例如,本发明中限定了第二内存需要被分区处理单元和分类处理单元所共享,但是第二内存同时也可以被调度处理单元所共享;本发明中限定了第三内存需要被分类处理单元和调度处理单元所共享,但是第三内存同时也可以被分区处理单元所共享;同理的,第四内存、第五内存、第六内存也可以被分区处理单元、分类处理单元、调度处理单元中的一个、两个或多个处理单元所共享。
本发明中介绍的第一内存至第六内存均为逻辑上的划分,虽然在附图中各自以分开独立的形式示意,但其实际形态中,任意的两块或更多块的内存均可以集成为一体。例如,整个图像处理装置可以只有一块由各处理单元所共享的内存,调度处理单元将该共享的内存的地址段划分为6块,分别担任第一内存至第六内存的角色。
除了本发明中介绍的第一内存至第六内存,本发明所提供的图像处理装置还可以包括更多的内存,所述更多的内存可以由多个处理单元所共享,也可以由某个处理单元所专用,此处不做限定。
为了便于理解上述实施例,下面将以上述实施例的一个具体应用场景为例进行描述。
图9中所示的是一个CPU+GPU异构系统构成的图像处理装置,其中的CPU为4-PLUS-1 Cortex A15型号,包括4个ARM A15运算核和1个ARM A15低功耗管理核。其中的GPU为Kepler GPU型号,包括192个GPU CUDA核。该异构系统采用3个ARM A15运算核来进行图像分区(即担任分区处理单元),1个ARM A15运算核来进行任务调度(即担任调度处理单元),192个GPU核进行图像分类(即担任分类处理单元)。CPU与GPU中的核通过2GB Double Data Rate 2(DDR2)的共享内存进行数据交互并构成硬件流水线。图像分区算法采用边缘框(EdgeBox)算法,图像分类算法采用CNN。
调度处理单元在2GB DDR2中划分出500MB的空间作为第一内存,用于保存调度处理单元获取的源图像,并在2GB DDR2中划分出200MB的空间作为第二内存,用于存放分区处理单元的分区结果,还在2GB DDR2中划分出100MB的空间作为第三内存,用于存放分类处理单元的分类结果。
用户采用图9所示的异构系统来检测识别图像A、图像B与图像C中的目标,具体流程请参阅图10。在T3时刻,该异构系统的调度处理单元先获取图像A并将图像A写入第一内存中;在T4时刻,分区处理单元启动对图像A进行分区的操作,具体的,从第一内存中读取图像A,对图像A进行分区并将图像A的分区结果写入第二内存,同时调度处理单元启动获取图像B的操作并将图像B写入第一内存;在T5时刻,分类处理单元启动对图像A的分类操作,具体的,从第一内存中读取图像A以及从第二内存中读取图像A的分区结果,根据图像A的分区结果对图像A进行分类并将分类结果写入第三内存,同时分区处理单元启动对图像B进行分区的操作,从第一内存中读取图像B,对图像B进行分区并将图像B的分区结果写入第二内存,且同时调度处理单元启动获取图像C的操作并将图像C写入第一内存。
其中,在T5~T6时间段内,分类处理单元并没有完成对图像A的分类操作,因此不能完全消耗第二内存中图像A的分区结果。但在T5~T6时间段内,分区处理单元仍在不停的对图像B进行分区,并向第二内存中输出图像B的分区结果,这就导致了图像A和图像B的分区结果在第二内存中积压。假设在T6时刻时,第二内存的占用率达到100%,于是调度处理单元挂起分区处理单元。
T7时刻,第二内存的剩余空间大于50MB,于是调度处理单元再次启动分区处理单元,分区处理单元完成对图像B的分区操作,并将分区结果保存到第二内存中。T8时刻,分类处理单元启动对图像B进行分类,并将分类结果写入第三内存,同时分区处理单元启动对图像C进行分区的操作,并将图像C的分区结果写入第二内存。T9时刻,分类处理单元启动对图像C的分类操作,并将分类结果写入第三内存中。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,该存储介质包括:U盘、移动硬盘、内存(可以包括只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)等)、磁碟或者光盘等各种可以存储程序代码的介质。本发明中的调度处理单元、分区处理单元和分区处理单元均可以读取该存储介质中的指令,并根据读取的指令执行本发明各个实施例所述方法的全部或部分步骤。
以上所述,以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (20)

  1. 一种图像处理方法,适用于图像处理装置,其特征在于,所述图像处理装置包括分区处理单元和分类处理单元,所述图像处理装置还包括第一内存与第二内存,其中,所述分区处理单元和所述分类处理单元是异构处理单元,所述分区处理单元与所述分类处理单元共享所述第一内存与所述第二内存,所述方法包括:
    所述分区处理单元从所述第一内存中获取待处理的第一图像,对所述第一图像进行分区,得到第一分区结果,并将所述第一分区结果保存在所述第二内存中;
    所述分类处理单元从所述第一内存中获取所述第一图像,从所述第二内存中获取所述第一分区结果,然后根据所述第一图像与所述第一分区结果,对所述第一图像进行分类,得到第一分类结果。
  2. 根据权利要求1所述的图像处理方法,其特征在于,所述图像处理装置还包括调度处理单元,所述调度处理单元与所述分区处理单元、所述分类处理单元共享所述第一内存,所述方法在所述分区处理单元从所述第一内存中获取待处理的第一图像之前还包括:
    所述调度处理单元获取待处理的所述第一图像,并将获取的所述第一图像保存在所述第一内存中。
  3. 根据权利要求2所述的图像处理方法,其特征在于,所述调度处理单元在将获取的所述第一图像保存在所述第一内存中之后,启动执行如下步骤:获取待处理的第二图像,并将获取的所述第二图像保存在所述第一内存中;
    所述分区处理单元在得到所述第一分区结果后,启动执行如下步骤:从所述第一内存中获取所述第二图像,对所述第二图像进行分区,得到第二分区结果,并将所述第二分区结果保存在所述第二内存中;
    所述分类处理单元在得到所述第一分类结果之后,启动执行如下步骤:从所述第一内存中获取所述第二图像,从所述第二内存中获取所述第二分区 结果,然后根据所述第二图像与所述第二分区结果,对所述第二图像进行分类,得到第二分类结果。
  4. 根据权利要求3所述的图像处理方法,其特征在于,所述调度处理单元启动获取待处理的所述第二图像操作的时刻点,不早于所述分区处理单元启动对所述第一图像进行分区操作的时刻点;
    和/或,所述分区处理单元启动对所述第二图像进行分区操作的时刻点,不早于所述分类处理单元启动对所述第一图像进行分类操作的时刻点。
  5. 根据权利要求2至4中任一项所述的方法,其特征在于,所述方法还包括:
    若所述第二内存的占用率达到第一预置占用率,则所述调度处理单元挂起所述分区处理单元。
  6. 根据权利要求2至5中任一项所述的方法,其特征在于,所述图像处理装置还包括第三内存,所述调度处理单元与所述分类处理单元共享所述第三内存,所述方法还包括:
    所述分类处理单元将所述第一分类结果保存在所述第三内存中;
    所述调度处理单元从所述第三内存中获取所述第一分类结果,并将所述分类结果从所述图像分类装置输出。
  7. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    若所述第三内存的占用率达到第二预置占用率,则所述调度处理单元挂起所述分类处理单元。
  8. 根据权利要求2至7中任一项所述的方法,其特征在于,所述方法还包括:
    所述调度处理单元获取:所述分区处理单元执行对所述第一图像进行分区操作的第一时长,以及所述分类处理单元执行对所述第一图像进行分类操作的第二时长;
    若所述第一时长大于所述第二时长,则所述调度处理单元增大所述第二内存的大小,和/或增大所述第二内存的带宽;
    和/或,
    若所述第一时长小于所述第二时长,则所述调度处理单元减小所述第二内存的大小,和/或减小所述第二内存的带宽。
  9. 根据权利要求6或7所述的方法,其特征在于,所述方法还包括:
    所述调度处理单元获取:所述分区处理单元执行对所述第一图像进行分区操作的第一时长,以及所述分类处理单元执行对所述第一图像进行分类操作的第二时长;
    若所述第一时长大于所述第二时长,则所述调度处理单元减小所述第三内存的大小,和/或减小所述第三内存的带宽;
    和/或,
    若所述第一时长小于所述第二时长,则所述调度处理单元增大所述第三内存的大小,和/或增大所述第三内存的带宽。
  10. 根据权利要求2至7中任一项所述的方法,其特征在于,所述图像处理装置还包括:第四内存,用于保存所述分区处理单元进行分区操作时的中间结果;和/或,第五内存,用于保存所述分类处理单元进分类操作时的中间结果;
    所述方法还包括:
    所述调度处理单元获取:所述分区处理单元执行对所述第一图像进行分区操作的第一时长,以及所述分类处理单元执行对所述第一图像进行分类操作的第二时长;
    若所述第一时长大于所述第二时长,则所述调度处理单元增大所述第四内存的大小,和/或减小所述第五内存的大小、和/或增大所述第四内存的带宽,和/或减小所述第五内存的带宽;
    若所述第一时长小于所述第二时长,则所述调度处理单元减小所述第四内存的大小,和/或增大所述第五内存的大小、和/或减小所述第四内存的带宽,和/或增大所述第五内存的带宽。
  11. 一种图像处理装置,其特征在于,包括分区处理单元、分类处理单元、第一内存与第二内存,其中,所述分区处理单元和所述分类处理单元是 异构处理单元,所述分区处理单元与所述分类处理单元共享所述第一内存与所述第二内存;
    所述分区处理单元用于:从所述第一内存中获取待处理的第一图像,对所述第一图像进行分区,得到第一分区结果,并将所述第一分区结果保存在所述第二内存中;
    所述分类处理单元用于:从所述第一内存中获取所述第一图像,从所述第二内存中获取所述第一分区结果,然后根据所述第一图像与所述第一分区结果,对所述第一图像进行分类,得到第一分类结果。
  12. 根据权利要求11所述的图像处理装置,其特征在于,所述图像处理装置还包括调度处理单元,所述调度处理单元与所述分区处理单元、所述分类处理单元共享所述第一内存;
    所述调度处理单元用于:获取待处理的所述第一图像,并将获取的所述第一图像保存在所述第一内存中。
  13. 根据权利要求12所述的图像处理装置,其特征在于,所述调度处理单元在将获取的所述第一图像保存在所述第一内存中之后,还用于启动执行如下步骤:获取待处理的第二图像,并将获取的所述第二图像保存在所述第一内存中;
    所述分区处理单元在得到所述第一分区结果后,还用于启动执行如下步骤:从所述第一内存中获取所述第二图像,对所述第二图像进行分区,得到第二分区结果,并将所述第二分区结果保存在所述第二内存中;
    所述分类处理单元在得到所述第一分类结果之后,还用于启动执行如下步骤:从所述第一内存中获取所述第二图像,从所述第二内存中获取所述第二分区结果,然后根据所述第二图像与所述第二分区结果,对所述第二图像进行分类,得到第二分类结果。
  14. 根据权利要求13所述的图像处理装置,其特征在于,所述调度处理单元启动获取待处理的所述第二图像操作的时刻点,不早于所述分区处理单元启动对所述第一图像进行分区操作的时刻点;
    和/或,所述分区处理单元启动对所述第二图像进行分区操作的时刻点,不早于所述分类处理单元启动对所述第一图像进行分类操作的时刻点。
  15. 根据权利要求12至14中任一项所述的装置,其特征在于,所述调度处理单元还用于:
    若所述第二内存的占用率达到第一预置占用率,则挂起所述分区处理单元。
  16. 根据权利要求12至15中任一项所述的装置,其特征在于,所述图像处理装置还包括第三内存,所述调度处理单元与所述分类处理单元共享所述第三内存;
    所述分类处理单元还用于:将所述第一分类结果保存在所述第三内存中;
    所述调度处理单元还用于:从所述第三内存中获取所述第一分类结果,并将所述分类结果从所述图像分类装置输出。
  17. 根据权利要求16所述的装置,其特征在于,所述调度处理单元还用于:
    若所述第三内存的占用率达到第二预置占用率,则挂起所述分类处理单元。
  18. 根据权利要求12至17中任一项所述的图像处理装置,其特征在于,所述调度处理单元还用于:
    获取:所述分区处理单元执行对所述第一图像进行分区操作的第一时长,以及所述分类处理单元执行对所述第一图像进行分类操作的第二时长;
    若所述第一时长大于所述第二时长,则增大所述第二内存的大小,和/或增大所述第二内存的带宽;
    和/或,
    若所述第一时长小于所述第二时长,则减小所述第二内存的大小,和/或减小所述第二内存的带宽。
  19. 根据权利要求16或17所述的图像处理装置,其特征在于,所述调度处理单元还用于:
    获取:所述分区处理单元执行对所述第一图像进行分区操作的第一时 长,以及所述分类处理单元执行对所述第一图像进行分类操作的第二时长;
    若所述第一时长大于所述第二时长,则所述调度处理单元减小所述第三内存的大小,和/或减小所述第三内存的带宽;
    和/或,
    若所述第一时长小于所述第二时长,则所述调度处理单元增大所述第三内存的大小,和/或增大所述第三内存的带宽。
  20. 根据权利要求12至17中任一项所述的装置,其特征在于,所述图像处理装置还包括:第四内存,用于保存所述分区处理单元进行分区操作时的中间结果;和/或,第五内存,用于保存所述分类处理单元进分类操作时的中间结果;
    所述调度处理单元还用于:
    获取:所述分区处理单元执行对所述第一图像进行分区操作的第一时长,以及所述分类处理单元执行对所述第一图像进行分类操作的第二时长;
    若所述第一时长大于所述第二时长,则增大所述第四内存的大小,和/或减小所述第五内存的大小、和/或增大所述第四内存的带宽,和/或减小所述第五内存的带宽;
    若所述第一时长小于所述第二时长,则减小所述第四内存的大小,和/或增大所述第五内存的大小、和/或减小所述第四内存的带宽,和/或增大所述第五内存的带宽。
PCT/CN2016/080997 2015-10-30 2016-05-04 一种图像处理方法与图像处理装置 WO2017071176A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP16858624.6A EP3352132B1 (en) 2015-10-30 2016-05-04 Image processing method and image processing apparatus
US15/964,045 US10740657B2 (en) 2015-10-30 2018-04-26 Image processing method and image processing apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510728119.5A CN106651748B (zh) 2015-10-30 2015-10-30 一种图像处理方法与图像处理装置
CN201510728119.5 2015-10-30

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/964,045 Continuation US10740657B2 (en) 2015-10-30 2018-04-26 Image processing method and image processing apparatus

Publications (1)

Publication Number Publication Date
WO2017071176A1 true WO2017071176A1 (zh) 2017-05-04

Family

ID=58629792

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/080997 WO2017071176A1 (zh) 2015-10-30 2016-05-04 一种图像处理方法与图像处理装置

Country Status (4)

Country Link
US (1) US10740657B2 (zh)
EP (1) EP3352132B1 (zh)
CN (1) CN106651748B (zh)
WO (1) WO2017071176A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239570A (zh) * 2017-06-27 2017-10-10 联想(北京)有限公司 数据处理方法及服务器集群
CN107341127B (zh) * 2017-07-05 2020-04-14 西安电子科技大学 基于OpenCL标准的卷积神经网络加速方法
US10867399B2 (en) 2018-12-02 2020-12-15 Himax Technologies Limited Image processing circuit for convolutional neural network
TWI694413B (zh) * 2018-12-12 2020-05-21 奇景光電股份有限公司 影像處理電路
US11341745B1 (en) 2019-11-14 2022-05-24 Lockheed Martin Corporation Unresolved object target detection using a deep neural network
CN113362219B (zh) * 2021-07-02 2023-08-11 展讯通信(天津)有限公司 一种图像数据处理方法及装置
CN116680042A (zh) * 2022-02-22 2023-09-01 华为技术有限公司 一种图像处理的方法及相关装置和系统

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN203930824U (zh) * 2014-04-17 2014-11-05 超威半导体产品(中国)有限公司 具有结合的cpu和gpu的芯片器件,相应的主板和计算机系统
US20140333635A1 (en) * 2013-05-10 2014-11-13 Nvidia Corporation Hierarchical hash tables for simt processing and a method of establishing hierarchical hash tables

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7003660B2 (en) 2000-06-13 2006-02-21 Pact Xpp Technologies Ag Pipeline configuration unit protocols and communication
US20050050305A1 (en) 2003-08-28 2005-03-03 Kissell Kevin D. Integrated mechanism for suspension and deallocation of computational threads of execution in a processor
US8108863B2 (en) 2005-12-30 2012-01-31 Intel Corporation Load balancing for multi-threaded applications via asymmetric power throttling
US8468532B2 (en) 2006-06-21 2013-06-18 International Business Machines Corporation Adjusting CPU time allocated to next thread based on gathered data in heterogeneous processor system having plurality of different instruction set architectures
CN101369315A (zh) * 2007-08-17 2009-02-18 上海银晨智能识别科技有限公司 人脸检测方法
US8397241B2 (en) 2008-11-13 2013-03-12 Intel Corporation Language level support for shared virtual memory
US8615637B2 (en) 2009-09-10 2013-12-24 Advanced Micro Devices, Inc. Systems and methods for processing memory requests in a multi-processor system using a probe engine
WO2012096988A2 (en) * 2011-01-10 2012-07-19 Rutgers, The State University Of New Jersey Method and apparatus for shape based deformable segmentation of multiple overlapping objects
CN103166995B (zh) 2011-12-14 2016-08-10 华为技术有限公司 一种视频传输方法及装置
US9235769B2 (en) 2012-03-15 2016-01-12 Herta Security, S.L. Parallel object detection method for heterogeneous multithreaded microarchitectures
US9129161B2 (en) * 2013-05-31 2015-09-08 Toyota Jidosha Kabushiki Kaisha Computationally efficient scene classification

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140333635A1 (en) * 2013-05-10 2014-11-13 Nvidia Corporation Hierarchical hash tables for simt processing and a method of establishing hierarchical hash tables
CN203930824U (zh) * 2014-04-17 2014-11-05 超威半导体产品(中国)有限公司 具有结合的cpu和gpu的芯片器件,相应的主板和计算机系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3352132A4 *

Also Published As

Publication number Publication date
EP3352132A4 (en) 2018-10-24
CN106651748B (zh) 2019-10-22
EP3352132A1 (en) 2018-07-25
US20180247164A1 (en) 2018-08-30
CN106651748A (zh) 2017-05-10
US10740657B2 (en) 2020-08-11
EP3352132B1 (en) 2022-09-28

Similar Documents

Publication Publication Date Title
WO2017071176A1 (zh) 一种图像处理方法与图像处理装置
JP7382925B2 (ja) ニューラルネットワークアクセラレーションのための機械学習ランタイムライブラリ
US10540584B2 (en) Queue management for direct memory access
US9996386B2 (en) Mid-thread pre-emption with software assisted context switch
US20170109214A1 (en) Accelerating Task Subgraphs By Remapping Synchronization
US10402223B1 (en) Scheduling hardware resources for offloading functions in a heterogeneous computing system
US20190324939A1 (en) Processor core to coprocessor interface with fifo semantics
JP7053713B2 (ja) 低電力コンピュータイメージング
CN111190735B (zh) 一种基于Linux的片上CPU/GPU流水化计算方法及计算机系统
US8180998B1 (en) System of lanes of processing units receiving instructions via shared memory units for data-parallel or task-parallel operations
CN111274025A (zh) 用于在ssd中加速数据处理的系统和方法
JP2019525324A (ja) メモリ要求仲裁
TW202109286A (zh) 純函數語言神經網路加速器系統及結構
US20230153157A1 (en) Inter-node communication method and device based on multiple processing nodes
CN107148619B (zh) 用于多线程图形流水线的自由排序线程模型
US20120246656A1 (en) Scheduling of tasks to be performed by a non-coherent device
US10877926B2 (en) Method and system for partial wavefront merger
US20240069965A1 (en) Systems and methods for executing compute functions
US12001365B2 (en) Scatter and gather streaming data through a circular FIFO
US11907144B1 (en) Early semaphore update
US11875247B1 (en) Input batching with serial dynamic memory access
US10423424B2 (en) Replicated stateless copy engine
US20220012201A1 (en) Scatter and Gather Streaming Data through a Circular FIFO
US20230112420A1 (en) Kernel optimization and delayed execution
WO2018049821A1 (zh) 请求源响应的仲裁方法、装置及计算机存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16858624

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2016858624

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE