CN112148668A

CN112148668A - Data storage method and device based on-chip cache and storage medium

Info

Publication number: CN112148668A
Application number: CN202010970132.2A
Authority: CN
Inventors: 范丹枫
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2020-12-29
Anticipated expiration: 2040-09-15
Also published as: CN112148668B

Abstract

The invention provides a data storage method and device based on-chip cache and a storage medium, wherein the method comprises the following steps: acquiring a line operator of a target object set for each level of filter in a multi-level filter, wherein the line operator is used for processing one line of data of a target image, and the multi-level filter is used for processing the target image; distributing on-chip cache matched with the size of the row operator for the row operator on the target device; the method comprises the steps of calling row operators of the multistage filter according to a preset time sequence to obtain an execution result, storing an intermediate result corresponding to an intermediate filter of the multistage filter into an on-chip cache, and adopting the technical scheme to solve the problem that in the related technology, the temporary DRAM read-write loss of the intermediate result is eliminated on the whole, so that the operation speed is improved from the system perspective, the load of a memory system is reduced, and the operation speed is improved.

Description

Data storage method and device based on-chip cache and storage medium

Technical Field

The invention relates to the technical field of image processing, in particular to a data storage method and device based on-chip cache and a storage medium.

Background

The image filter is the most basic type of operators in computer vision, and any pixel can take a neighborhood point with a certain size as an input and generate an output, so the range of the operators is very wide, and all operators with local characteristics can be included. From the aspect of a use mode, a filtering algorithm generally exists as a preprocessing algorithm of a more complex algorithm such as feature analysis and extraction, and a plurality of filters are generally required to be used in a cascade manner, so that the design and performance optimization of the filtering algorithm are generally an important ring in algorithm design, and the effectiveness and the real-time performance of the whole system are concerned.

In the prior art, optimization of a filter mostly focuses on the optimization of the logic of a specific filtering algorithm and the calculation on a specific platform, and the calculation research of an operator per se is relatively comprehensive. In the aspect of Memory architecture, common optimization focuses on how to improve the Memory Access friendliness of an algorithm, and how to complete preloading of data from a Dynamic Random Access Memory (DRAM) to an on-chip cache in advance by using characteristics of the algorithm and a platform, so as to avoid miss cost.

The multi-stage filter problem can be described as one input image (arbitrarily described) passing through any number of filters in sequence to produce one image output (arbitrarily described), the input image and the output image being located in a dynamic random access memory DRAM. The output results among the filters except the final output do not need to be output in a specific mode, and can be processed according to the platform optimization requirements. If the intermediate filtering links are not specially designed and processed in a single filter combination mode, the output of the intermediate filtering links is written into a DRAM (dynamic random access memory), the next-stage filter is loaded from the DRAM again to be used as input, the DRAM write-in and read-out operation generated at each stage generates larger pressure on a memory system, and under the condition that the memory read-write bandwidth of the whole system is limited, the read-write operations become the bottleneck of the operation in a tiring way, so that the performance of all programs with DRAM read-write behaviors is reduced.

Aiming at the problems that in the related art, all the outputs of the multistage filter are stored in the DRAM, so that the DRAM is frequently read and written, great pressure is caused on the DRAM, the running speed is reduced, and the like, an effective technical scheme is not provided.

Disclosure of Invention

The embodiment of the invention provides a data storage method and device based on-chip cache and a storage medium, which are used for at least solving the problems that in the related art, all outputs of a multi-stage filter are stored in a DRAM (dynamic random access memory), so that frequent read-write operation is performed on the DRAM, greater pressure is applied to the DRAM, the running speed is reduced and the like.

The embodiment of the invention provides a data storage method based on-chip cache, which comprises the following steps: acquiring a line operator of a target object set for each level of filter in a multi-level filter, wherein the line operator is used for processing a line of data of a target image, and the multi-level filter is used for processing the target image; distributing on-chip cache matched with the size of the line operator to the line operator on target equipment; and calling the row operators of the multistage filter according to a preset time sequence to obtain an execution result, and storing an intermediate result corresponding to an intermediate filter of the multistage filter in the execution result to the on-chip cache.

In an alternative embodiment, allocating on-chip cache for the row operator on the target device that matches the row operator size comprises: determining a target filter corresponding to the row operator; acquiring p inputs of a next-stage filter of the target filter and address ranges of on-chip caches corresponding to the p inputs, wherein p is a positive integer; and distributing on-chip cache matched with the size of the line operator to the line operator on target equipment according to the address range.

In an optional embodiment, the on-chip cache is a ring cache, and storing an intermediate result corresponding to an intermediate filter of the multistage filter in the execution result to the on-chip cache includes: and under the condition that the line operator corresponds to the nth line output result, storing the intermediate result in nth% L of the annular cache, wherein n and L are both positive integers, L is the size of the on-chip cache, and n% L is used for representing that n is used for obtaining the remainder of L.

In an optional embodiment, after allocating on-chip cache matching the row operator size for the row operator on the target device, the method further comprises: and applying for the on-chip cache with the corresponding size of the line operator to the target device so that the on-chip cache stores the intermediate result of the line operator.

In an optional embodiment, invoking the row operators of the multistage filter according to a preset timing sequence to obtain an execution result, includes: for each level of filter in the multistage filter, establishing a mapping table of p inputs of the filter of each level and an on-chip cache corresponding to the filter of each level, wherein p is an integer; sequentially traversing the multistage filters according to a preset time sequence, operating the row operators of the current filter under the condition that data corresponding to p inputs of the current filter are generated by a preceding filter, and increasing the row count of the current filter; traversing the multistage filter again under the condition that the multistage filter is completely traversed and the row count of the final stage filter does not reach the total row number of the multistage filter; and under the condition that the multistage filter is completely traversed and the total row number of the multistage filter is reached, determining that the scheduling execution of the multistage filter is completed.

In an optional embodiment, the method further comprises: searching the on-chip cache address array input by the row operator of the current filter in the mapping table by taking the row number corresponding to the p inputs as an index; searching an on-chip cache address array output by a row operator of the current filter in the mapping table by taking the output row number of the current filter as an index; and storing the intermediate result on an on-chip cache determined according to the on-chip cache address array output by the line operator.

According to another embodiment of the present invention, there is also provided an on-chip cache-based data saving apparatus, including: the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a line operator set by a target object for each level of filter in a multi-level filter, the line operator is used for processing a line of data of a target image, and the multi-level filter is used for processing the target image; the setting module is used for distributing on-chip cache matched with the size of the line operator for the line operator on target equipment; and the processing module is used for calling the row operators of the multistage filter according to a preset time sequence to obtain an execution result and storing an intermediate result corresponding to an intermediate filter of the multistage filter into the on-chip cache.

In an optional embodiment, the setting module is further configured to determine a target filter corresponding to the row operator; acquiring p inputs of a next-stage filter of the target filter and address ranges of on-chip caches corresponding to the p inputs, wherein p is a positive integer; and taking the address range as the size of the on-chip cache of the target device set by the line operator.

In an optional embodiment, the processing module is further configured to, when the line operator corresponds to an nth line output result, store the intermediate result in an nth% L of the ring cache, where n and L are both positive integers, L is a size of the on-chip cache, and n% L is used to indicate that n is a remainder of L.

In an optional embodiment, the processing module is further configured to apply for an on-chip cache of a size corresponding to the line operator from the target device, so that the on-chip cache stores an intermediate result of the line operator.

In an optional embodiment, the processing module is further configured to, for each stage of the multiple stages of filters, establish a mapping table between p inputs of the each stage of filter and an on-chip cache corresponding to the each stage of filter, where p is an integer; sequentially traversing the multistage filters according to a preset time sequence, operating the row operators of the current filter under the condition that data corresponding to p inputs of the current filter are generated by a preceding filter, and increasing the row count of the current filter; traversing the multistage filter again under the condition that the multistage filter is completely traversed and the row count of the final stage filter does not reach the total row number of the multistage filter; and under the condition that the multistage filter is completely traversed and the total row number of the multistage filter is reached, determining that the scheduling execution of the multistage filter is completed.

In an optional embodiment, the processing module is further configured to look up an on-chip cache address array of the row operator input of the current filter in the mapping table by using the row number corresponding to the p inputs as an index; searching an on-chip cache address array output by a row operator of the current filter in the mapping table by taking the output row number of the current filter as an index; and storing the intermediate result on an on-chip cache determined according to the on-chip cache address array output by the line operator.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the method and the device, a line operator set by a target object for each stage of filter in a multi-stage filter is obtained, wherein the line operator is used for processing one line of data of a target image, and the multi-stage filter is used for processing the target image; distributing on-chip cache matched with the size of the line operator to the line operator on target equipment; the method comprises the steps of calling row operators of the multistage filter according to a preset time sequence to obtain an execution result, storing an intermediate result corresponding to an intermediate filter of the multistage filter in the execution result into the on-chip cache, namely storing the intermediate result of the multistage filter into the on-chip cache allocated for the multistage filter in advance without storing the intermediate result into a DRAM.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware structure of a computer terminal of a data saving method based on-chip cache according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for data saving based on-chip caching according to an embodiment of the invention;

FIG. 3 is a flow diagram illustrating a manner in which operators of a multistage filter are constructed and scheduled, according to an alternative embodiment of the present invention;

fig. 4 is a block diagram of a data saving apparatus based on-chip cache according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal or a similar operation device. Taking the operation on a computer terminal as an example, fig. 1 is a block diagram of a hardware structure of a computer terminal of a data storage method based on-chip cache according to an embodiment of the present invention. As shown in fig. 1, the computer terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the computer terminal. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration with equivalent functionality to that shown in FIG. 1 or with more functionality than that shown in FIG. 1. The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as a computer program corresponding to the on-chip cache-based data saving method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

According to an embodiment of the present invention, a data saving method based on an on-chip cache is provided, which may be applied to the computer terminal, and fig. 2 is a flowchart of the data saving method based on the on-chip cache according to the embodiment of the present invention, as shown in fig. 2, including:

step S202, acquiring a line operator set by a target object for each level of filter in a multi-level filter, wherein the line operator is used for processing a line of data of a target image, and the multi-level filter is used for processing the target image;

step S204, the line operator distributes on-chip cache with the size corresponding to the line operator on target equipment;

step S206, calling the row operator of the multistage filter according to a preset time sequence to obtain an execution result, and storing an intermediate result corresponding to the intermediate filter of the multistage filter in the execution result to the on-chip cache.

Through the steps, a row operator set by a target object for each stage of filter in a multi-stage filter is obtained, wherein the row operator is used for processing a row of data of a target image, and the multi-stage filter is used for processing the target image; distributing on-chip cache matched with the size of the line operator to the line operator on target equipment; the method comprises the steps of calling row operators of the multistage filter according to a preset time sequence to obtain an execution result, storing an intermediate result corresponding to an intermediate filter of the multistage filter in the execution result into the on-chip cache, namely storing the intermediate result of the multistage filter into the on-chip cache allocated for the multistage filter in advance without storing the intermediate result into a DRAM.

There are various ways for setting the size of the on-chip cache of the target device in step S204, and in an optional embodiment, the following scheme may be implemented: determining a target filter corresponding to the row operator; acquiring p inputs of a next-stage filter of the target filter and address ranges of on-chip caches corresponding to the p inputs, wherein p is a positive integer; and taking the address range as the size of the on-chip cache of the target device set by the line operator.

That is, after the target filter corresponding to the row operator is determined, the size of the on-chip cache is set for the row operator on the target device according to the address range by acquiring the p inputs of the next-stage filter of the target filter and the address ranges of the on-chip caches corresponding to the p inputs.

Optionally, the on-chip cache is a ring cache, and storing an intermediate result corresponding to an intermediate filter of the multistage filter in the execution result to the on-chip cache includes: and under the condition that the line operator corresponds to the nth line output result, storing the intermediate result in nth% L of the annular cache, wherein n and L are both positive integers, L is the size of the on-chip cache, and n% L is used for representing that n is used for obtaining the remainder of L.

For convenience of output at each level, the on-chip cache adopts a ring cache structure, when it is determined that the line operator corresponds to the nth line output result, other intermediate results are stored in the nth% L of the ring cache, for example, assuming that the size of the on-chip cache is L cache lines, for any filter operator, when the filter operator is scheduled to calculate the nth line output, the output result is actually written into the nth% L line of the ring cache, which is equivalent to folding the intermediate result output into the cache lines with the total number of L, and the width of each cache line is the same as the original output, thereby determining the size of the number L of the cache lines at each level.

Optionally, after allocating an on-chip cache matching the size of the line operator to the line operator on the target device, the method further includes: and applying for the on-chip cache with the corresponding size of the line operator to the target device so that the on-chip cache stores the intermediate result of the line operator.

After the setting of the cache size of the line operator is completed on the on-chip cache of the target device, the memory usage space required by the on-chip cache with the set size needs to be applied to the target device, so that the intermediate result of the line operator is conveniently stored in the on-chip cache, and the temporary DRAM read-write loss of the intermediate result is reduced.

Optionally, the invoking a row operator of the multistage filter according to a preset time sequence to obtain an execution result, including: for each level of filter in the multistage filter, establishing a mapping table of p inputs of the filter of each level and an on-chip cache corresponding to the filter of each level, wherein p is an integer; sequentially traversing the multistage filters according to a preset time sequence, operating the row operators of the current filter under the condition that data corresponding to p inputs of the current filter are generated by a preceding filter, and increasing the row count of the current filter; traversing the multistage filter again under the condition that the multistage filter is completely traversed and the row count of the final stage filter does not reach the total row number of the multistage filter; and under the condition that the multistage filter is completely traversed and the total row number of the multistage filter is reached, determining that the scheduling execution of the multistage filter is completed.

That is to say, when the row operator of the multi-stage filter is called according to the preset timing sequence, in order to improve the calling efficiency, a mapping table of a plurality of inputs of each stage of filter and an on-chip cache corresponding to each stage of filter needs to be established for each stage of filter in the multi-stage filter, and then when traversing of the multi-stage filter is performed, when data corresponding to the plurality of inputs of the current filter are found to have been generated by a preceding stage of filter, the row operator of the current filter is directly operated and row count is incremented, after the multi-stage filter is completely traversed, when the row count of the last stage of filter is found not to reach the total row count of the multi-stage filter, the multi-stage filter needs to be traversed again, and it is determined that the scheduling process of the multi-stage filter is completed until the row count of.

For example, the current lines of all the filters may be set to 0 first, and from the first-stage filter, filters of each stage are scanned in sequence, and it is determined whether a preceding-stage output line, on which the operator of each stage is to calculate the current line, has been generated, if so, the operator is operated once and the current line number is increased, if not, the current multi-stage filter scanning is stopped, and a new scanning cycle is started from the first-stage filter again. This process is repeated until all rows of the last stage output are generated, and the entire calculation process ends.

Optionally, the method further includes: searching the on-chip cache address array input by the row operator of the current filter in the mapping table by taking the row number corresponding to the p inputs as an index; searching an on-chip cache address array output by a row operator of the current filter in the mapping table by taking the output row number of the current filter as an index; and storing the intermediate result on an on-chip cache determined according to the on-chip cache address array output by the line operator.

Because a plurality of input of each stage of filter and a mapping table of on-chip cache corresponding to each stage of filter are established for each stage of filter in the multistage filter, and then an on-chip cache address array input by a line operator of the current filter can be searched in the mapping table by inputting a corresponding line number.

In order to better understand the above data saving method based on-chip cache, the following design, cache allocation and execution scheduling method of a multi-stage filter provided in combination with an optional embodiment of the present invention includes: defining a line operator for each stage of the multi-stage filter, wherein the line operator is a function of processing and outputting one line of data of the image at a time; calculating the minimum cache line quantity of the input of the level according to the input characteristics of the line operators, and distributing on-chip caches with corresponding sizes to the minimum cache line quantity; and the execution scheduler circularly calls each level of row operators at a correct time sequence, the correct address of each row of operators output to the on-chip cache is set each time, and the last level of row operators output the calculation result to the correct address of the DRAM.

Optionally, the defining a line operator includes: designing a calculation function as a line operator for each stage of filter according to the function of the filter, wherein the design assumes that the filter only needs to be responsible for calculating and outputting one line of image data, and the dependent input data address and output data address are both provided by an execution scheduler and ensure the correctness; the row operators calculate the required input data neighborhood size, and row operator memory access characteristic information such as the single element size of input and output data is also provided at the same time, so that reference is provided for cache allocation and execution scheduling.

Optionally, performing minimum on-chip cache calculation and allocation for each stage of filter includes: for each stage of filter, further assume that its row operator has p inputs, i.e. its previous stage of filter has p outputs, p being a positive integer; determining the relative line number range of the line operator, on which p inputs depend respectively in single execution, according to the line operator memory access characteristic information, and calculating the minimum cache line number L corresponding to the p inputs respectively according to the p relative line number ranges and the relation thereof; calculating to obtain the single line size corresponding to the p inputs respectively according to the line operator memory access characteristic information, then calculating to obtain the on-chip cache size corresponding to the p inputs respectively by combining the minimum cache line number L corresponding to the p inputs and distributing the on-chip cache for the p inputs respectively; and performing the calculation and distribution on each stage of filter to obtain all required caches.

Optionally, the executing scheduler calls each stage of row operators in a correct timing cycle to obtain a final calculation result of the multi-stage filter, where the method includes: for each stage of filter, according to the cache information obtained by calculation in the cache allocation, establishing a mapping table from each line (the total line is the height H of the image to be processed) of p inputs (namely p outputs of the preceding stage of filter) to the on-chip cache address; clearing the line count c of each filter, traversing the filters in sequence, scheduling the line operator of the current filter and increasing the line count c of the current filter if the p data input in the dependent line range of the current filter are generated by the pre-filter (the input of the first filter is an image to be processed and all line number data are generated by default), otherwise restarting traversing the filters in each stage, restarting traversing the filters in each stage if the line count c of the last filter is not up to the total line number H after all the filters in each stage are traversed, and stopping the operation of the execution scheduler if the line count c of the last filter reaches the total line number H.

The flow of the row operator of the scheduling filter is as follows: taking all the dependent line numbers of p input lines of the current filter when the line count c is counted (if the total line number H is reached, the process is skipped) as an index lookup mapping table, taking the obtained on-chip cache address array as the input of the line operator, taking the output line number of the current filter as the index lookup mapping table, obtaining the on-chip cache address array as the output of the operator, calling a line operator function, and storing the calculation result to the required cache address as the input of the next-level line operator.

The process of calculating the on-chip cache address mapping table comprises the following steps: for each stage of filter, according to the minimum cache line number L of each p inputs obtained by calculation in the cache calculation step and the use mode of a ring cache (ring buffer), mapping the nth line (0< n < total line number H) of each input into the (n% L) th line of the on-chip cache of the filter respectively, and storing the head address of the line in a mapping table; for the input of the first filter stage and the output of the last filter stage, the addresses are the actual addresses in the DRAM without mapping.

The following explains the flow of the above data saving method based on-chip cache with reference to several alternative embodiments, but is not intended to limit the technical solution of the embodiments of the present invention.

An optional embodiment of the present invention provides an operator structure and scheduling method for a multi-stage filter, which optimizes a filter existing as preprocessing in a computer vision system, and uses a neighborhood pixel with a certain size of an input image as an input and generates an output pixel due to the input and output data characteristics of the filter, and traverses the entire image to generate all pixels, so that data Access of the filter has a local characteristic.

Fig. 3 is a schematic flow chart of an operator construction and scheduling manner of a multistage filter according to an alternative embodiment of the present invention, as shown in fig. 3, specifically including the following steps:

step 1: a user (corresponding to a target object in the embodiment of the present invention) specifies the row operators and input/output data characteristics of the multistage filter;

since the optional embodiment of the present invention is to optimize the multi-stage filter existing as the preprocessing in the computer vision system, the multi-stage filter is used to process the target image, therefore, the multi-stage filter problem can be described as that one input image (in any description mode) passes through any of the plurality of filters in sequence to generate one image output (in any description mode), the input image and the output image are located in the dynamic random access memory DRAM, and the output result among the plurality of filters except the final output does not require to be output in a specific mode, and can be processed according to the platform optimization requirement.

In an optional embodiment of the present invention, a row operator is used in the structure of the filter operator, that is, a row of data is output by the operation of a single operator, and this is used as a minimum unit of the filter operation, and the scheduling logic schedules the row operator of each stage of filter to generate the total output of the multi-stage filter. The main reason for choosing such operator granularity is that it takes a line as a unit of data processing, the amount of data processed by an operator at a time usually conforms to the order of magnitude of the on-chip cache of current devices, and is sufficient for a CPU system to be accommodated in a cache memory having the attribute of a cache memory, and usually is similar to the size of an L1 cache; for Digital Signal Processing (DSP) systems, setting for Direct Memory Access (DMA) control in units of rows is generally relatively friendly to both the use of control resources and the size of SRAM. Another reason is that arithmetic devices such as CPUs or DSPs generally support single Instruction Multiple Data (SIMD for short) operation, and it can be effectively utilized to implement parallelization acceleration on Multiple pixels simultaneously in one-line arithmetic; and a data loading/storing instruction exists in the operation, and the data processing in a row unit can insert the data loading instruction to be processed into the current operation instruction in advance to eliminate the loading delay of the data loading instruction.

Optionally, since the intermediate output of the multi-stage filter may not be visible to a user or may not be output to the DRAM, the intermediate output of the multi-stage filter is an unnecessary output, and the intermediate output of the multi-stage filter in a computer vision system may be logically equivalent to an intermediate virtual image (described in any way), so that a user may eliminate the read/write of the DRAM by specifying the row operator and the input/output data characteristics of the multi-stage filter and further by performing good logical processing on each stage of the filter.

Step 2: calculating the size and the form of each level of minimum intermediate cache according to the line operator of the multi-level filter appointed by a user and the input and output data characteristics, and distributing intermediate annular cache at a proper position;

the optional embodiment of the invention uses the on-chip cache to store the data of the intermediate result (equivalent to the intermediate result in the embodiment of the invention), and the aim is to design the minimum cache for any one stage of intermediate output, so that the minimum cache can just establish the operation pipeline of filter line operators of each stage.

Optionally, when the outputs of each stage are stored in the on-chip cache, a ring buffer structure may be adopted, specifically, assuming that the size of the on-chip cache is L cache lines, for any filter operator, when the filter operator is scheduled to calculate the output of the nth line, the output result is actually written into the (n% L) th line of the ring cache, which is equivalent to folding all the intermediate result outputs with the original height of each stage as the image height into the cache lines with the total number of L, the width of each cache line is the same as that of the original output (including necessary alignment and other extensions that improve the operation efficiency of each platform), and the current objective is to determine the size of the number L of each stage cache lines.

Alternatively, for a certain level of intermediate buffering, it can be assumed that the previous filter a has p outputs and the next filter B has p inputs, while the access neighborhoods of the respective inputs for the filter B are known, i.e. the line operators of the filter B are known for the range of line numbers of the respective inputs on which each execution depends. From this, the line number range of all p inputs on which filter B depends in a single line operator run can be calculated, the size of the range being the number of each intermediate cache line. When the row operator B is run next, the first row from the previous time is no longer needed and can be overwritten with the latest row of data in the output of operator a, since the dependent neighborhood is shifted down one row in order. And for each intermediate data, analyzing the number of cache lines in each input and output by using the process, determining the width of the cache lines according to operator logic, and pre-allocating the space.

And step 3: running a row operator scheduling flow, calling a row operator according to a preset time sequence, and calculating final image output;

in the operation of the whole multistage filter, each operator is scheduled according to the following modes: setting the current behaviors 0 of all filters, starting from a first-stage filter, scanning each-stage filter in sequence, judging whether a preceding-stage output line on which an operator of each stage is to calculate a current line is generated or not, if so, operating the operator once and increasing the current line number, if not, stopping the scanning of the current multi-stage filter, and starting a new scanning from the first-stage filter again. This process is repeated until all rows of the last stage output are generated, and the entire calculation process ends. It should be noted that boundary problems and various alignment problems required for SIMD computation must be considered in the computation flow.

The operator construction and scheduling scheme of the multistage filter according to the optional embodiment of the present invention is applicable to any computing device with on-chip cache. For the situation of a CPU, on-chip caches are cache memories of all levels, when the CPU is used, only internal memories with the sizes of the middle caches calculated in the front are required to be applied, after each middle ring cache is filled once in the operation process, the middle caches are smaller and can be usually kept in the cache memories, the DRAM cannot be accessed in the subsequent operation, all read and write operations only enter the cache memories, and therefore a large amount of internal memory access cost is avoided. For a digital signal processing system, usually, an intermediate ring cache is located in an on-chip cache, except for the fact that the first-stage filter and the last-stage filter need extra DMA (direct memory access) moving operation, other intermediate caches are always located in the on-chip cache, extra DRAM read-write overhead is not introduced, further temporary DRAM read-write loss of an intermediate result is eliminated on the whole, the operation speed is improved from the perspective of the system, the load of a memory system is reduced, the inherent defect that a single algorithm cannot be comprehensively optimized is overcome, and the operation speed is improved.

In summary, the optional embodiment of the present invention provides a solution for comprehensive optimization of a multi-stage filter, which, compared with the existing optimization for a single filter algorithm, integrally eliminates the DRAM read-write overhead of the intermediate temporary result, increases the operating speed and reduces the load of the memory system from the system perspective, and overcomes the inherent defect that the single algorithm cannot be comprehensively optimized.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a data saving device based on-chip cache is further provided, where the device is used to implement the foregoing embodiments and preferred embodiments, and details are not repeated for what has been described. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of a data saving apparatus based on-chip cache according to an embodiment of the present invention, as shown in fig. 4, the apparatus includes:

an obtaining module 40, configured to obtain a line operator, where the target object is set for each level of a multi-level filter, where the line operator is used to process a line of data of a target image, and the multi-level filter is used to process the target image;

a setting module 42, configured to allocate, on a target device, an on-chip cache that matches the size of the line operator to the line operator;

and the processing module 44 is configured to call the row operators of the multistage filter according to a preset time sequence to obtain an execution result, and store an intermediate result corresponding to an intermediate filter of the multistage filter to the on-chip cache.

Optionally, the setting module is further configured to determine a target filter corresponding to the row operator; acquiring p inputs of a next-stage filter of the target filter and address ranges of on-chip caches corresponding to the p inputs, wherein p is a positive integer; and taking the address range as the size of the on-chip cache of the target device set by the line operator.

Optionally, the processing module is further configured to store the intermediate result in an nth% L of the ring cache when the line operator corresponds to an nth line output result, where n and L are both positive integers, L is a size of the on-chip cache, and n% L is used to indicate that n is a remainder of L.

Optionally, the processing module is further configured to apply for an on-chip cache of a size corresponding to the line operator from the target device, so that the on-chip cache stores an intermediate result of the line operator.

Optionally, the processing module is further configured to, for each level of the multiple levels of filters, establish a mapping table between p inputs of the each level of filters and an on-chip cache corresponding to the each level of filters, where p is an integer; sequentially traversing the multistage filters according to a preset time sequence, operating the row operators of the current filter under the condition that data corresponding to p inputs of the current filter are generated by a preceding filter, and increasing the row count of the current filter; traversing the multistage filter again under the condition that the multistage filter is completely traversed and the row count of the final stage filter does not reach the total row number of the multistage filter; and under the condition that the multistage filter is completely traversed and the total row number of the multistage filter is reached, determining that the scheduling execution of the multistage filter is completed.

Optionally, the processing module is further configured to look up an on-chip cache address array of the row operator input of the current filter in the mapping table by using the row number corresponding to the p inputs as an index; searching an on-chip cache address array output by a row operator of the current filter in the mapping table by taking the output row number of the current filter as an index; and storing the intermediate result on an on-chip cache determined according to the on-chip cache address array output by the line operator.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring a line operator set by a target object for each stage of filter in a multi-stage filter, wherein the line operator is used for processing a line of data of a target image, and the multi-stage filter is used for processing the target image;

s2, distributing on-chip cache matched with the size of the line operator for the line operator on target equipment;

and S3, calling the row operators of the multistage filter according to a preset time sequence to obtain an execution result, and storing an intermediate result corresponding to the intermediate filter of the multistage filter into the on-chip cache.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data storage method based on-chip cache is characterized by comprising the following steps:

acquiring a line operator of a target object set for each level of filter in a multi-level filter, wherein the line operator is used for processing a line of data of a target image, and the multi-level filter is used for processing the target image;

distributing on-chip cache matched with the size of the line operator to the line operator on target equipment;

and calling the row operators of the multistage filter according to a preset time sequence to obtain an execution result, and storing an intermediate result corresponding to an intermediate filter of the multistage filter in the execution result to the on-chip cache.

2. The method of claim 1, wherein allocating on-chip cache for the row operator on a target device that matches the row operator size comprises:

determining a target filter corresponding to the row operator;

acquiring p inputs of a next-stage filter of the target filter and address ranges of on-chip caches corresponding to the p inputs, wherein p is a positive integer;

and distributing on-chip cache matched with the size of the line operator to the line operator on target equipment according to the address range.

3. The method of claim 1, wherein the on-chip cache is a ring cache, and storing the intermediate result corresponding to the intermediate filter of the multi-stage filter in the execution result to the on-chip cache comprises:

and under the condition that the line operator corresponds to the nth line output result, storing the intermediate result in nth% L of the annular cache, wherein n and L are both positive integers, L is the size of the on-chip cache, and n% L is used for representing that n is used for obtaining the remainder of L.

4. The method of claim 1, wherein after allocating an on-chip cache for the line operator on a target device of a size corresponding to the line operator, the method further comprises:

and applying for the on-chip cache with the corresponding size of the line operator to the target device so that the on-chip cache stores the intermediate result of the line operator.

5. The method of claim 1, wherein invoking row operators of the multistage filter according to a predetermined timing sequence to obtain an execution result comprises:

for each level of filter in the multistage filter, establishing a mapping table of p inputs of the filter of each level and an on-chip cache corresponding to the filter of each level, wherein p is an integer;

sequentially traversing the multistage filters according to a preset time sequence, operating the row operators of the current filter under the condition that data corresponding to p inputs of the current filter are generated by a preceding filter, and increasing the row count of the current filter;

traversing the multistage filter again under the condition that the multistage filter is completely traversed and the row count of the final stage filter does not reach the total row number of the multistage filter;

and under the condition that the multistage filter is completely traversed and the total row number of the multistage filter is reached, determining that the scheduling execution of the multistage filter is completed.

6. The method of claim 5, further comprising:

searching the on-chip cache address array input by the row operator of the current filter in the mapping table by taking the row number corresponding to the p inputs as an index;

searching an on-chip cache address array output by a row operator of the current filter in the mapping table by taking the output row number of the current filter as an index;

and storing the intermediate result on an on-chip cache determined according to the on-chip cache address array output by the line operator.

7. A data saving device based on-chip cache is characterized by comprising:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a line operator set by a target object for each level of filter in a multi-level filter, the line operator is used for processing a line of data of a target image, and the multi-level filter is used for processing the target image;

the setting module is used for distributing on-chip cache matched with the size of the line operator for the line operator on target equipment;

and the processing module is used for calling the row operators of the multistage filter according to a preset time sequence to obtain an execution result and storing an intermediate result corresponding to an intermediate filter of the multistage filter into the on-chip cache.

8. The apparatus of claim 7, wherein the setting module is further configured to determine a target filter corresponding to the row operator; acquiring p inputs of a next-stage filter of the target filter and address ranges of on-chip caches corresponding to the p inputs, wherein p is a positive integer; and taking the address range as the size of the on-chip cache of the target device set by the line operator.

9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 6 when executed.

10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.