WO2019214303A1 - Method and device for batch selection of data - Google Patents

Method and device for batch selection of data Download PDF

Info

Publication number
WO2019214303A1
WO2019214303A1 PCT/CN2019/074777 CN2019074777W WO2019214303A1 WO 2019214303 A1 WO2019214303 A1 WO 2019214303A1 CN 2019074777 W CN2019074777 W CN 2019074777W WO 2019214303 A1 WO2019214303 A1 WO 2019214303A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
interval
intervals
candidate
range
Prior art date
Application number
PCT/CN2019/074777
Other languages
French (fr)
Chinese (zh)
Inventor
毛坤
张臻
李翀
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2019214303A1 publication Critical patent/WO2019214303A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Definitions

  • the present application relates to the field of data processing and, more particularly, to a method and apparatus for batch selection of data.
  • the computer Before the computer processes the data, it generally needs to determine the target data from the huge amount of candidate data, and then further process the target data, such as finding the target person or vehicle from the massive video in the tide of “Safe City”.
  • the input picture is connected to a plurality of candidate windows via a series of convolutional layers and full-layer connections, and the target is detected in the plurality of candidate windows.
  • the candidate data is generally sorted to determine the target data.
  • existing distributed parallel algorithms have problems such as repeated calculations, high memory requirements, and poor scalability. This leads to the selection/sorting process becoming a bottleneck that can't overcome and limit system performance.
  • the present application provides a method and apparatus for batch selection of data, which does not need to perform full sorting of candidate data, avoids repeated calculation of candidate data multiple times, saves memory and bandwidth, and improves system efficiency.
  • a method for batch selection of data comprising: a data analyzer stats a data interval to which data in the candidate data belongs to obtain a statistical result, the statistical result including each of the plurality of data intervals The number of data included in the interval, the sum of the range ranges of each data interval is equal to the data distribution interval range of the candidate data; the interval statisticer accumulates the number of data included in each data interval according to the statistical result, To obtain an accumulated result, the accumulated result is the sum of the number of data included in each data interval and the number of data included in all data intervals before each data interval; the batch picker determines the target data according to the accumulated result The target data interval and output candidate data belonging to the target data interval.
  • the interval statisticer accumulates the number of data included in each data interval separately, and may perform a prefix and operation on the number of data included in each data interval to obtain an accumulated result of each data interval.
  • the interval statistic may calculate a cumulative sum of the number of data included in each data interval by using a prefix and a prefix sum.
  • the data interval is ordered, but the data in each data interval is out of order, and the candidate data does not need to be fully sorted.
  • the output target data only needs 2 full parallel scans and 1 parallel.
  • the batch calculation can be completed by accumulating calculations, avoiding repeated calculations of candidate data, saving memory and bandwidth, and improving system efficiency.
  • the data analyzer can be a multi-core processor, a plurality of parallel processors, or a multi-threaded processor, or the data analysis
  • the processor is the multi-core processor, the combination of the plurality of parallel processors and the multi-threaded processor.
  • the interval configurator may be a multi-core processor, a plurality of parallel processors, or a multi-thread processor, or the interval configuration
  • the processor is the multi-core processor, the combination of the plurality of parallel processors and the multi-threaded processor.
  • the batch picker may be a multi-core processor, a plurality of parallel processors, or a multi-thread processor, or the batch selection
  • the processor is the multi-core processor, the combination of the plurality of parallel processors and the multi-threaded processor.
  • each data interval corresponds to a counter
  • the counter is configured to record the number of the data intervals, when the data analyzer determines that a data belongs to the data interval, Add 1 to the counter corresponding to the data interval.
  • the method before the data analyzer counts the data interval to which the data in the candidate data belongs, the method further includes: the interval configurator determining, according to the data information of the candidate data, a number of the plurality of data intervals and a range of each of the plurality of data intervals; the interval configurator transmits the plurality of data intervals and a range of each of the plurality of data intervals to the range Data analyzer.
  • the interval configurator determines the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the data information of the candidate data, so that the result of the subsequent batch selection can be more accurate.
  • the interval configurator determines, according to the data information of the candidate data, the number of the plurality of data intervals and the range of each of the plurality of data intervals
  • the method includes: when the candidate data is uniformly distributed, determining a number of the plurality of data intervals and a range of each of the plurality of data intervals according to the uniform quantization strategy, where the range of each data interval is equal; or When the candidate data is non-uniform, the number of the plurality of data intervals and the range of each of the plurality of data intervals are determined according to the non-uniform quantization strategy, and at least two of the ranges of the plurality of data intervals The range of intervals is not equal.
  • the range of each data interval is ⁇ , determining the number of the plurality of data intervals according to the uniform quantization strategy and the The range of each data interval in multiple data intervals, including:
  • x is the data interval range of the candidate data
  • M is the number of the plurality of data intervals.
  • the method further comprises:
  • x is the data interval range of the candidate data
  • M is the number of the plurality of data intervals.
  • the interval statistic accumulates the number of the plurality of data intervals according to the statistical result, including:
  • the target data is the smallest partial data of the candidate data, accumulating the number included in the plurality of data intervals according to the ascending order of the plurality of data intervals;
  • the number of the plurality of data intervals is accumulated according to the descending order of the plurality of data intervals.
  • the data analyzer, the interval statistic, and the batch picker are the same physical entity or partially identical physical entities.
  • an apparatus for batch selection of data comprising:
  • a data analyzer configured to count a data interval to which the data in the candidate data belongs, to obtain a statistical result, where the statistical result includes the number of data included in each of the plurality of data intervals, and the interval of each data interval The sum of the ranges is equal to the range of the data distribution interval of the candidate data;
  • the interval statistic unit accumulates the number of data included in each data interval according to the statistical result to obtain an accumulated result, where the accumulated result is the number of data included in each data interval and before each of the data intervals The sum of the number of data contained in all data intervals;
  • the batch picker determines a target data interval in which the target data is located according to the accumulated result, and outputs candidate data belonging to the target data interval.
  • the interval statisticer accumulates the number of data included in each data interval separately, and may perform a prefix and operation on the number of data included in each data interval to obtain an accumulated result of each data interval.
  • the apparatus further comprises:
  • An interval configurator configured to determine, according to the data information of the candidate data, a number of the plurality of data intervals and a range of each of the plurality of data intervals; the interval configurator and the plurality of data intervals A range of each of the plurality of data intervals is transmitted to the first processor.
  • the interval configurator is specifically configured to: when the candidate data is uniformly distributed, determine a number of the plurality of data intervals and the plurality of data according to the uniform quantization policy a range of each data interval in the interval, the range of each data interval being equal; or when the candidate data is non-uniformly distributed, determining the number of the plurality of data intervals and the plurality of data intervals according to the non-uniform quantization strategy The range of each data interval, the range of at least two of the plurality of data intervals is not equal.
  • the interval configurator when the candidate data is uniformly distributed, and the range of each data interval is ⁇ , the interval configurator is specifically configured to:
  • x is the data interval range of the candidate data
  • M is the number of the plurality of data intervals.
  • the interval configurator is specifically configured to:
  • x is the data interval range of the candidate data
  • M is the number of the plurality of data intervals.
  • the interval statistic is specifically configured to: when the target data is the smallest partial data of the candidate data, according to the ascending order of the multiple data intervals, The number of the plurality of data intervals is prefixed and operated; or when the target data is the largest part of the candidate data, the number of the plurality of data intervals is prefixed according to the descending order of the plurality of data intervals And operation.
  • the data analyzer, the interval statistic, and the batch picker are the same physical device or portions of the same physical device.
  • a computer storage medium stores program instructions, and when the instructions are executed, the computer storage medium can perform any of the first aspect or the first aspect The method in the implementation.
  • a computer program product comprising instructions that, when executed, cause the device for batch selection of data to perform any of the first aspect or any of the first aspects The method in the implementation.
  • a chip system comprising: at least one processor, the at least one processor for executing stored instructions, such that the device for batch selection of data can perform the first aspect or the first aspect An alternative implementation.
  • FIG. 1 is a schematic block diagram of a system architecture of a method and apparatus for data batch selection in accordance with the present application.
  • FIG. 2 is a schematic flow chart of a method for data batch selection in the present application.
  • FIG. 3 is a schematic block diagram of the number of data intervals according to a prefix and an accumulation of the present application.
  • FIG. 4 is a schematic block diagram of the number of data intervals according to a prefix and an accumulation of the present application.
  • FIG. 5 is a schematic flowchart of a method for data batch selection according to the present application.
  • FIG. 6 is a schematic block diagram of an apparatus for data batch selection in accordance with the present application.
  • FIG. 7 is a schematic architectural diagram of a system for data batch selection in accordance with the present application.
  • FIG. 8 shows a schematic block diagram of an apparatus for batch selection of data provided by the present application.
  • the system 100 architecture includes a front end collection device 110, a storage management device 120, and an intelligent analysis device 130.
  • the front end collection device 110, the storage management device 120, and the intelligent analysis device 130 are connected through a network.
  • the front-end collection device 110 is configured to capture an object, such as a human body, a human face, and a capture of a vehicle body.
  • the front-end collection device 110 transmits the captured information to the storage management device 120, and the storage management device 120 captures the front-end collection device 110.
  • the information is extracted, and the storage management device 120 transmits the feature-extracted data to the intelligent analysis device 130.
  • the intelligent analysis device 130 performs batch selection based on the extracted data, and outputs a detection target.
  • FIG. 1 is only an exemplary architecture diagram.
  • the system architecture may include other devices in addition to the device shown in FIG.
  • a computer readable medium may include, but is not limited to, a magnetic storage device (eg, a hard disk, a floppy disk, or a magnetic tape, etc.), such as a compact disc (CD), a digital versatile disc (DVD). Etc.), smart cards and flash memory devices (eg, erasable programmable read-only memory (EPROM), cards, sticks or key drivers, etc.).
  • a magnetic storage device eg, a hard disk, a floppy disk, or a magnetic tape, etc.
  • CD compact disc
  • DVD digital versatile disc
  • Etc. smart cards and flash memory devices (eg, erasable programmable read-only memory (EPROM), cards, sticks or key drivers, etc.).
  • various storage media described herein can represent one or more devices and/or other machine-readable media for storing information.
  • the term "machine-readable medium” may include, but is not limited to, a variety of media capable of storing, containing, and/or carrying instructions and/or data.
  • FIG. 2 is a schematic flowchart of a method 200 for data batch selection according to an embodiment of the present application.
  • the method 200 can be applied to FIG.
  • the embodiment of the present application is not limited herein.
  • the method 200 includes the following.
  • Step 210 The data analyzer collects a data interval to which the data in the candidate data belongs to obtain a statistical result, where the statistical result includes the number of data included in each of the plurality of data intervals, and the interval of each data interval The sum of the ranges is equal to the range of the data distribution interval of the candidate data.
  • the data analyzer may be a multi-core processor, a plurality of parallel processors, or a multi-thread processor, or the data analyzer is the multi-core processor, the multiple parallel A combination of a processor and the multi-threaded processor.
  • the number of statistical data that each processor is responsible for is equal or approximately equal, that is, the load balancing principle is satisfied, and candidates are
  • the data is evenly distributed to a plurality of parallel processors, each of which counts the data interval to which the data in the candidate data is allocated to obtain statistical results. For example, there are 9 candidate data, and the data distribution interval range of the candidate data is [0, 9], and the data are 1, 2, 3, 4, 5, 6, 7, 8, and 9, respectively, and the data interval is [0, respectively. 3), [3, 6), [6, 9].
  • the data analyzer is three parallel processors.
  • each data analyzer is responsible for counting three data. That is, the first processor of the plurality of parallel processors counts the data interval to which the first to third data of the nine data belong, and the second processor counts the fourth to sixth data of the nine data. Data interval, the third processor counts the data interval to which the seventh to ninth data of the nine data belong; or the first processor counts the first, fourth, and seventh data of the nine data. The data interval to which the second processor counts the data interval to which the second, fifth, and eighth data of the nine data belong, and the third processor counts the third and sixth of the nine data. The data interval to which the ninth data belongs. According to the statistics of the data statistic, the number of data included in the data interval [0, 3) is 2, the number of data included in the data interval [3, 6) is 3, and the data included in the data interval [6, 9] The number is 4.
  • each data interval corresponds to a counter for recording the number of the data intervals.
  • the counter corresponding to the data interval is incremented by one.
  • each data interval may also correspond to a memory space, which is used to record the number of data in the data interval.
  • the memory corresponding to the data interval Add 1 to the space.
  • Step 220 The interval statistic unit accumulates the number of data included in the plurality of data intervals according to the statistical result, to obtain an accumulated result, where the accumulated result is the number of data included in each data interval and each of the data The sum of the number of data contained in all data intervals before the interval.
  • the above nine candidate data are allocated three data intervals, which are data intervals [0, 3), [3, 6), [6, 9], respectively, and the interval statistic respectively counts [0, 3)
  • the number of data included is 2, the number of data included in [0, 6) is 5, and the number of data included in [0, 9] is 9.
  • the interval statistic may be a processor with multiple cores, multiple parallel processors, or a multi-thread processor, or the interval statistic is the multi-core processor, the multiple parallel A combination of a processor and the multi-threaded processor.
  • the interval statistic and the data analyzer may be the same physical entity or a partially identical physical entity, and the physical entity may be a physical device or device or device.
  • the data analyzer is three parallel processors, then the interval statisticer may also be the three parallel processors, or the interval statisticator may be one or two of the three parallel processors. Device.
  • Step 230 The batch picker determines, according to the accumulated result, a target data interval in which the target data is located, and outputs candidate data belonging to the target data interval.
  • the target data is data that needs to be selected in the candidate data
  • the batch picker determines a target data interval in which the target data is located according to the accumulated result of the interval statistic, and outputs candidate data belonging to the target data interval.
  • the batch picker may be a multi-core processor, a plurality of parallel processors, or a multi-thread processor, or the batch picker is the multi-core processor, the multiple parallel A combination of a processor and the multi-threaded processor.
  • each parallel processor in the batch picker may determine, according to the accumulated result, a target data interval in which the target data is located, and output candidate data belonging to the target data interval; or a certain one in the batch picker
  • a parallel processor determines a target data interval in which the target data is located according to the accumulated result, and sends the target data interval to the other parallel processor, and each parallel processor output in the batch picker belongs to the target data interval Candidate data.
  • the data analyzer is exemplified by a plurality of parallel processors.
  • the target data is the output of the smallest two of the above nine candidate data, and the batch picker determines that the target data interval is [0, 3).
  • each data analyzer is responsible for counting three data. It is assumed that the data processed by the first processor is 1, 2, 3; the data processed by the second processor is 4, 5, 6; the data processed by the second processor is 7, 8, 9.
  • the three processors output 1, 2 according to the interval of the target data, and the second processor and the third processor have no output.
  • the batch picker and the data analyzer, the interval statistic may be the same physical entity or a partially identical physical entity, and the physical entity may be a physical device or device or device.
  • the data analyzer is three parallel processors, and the batch picker can also be the three parallel processors.
  • the additional required space is the storage space M or M counters of the number of data included in the M data sections.
  • the number of input data be n and the number of parallel selector processors be p
  • the time complexity required by the data analyzer to count the data interval in the candidate data is O(n/p): analysis by each parallel processor
  • the batch picker determines, according to the accumulated result, the time complexity of the target data interval in which the target data is located is O(n/p): each parallel processor determines n/p input/output or not.
  • the data interval is ordered, but the data in each data interval is out of order, and the candidate data does not need to be fully sorted.
  • the output target data only needs 2 full parallel scans and 1 parallel.
  • the batch calculation can be completed by accumulating calculations, avoiding repeated calculations of candidate data, saving memory and bandwidth, and improving system efficiency.
  • the interval statistic accumulates the number of the plurality of data intervals according to the statistical result, including:
  • the target data is the smallest partial data of the candidate data, accumulating the number of the plurality of data intervals according to the ascending order of the plurality of data intervals;
  • the number of the plurality of data intervals is accumulated according to the descending order of the plurality of data intervals.
  • the ascending order of the plurality of data intervals accumulating the number of the plurality of data intervals
  • the smallest qth to pth data according to the plurality of data
  • the descending order of the interval is accumulated for the number of the plurality of data intervals.
  • the interval statistic may use a prefix sum to calculate an accumulated sum of the number of data included in each data interval
  • prefix sum is an algorithm for summing sum. It is defined as:
  • each bit is output as the sum of the inputs from the first bit to the current position.
  • the cumulative calculation can be performed according to the following steps:
  • Each parallel processor calculates the sum of the number of two consecutive data intervals (assuming that the number of data intervals is 8, from left to by x 0 , x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , the number of parallel processors is 20.
  • d 0, processor 1 calculates x 0 + x 1 , and processor 2 calculates x 2 + x 3 , Processor 3 calculates x 4 + x 5 and processor 4 calculates x 6 + x 7 )
  • processor 5 calculates ⁇ (x 0 , x 1 ) + ⁇ (x 2 , x 3 ), the processor 6 calculates ⁇ (x 4 , x 5 )+ ⁇ (x 6 , x 7 ), and the processor 7 calculates ⁇ (x 0 ... x 3 ) ) + ⁇ (x 4 ... x 7 ). If the number of data intervals is not the power of 2, the final update result is postponed in recursion.
  • the processor 8 moves the saved "0" left to the number corresponding to the data interval x 3 (shown by the dashed line in step d 0 in Fig. 4), and shifts the left value to be replaced ⁇ (x 0 ... x 3 ) is added to the saved value "0" as a new value (shown by the solid line of the d 0 step in Fig.
  • the processor 9 shifts the saved "0" to the data interval x 1 Corresponding number (shown by the dashed line in step d 1 in Figure 5), and adding the value ⁇ (x 0 , x 1 ) that will be replaced by the left shift and the saved value "0" are added as new values ( Figure 4 In the middle of the d 1 step, the processor 10 shifts the saved " ⁇ (x 0 ... x 3 )" to the left of the data interval x 5 (shown by the dashed line in step d 1 of Fig. 4) And adding the value ⁇ (x 4 , x 5 ) to be replaced by the left shift and the saved value “ ⁇ (x 0 ... x 3 )” are added as new values (shown by the solid line in step d 1 of Figure 4) And so on, get the value of y 0 , y 1 ,...y (n-1) .
  • the accumulation calculation can be performed according to the following steps:
  • the number of data intervals is divided into multiple groups of blocks, and the number of data intervals included in each group is less than or equal to twice the number of parallel processors.
  • Each group block calculates the prefix sum of the group using the above method when the number of data intervals of the precision table is less than or equal to twice the number of parallel processors.
  • Block0 does not move, block1 group each element (block group y 0 ... y n ) plus auxiliary group y 0 , block 2 group each element plus auxiliary group y 1 , block 3 group each element plus auxiliary group y 2 ,..., block m group plus y (m-1) for each element of the auxiliary group. This completes the prefix sum.
  • the data analyzer when the data analyzer counts the data interval to which the data in the candidate data belongs, the plurality of data intervals and the range of each of the plurality of data intervals have been allocated to the data analyzer.
  • the plurality of data intervals and a range of each of the plurality of data intervals are saved in a shared memory, and the data analyzer can obtain the plurality of data intervals and the plurality by reading the shared memory.
  • the range of each data interval in the data interval; or the memory local to the data analyzer stores the plurality of data intervals and a range of each of the plurality of data intervals.
  • the method 200 240 is also included before 210, as shown in FIG.
  • the interval configurator determines, according to the data information of the candidate data, the number of the plurality of data intervals and the range of each of the plurality of data intervals, the interval configurator to the plurality of data intervals And a range of each of the plurality of data intervals is sent to the data analyzer.
  • the interval configurator determines the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the data information of the candidate data, so that the result of the subsequent batch selection can be more accurate.
  • the interval configurator can allocate candidate data to the data analyzer according to a load balancing principle.
  • candidate data may also be received by other components in the embodiment of the present application, and then the candidate data is allocated to the data analyzer, which is not limited in this application.
  • the interval configurator determines, according to the data information of the candidate data, the number of the plurality of data intervals and the range of each of the plurality of data intervals, including:
  • the candidate data When the candidate data is uniformly distributed, determining a number of the plurality of data intervals and a range of each of the plurality of data intervals according to the uniform quantization strategy, the range of each of the data intervals being equal; or
  • the candidate data is non-uniform
  • determining a number of the plurality of data intervals and a range of each of the plurality of data intervals according to the non-uniform quantization strategy at least two of the ranges of the plurality of data intervals The range of data intervals is not equal.
  • the number of the plurality of data intervals and the range of each of the plurality of data intervals may be determined according to the uniform quantization strategy; when the candidate data When the non-uniform distribution or the extremely uneven distribution (that is, the equal-width interval causes a serious imbalance of data between the intervals), the number of the plurality of data intervals and the plurality of data intervals are determined according to the non-uniform quantization strategy.
  • the range of each data interval is determined according to the uniform quantization strategy.
  • determining the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the uniform quantization strategy includes:
  • x is the data interval range of the candidate data
  • M is the number of the plurality of data intervals.
  • the number M of the plurality of data intervals can be determined according to the quantization strategy in the uniform quantization formula, that is, the equation (1).
  • a set of candidate data 7, 3, 9, 1, 5, the candidate data is evenly distributed, the data interval of the data distribution ranges from 0 to 10, and when the range of each data interval is 2, according to the formula ( 1) Determine the allocation 5 data interval, where the range of each data interval is: [0, 2), [2, 4), [4, 6), [6, 8), [8, 10).
  • the range ⁇ of each of the data sections may be determined according to the formula (1).
  • the plurality of data intervals may be determined by the number of the candidate data and the number of the target data of the output.
  • the number M, and then the range ⁇ of each data interval is determined according to equation (1).
  • the total number of candidate data is 9, and the target data to be determined is the largest three of the candidate data, and then the total number of candidate data 9 is equal to the number of data selected 3 to obtain the number M of the plurality of data intervals is 3, and then The range ⁇ of each data interval is determined according to equation (1).
  • the candidate data When the candidate data is non-uniformly distributed, when determining the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the non-uniform quantization strategy, it is necessary to obtain probability distribution information of the candidate data, Determining the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the probability distribution information of the candidate data in combination with the non-uniform quantization strategy, so that the number of data intervals corresponding to the dense portion of the candidate data is larger The number of data intervals corresponding to the sparse portion of the candidate data is small.
  • the selected non-uniform quantization strategy is to use the Lloyd-Max method to convert the problem into a distortion minimization problem, that is, the minimum distortion formula
  • Equation (2) Given M, the optimal b i and y i minimize the mean squared quantization error (MSQE), ie
  • the candidate data is non-uniformly distributed in 9, 4, 5, 6, and 1.
  • the data is concentrated in the middle and sparse on both sides. If you continue to use the uniform strategy, the range ⁇ of the data interval is 2, then it will appear in 110: there is 1 data in the [0, 2) interval, 0 in the [2, 4) interval, and 3 in the [4, 6) interval. There are 0 in the [6,8) interval and one in the [8,10) interval. If we are looking for the smallest 2 numbers, after 120 we will get: [0, 2) with 1, [0, 4) or only 1, [0, 6) burst to 4, [0, 8) Still only 4, and finally [0, 10) is 5.
  • step 130 is required to select the [0,6) range, ie the final output is the minimum of 4 numbers instead of 2. Therefore, it is not suitable to use a uniform strategy.
  • a non-uniform quantization strategy we can calculate 5 data intervals to different sizes by Lloyd-Max method: [0,3), [3,4.5), [4.5,5.5), [5.5,7 ), [7, 10).
  • the selected range becomes [0, 4.5), and the final output target data is 4 and 1.
  • the number of data intervals the number of data intervals is still 5
  • the "precision" of data batch selection is successfully improved.
  • a method for batch selection of data according to an embodiment of the present application is described in detail with reference to FIG. 2 to FIG. 5 .
  • the method implements ordering of data intervals, but the data in each data interval is out of order, and the candidate data does not need to be performed.
  • Full sorting, output target data only needs 2 full parallel scans and 1 parallel accumulation calculation to complete batch selection, avoiding repeated calculation of candidate data multiple times, saving memory and bandwidth, and improving system efficiency.
  • determining the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the data information of the candidate data may make the result of the subsequent batch selection more accurate.
  • a method of batch selection of data of the present application will be described below in conjunction with a specific set of candidate data.
  • the candidate data are 0.66, 0.44, 0.99, 0.33, 0.11, 0.55, 0.22, 0.77, 0.88, and 9 candidate data.
  • the target data is three numbers in which the largest value among the candidate data is selected.
  • the data analyzer is three parallel processors, and the range of the data interval is unqualified in this case.
  • the number M of the data interval should be adjusted as small as possible to minimize the performance formula according to the performance formula O(n/p
  • the candidate value range is (0.0, 1.0)
  • the range of each data interval is 0.33333...
  • each of the three parallel processors is processed.
  • the range of responsibility for the device is (0.0, 1/3], (1/3, 2/3), (2/3, 1.0).
  • the number corresponding to each data interval is 0, as shown in Table 1. .
  • each of the three parallel processors is responsible for three of the three candidate data.
  • the first processor is responsible for the data 0.66, 0.44, 0.99
  • the second processor is responsible for the data.
  • the third processor is responsible for the data 0.22, 0.77, 0.88.
  • the three processors simultaneously count the data they process, and the statistics can be either a local subtotal or a total, or a global synchronization.
  • the global synchronization direct total example is as follows.
  • the first processor determines that 0.66 belongs to the interval (1/3, 2/3)
  • the second processor determines that 0.33 belongs to the interval (0.0, 1/3]
  • the third processor determines that 0.22 belongs to the interval (0.0, 1/3) ]
  • the number of each data interval is shown in Table 2.
  • the first processor determines that 0.44 belongs to the interval (1/3, 2/3), the second processor determines that 0.11 belongs to the interval (0.0, 1/3], and the third processor determines that 0.77 belongs to the interval (2/3, 1.0), After the second statistics are completed, the number of each data interval is as shown in Table 3.
  • the first processor determines that 0.99 belongs to the interval (2/3, 1.0)
  • the second processor determines that 0.55 belongs to the interval (1/3, 2/3)
  • the third processor determines that 0.88 belongs to the interval (2/3, 1.0)
  • the number of each data interval is as shown in Table 4.
  • the interval accumulator accumulates the three data intervals, and the accumulated result includes a sum of each of the plurality of data intervals and the number of data included in all the data intervals before each of the data intervals, In this example, the maximum number of 3 is selected, so the accumulation is performed in descending order of the data interval, and the cumulative result is shown in Table 5. That is, the class in the (2/3, 1.0) range contains the largest three values, and the two classes in the (1/3, 1.0) range contain a maximum of six values, three in the range of (0.1, 1.0). The class contains the largest 9 values (all values are already here).
  • the batch picker determines that the data interval of the target data is (2/3, 1.0), and it is assumed here that the batch picker is the above three parallel processors, and therefore.
  • the three parallel processors respectively output data belonging to the data interval (2/3, 1.0), then the first processor outputs 0.99, the second processor has no output, and the third processor outputs 0.77, 0.88.
  • FIG. 6 is a schematic block diagram of an apparatus 300 for data batch selection in accordance with the present application. As shown in Figure 6, the device 300 includes the following modules.
  • the data analyzer 310 is configured to collect a data interval to which the data in the candidate data belongs to obtain a statistical result, where the statistical result includes the number of data included in each of the plurality of data intervals, and the data interval of each of the data segments The sum of the interval ranges is equal to the data distribution interval range of the candidate data.
  • the interval statistic unit 320 is configured to accumulate the number of the plurality of data intervals according to the statistical result, to obtain an accumulated result, where the accumulated result is the number of data included in each data interval and each of the data intervals The sum of the number of data contained in all previous data intervals.
  • the batch picker 330 is configured to determine, according to the accumulated result, a target data interval in which the target data is located, and output candidate data belonging to the target data interval.
  • the apparatus 300 further includes an interval configurator 340, configured to determine, according to the data information of the candidate data, a number of the plurality of data intervals and a range of each of the plurality of data intervals; the interval The configurator transmits the plurality of data intervals and a range of each of the plurality of data intervals to the first processor.
  • an interval configurator 340 configured to determine, according to the data information of the candidate data, a number of the plurality of data intervals and a range of each of the plurality of data intervals; the interval The configurator transmits the plurality of data intervals and a range of each of the plurality of data intervals to the first processor.
  • the interval configurator 340 is specifically configured to: when the candidate data is uniformly distributed, determine a number of the plurality of data intervals and a range of each of the plurality of data intervals according to the uniform quantization policy, where The range of each data interval is equal; or when the candidate data is non-uniformly distributed, determining the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the non-uniform quantization strategy, the plurality of The ranges of at least two of the data intervals are not equal.
  • the interval configurator 340 is specifically configured to: determine the number M of the plurality of data intervals according to the formula (1).
  • the interval configurator 340 is specifically configured to: determine, according to the number of the candidate data and the number of the output target data, the number M of the plurality of data intervals; determine each of the multiple according to formula (1) The range of the data interval ⁇ .
  • the second processor is specifically configured to: when the target data is the smallest part of the candidate data, prefix the number of the multiple data intervals according to the ascending order of the multiple data intervals Or calculating; or when the target data is the largest partial data in the candidate data, prefixing and counting the number of the plurality of data intervals according to the descending order of the plurality of data intervals.
  • the data analyzer, the interval statistic, and the batch picker are the same physical device or part of the same physical device.
  • the data analyzer 310, the interval statistic 320, the batch picker 330, and the interval configurator 340 are used to perform various operations of the method 200 for data batch selection of the present application. I will not repeat them here.
  • the data analyzer, the interval statistic, the batch picker and the interval configurator are completely corresponding to the data analyzer, the interval statistic, the batch picker and the interval configurator in the method embodiment, and the corresponding modules execute corresponding steps, specifically Reference can be made to corresponding method embodiments.
  • the data analyzer 310, the interval statistic 320, the batch picker 330, and the interval configurator 340 may be separately configured or integrated together and implemented by one processing chip.
  • the device of the present application is applicable to the PRAM model, and various parallel processors, accelerators, GPUs, FPGAs, ASICs, clouds, and edges can be configured.
  • the cloud system is taken as an example to describe a system for batch selection of data in the present application.
  • 7 is a schematic architectural diagram of a system for data batch selection in accordance with the present application.
  • the system 400 includes a data analyzer 410, an interval statistic 420, a batch picker 430, and an interval configurator 440.
  • the data analyzer 410 is configured to collect a data interval to which the data in the candidate data belongs to obtain a statistical result, where the statistical result includes the number of data included in each of the plurality of data intervals, each of the data The sum of the interval ranges of the data intervals is equal to the data distribution interval range of the candidate data.
  • the interval statistic unit 420 is configured to accumulate the number of the plurality of data intervals according to the statistical result, to obtain an accumulated result, where the accumulated result is the number of data and the data included in each data interval. The sum of the number of data included in all data intervals before each data interval.
  • the batch picker 430 is configured to determine, according to the accumulated result, a target data interval in which the target data is located, and output candidate data belonging to the target data interval.
  • the interval configurator 440 is configured to determine, according to the data information of the candidate data, a number of the plurality of data intervals and a range of each of the plurality of data intervals;
  • the interval configurator 440 transmits a range of each of the plurality of data intervals and the plurality of data intervals to the data analyzer 410.
  • the interval configurator is further configured to allocate candidate data to the data analyzer 410 and the batch picker 430.
  • the interval configurator 440 transmits partial data in the candidate data to the data analyzer 410.
  • the data analyzer 410 counts a data interval to which the data in the candidate data belongs to obtain a statistical result, and writes the statistical result into the first shared memory, where the statistical result includes each of the plurality of data intervals.
  • the number of pieces of data included in the interval, the sum of the range ranges of the each data interval being equal to the range of the data distribution interval of the candidate data.
  • the data analyzer 410 sends a first message to the interval statistic 420, the first message being used to instruct the interval statistic 420 to accumulate the number of the plurality of data intervals according to the statistical result.
  • the interval statistic 420 accumulates the number of the plurality of data intervals according to the statistical result to obtain an accumulated result, where the accumulated result is that each of the data intervals includes The sum of the number of data and the number of data included in all data intervals preceding each of the data intervals, and the accumulated result is written into the second shared memory.
  • the interval statistic 420 sends a second message to the batch picker 430, where the second message is used to instruct the batch picker 430 to determine a target data interval in which the target data is located according to the accumulated result.
  • the batch picker 430 outputs the target data according to the target data section.
  • the data analyzer 410 may include a processor with multiple cores, may also include multiple parallel processors, may also include a multi-thread processor, or the data analyzer 410 is the multi-core processor. A combination of the plurality of parallel processors and the multi-threaded processor.
  • the interval statistic 420 may include a processor with multiple cores, may also include multiple parallel processors, may also include a multi-thread processor, or the data analyzer 410 is the multi-core processor.
  • the batch picker 430 may include a processor with multiple cores, may also include multiple parallel processors, may also include a multi-threaded processor, or the data analyzer 410 is the multi-core processor. A combination of the plurality of parallel processors and the multi-threaded processor.
  • the first shared memory, the second shared memory, and the third shared memory may be the same shared memory.
  • each digital interval is delivered to a distributed memory group corresponding to one processor, and the data analyzer, batch picker, interval statistics
  • the devices are distributed in software form.
  • the data analyzer 410, the interval statistic 420, the batch picker 430, and the interval configurator may perform communication interaction through respective sub-processors included.
  • the data analyzer 410 can include three distributed processors, and the interval statistic includes three distributed processors, One processor is responsible for statistics (0, 3), the second processor is responsible for counting the number of (3, 6) intervals, the third processor is responsible for counting the number of intervals (6, 9), and three distributed processors can Deploying in the same physical location, any one of the data analyzers 410 sends an indication message to the corresponding processor in the interval statistic 420 to indicate the corresponding data when the data interval to which the candidate data belongs is counted.
  • the processor counts the number of data intervals it is responsible for. If any one of the data analyzers 410 counts the data interval to which the candidate data belongs (0, 3), the data analyzer 410 Any one of the processors sends an indication message to the first processor indicating that the first processor is incremented by one.
  • FIG. 8 is a schematic block diagram of a device 500 for data batch selection provided by the present application, the device 500 including:
  • a memory 510 configured to store a program, where the program includes a code
  • the transceiver 520 is configured to communicate with other devices;
  • the processor 530 is configured to execute program code in the memory 510.
  • the processor 530 can implement various operations of the method 200.
  • the transceiver 520 is configured to perform specific signal transceiving under the driving of the processor 530.
  • FIG. 8 only shows a schematic block diagram of a device for data batch selection.
  • the memory 510, the transceiver 520, and the processor 530 share the same system bus, but the memory 510 The transceiver 520 and the three components of the processor 530 may also be directly connected.
  • the connection relationship between the components of the device selected in batches of the data is not limited in this application.
  • the processor 530 may be a central processing unit ("CPU"), and the processor 530 may also be other general-purpose processors, digital signal processors (DSPs). , an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, and the like.
  • CPU central processing unit
  • DSP digital signal processors
  • ASIC application specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • FPGA off-the-shelf programmable gate array
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present application which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Algebra (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Complex Calculations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided by the present application are a method and device for batch selection of data, which does not require fully sorting candidate data, thus avoiding multiple repeated calculations of candidate data, saving memory and bandwidth, and improving system efficiency. The method comprises: a data analyzer calculates a data interval to which data among candidate data belongs so as to obtain a statistical result, the statistical result comprising the number of data comprised in each data interval among a plurality of data intervals, and the sum of interval ranges of the each data interval being equal to a data distribution interval range of the candidate data; an interval counter adds up the amount of data comprised in the each data interval respectively according to the statistical result so as to obtain an accumulative result, the accumulative result being the sum of the amount of data comprised in the each data interval and the amount of data comprised in all data intervals before the each data interval; and a batch selector determines a target data interval in which target data is located according to the accumulative result, and outputs candidate data belonging to the target data interval.

Description

数据批量选择的方法和装置Method and device for batch selection of data 技术领域Technical field
本申请涉及数据处理领域,并且更具体地,涉及一种数据批量选择的方法和装置。The present application relates to the field of data processing and, more particularly, to a method and apparatus for batch selection of data.
背景技术Background technique
计算机在对数据进行处理前,一般都需要从海量的候选数据中确定目标数据,进而对该目标数据进行进一步的处理,如在“平安城市”大潮中从海量视频中找到目标人物或者车辆,又如在使用快速区域卷积神经网络Faster R-CNN作图片目标检测时,输入图片经由一系列卷积层以及全层连接后生成多个候选窗口,在该多个候选窗口中检测目标。现有技术中一般均采用对候选数据全排序,进而确定目标数据。对于超大规模的数据,单靠提供处理器主频来提升传统排序或选择算法的运算速度已经越来越难;但是现有分布式并行算法又存在重复计算、内存需求高、扩展性差等问题,导致选择/排序环节成为无法逾越、制约系统性能提升的瓶颈。Before the computer processes the data, it generally needs to determine the target data from the huge amount of candidate data, and then further process the target data, such as finding the target person or vehicle from the massive video in the tide of “Safe City”. For example, when using the fast region convolutional neural network Faster R-CNN for picture object detection, the input picture is connected to a plurality of candidate windows via a series of convolutional layers and full-layer connections, and the target is detected in the plurality of candidate windows. In the prior art, the candidate data is generally sorted to determine the target data. For ultra-large-scale data, it is increasingly difficult to increase the speed of traditional sorting or selection algorithms by providing the processor's main frequency. However, existing distributed parallel algorithms have problems such as repeated calculations, high memory requirements, and poor scalability. This leads to the selection/sorting process becoming a bottleneck that can't overcome and limit system performance.
如何在海量的数据中准确快速的找到目标数据,是一项亟待解决的问题。How to accurately and quickly find target data in massive data is an urgent problem to be solved.
发明内容Summary of the invention
本申请提供一种数据批量选择的方法和装置,不需要对候选数据进行全排序,避免了对候选数据多次重复计算,节省了内存和带宽,提高了系统效率。The present application provides a method and apparatus for batch selection of data, which does not need to perform full sorting of candidate data, avoids repeated calculation of candidate data multiple times, saves memory and bandwidth, and improves system efficiency.
第一方面,提供了一种数据批量选择的方法,该方法包括:数据分析器统计候选数据中的数据所属的数据区间,以获取统计结果,该统计结果包括多个数据区间中的每个数据区间包含的数据的个数,该每个数据区间的区间范围总和等于该候选数据的数据分布区间范围;区间统计器根据该统计结果,对该每个数据区间包含的数据个数分别做累加,以得到累加结果,该累加结果为该每个数据区间包含的数据个数与该每个数据区间之前的所有数据区间包含的数据个数之和;批量选取器根据该累加结果,确定目标数据所在的目标数据区间,并输出属于该目标数据区间的候选数据。In a first aspect, a method for batch selection of data is provided, the method comprising: a data analyzer stats a data interval to which data in the candidate data belongs to obtain a statistical result, the statistical result including each of the plurality of data intervals The number of data included in the interval, the sum of the range ranges of each data interval is equal to the data distribution interval range of the candidate data; the interval statisticer accumulates the number of data included in each data interval according to the statistical result, To obtain an accumulated result, the accumulated result is the sum of the number of data included in each data interval and the number of data included in all data intervals before each data interval; the batch picker determines the target data according to the accumulated result The target data interval and output candidate data belonging to the target data interval.
其中,区间统计器对每个数据区间包含的数据个数分别做累加可以采用分别对每个数据区间包含的数据个数做前缀和运算,以得到每个数据区间各自的累加结果。The interval statisticer accumulates the number of data included in each data interval separately, and may perform a prefix and operation on the number of data included in each data interval to obtain an accumulated result of each data interval.
可选地,该区间统计器可以采用前缀和prefix sum计算每个数据区间包含的数据个数的累加和。Optionally, the interval statistic may calculate a cumulative sum of the number of data included in each data interval by using a prefix and a prefix sum.
因此,在本申请实施例中,实现了数据区间有序,但是每个数据区间内的数据无序,不需要对候选数据进行全排序,输出目标数据只需2次全并行扫描和1次并行累加计算就可完成批量选择,避免了对候选数据多次重复计算,节省了内存和带宽,提高了系统效率。Therefore, in the embodiment of the present application, the data interval is ordered, but the data in each data interval is out of order, and the candidate data does not need to be fully sorted. The output target data only needs 2 full parallel scans and 1 parallel. The batch calculation can be completed by accumulating calculations, avoiding repeated calculations of candidate data, saving memory and bandwidth, and improving system efficiency.
结合第一方面,在第一方面的某些实现方式中,该数据分析器可以是具有多核的处理器,也可以是多个并行处理器,还可以是一个多线程处理器,或者该数据分析器是该多核的处理器,该多个并行处理器和该多线程处理器的组合。In conjunction with the first aspect, in some implementations of the first aspect, the data analyzer can be a multi-core processor, a plurality of parallel processors, or a multi-threaded processor, or the data analysis The processor is the multi-core processor, the combination of the plurality of parallel processors and the multi-threaded processor.
结合第一方面,在第一方面的某些实现方式中,该区间配置器可以是具有多核的处理器,也可以是多个并行处理器,还可以是一个多线程处理器,或者该区间配置器是该多核的处理器,该多个并行处理器和该多线程处理器的组合。With reference to the first aspect, in some implementations of the first aspect, the interval configurator may be a multi-core processor, a plurality of parallel processors, or a multi-thread processor, or the interval configuration The processor is the multi-core processor, the combination of the plurality of parallel processors and the multi-threaded processor.
结合第一方面,在第一方面的某些实现方式中,该批量选取器可以是具有多核的处理器,也可以是多个并行处理器,还可以是一个多线程处理器,或者该批量选取器是该多核的处理器,该多个并行处理器和该多线程处理器的组合。With reference to the first aspect, in some implementations of the first aspect, the batch picker may be a multi-core processor, a plurality of parallel processors, or a multi-thread processor, or the batch selection The processor is the multi-core processor, the combination of the plurality of parallel processors and the multi-threaded processor.
结合第一方面,在第一方面的某些实现方式中,每个数据区间对应一个计数器,该计数器用于记载该数据区间的个数,当该数据分析器确定一个数据属于该数据区间时,在该数据区间对应的计数器中加1。In conjunction with the first aspect, in some implementations of the first aspect, each data interval corresponds to a counter, the counter is configured to record the number of the data intervals, when the data analyzer determines that a data belongs to the data interval, Add 1 to the counter corresponding to the data interval.
结合第一方面,在第一方面的某些实现方式中,在该数据分析器统计候选数据中的数据所属的数据区间之前,该方法还包括:区间配置器根据该候选数据的数据信息,确定该多个数据区间的个数和该多个数据区间中的每个数据区间的范围;该区间配置器将该多个数据区间和该多个数据区间中的每个数据区间的范围发送给该数据分析器。In conjunction with the first aspect, in some implementations of the first aspect, before the data analyzer counts the data interval to which the data in the candidate data belongs, the method further includes: the interval configurator determining, according to the data information of the candidate data, a number of the plurality of data intervals and a range of each of the plurality of data intervals; the interval configurator transmits the plurality of data intervals and a range of each of the plurality of data intervals to the range Data analyzer.
此时,通过该区间配置器根据该候选数据的数据信息,确定该多个数据区间的个数和该多个数据区间中的每个数据区间的范围,可以使得后续批量选择的结果更加精确。At this time, the interval configurator determines the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the data information of the candidate data, so that the result of the subsequent batch selection can be more accurate.
结合第一方面,在第一方面的某些实现方式中,该区间配置器根据该候选数据的数据信息,确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,包括:当该候选数据为均匀分布时,根据均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,该每个数据区间的范围相等;或当该候选数据为非匀分布时,根据非均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,该多个数据区间的范围中的至少两个数据区间的范围不相等。With reference to the first aspect, in some implementations of the first aspect, the interval configurator determines, according to the data information of the candidate data, the number of the plurality of data intervals and the range of each of the plurality of data intervals The method includes: when the candidate data is uniformly distributed, determining a number of the plurality of data intervals and a range of each of the plurality of data intervals according to the uniform quantization strategy, where the range of each data interval is equal; or When the candidate data is non-uniform, the number of the plurality of data intervals and the range of each of the plurality of data intervals are determined according to the non-uniform quantization strategy, and at least two of the ranges of the plurality of data intervals The range of intervals is not equal.
结合第一方面,在第一方面的某些实现方式中,当该候选数据为均匀分布时,该每个数据区间的范围为Δ时,根据均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,包括:With reference to the first aspect, in some implementations of the first aspect, when the candidate data is uniformly distributed, when the range of each data interval is Δ, determining the number of the plurality of data intervals according to the uniform quantization strategy and the The range of each data interval in multiple data intervals, including:
根据式(1)确定多个数据区间的个数M,Determining the number M of the plurality of data intervals according to the formula (1),
M=x/Δ                              (1)M=x/Δ (1)
其中,x为该候选数据的数据区间范围,M为多个数据区间的个数。Where x is the data interval range of the candidate data, and M is the number of the plurality of data intervals.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:In conjunction with the first aspect, in some implementations of the first aspect, the method further comprises:
根据该候选数据的个数和该输出的目标数据的个数,确定该多个数据区间的个数M;Determining the number M of the plurality of data intervals according to the number of the candidate data and the number of the target data of the output;
根据式(1)确定该每个数据区间的范围Δ,Determining the range Δ of each data interval according to equation (1),
M=x/Δ                              (1)M=x/Δ (1)
其中,x为该候选数据的数据区间范围,M为多个数据区间的个数。Where x is the data interval range of the candidate data, and M is the number of the plurality of data intervals.
结合第一方面,在第一方面的某些实现方式中,该区间统计器根据该统计结果,对对该多个数据区间包括的个数做累加,包括:In conjunction with the first aspect, in some implementations of the first aspect, the interval statistic accumulates the number of the plurality of data intervals according to the statistical result, including:
当该目标数据为该候选数据中的最小的部分数据时,根据该多个数据区间的升序,对对该多个数据区间包括的个数做累加;或When the target data is the smallest partial data of the candidate data, accumulating the number included in the plurality of data intervals according to the ascending order of the plurality of data intervals; or
当该目标数据为该候选数据中的最大的部分数据时,根据该多个数据区间的降序,对该多个数据区间包括的个数做累加。When the target data is the largest partial data of the candidate data, the number of the plurality of data intervals is accumulated according to the descending order of the plurality of data intervals.
结合第一方面,在第一方面的某些实现方式中,该数据分析器、该区间统计器和该批量选取器为相同的物理实体或部分相同的物理实体。In conjunction with the first aspect, in some implementations of the first aspect, the data analyzer, the interval statistic, and the batch picker are the same physical entity or partially identical physical entities.
第二方面,提供了一种数据批量选择的装置,该装置包括:In a second aspect, an apparatus for batch selection of data is provided, the apparatus comprising:
数据分析器,用于统计候选数据中的数据所属的数据区间,以获取统计结果,该统计结果包括多个数据区间中的每个数据区间包含的数据的个数,该每个数据区间的区间范围总和等于该候选数据的数据分布区间范围;a data analyzer, configured to count a data interval to which the data in the candidate data belongs, to obtain a statistical result, where the statistical result includes the number of data included in each of the plurality of data intervals, and the interval of each data interval The sum of the ranges is equal to the range of the data distribution interval of the candidate data;
区间统计器根据该统计结果,对该每个数据区间包含的数据个数分别做累加,以得到 累加结果,该累加结果为该每个数据区间包含的数据个数与该每个数据区间之前的所有数据区间包含的数据个数之和;The interval statistic unit accumulates the number of data included in each data interval according to the statistical result to obtain an accumulated result, where the accumulated result is the number of data included in each data interval and before each of the data intervals The sum of the number of data contained in all data intervals;
批量选取器根据该累加结果,确定目标数据所在的目标数据区间,并输出属于该目标数据区间的候选数据。The batch picker determines a target data interval in which the target data is located according to the accumulated result, and outputs candidate data belonging to the target data interval.
其中,区间统计器对每个数据区间包含的数据个数分别做累加可以采用分别对每个数据区间包含的数据个数做前缀和运算,以得到每个数据区间各自的累加结果。The interval statisticer accumulates the number of data included in each data interval separately, and may perform a prefix and operation on the number of data included in each data interval to obtain an accumulated result of each data interval.
结合第二方面,在第二方面的某些实现方式中,该装置还包括:In conjunction with the second aspect, in some implementations of the second aspect, the apparatus further comprises:
区间配置器,用于根据该候选数据的数据信息,确定该多个数据区间的个数和该多个数据区间中的每个数据区间的范围;该区间配置器将该多个数据区间和该多个数据区间中的每个数据区间的范围发送给该第一处理器。An interval configurator, configured to determine, according to the data information of the candidate data, a number of the plurality of data intervals and a range of each of the plurality of data intervals; the interval configurator and the plurality of data intervals A range of each of the plurality of data intervals is transmitted to the first processor.
结合第二方面,在第二方面的某些实现方式中,该区间配置器具体用于:当该候选数据为均匀分布时,根据均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,该每个数据区间的范围相等;或当该候选数据为非均匀分布时,根据非均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,该多个数据区间的范围中的至少两个数据区间的范围不相等。With reference to the second aspect, in some implementations of the second aspect, the interval configurator is specifically configured to: when the candidate data is uniformly distributed, determine a number of the plurality of data intervals and the plurality of data according to the uniform quantization policy a range of each data interval in the interval, the range of each data interval being equal; or when the candidate data is non-uniformly distributed, determining the number of the plurality of data intervals and the plurality of data intervals according to the non-uniform quantization strategy The range of each data interval, the range of at least two of the plurality of data intervals is not equal.
结合第二方面,在第二方面的某些实现方式中,当该候选数据为均匀分布时,该每个数据区间的范围为Δ时,该区间配置器具体用于:With reference to the second aspect, in some implementations of the second aspect, when the candidate data is uniformly distributed, and the range of each data interval is Δ, the interval configurator is specifically configured to:
根据式(1)确定多个数据区间的个数M,Determining the number M of the plurality of data intervals according to the formula (1),
M=x/Δ                              (1)M=x/Δ (1)
其中,x为该候选数据的数据区间范围,M为多个数据区间的个数。Where x is the data interval range of the candidate data, and M is the number of the plurality of data intervals.
结合第二方面,在第二方面的某些实现方式中,该区间配置器具体用于:In conjunction with the second aspect, in some implementations of the second aspect, the interval configurator is specifically configured to:
根据该候选数据的个数和该输出的目标数据的个数,确定该多个数据区间的个数M;Determining the number M of the plurality of data intervals according to the number of the candidate data and the number of the target data of the output;
根据式(1)确定该每个数据区间的范围Δ,Determining the range Δ of each data interval according to equation (1),
M=x/Δ                              (1)M=x/Δ (1)
其中,x为该候选数据的数据区间范围,M为多个数据区间的个数。Where x is the data interval range of the candidate data, and M is the number of the plurality of data intervals.
结合第二方面,在第二方面的某些实现方式中,该区间统计器具体用于:当该目标数据为该候选数据中的最小的部分数据时,根据该多个数据区间的升序,对该多个数据区间的个数做前缀和运算;或当该目标数据为该候选数据中的最大的部分数据时,根据该多个数据区间的降序,对该多个数据区间的个数做前缀和运算。With reference to the second aspect, in some implementations of the second aspect, the interval statistic is specifically configured to: when the target data is the smallest partial data of the candidate data, according to the ascending order of the multiple data intervals, The number of the plurality of data intervals is prefixed and operated; or when the target data is the largest part of the candidate data, the number of the plurality of data intervals is prefixed according to the descending order of the plurality of data intervals And operation.
结合第二方面,在第二方面的某些实现方式中,该数据分析器、该区间统计器和该批量选取器为相同的物理器件或该相同的物理器件的部分。In conjunction with the second aspect, in some implementations of the second aspect, the data analyzer, the interval statistic, and the batch picker are the same physical device or portions of the same physical device.
第三方面,提供了一种计算机存储介质,其特征在于,该计算机存储介质存储有程序指令,当该指令被执行时,该计算机存储介质可以执行第一方面或第一方面的任一可选的实现方式中的方法。In a third aspect, a computer storage medium is provided, wherein the computer storage medium stores program instructions, and when the instructions are executed, the computer storage medium can perform any of the first aspect or the first aspect The method in the implementation.
第四方面,提供了一种计算机程序产品,所述计算机程序产品包括指令,当所述指令被执行时,使得所述数据批量选择的装置可以执行第一方面或第一方面的任一可选的实现方式中的方法。In a fourth aspect, a computer program product is provided, the computer program product comprising instructions that, when executed, cause the device for batch selection of data to perform any of the first aspect or any of the first aspects The method in the implementation.
第七方面,提供了一种芯片系统,包括:至少一个处理器,所述至少一个处理器用于执行存储的指令,以使得所述数据批量选择的装置可以执行第一方面或第一方面的任一可 选的实现方式中的方法。In a seventh aspect, a chip system is provided, comprising: at least one processor, the at least one processor for executing stored instructions, such that the device for batch selection of data can perform the first aspect or the first aspect An alternative implementation.
附图说明DRAWINGS
图1是根据本申请的一种数据批量选择的方法和设备的系统架构的示意性框图。1 is a schematic block diagram of a system architecture of a method and apparatus for data batch selection in accordance with the present application.
图2是本申请的一种数据批量选择的方法的示意性流程图。2 is a schematic flow chart of a method for data batch selection in the present application.
图3是本申请的根据前缀和累加多个数据区间的个数的示意性框图。3 is a schematic block diagram of the number of data intervals according to a prefix and an accumulation of the present application.
图4是本申请的根据前缀和累加多个数据区间的个数的示意性框图。4 is a schematic block diagram of the number of data intervals according to a prefix and an accumulation of the present application.
图5是本申请的一种数据批量选择的方法的示意性流程图。FIG. 5 is a schematic flowchart of a method for data batch selection according to the present application.
图6是根据本申请的一种数据批量选择的装置的示意性框图。6 is a schematic block diagram of an apparatus for data batch selection in accordance with the present application.
图7是根据本申请的一种数据批量选择的系统的示意性架构图。7 is a schematic architectural diagram of a system for data batch selection in accordance with the present application.
图8示出了本申请提供的数据批量选择的设备的示意性框图。FIG. 8 shows a schematic block diagram of an apparatus for batch selection of data provided by the present application.
具体实施方式detailed description
下面将结合附图,对本申请中的技术方案进行描述。The technical solutions in the present application will be described below with reference to the accompanying drawings.
图1是根据本申请的一种数据批量选择的方法和设备的系统100架构的示意性框图。如图1所示,该系统100架构包括,前端采集装置110,存储管理装置120,智能分析装置130。其中,前端采集装置110,存储管理装置120,智能分析装置130通过网络连接。该前端采集装置110用于拍摄物体,例如人体,人脸,车体的抓拍,前端采集装置110将拍摄到的信息传递到存储管理装置120,该存储管理装置120对前端采集装置110将拍摄到的信息进行特征提取,该存储管理装置120将特征提取后的数据传递到智能分析装置130,该智能分析装置130根据特征提取后的数据进行批量选择,输出检测目标。1 is a schematic block diagram of a system 100 architecture of a method and apparatus for data batch selection in accordance with the present application. As shown in FIG. 1, the system 100 architecture includes a front end collection device 110, a storage management device 120, and an intelligent analysis device 130. The front end collection device 110, the storage management device 120, and the intelligent analysis device 130 are connected through a network. The front-end collection device 110 is configured to capture an object, such as a human body, a human face, and a capture of a vehicle body. The front-end collection device 110 transmits the captured information to the storage management device 120, and the storage management device 120 captures the front-end collection device 110. The information is extracted, and the storage management device 120 transmits the feature-extracted data to the intelligent analysis device 130. The intelligent analysis device 130 performs batch selection based on the extracted data, and outputs a detection target.
需要说明的是,图1仅为示例性架构图,除图1中所示装置之外,该系统架构还可以包括其他装置,本申请实施例对此不进行限定。It should be noted that FIG. 1 is only an exemplary architecture diagram. The system architecture may include other devices in addition to the device shown in FIG.
本申请实施例的技术方案可以应用于各种领域,在深度学习领域,所有涉及到基于候选区域的枚举,必然要用到排序算法,必然可以用本发明的算法来替换提速;同时在其他需要做排序然后选取结果的其他领域,也同样可以适用。The technical solution of the embodiment of the present application can be applied to various fields. In the field of deep learning, all enumerations based on candidate regions must use a sorting algorithm, and the algorithm of the present invention can be used to replace the speed increase; Other areas that need to be sorted and then selected for results are equally applicable.
此外,本申请的各个方面或特征可以实现成方法、装置或使用标准编程和/或工程技术的制品。本申请中使用的术语“制品”涵盖可从任何计算机可读器件、载体或介质访问的计算机程序。例如,计算机可读介质可以包括,但不限于:磁存储器件(例如,硬盘、软盘或磁带等),光盘(例如,压缩盘(compact disc,CD)、数字通用盘(digital versatile disc,DVD)等),智能卡和闪存器件(例如,可擦写可编程只读存储器(erasable programmable read-only memory,EPROM)、卡、棒或钥匙驱动器等)。另外,本文描述的各种存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读介质。术语“机器可读介质”可包括但不限于,能够存储、包含和/或承载指令和/或数据的各种介质。Furthermore, various aspects or features of the present application can be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques. The term "article of manufacture" as used in this application encompasses a computer program accessible from any computer-readable device, carrier, or media. For example, a computer readable medium may include, but is not limited to, a magnetic storage device (eg, a hard disk, a floppy disk, or a magnetic tape, etc.), such as a compact disc (CD), a digital versatile disc (DVD). Etc.), smart cards and flash memory devices (eg, erasable programmable read-only memory (EPROM), cards, sticks or key drivers, etc.). Additionally, various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" may include, but is not limited to, a variety of media capable of storing, containing, and/or carrying instructions and/or data.
下面结合图2详细说明本申请提供的一种数据批量选择的方法,图2是本申请一个实施例的一种数据批量选择的方法200的示意性流程图,该方法200可以应用在图1所示的场景中,当然也可以应用在其他场景中,本申请实施例在此不作限制。A method for data batch selection provided by the present application is described in detail below with reference to FIG. 2. FIG. 2 is a schematic flowchart of a method 200 for data batch selection according to an embodiment of the present application. The method 200 can be applied to FIG. The embodiment of the present application is not limited herein.
如图2所示,该方法200包括以下内容。As shown in FIG. 2, the method 200 includes the following.
步骤210,数据分析器统计候选数据中的数据所属的数据区间,以获取统计结果,该统计结果包括多个数据区间中的每个数据区间包含的数据的个数,该每个数据区间的区间范围总和等于该候选数据的数据分布区间范围。Step 210: The data analyzer collects a data interval to which the data in the candidate data belongs to obtain a statistical result, where the statistical result includes the number of data included in each of the plurality of data intervals, and the interval of each data interval The sum of the ranges is equal to the range of the data distribution interval of the candidate data.
可选地,该数据分析器可以是具有多核的处理器,也可以是多个并行处理器,还可以是一个多线程处理器,或者该数据分析器是该多核的处理器,该多个并行处理器和该多线程处理器的组合。Optionally, the data analyzer may be a multi-core processor, a plurality of parallel processors, or a multi-thread processor, or the data analyzer is the multi-core processor, the multiple parallel A combination of a processor and the multi-threaded processor.
具体而言,以该数据分析器为多个并行处理器为例,为了提高系统的运算速度,一般让每个处理器负责的统计的数据个数相等或近似相等,即满足负载均衡原理,候选数据会被均匀的分配到多个并行处理器,该多个并行处理器中的每个处理器会统计其分配的候选数据中的数据所属的数据区间,以获取统计结果。例如,候选数据有9个,候选数据的数据分布区间范围为[0,9],数据分别为1,2,3,4,5,6,7,8,9,数据区间分别为[0,3),[3,6),[6,9]。数据分析器为3个并行的处理器,那么根据负载均衡原理,每个数据分析器负责统计的数据为3个。即多个并行处理器中的第一处理器统计九个数据中的第一个至第三个数据所属的数据区间,第二处理器统计九个数据中的第四个至第六个数据所属的数据区间,第三处理器统计九个数据中的第七个至第九个数据所属的数据区间;或者第一处理器统计九个数据中的第一个、第四个和第七个数据所属的数据区间,第二处理器统计九个数据中的第二个、第五个和第八个数据所属的数据区间,第三处理器统计九个数据中的第三个、第六个和第九个数据所属的数据区间。经过该数据统计器的统计,数据区间[0,3)包括的数据的个数为2,数据区间[3,6)包括的数据的个数为3,数据区间[6,9]包括的数据的个数为4。Specifically, taking the data analyzer as a plurality of parallel processors as an example, in order to improve the computing speed of the system, generally, the number of statistical data that each processor is responsible for is equal or approximately equal, that is, the load balancing principle is satisfied, and candidates are The data is evenly distributed to a plurality of parallel processors, each of which counts the data interval to which the data in the candidate data is allocated to obtain statistical results. For example, there are 9 candidate data, and the data distribution interval range of the candidate data is [0, 9], and the data are 1, 2, 3, 4, 5, 6, 7, 8, and 9, respectively, and the data interval is [0, respectively. 3), [3, 6), [6, 9]. The data analyzer is three parallel processors. According to the load balancing principle, each data analyzer is responsible for counting three data. That is, the first processor of the plurality of parallel processors counts the data interval to which the first to third data of the nine data belong, and the second processor counts the fourth to sixth data of the nine data. Data interval, the third processor counts the data interval to which the seventh to ninth data of the nine data belong; or the first processor counts the first, fourth, and seventh data of the nine data The data interval to which the second processor counts the data interval to which the second, fifth, and eighth data of the nine data belong, and the third processor counts the third and sixth of the nine data. The data interval to which the ninth data belongs. According to the statistics of the data statistic, the number of data included in the data interval [0, 3) is 2, the number of data included in the data interval [3, 6) is 3, and the data included in the data interval [6, 9] The number is 4.
应理解,对于满足负载均衡原理的条件下,如何给数据分析器分配候选数据的具体形式本申请并不进行限定。It should be understood that the specific form of how to assign candidate data to the data analyzer under the condition that the load balancing principle is satisfied is not limited in this application.
可选地,每个数据区间对应一个计数器,该计数器用于记载该数据区间的个数,当该数据分析器确定一个数据属于该数据区间时,在该数据区间对应的计数器中加1。Optionally, each data interval corresponds to a counter for recording the number of the data intervals. When the data analyzer determines that one data belongs to the data interval, the counter corresponding to the data interval is incremented by one.
应理解,每个数据区间也可能对应一个内存空间,该内存空间用于记载该数据区间内数据的个数,当任一个处理器确定一个数据属于该数据区间时,在该数据区间对应的内存空间中加1。It should be understood that each data interval may also correspond to a memory space, which is used to record the number of data in the data interval. When any processor determines that a data belongs to the data interval, the memory corresponding to the data interval Add 1 to the space.
步骤220,区间统计器根据该统计结果,对该多个数据区间包括的数据个数做累加和,以得到累加结果,该累加结果为该每个数据区间包含的数据个数与该每个数据区间之前的所有数据区间包含的数据个数之和。Step 220: The interval statistic unit accumulates the number of data included in the plurality of data intervals according to the statistical result, to obtain an accumulated result, where the accumulated result is the number of data included in each data interval and each of the data The sum of the number of data contained in all data intervals before the interval.
具体而言,例如上述的9个候选数据分配了三个数据区间,分别为数据区间[0,3),[3,6),[6,9],那么该区间统计器分别统计[0,3)包括的数据的个数为2,[0,6)包括的数据的个数为5,[0,9]包括的数据的个数为9。Specifically, for example, the above nine candidate data are allocated three data intervals, which are data intervals [0, 3), [3, 6), [6, 9], respectively, and the interval statistic respectively counts [0, 3) The number of data included is 2, the number of data included in [0, 6) is 5, and the number of data included in [0, 9] is 9.
可选地,该区间统计器可以是具有多核的处理器,也可以是多个并行处理器,还可以是一个多线程处理器,或者该区间统计器是该多核的处理器,该多个并行处理器和该多线程处理器的组合。Optionally, the interval statistic may be a processor with multiple cores, multiple parallel processors, or a multi-thread processor, or the interval statistic is the multi-core processor, the multiple parallel A combination of a processor and the multi-threaded processor.
可选地,该区间统计器与该数据分析器可以是相同的物理实体或部分相同的物理实体,该物理实体可以是物理器件或者设备或者装置。例如,该数据分析器是3个并行的处理器,那么该区间统计器也可以是该3个并行的处理器,或者该区间统计器可以是该3个并行处理器的其中一个或两个处理器。Optionally, the interval statistic and the data analyzer may be the same physical entity or a partially identical physical entity, and the physical entity may be a physical device or device or device. For example, the data analyzer is three parallel processors, then the interval statisticer may also be the three parallel processors, or the interval statisticator may be one or two of the three parallel processors. Device.
步骤230,批量选取器根据该累加结果,确定目标数据所在的目标数据区间,并输出属于该目标数据区间的候选数据。Step 230: The batch picker determines, according to the accumulated result, a target data interval in which the target data is located, and outputs candidate data belonging to the target data interval.
具体而言,目标数据为在所述候选数据中需要选择的数据,批量选取器根据该区间统计器的累加结果,确定目标数据所在的目标数据区间,并输出属于该目标数据区间的候选数据。Specifically, the target data is data that needs to be selected in the candidate data, and the batch picker determines a target data interval in which the target data is located according to the accumulated result of the interval statistic, and outputs candidate data belonging to the target data interval.
可选地,该批量选取器可以是具有多核的处理器,也可以是多个并行处理器,还可以是一个多线程处理器,或者该批量选取器是该多核的处理器,该多个并行处理器和该多线程处理器的组合。Optionally, the batch picker may be a multi-core processor, a plurality of parallel processors, or a multi-thread processor, or the batch picker is the multi-core processor, the multiple parallel A combination of a processor and the multi-threaded processor.
可选地,该批量选取器中的每个并行处理器均可以根据该累加结果,确定目标数据所在的目标数据区间,并输出属于该目标数据区间的候选数据;或者该批量选取器中的某一个并行处理器根据该累加结果,确定目标数据所在的目标数据区间,并将所述目标数据区间发送给该其他并行处理器,该批量选取器中的每个并行处理器输出属于该目标数据区间的候选数据。Optionally, each parallel processor in the batch picker may determine, according to the accumulated result, a target data interval in which the target data is located, and output candidate data belonging to the target data interval; or a certain one in the batch picker A parallel processor determines a target data interval in which the target data is located according to the accumulated result, and sends the target data interval to the other parallel processor, and each parallel processor output in the batch picker belongs to the target data interval Candidate data.
具体而言,以该数据分析器为多个并行处理器为例。目标数据为输出上述9个候选数据中最小的2个数据,那么该批量选取器确定该目标数据区间为[0,3)。假设批量选取器为3个并行的处理器,那么根据负载均衡原理,每个数据分析器负责统计的数据为3个。假设第一处理器负责处理的数据为1,2,3;第二处理器负责处理的数据为4,5,6;第二处理器负责处理的数据为7,8,9。三个处理器根据目标数据的区间,第一处理器输出1,2,第二处理器和第三处理器没有输出。Specifically, the data analyzer is exemplified by a plurality of parallel processors. The target data is the output of the smallest two of the above nine candidate data, and the batch picker determines that the target data interval is [0, 3). Assuming that the batch picker is 3 parallel processors, according to the load balancing principle, each data analyzer is responsible for counting three data. It is assumed that the data processed by the first processor is 1, 2, 3; the data processed by the second processor is 4, 5, 6; the data processed by the second processor is 7, 8, 9. The three processors output 1, 2 according to the interval of the target data, and the second processor and the third processor have no output.
可选地,该批量选取器与该数据分析器、该区间统计器可以是相同的物理实体或部分相同的物理实体,该物理实体可以是物理器件或者设备或装置。例如,该数据分析器是3个并行的处理器,那么该批量选取器也可以是该3个并行的处理器。Optionally, the batch picker and the data analyzer, the interval statistic may be the same physical entity or a partially identical physical entity, and the physical entity may be a physical device or device or device. For example, the data analyzer is three parallel processors, and the batch picker can also be the three parallel processors.
在本申请实施例中,除了输入输出数据空间为N,额外需要的空间为M个数据区间包括的数据的个数的存放空间M或者M个计数器。设输入数据数量为n,并行选取器处理器个数为p,则数据分析器统计候选数据中的数据所属的数据区间需要的时间复杂度为O(n/p):每个并行处理器分析n/p个输入应该在哪个类的计数器加一;区间统计器根据该统计结果,对该多个数据区间包括的个数做累加时,当p≥M时时间复杂度为O(log M);批量选取器根据该累加结果,确定目标数据所在的目标数据区间的时间复杂度为O(n/p):每个并行处理器判断n/p个输入输出与否。本发明对性能有很好的扩展性,并行处理器的个数可以一直增长到p=n同时保持性能。当p=n时根据性能公式O(n/p)+O(logM)+O(n/p),流程的时间复杂度为O(2)+O(logM)。In the embodiment of the present application, in addition to the input/output data space being N, the additional required space is the storage space M or M counters of the number of data included in the M data sections. Let the number of input data be n and the number of parallel selector processors be p, then the time complexity required by the data analyzer to count the data interval in the candidate data is O(n/p): analysis by each parallel processor The counter of which class should be incremented by n/p inputs; the interval statisticer accumulates the number of the multiple data intervals according to the statistical result, and the time complexity is O(log M) when p≥M The batch picker determines, according to the accumulated result, the time complexity of the target data interval in which the target data is located is O(n/p): each parallel processor determines n/p input/output or not. The present invention has a good scalability to performance, and the number of parallel processors can be increased up to p=n while maintaining performance. When p=n, according to the performance formula O(n/p)+O(logM)+O(n/p), the time complexity of the process is O(2)+O(logM).
因此,在本申请实施例中,实现了数据区间有序,但是每个数据区间内的数据无序,不需要对候选数据进行全排序,输出目标数据只需2次全并行扫描和1次并行累加计算就可完成批量选择,避免了对候选数据多次重复计算,节省了内存和带宽,提高了系统效率。Therefore, in the embodiment of the present application, the data interval is ordered, but the data in each data interval is out of order, and the candidate data does not need to be fully sorted. The output target data only needs 2 full parallel scans and 1 parallel. The batch calculation can be completed by accumulating calculations, avoiding repeated calculations of candidate data, saving memory and bandwidth, and improving system efficiency.
可选地,该区间统计器根据该统计结果,对该多个数据区间包括的个数做累加,包括:Optionally, the interval statistic accumulates the number of the plurality of data intervals according to the statistical result, including:
当所述目标数据为所述候选数据中的最小的部分数据时,根据该多个数据区间的升序,对该多个数据区间包括的个数做累加;或When the target data is the smallest partial data of the candidate data, accumulating the number of the plurality of data intervals according to the ascending order of the plurality of data intervals; or
当所述目标数据为所述候选数据中的最大的部分数据时,根据该多个数据区间的降序,对该多个数据区间包括的个数做累加。When the target data is the largest partial data of the candidate data, the number of the plurality of data intervals is accumulated according to the descending order of the plurality of data intervals.
具体而言,当选择候选数据中最大的第n个到第m个数据(如最大的前100个数据,即n=1,m=100;最大的前50到前90之间,即n=50,m=90)时,根据该多个数据区间的升 序,对该多个数据区间包括的个数做累加;而当选择最小的第q个到第p个数据时,根据该多个数据区间的降序,对该多个数据区间包括的个数做累加。Specifically, when selecting the largest nth to mth data in the candidate data (for example, the largest top 100 data, that is, n=1, m=100; the maximum of the first 50 to the top 90, that is, n= 50, m=90), according to the ascending order of the plurality of data intervals, accumulating the number of the plurality of data intervals; and when selecting the smallest qth to pth data, according to the plurality of data The descending order of the interval is accumulated for the number of the plurality of data intervals.
具体地,该区间统计器可以采用前缀和(prefix sum)计算每个数据区间包含的数据个数的累加和,prefix sum是一种求累加和的算法。其定义为:Specifically, the interval statistic may use a prefix sum to calculate an accumulated sum of the number of data included in each data interval, and prefix sum is an algorithm for summing sum. It is defined as:
输入:x 0,x 1,x 2,x 3,…,x n Input: x 0 , x 1 , x 2 , x 3 ,..., x n
输出:y 0,y 1,y 2,y 3,…,y n Output: y 0 , y 1 , y 2 , y 3 ,..., y n
其中,y 0=x 0Where y 0 = x 0 ,
y 1=x 0+x 1y 1 =x 0 +x 1 ,
y 2=x 0+x 1+x 2y 2 =x 0 +x 1 +x 2 ,
y 3=x 0+x 1+x 2+x 3y 3 =x 0 +x 1 +x 2 +x 3 ,
……......
y n=x 0+x 1+x 2+x 3+…+x n y n =x 0 +x 1 +x 2 +x 3 +...+x n
即输出每位为输入第一位到当前位置的各输入的和。That is, each bit is output as the sum of the inputs from the first bit to the current position.
下面具体地对采用前缀和prefix sum算法计算该多个数据区间包括的个数做累加进行详细描述。The following is specifically described in detail by using the prefix and prefix sum algorithm to calculate the number of the plurality of data intervals.
当数据区间的个数小于或等于累加器包括的并行处理器的个数的两倍时,可以根据以下步骤进行累加计算:When the number of data intervals is less than or equal to twice the number of parallel processors included in the accumulator, the cumulative calculation can be performed according to the following steps:
(1)每个并行处理器计算属于两个连续数据区间的个数的和(假设数据区间的个数为8,从左至由依次为x 0,x 1,x 2,x 3,x 4,x 5,x 6,x 7,并行处理器的个数为20个。如图3中的d=0行,处理器1计算x 0+x 1,处理器2计算x 2+x 3,处理器3计算x 4+x 5,处理器4计算x 6+x 7) (1) Each parallel processor calculates the sum of the number of two consecutive data intervals (assuming that the number of data intervals is 8, from left to by x 0 , x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , the number of parallel processors is 20. As shown in Figure 3, d = 0, processor 1 calculates x 0 + x 1 , and processor 2 calculates x 2 + x 3 , Processor 3 calculates x 4 + x 5 and processor 4 calculates x 6 + x 7 )
(2)递归地使用上一步一半的处理器计算两个连续的在上一步被更新的数据区间的个数的和(如图3中的d=1及d=2行,处理器5计算Σ(x 0,x 1)+Σ(x 2,x 3),处理器6计算Σ(x 4,x 5)+Σ(x 6,x 7),处理器7计算Σ(x 0…x 3)+Σ(x 4…x 7)。如数据区间的个数不是2的次方,则最后更新结果在递归中顺延。 (2) Recursively use the processor of the previous step to calculate the sum of the number of consecutive data segments updated in the previous step (such as d=1 and d=2 in Figure 3, processor 5 calculates Σ (x 0 , x 1 ) + Σ (x 2 , x 3 ), the processor 6 calculates Σ(x 4 , x 5 )+Σ(x 6 , x 7 ), and the processor 7 calculates Σ(x 0 ... x 3 ) ) + Σ (x 4 ... x 7 ). If the number of data intervals is not the power of 2, the final update result is postponed in recursion.
(3)递归结束时,最后一位即为y n的值(如图3最上一行最右边的值,Σ(x 0…x 3)+Σ(x 4…x 7)),记录下来,然后填0(如图5最上一行)。 (3) At the end of recursion, the last digit is the value of y n (as shown in the rightmost value of the top row of Figure 3, Σ(x 0 ... x 3 ) + Σ(x 4 ... x 7 )), recorded, and then Fill in 0 (as in the top line of Figure 5).
(4)按以上递归的反顺序递归(如图4d=0,d=1,d=2,由上向下),先用一个处理器处理以上递归的d 2步的值,然后再用两个处理器处理以上递归的d 1步的值,以此类推,直至递归结束。 (4) Recursively in the reverse order of recursion above (as shown in Figure 4d = 0, d = 1, d = 2, from top to bottom), first use a processor to process the value of the above recursive d 2 step, and then use two The processors process the values of the above recursive d 1 steps, and so on, until the recursion ends.
在反顺序递归过程中,处理器8将保存的“0”左移至数据区间x 3对应的个数(图4中d 0步的虚线所示),并将左移会被替换的值Σ(x 0…x 3)和保存的值“0”相加作为新值保存(图4中d 0步的实线所示);处理器9将保存的“0”左移至数据区间x 1对应的个数(图5中d 1步的虚线所示),并将左移会被替换的值Σ(x 0,x 1)和保存的值“0”相加作为新值保存(图4中d 1步的实线所示),处理器10将保存的“Σ(x 0…x 3)”左移至数据区间x 5对应的个数(图4中d 1步的虚线所示),并将左移会被替换的值Σ(x 4,x 5)和保存的值“Σ(x 0…x 3)”相加作为新值保存(图4中d 1步的实线所示);以此类推,得到y 0,y 1,…y (n-1)的值。 In the reverse order recursive process, the processor 8 moves the saved "0" left to the number corresponding to the data interval x 3 (shown by the dashed line in step d 0 in Fig. 4), and shifts the left value to be replaced Σ (x 0 ... x 3 ) is added to the saved value "0" as a new value (shown by the solid line of the d 0 step in Fig. 4); the processor 9 shifts the saved "0" to the data interval x 1 Corresponding number (shown by the dashed line in step d 1 in Figure 5), and adding the value Σ(x 0 , x 1 ) that will be replaced by the left shift and the saved value "0" are added as new values (Figure 4 In the middle of the d 1 step, the processor 10 shifts the saved "Σ(x 0 ... x 3 )" to the left of the data interval x 5 (shown by the dashed line in step d 1 of Fig. 4) And adding the value Σ(x 4 , x 5 ) to be replaced by the left shift and the saved value “Σ(x 0 ... x 3 )” are added as new values (shown by the solid line in step d 1 of Figure 4) And so on, get the value of y 0 , y 1 ,...y (n-1) .
(5)递归结束时,可得y 0,y 1,…y (n-1)的值。结合前面记录的y n的值,完成prefix sum。 (5) At the end of recursion, the values of y 0 , y 1 , ... y (n-1) are obtained. Complete the prefix sum in conjunction with the previously recorded value of y n .
当数据区间的个数大于累加器包括的并行处理器的个数的两倍时,可以根据以下步骤 进行累加计算:When the number of data intervals is greater than twice the number of parallel processors included in the accumulator, the accumulation calculation can be performed according to the following steps:
(1)将数据区间的个数切分成多组block,每组含的数据区间的个数小于等于并行处理器的个数的两倍。(1) The number of data intervals is divided into multiple groups of blocks, and the number of data intervals included in each group is less than or equal to twice the number of parallel processors.
(2)每组block使用上述当精度表的数据区间的个数小于或等于并行处理器的个数的两倍时的方法计算当组的prefix sum。(2) Each group block calculates the prefix sum of the group using the above method when the number of data intervals of the precision table is less than or equal to twice the number of parallel processors.
(3)每组的最后一个值(即每组上述方法步骤3中记录的y n)组成一个新的辅助数组auxiliary组,使用上述当精度表的数据区间数的个数小于或等于并行处理器的个数的两倍时的方法计算本组的prefix sum。 (3) The last value of each group (ie, y n recorded in step 3 of the above method) constitutes a new auxiliary array auxiliary group, and the number of data interval numbers used in the precision table is less than or equal to the parallel processor. The method of calculating the prefix sum of this group when the number of times is twice.
(4)Block0组不动,block1组每元素(block组的y 0…y n)加auxiliary组的y 0,block2组每元素加auxiliary组的y 1,block3组每元素加auxiliary组的y 2,…,block m组每元素加auxiliary组的y (m-1)。至此完成prefix sum。 (4) Block0 does not move, block1 group each element (block group y 0 ... y n ) plus auxiliary group y 0 , block 2 group each element plus auxiliary group y 1 , block 3 group each element plus auxiliary group y 2 ,..., block m group plus y (m-1) for each element of the auxiliary group. This completes the prefix sum.
应理解,在210中,该数据分析器统计候选数据中的数据所属的数据区间时,该多个数据区间和该多个数据区间中的每个数据区间的范围已经配置给该数据分析器。可选地,该多个数据区间和该多个数据区间中的每个数据区间的范围保存在共享内存中,该数据分析器可以通过读取该共享内存获得该多个数据区间和该多个数据区间中的每个数据区间的范围;或者该数据分析器本地的内存上保存了该多个数据区间和该多个数据区间中的每个数据区间的范围。It should be understood that, in 210, when the data analyzer counts the data interval to which the data in the candidate data belongs, the plurality of data intervals and the range of each of the plurality of data intervals have been allocated to the data analyzer. Optionally, the plurality of data intervals and a range of each of the plurality of data intervals are saved in a shared memory, and the data analyzer can obtain the plurality of data intervals and the plurality by reading the shared memory. The range of each data interval in the data interval; or the memory local to the data analyzer stores the plurality of data intervals and a range of each of the plurality of data intervals.
如果,在该数据分析器统计候选数据中的数据所属的数据区间之前,该数据分析器没有获取到该多个数据区间和该多个数据区间中的每个数据区间的范围,则该方法200在210之前还包括240,如图5所示。If the data analyzer does not obtain the range of the plurality of data intervals and each of the plurality of data intervals before the data analyzer belongs to the data interval to which the data in the candidate data belongs, the method 200 240 is also included before 210, as shown in FIG.
在步骤240中,区间配置器根据该候选数据的数据信息,确定该多个数据区间的个数和该多个数据区间中的每个数据区间的范围,该区间配置器将该多个数据区间和该多个数据区间中的每个数据区间的范围发送给该数据分析器。In step 240, the interval configurator determines, according to the data information of the candidate data, the number of the plurality of data intervals and the range of each of the plurality of data intervals, the interval configurator to the plurality of data intervals And a range of each of the plurality of data intervals is sent to the data analyzer.
此时,通过该区间配置器根据该候选数据的数据信息,确定该多个数据区间的个数和该多个数据区间中的每个数据区间的范围,可以使得后续批量选择的结果更加精确。At this time, the interval configurator determines the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the data information of the candidate data, so that the result of the subsequent batch selection can be more accurate.
可选地,该区间配置器可以根据负载均衡原则向数据分析器分配候选数据。Optionally, the interval configurator can allocate candidate data to the data analyzer according to a load balancing principle.
应理解,在本申请实施例中也可以通过其他部件接收候选数据,然后向数据分析器分配候选数据,对此本申请不进行限定。It should be understood that the candidate data may also be received by other components in the embodiment of the present application, and then the candidate data is allocated to the data analyzer, which is not limited in this application.
可选地,该区间配置器根据该候选数据的数据信息,确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,包括:Optionally, the interval configurator determines, according to the data information of the candidate data, the number of the plurality of data intervals and the range of each of the plurality of data intervals, including:
当该候选数据为均匀分布时,根据均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,该每个数据区间的范围相等;或When the candidate data is uniformly distributed, determining a number of the plurality of data intervals and a range of each of the plurality of data intervals according to the uniform quantization strategy, the range of each of the data intervals being equal; or
当该候选数据为非匀分布时,根据非均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,该多个数据区间的范围中的至少两个数据区间的范围不相等。When the candidate data is non-uniform, determining a number of the plurality of data intervals and a range of each of the plurality of data intervals according to the non-uniform quantization strategy, at least two of the ranges of the plurality of data intervals The range of data intervals is not equal.
具体而言,当该数据为均匀分布或者近似均匀分布式时,都可以根据均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围;当该候选数据为非均匀分布或者极不均匀分布(即用等宽区间会产生数据量在区间之间的严重不均衡)时,根据非均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围。Specifically, when the data is uniformly distributed or approximately uniformly distributed, the number of the plurality of data intervals and the range of each of the plurality of data intervals may be determined according to the uniform quantization strategy; when the candidate data When the non-uniform distribution or the extremely uneven distribution (that is, the equal-width interval causes a serious imbalance of data between the intervals), the number of the plurality of data intervals and the plurality of data intervals are determined according to the non-uniform quantization strategy. The range of each data interval.
当该候选数据为均匀分布时,该每个数据区间的范围为Δ时,根据均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,包括:When the candidate data is uniformly distributed, when the range of each data interval is Δ, determining the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the uniform quantization strategy includes:
根据式(1)确定多个数据区间的个数M,Determining the number M of the plurality of data intervals according to the formula (1),
M=x/Δ                                (1)M=x/Δ (1)
其中,x为该候选数据的数据区间范围,M为多个数据区间的个数。Where x is the data interval range of the candidate data, and M is the number of the plurality of data intervals.
具体而言,当候选数据为均匀分布时,此时不需要知道候选数据的概率分布信息。可以根据均匀量化公式中的量化策略,即式(1)确定多个数据区间的个数M。Specifically, when the candidate data is uniformly distributed, it is not necessary to know the probability distribution information of the candidate data at this time. The number M of the plurality of data intervals can be determined according to the quantization strategy in the uniform quantization formula, that is, the equation (1).
例如,一组候选数据7、3、9、1、5,该候选数据为均匀分布,数据分布的数据区间范围在0到10之间,当每个数据区间的范围为2时,根据式(1)确定分配5数据区间,其中每个数据区间的范围分别为:[0,2),[2,4),[4,6),[6,8),[8,10)。For example, a set of candidate data 7, 3, 9, 1, 5, the candidate data is evenly distributed, the data interval of the data distribution ranges from 0 to 10, and when the range of each data interval is 2, according to the formula ( 1) Determine the allocation 5 data interval, where the range of each data interval is: [0, 2), [2, 4), [4, 6), [6, 8), [8, 10).
进一步地,在根据该候选数据的个数和该输出的目标数据的个数,确定该多个数据区间的个数M后,还可以根据式(1)确定该每个数据区间的范围Δ。Further, after determining the number M of the plurality of data sections based on the number of the candidate data and the number of the target data to be output, the range Δ of each of the data sections may be determined according to the formula (1).
具体而言,当候选数据为均匀分布时,如果此时不确定该每个数据区间的范围Δ,则可以该候选数据的个数和该输出的目标数据的个数,确定该多个数据区间的个数M,然后再根据式(1)确定该每个数据区间的范围Δ。Specifically, when the candidate data is uniformly distributed, if the range Δ of each data interval is not determined at this time, the plurality of data intervals may be determined by the number of the candidate data and the number of the target data of the output. The number M, and then the range Δ of each data interval is determined according to equation (1).
例如,候选数据总数为9,需要确定的目标数据为该候选数据中最大的三个数据,那么候选数据总数9除需选择数据个数3得到该多个数据区间的个数M为3,然后再根据式(1)确定该每个数据区间的范围Δ。For example, the total number of candidate data is 9, and the target data to be determined is the largest three of the candidate data, and then the total number of candidate data 9 is equal to the number of data selected 3 to obtain the number M of the plurality of data intervals is 3, and then The range Δ of each data interval is determined according to equation (1).
当该候选数据为非均匀分布时,在根据非均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围时,需要获得该候选数据的概率分布信息,根据该候选数据的概率分布信息结合非均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,使得在候选数据的密集部分对应的数据区间个数多,候选数据的稀疏部分对应的数据区间的个数少。When the candidate data is non-uniformly distributed, when determining the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the non-uniform quantization strategy, it is necessary to obtain probability distribution information of the candidate data, Determining the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the probability distribution information of the candidate data in combination with the non-uniform quantization strategy, so that the number of data intervals corresponding to the dense portion of the candidate data is larger The number of data intervals corresponding to the sparse portion of the candidate data is small.
例如,给定该候选数据的概率密度函数为f(x),分M类,并且选择的非均匀量化策略为使用Lloyd-Max方法将问题转化为求失真最小化问题,即最小化失真公式为For example, given that the probability density function of the candidate data is f(x), divided into M classes, and the selected non-uniform quantization strategy is to use the Lloyd-Max method to convert the problem into a distortion minimization problem, that is, the minimum distortion formula is
Figure PCTCN2019074777-appb-000001
Figure PCTCN2019074777-appb-000001
在式(2)中,给定M时,最佳的b i和y i使得均方量化误差(Mean squared quantization error,MSQE)最小,即 In equation (2), given M, the optimal b i and y i minimize the mean squared quantization error (MSQE), ie
Figure PCTCN2019074777-appb-000002
Figure PCTCN2019074777-appb-000002
得到:get:
Figure PCTCN2019074777-appb-000003
Figure PCTCN2019074777-appb-000003
其中,b i为多个数据区间的分界点。 Where b i is the boundary point of multiple data intervals.
下面给出一个具体地例子,对非均匀量化策略进行详细说明。例如候选数据是9、4、5、6、1非均匀分布,数据在中间比较集中,两边比较稀疏。如果继续选用均匀策略,数据区间的范围Δ选用2,则在110中会出现:[0,2)区间有1个数据,[2,4)区间有0个,[4,6)区间有3个,[6,8)区间有0个,[8,10)区间有1个。如果我们要找最小的2个 数,在120后我们会得到:[0,2)有1个,[0,4)还是只有1个,[0,6)突增为4个,[0,8)还是只有4,最后[0,10)是5。最终导致步骤130需要选取,[0,6)范围,即最终输出最小的4个数而不是2个。因此不适合用均匀策略。在选择非均匀量化策略时,我们可以通过Lloyd-Max方法计算将5个数据区间设定为不同大小的:[0,3),[3,4.5),[4.5,5.5),[5.5,7),[7,10)。这样在110,可以算出每个数据区间都有1个数据。在130中,选取的范围变成[0,4.5),最终输出目标数据为4和1。在不增加数据区间的个数(数据区间的个数还是5)的情况下,成功提高数据批量选取的“精度”。A specific example is given below to describe the non-uniform quantization strategy in detail. For example, the candidate data is non-uniformly distributed in 9, 4, 5, 6, and 1. The data is concentrated in the middle and sparse on both sides. If you continue to use the uniform strategy, the range Δ of the data interval is 2, then it will appear in 110: there is 1 data in the [0, 2) interval, 0 in the [2, 4) interval, and 3 in the [4, 6) interval. There are 0 in the [6,8) interval and one in the [8,10) interval. If we are looking for the smallest 2 numbers, after 120 we will get: [0, 2) with 1, [0, 4) or only 1, [0, 6) burst to 4, [0, 8) Still only 4, and finally [0, 10) is 5. Eventually, step 130 is required to select the [0,6) range, ie the final output is the minimum of 4 numbers instead of 2. Therefore, it is not suitable to use a uniform strategy. When choosing a non-uniform quantization strategy, we can calculate 5 data intervals to different sizes by Lloyd-Max method: [0,3), [3,4.5), [4.5,5.5), [5.5,7 ), [7, 10). Thus at 110, one data can be calculated for each data interval. In 130, the selected range becomes [0, 4.5), and the final output target data is 4 and 1. In the case where the number of data intervals (the number of data intervals is still 5) is not increased, the "precision" of data batch selection is successfully improved.
以上结合图2至图5详细描述了根据本申请实施例的一种数据批量选择的方法,该方法实现了数据区间有序,但是每个数据区间内的数据无序,不需要对候选数据进行全排序,输出目标数据只需2次全并行扫描和1次并行累加计算就可完成批量选择,避免了对候选数据多次重复计算,节省了内存和带宽,提高了系统效率。并且本申请中根据该候选数据的数据信息,确定该多个数据区间的个数和该多个数据区间中的每个数据区间的范围,可以使得后续批量选择的结果更加精确。为了更清楚的理解本申请,下面结合具体的一组候选数据对本申请的一种数据批量选择的方法进行描述。A method for batch selection of data according to an embodiment of the present application is described in detail with reference to FIG. 2 to FIG. 5 . The method implements ordering of data intervals, but the data in each data interval is out of order, and the candidate data does not need to be performed. Full sorting, output target data only needs 2 full parallel scans and 1 parallel accumulation calculation to complete batch selection, avoiding repeated calculation of candidate data multiple times, saving memory and bandwidth, and improving system efficiency. In the present application, determining the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the data information of the candidate data may make the result of the subsequent batch selection more accurate. For a clearer understanding of the present application, a method of batch selection of data of the present application will be described below in conjunction with a specific set of candidate data.
候选数据为0.66,0.44,0.99,0.33,0.11,0.55,0.22,0.77,0.88,9个候选数据。目标数据为选择出候选数据中数值最大的3个数。其中数据分析器为3个并行的处理器,同时数据区间的范围在此例中为非限定条件,数据区间的个数M应调为尽量小才能最小化性能公式根据性能公式O(n/p)+O(logM)+O(n/p)的值,本例中为候选数据总数9除需选择数据个数3,所以数据区间的个数MM=9/3=3。再根据均匀量化公式(1),在候选数值范围为(0.0,1.0)时,得出数据区间的个数为3时,每个数据区间的范围为0.33333…,3个并行处理器每个处理器负责的范围为(0.0,1/3],(1/3,2/3],(2/3,1.0)。此时,每个数据区间对应的个数为0,如表1所示。The candidate data are 0.66, 0.44, 0.99, 0.33, 0.11, 0.55, 0.22, 0.77, 0.88, and 9 candidate data. The target data is three numbers in which the largest value among the candidate data is selected. The data analyzer is three parallel processors, and the range of the data interval is unqualified in this case. The number M of the data interval should be adjusted as small as possible to minimize the performance formula according to the performance formula O(n/p The value of +O(logM)+O(n/p). In this example, the total number of candidate data is 9 except that the number of data is required to be 3, so the number of data intervals is MM=9/3=3. According to the uniform quantization formula (1), when the candidate value range is (0.0, 1.0), when the number of data intervals is 3, the range of each data interval is 0.33333..., and each of the three parallel processors is processed. The range of responsibility for the device is (0.0, 1/3], (1/3, 2/3), (2/3, 1.0). At this time, the number corresponding to each data interval is 0, as shown in Table 1. .
表1Table 1
数据区间Data interval (0.0,1/3](0.0,1/3] (1/3,2/3](1/3, 2/3) (2/3,1.0)(2/3, 1.0)
个数 Number 00 00 00
将九个候选数据按照负载均衡原则,让3个并行的处理器中的每个并行处理器负责其中的三个数据,如第一处理器负责数据0.66,0.44,0.99,第二处理器负责数据0.33,0.11,0.55,第三处理器负责数据0.22,0.77,0.88。According to the load balancing principle, each of the three parallel processors is responsible for three of the three candidate data. For example, the first processor is responsible for the data 0.66, 0.44, 0.99, and the second processor is responsible for the data. 0.33, 0.11, 0.55, the third processor is responsible for the data 0.22, 0.77, 0.88.
三个处理器同时对其处理的数据进行统计,统计既可以先本地小计再总计,也可以全局同步直接总计。全局同步直接总计例子如下。The three processors simultaneously count the data they process, and the statistics can be either a local subtotal or a total, or a global synchronization. The global synchronization direct total example is as follows.
例如,第一处理器确定0.66属于区间(1/3,2/3],第二处理器确定0.33属于区间(0.0,1/3],第三处理器确定0.22属于区间(0.0,1/3],则第一次统计结束后,每个数据区间的个数如表2所示。For example, the first processor determines that 0.66 belongs to the interval (1/3, 2/3), the second processor determines that 0.33 belongs to the interval (0.0, 1/3], and the third processor determines that 0.22 belongs to the interval (0.0, 1/3) ], after the first statistics are over, the number of each data interval is shown in Table 2.
表2Table 2
数据区间Data interval (0.0,1/3](0.0,1/3] (1/3,2/3](1/3, 2/3) (2/3,1.0)(2/3, 1.0)
个数 Number 22 11 00
第一处理器确定0.44属于区间(1/3,2/3],第二处理器确定0.11属于区间(0.0,1/3],第三处理器确定0.77属于区间(2/3,1.0),则第二次统计结束后,每个数据区间的个数如表3所示。The first processor determines that 0.44 belongs to the interval (1/3, 2/3), the second processor determines that 0.11 belongs to the interval (0.0, 1/3], and the third processor determines that 0.77 belongs to the interval (2/3, 1.0), After the second statistics are completed, the number of each data interval is as shown in Table 3.
表3table 3
数据区间Data interval (0.0,1/3](0.0,1/3] (1/3,2/3](1/3, 2/3) (2/3,1.0)(2/3, 1.0)
个数Number 33 22 11
第一处理器确定0.99属于区间(2/3,1.0),第二处理器确定0.55属于区间(1/3,2/3],第三处理器确定0.88属于区间(2/3,1.0),则第二次统计结束后,每个数据区间的个数如表4所示。The first processor determines that 0.99 belongs to the interval (2/3, 1.0), the second processor determines that 0.55 belongs to the interval (1/3, 2/3), and the third processor determines that 0.88 belongs to the interval (2/3, 1.0), Then, after the second statistics are completed, the number of each data interval is as shown in Table 4.
表4Table 4
数据区间Data interval (0.0,1/3](0.0,1/3] (1/3,2/3](1/3, 2/3) (2/3,1.0)(2/3, 1.0)
个数Number 33 33 33
然后区间累加器对该3个数据区间做累加,累加结果包括所述多个数据区间中的每个数据区间与所述每个数据区间之前的所有数据区间包括的数据的个数之和,由于本例是选择最大的3个数,因此按照数据区间的降序进行累加,累加结果如表5所示。即,(2/3,1.0)范围的类包含了最大的3个值,(1/3,1.0)范围的2个类共包含了最大的6个值,(0.1,1.0)范围的3个类包含了最大的9个值(此处已为所有值)。Then the interval accumulator accumulates the three data intervals, and the accumulated result includes a sum of each of the plurality of data intervals and the number of data included in all the data intervals before each of the data intervals, In this example, the maximum number of 3 is selected, so the accumulation is performed in descending order of the data interval, and the cumulative result is shown in Table 5. That is, the class in the (2/3, 1.0) range contains the largest three values, and the two classes in the (1/3, 1.0) range contain a maximum of six values, three in the range of (0.1, 1.0). The class contains the largest 9 values (all values are already here).
表5table 5
数据区间Data interval (0.1,1.0)(0.1,1.0) (1/3,1.0)(1/3, 1.0) (2/3,1.0)(2/3, 1.0)
个数Number 33 66 99
最后,批量选取器确定目标数据的数据区间为(2/3,1.0),这里假设该批量选取器为上述三个并行处理器,因此。三个并行处理器分别输出属于数据区间为(2/3,1.0)的数据,则第一处理器输出0.99,第二处理器无输出,第三处理器输出0.77,0.88。Finally, the batch picker determines that the data interval of the target data is (2/3, 1.0), and it is assumed here that the batch picker is the above three parallel processors, and therefore. The three parallel processors respectively output data belonging to the data interval (2/3, 1.0), then the first processor outputs 0.99, the second processor has no output, and the third processor outputs 0.77, 0.88.
图6是根据本申请的一种数据批量选择的装置300的示意性框图。如图6所示,该装置300包括以下模块。FIG. 6 is a schematic block diagram of an apparatus 300 for data batch selection in accordance with the present application. As shown in Figure 6, the device 300 includes the following modules.
数据分析器310,用于统计候选数据中的数据所属的数据区间,以获取统计结果,该统计结果包括多个数据区间中的每个数据区间包含的数据的个数,该每个数据区间的区间范围总和等于该候选数据的数据分布区间范围。The data analyzer 310 is configured to collect a data interval to which the data in the candidate data belongs to obtain a statistical result, where the statistical result includes the number of data included in each of the plurality of data intervals, and the data interval of each of the data segments The sum of the interval ranges is equal to the data distribution interval range of the candidate data.
区间统计器320,用于根据该统计结果,对该多个数据区间包括的个数做累加,以得到累加结果,该累加结果为该每个数据区间包含的数据个数与该每个数据区间之前的所有数据区间包含的数据个数之和。The interval statistic unit 320 is configured to accumulate the number of the plurality of data intervals according to the statistical result, to obtain an accumulated result, where the accumulated result is the number of data included in each data interval and each of the data intervals The sum of the number of data contained in all previous data intervals.
批量选取器330,用于根据该累加结果,确定目标数据所在的目标数据区间,并输出属于该目标数据区间的候选数据。The batch picker 330 is configured to determine, according to the accumulated result, a target data interval in which the target data is located, and output candidate data belonging to the target data interval.
可选地,该装置300还包括区间配置器340,用于根据该候选数据的数据信息,确定该多个数据区间的个数和该多个数据区间中的每个数据区间的范围;该区间配置器将该多个数据区间和该多个数据区间中的每个数据区间的范围发送给该第一处理器。Optionally, the apparatus 300 further includes an interval configurator 340, configured to determine, according to the data information of the candidate data, a number of the plurality of data intervals and a range of each of the plurality of data intervals; the interval The configurator transmits the plurality of data intervals and a range of each of the plurality of data intervals to the first processor.
可选地,该区间配置器340具体用于:当该候选数据为均匀分布时,根据均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,该每个数据区间的范围相等;或当该候选数据为非均匀分布时,根据非均匀量化策略确定多个数据区间的个数和该多个数据区间中的每个数据区间的范围,该多个数据区间的范围中的至少两个数 据区间的范围不相等。Optionally, the interval configurator 340 is specifically configured to: when the candidate data is uniformly distributed, determine a number of the plurality of data intervals and a range of each of the plurality of data intervals according to the uniform quantization policy, where The range of each data interval is equal; or when the candidate data is non-uniformly distributed, determining the number of the plurality of data intervals and the range of each of the plurality of data intervals according to the non-uniform quantization strategy, the plurality of The ranges of at least two of the data intervals are not equal.
可选地,当该候选数据为均匀分布时,该每个数据区间的范围为Δ时,该区间配置器340具体用于:根据式(1)确定多个数据区间的个数M。Optionally, when the candidate data is uniformly distributed, and the range of each data interval is Δ, the interval configurator 340 is specifically configured to: determine the number M of the plurality of data intervals according to the formula (1).
可选地,该区间配置器340具体用于:根据该候选数据的个数和该输出的目标数据的个数,确定该多个数据区间的个数M;根据式(1)确定该每个数据区间的范围Δ。Optionally, the interval configurator 340 is specifically configured to: determine, according to the number of the candidate data and the number of the output target data, the number M of the plurality of data intervals; determine each of the multiple according to formula (1) The range of the data interval Δ.
可选地,该第二处理器具体用于:当该目标数据为该候选数据中的最小的部分数据时,根据该多个数据区间的升序,对该多个数据区间的个数做前缀和运算;或当该目标数据为该候选数据中的最大的部分数据时,根据该多个数据区间的降序,对该多个数据区间的个数做前缀和运算。Optionally, the second processor is specifically configured to: when the target data is the smallest part of the candidate data, prefix the number of the multiple data intervals according to the ascending order of the multiple data intervals Or calculating; or when the target data is the largest partial data in the candidate data, prefixing and counting the number of the plurality of data intervals according to the descending order of the plurality of data intervals.
可选地,该数据分析器、该区间统计器和该批量选取器为相同的物理器件或该相同的物理器件的部分。Optionally, the data analyzer, the interval statistic, and the batch picker are the same physical device or part of the same physical device.
可选地,所述数据分析器310、所述区间统计器320、所述批量选取器330和所述区间配置器340用于执行本申请的数据批量选择的方法200的各个操作,为了简洁,在此不再赘述。Optionally, the data analyzer 310, the interval statistic 320, the batch picker 330, and the interval configurator 340 are used to perform various operations of the method 200 for data batch selection of the present application. I will not repeat them here.
上述数据分析器、区间统计器、批量选取器及区间配置器与方法实施例中的数据分析器、区间统计器、批量选取器及区间配置器完全对应,由相应的模块执行相应的步骤,具体可以参考相应的方法实施例。The data analyzer, the interval statistic, the batch picker and the interval configurator are completely corresponding to the data analyzer, the interval statistic, the batch picker and the interval configurator in the method embodiment, and the corresponding modules execute corresponding steps, specifically Reference can be made to corresponding method embodiments.
需要说明的是,数据分析器310、区间统计器320、批量选取器330和区间配置器340可以分开设置,也可以集成在一起,以一个处理芯片实现。It should be noted that the data analyzer 310, the interval statistic 320, the batch picker 330, and the interval configurator 340 may be separately configured or integrated together and implemented by one processing chip.
同时本申请的装置适用PRAM模型,各种并行处理器、加速器、GPU、FPGA、ASIC、云端、边端都可配置。At the same time, the device of the present application is applicable to the PRAM model, and various parallel processors, accelerators, GPUs, FPGAs, ASICs, clouds, and edges can be configured.
下面以云端系统为例,对本申请的一种数据批量选择的系统进行简单描述。图7是根据本申请的一种数据批量选择的系统的示意性架构图。该系统400包括数据分析器410,区间统计器420,批量选取器430和区间配置器440。The cloud system is taken as an example to describe a system for batch selection of data in the present application. 7 is a schematic architectural diagram of a system for data batch selection in accordance with the present application. The system 400 includes a data analyzer 410, an interval statistic 420, a batch picker 430, and an interval configurator 440.
该数据分析器410,用于统计候选数据中的数据所属的数据区间,以获取统计结果,所述统计结果包括多个数据区间中的每个数据区间包含的数据的个数,所述每个数据区间的区间范围总和等于所述候选数据的数据分布区间范围。The data analyzer 410 is configured to collect a data interval to which the data in the candidate data belongs to obtain a statistical result, where the statistical result includes the number of data included in each of the plurality of data intervals, each of the data The sum of the interval ranges of the data intervals is equal to the data distribution interval range of the candidate data.
区间统计器420,用于根据所述统计结果,对所述多个数据区间包括的个数做累加,以得到累加结果,所述累加结果为所述每个数据区间包含的数据个数与所述每个数据区间之前的所有数据区间包含的数据个数之和。The interval statistic unit 420 is configured to accumulate the number of the plurality of data intervals according to the statistical result, to obtain an accumulated result, where the accumulated result is the number of data and the data included in each data interval. The sum of the number of data included in all data intervals before each data interval.
批量选取器430,用于根据所述累加结果,确定目标数据所在的目标数据区间,并输出属于所述目标数据区间的候选数据。The batch picker 430 is configured to determine, according to the accumulated result, a target data interval in which the target data is located, and output candidate data belonging to the target data interval.
可选地,该区间配置器440用于根据所述候选数据的数据信息,确定所述多个数据区间的个数和所述多个数据区间中的每个数据区间的范围;Optionally, the interval configurator 440 is configured to determine, according to the data information of the candidate data, a number of the plurality of data intervals and a range of each of the plurality of data intervals;
所述区间配置器440将所述多个数据区间和所述多个数据区间中的每个数据区间的范围发送给所述数据分析器410。The interval configurator 440 transmits a range of each of the plurality of data intervals and the plurality of data intervals to the data analyzer 410.
可选地,所述区间配置器还用于将候选数据分配给该数据分析器410和批量选取器430。Optionally, the interval configurator is further configured to allocate candidate data to the data analyzer 410 and the batch picker 430.
具体而言,所述区间配置器440向该数据分析器410发送候选数据中的部分数据。Specifically, the interval configurator 440 transmits partial data in the candidate data to the data analyzer 410.
该数据分析器410统计所述候选数据中的数据所属的数据区间,以获取统计结果,并 将所述统计结果写入第一共享内存,所述统计结果包括多个数据区间中的每个数据区间包含的数据的个数,所述每个数据区间的区间范围总和等于所述候选数据的数据分布区间范围。The data analyzer 410 counts a data interval to which the data in the candidate data belongs to obtain a statistical result, and writes the statistical result into the first shared memory, where the statistical result includes each of the plurality of data intervals. The number of pieces of data included in the interval, the sum of the range ranges of the each data interval being equal to the range of the data distribution interval of the candidate data.
该数据分析器410向所述区间统计器420发送第一消息,所述第一消息用于指示所述区间统计器420根据所述统计结果对所述多个数据区间包括的个数做累加。The data analyzer 410 sends a first message to the interval statistic 420, the first message being used to instruct the interval statistic 420 to accumulate the number of the plurality of data intervals according to the statistical result.
响应于所述第一消息,所述区间统计器420根据所述统计结果对所述多个数据区间包括的个数做累加,以得到累加结果,所述累加结果为所述每个数据区间包含的数据个数与所述每个数据区间之前的所有数据区间包含的数据个数之和,并将所述累加结果写入第二共享内存中。In response to the first message, the interval statistic 420 accumulates the number of the plurality of data intervals according to the statistical result to obtain an accumulated result, where the accumulated result is that each of the data intervals includes The sum of the number of data and the number of data included in all data intervals preceding each of the data intervals, and the accumulated result is written into the second shared memory.
所述区间统计器420对所述批量选取器430发送第二消息,所述第二消息用于指示所述批量选取器430根据所述累加结果,确定目标数据所在的目标数据区间。The interval statistic 420 sends a second message to the batch picker 430, where the second message is used to instruct the batch picker 430 to determine a target data interval in which the target data is located according to the accumulated result.
所述批量选取器430根据所述目标数据区间输出所述目标数据。The batch picker 430 outputs the target data according to the target data section.
可选地,该数据分析器410可以包括一个具有多核的处理器,也可以包括多个并行处理器,还可以是包括一个多线程处理器,或者该数据分析器410是该多核的处理器,该多个并行处理器和该多线程处理器的组合。Optionally, the data analyzer 410 may include a processor with multiple cores, may also include multiple parallel processors, may also include a multi-thread processor, or the data analyzer 410 is the multi-core processor. A combination of the plurality of parallel processors and the multi-threaded processor.
可选地,该区间统计器420可以包括一个具有多核的处理器,也可以包括多个并行处理器,还可以是包括一个多线程处理器,或者该数据分析器410是该多核的处理器,该多个并行处理器和该多线程处理器的组合。Optionally, the interval statistic 420 may include a processor with multiple cores, may also include multiple parallel processors, may also include a multi-thread processor, or the data analyzer 410 is the multi-core processor. A combination of the plurality of parallel processors and the multi-threaded processor.
可选地,该批量选取器430可以包括一个具有多核的处理器,也可以包括多个并行处理器,还可以是包括一个多线程处理器,或者该数据分析器410是该多核的处理器,该多个并行处理器和该多线程处理器的组合。Optionally, the batch picker 430 may include a processor with multiple cores, may also include multiple parallel processors, may also include a multi-threaded processor, or the data analyzer 410 is the multi-core processor. A combination of the plurality of parallel processors and the multi-threaded processor.
可选地,该第一共享内存、该第二共享内存和该第三共享内存可以是同一共享内存。Optionally, the first shared memory, the second shared memory, and the third shared memory may be the same shared memory.
应理解,在该云端系统中,也可能没有共享内存,而是分布式储存器,即每个数字区间交付给一个处理器对应的分布式内存组,而数据分析器、批量选取器、区间统计器都是软件形式分布式存在的。It should be understood that in the cloud system, there may be no shared memory, but distributed storage, that is, each digital interval is delivered to a distributed memory group corresponding to one processor, and the data analyzer, batch picker, interval statistics The devices are distributed in software form.
可选地,在该云端系统中,该数据分析器410、该区间统计器420、该批量选取器430和该区间配置器之间可以通过各自包括的子处理器进行通信交互。Optionally, in the cloud system, the data analyzer 410, the interval statistic 420, the batch picker 430, and the interval configurator may perform communication interaction through respective sub-processors included.
具体而言,以该数据分析器410和该区间统计器420之间的子处理器进行通信交互为例进行说明。假设数据区间为(0,3],(3,6],(6,9],则该数据分析器410可以包括3个分布式处理器,该区间统计器包括3个分布式处理器,第一处理器负责统计(0,3],第二处理器负责统计(3,6]区间的个数,第三处理器负责统计(6,9]区间的个数,3个分布式处理器可以部署在同一个物理位置。该数据分析器410中的任一个处理器在统计出一个候选数据所属的数据区间时,向该区间统计器420中对应的处理器发送指示信息,用于指示对应的处理器对其负责的数据区间的个数作统计,如该数据分析器410中的任一个处理器在统计出一个候选数据所属的数据区间为(0,3]时,则该数据分析器410中的任一个处理器向该第一处理器发送指示信息,指示该第一处理器加1。Specifically, a communication interaction between the data analyzer 410 and the sub-processor between the interval statistic 420 will be described as an example. Assuming that the data interval is (0, 3], (3, 6], (6, 9], the data analyzer 410 can include three distributed processors, and the interval statistic includes three distributed processors, One processor is responsible for statistics (0, 3), the second processor is responsible for counting the number of (3, 6) intervals, the third processor is responsible for counting the number of intervals (6, 9), and three distributed processors can Deploying in the same physical location, any one of the data analyzers 410 sends an indication message to the corresponding processor in the interval statistic 420 to indicate the corresponding data when the data interval to which the candidate data belongs is counted. The processor counts the number of data intervals it is responsible for. If any one of the data analyzers 410 counts the data interval to which the candidate data belongs (0, 3), the data analyzer 410 Any one of the processors sends an indication message to the first processor indicating that the first processor is incremented by one.
应理解,该系统中的具体流程,可以参考相应的方法200进行理解,为了避免重复,此处不再赘述。It should be understood that the specific process in the system can be understood by referring to the corresponding method 200. To avoid repetition, details are not described herein again.
图8示出了本申请提供的数据批量选择的设备500的示意性框图,所述设备500包括:FIG. 8 is a schematic block diagram of a device 500 for data batch selection provided by the present application, the device 500 including:
存储器510,用于存储程序,所述程序包括代码;a memory 510, configured to store a program, where the program includes a code;
收发器520,用于和其他设备进行通信;The transceiver 520 is configured to communicate with other devices;
处理器530,用于执行存储器510中的程序代码。The processor 530 is configured to execute program code in the memory 510.
可选地,当所述代码被执行时,所述处理器530可以实现方法200的各个操作,为了简洁,在此不再赘述。收发器520用于在处理器530的驱动下执行具体的信号收发。Optionally, when the code is executed, the processor 530 can implement various operations of the method 200. For brevity, no further details are provided herein. The transceiver 520 is configured to perform specific signal transceiving under the driving of the processor 530.
应理解,图8仅示出了一种数据批量选择的设备的示意性框图,在图8中,该存储器510、该收发器520、该处理器530共享了同一个系统总线,但是该存储器510、该收发器520和该处理器530三个部件之间也可以是分别直连的。对于该数据批量选择的设备的各部件之间的连接关系,本申请并不进行限定。It should be understood that FIG. 8 only shows a schematic block diagram of a device for data batch selection. In FIG. 8, the memory 510, the transceiver 520, and the processor 530 share the same system bus, but the memory 510 The transceiver 520 and the three components of the processor 530 may also be directly connected. The connection relationship between the components of the device selected in batches of the data is not limited in this application.
应理解,在本申请实施例中,该处理器530可以是中央处理单元(Central Processing Unit,简称为“CPU”),该处理器530还可以是其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。It should be understood that, in the embodiment of the present application, the processor 530 may be a central processing unit ("CPU"), and the processor 530 may also be other general-purpose processors, digital signal processors (DSPs). , an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, and the like.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including The instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟 悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present application. It should be covered by the scope of protection of this application. Therefore, the scope of protection of the present application should be determined by the scope of the claims.

Claims (15)

  1. 一种数据批量选择的方法,其特征在于,所述方法包括:A method for batch selection of data, characterized in that the method comprises:
    数据分析器统计候选数据中的数据所属的数据区间,以获取统计结果,所述统计结果为多个数据区间中的每个数据区间包含的数据的个数,所述每个数据区间的区间范围总和等于所述候选数据的数据分布区间范围;The data analyzer calculates a data interval to which the data in the candidate data belongs to obtain a statistical result, where the statistical result is the number of data included in each of the plurality of data intervals, and the interval range of each of the data intervals The sum is equal to the data distribution interval range of the candidate data;
    区间统计器根据所述统计结果,对所述每个数据区间包含的数据个数分别做累加,以得到累加结果,所述累加结果为所述每个数据区间包含的数据个数与所述每个数据区间之前的所有数据区间包含的数据个数之和;The interval statistic accumulates the number of data included in each data interval according to the statistical result, to obtain an accumulated result, where the accumulated result is the number of data included in each data interval and each of the data The sum of the number of data included in all data intervals before the data interval;
    批量选取器根据所述累加结果,确定目标数据所在的目标数据区间,并输出属于所述目标数据区间的候选数据。The batch picker determines a target data section in which the target data is located according to the accumulated result, and outputs candidate data belonging to the target data section.
  2. 根据权利要求1所述的方法,其特征在于,在所述数据分析器统计候选数据中的数据所属的数据区间之前,所述方法还包括:The method according to claim 1, wherein before the data analyzer counts the data interval to which the data in the candidate data belongs, the method further includes:
    区间配置器根据所述候选数据的数据信息,确定所述多个数据区间的个数和所述多个数据区间中的每个数据区间的范围;The interval configurator determines, according to the data information of the candidate data, the number of the plurality of data intervals and the range of each of the plurality of data intervals;
    所述区间配置器将所述多个数据区间和所述多个数据区间中的每个数据区间的范围发送给所述数据分析器。The interval configurator transmits a range of each of the plurality of data intervals and the plurality of data intervals to the data analyzer.
  3. 根据权利要求2所述的方法,其特征在于,所述区间配置器根据所述候选数据的数据信息,确定所述多个数据区间的个数和所述多个数据区间中的每个数据区间的范围,包括:The method according to claim 2, wherein the interval configurator determines the number of the plurality of data intervals and each of the plurality of data intervals according to the data information of the candidate data. The scope includes:
    当所述候选数据为均匀分布时,根据均匀量化策略确定所述多个数据区间的个数和所述多个数据区间中的每个数据区间的范围,所述每个数据区间的范围相等;或When the candidate data is uniformly distributed, determining a number of the plurality of data intervals and a range of each of the plurality of data intervals according to a uniform quantization policy, the range of each of the data intervals being equal; or
    当所述候选数据为非匀分布时,根据非均匀量化策略确定所述多个数据区间的个数和所述多个数据区间中的每个数据区间的范围,所述多个数据区间的范围中的至少两个数据区间的范围不相等。When the candidate data is non-uniformly distributed, determining a number of the plurality of data intervals and a range of each of the plurality of data intervals according to a non-uniform quantization strategy, and ranges of the plurality of data intervals The range of at least two data intervals in the unequal is not equal.
  4. 根据权利要求3所述的方法,其特征在于,当所述候选数据为均匀分布时,所述每个数据区间的范围为Δ时,根据均匀量化策略确定所述多个数据区间的个数和所述多个数据区间中的每个数据区间的范围,包括:The method according to claim 3, wherein when the candidate data is uniformly distributed, when the range of each data interval is Δ, determining the number of the plurality of data intervals according to the uniform quantization strategy The range of each of the plurality of data intervals includes:
    根据式(1)确定多个数据区间的个数M,Determining the number M of the plurality of data intervals according to the formula (1),
    M=x/Δ    (1)M=x/Δ (1)
    其中,x为所述候选数据的数据区间范围,M为多个数据区间的个数。Where x is the data interval range of the candidate data, and M is the number of the plurality of data intervals.
  5. 根据权利要求3所述的方法,其特征在于,所述方法还包括:The method of claim 3, wherein the method further comprises:
    根据所述候选数据的个数和所述输出的目标数据的个数,确定所述多个数据区间的个数M;Determining the number M of the plurality of data intervals according to the number of the candidate data and the number of the output target data;
    根据式(1)确定所述每个数据区间的范围Δ,Determining the range Δ of each of the data intervals according to formula (1),
    M=x/Δ    (1)M=x/Δ (1)
    其中,x为所述候选数据的数据区间范围,M为多个数据区间的个数。Where x is the data interval range of the candidate data, and M is the number of the plurality of data intervals.
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述区间统计器根据 所述统计结果,对对所述每个数据区间包含的数据个数做累加,包括:The method according to any one of claims 1 to 5, wherein the interval statistic accumulates the number of data included in each of the data intervals according to the statistical result, including:
    当所述目标数据为所述候选数据中的最小的部分数据时,根据所述每个数据区间的升序,对对所述每个数据区间包含的数据个数做累加;或When the target data is the smallest partial data of the candidate data, accumulate the number of data included in each of the data intervals according to the ascending order of each of the data intervals; or
    当所述目标数据为所述候选数据中的最大的部分数据时,根据所述每个数据区间的降序,对所述每个数据区间包含的数据个数做累加。When the target data is the largest partial data of the candidate data, the number of data included in each data interval is accumulated according to the descending order of each data interval.
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述数据分析器、所述区间统计器和所述批量选取器为相同的物理实体或部分相同的物理实体。The method according to any one of claims 1 to 6, wherein the data analyzer, the interval statistic and the batch picker are the same physical entity or partially identical physical entities.
  8. 一种数据批量选择的装置,其特征在于,所述装置包括:A device for batch selection of data, characterized in that the device comprises:
    数据分析器,用于统计候选数据中的数据所属的数据区间,以获取统计结果,所述统计结果包括多个数据区间中的每个数据区间包含的数据的个数,所述每个数据区间的区间范围总和等于所述候选数据的数据分布区间范围;a data analyzer, configured to count a data interval to which the data in the candidate data belongs, to obtain a statistical result, where the statistical result includes the number of data included in each of the plurality of data intervals, and each of the data intervals The sum of the interval ranges is equal to the data distribution interval range of the candidate data;
    区间统计器根据所述统计结果,对所述每个数据区间包含的数据个数分别做累加,以得到累加结果,所述累加结果为所述每个数据区间包含的数据个数与所述每个数据区间之前的所有数据区间包含的数据个数之和;The interval statistic accumulates the number of data included in each data interval according to the statistical result, to obtain an accumulated result, where the accumulated result is the number of data included in each data interval and each of the data The sum of the number of data included in all data intervals before the data interval;
    批量选取器根据所述累加结果,确定目标数据所在的目标数据区间,并输出属于所述目标数据区间的候选数据。The batch picker determines a target data section in which the target data is located according to the accumulated result, and outputs candidate data belonging to the target data section.
  9. 根据权利要求8所述的装置,其特征在于,所述装置还包括:The device according to claim 8, wherein the device further comprises:
    区间配置器,用于根据所述候选数据的数据信息,确定数据区间的个数和每个数据区间的范围;An interval configurator, configured to determine, according to the data information of the candidate data, a number of data intervals and a range of each data interval;
    所述区间配置器将所述每个数据区间和所述每个数据区间的范围发送给所述数据分析器。The interval configurator transmits the each data interval and a range of each of the data intervals to the data analyzer.
  10. 根据权利要求9所述的装置,其特征在于,所述区间配置器具体用于:The device according to claim 9, wherein the interval configurator is specifically configured to:
    当所述候选数据为均匀分布时,根据均匀量化策略确定数据区间的个数和每个数据区间中的每个数据区间的范围,所述每个数据区间的范围相等;或When the candidate data is uniformly distributed, determining the number of data intervals and the range of each data interval in each data interval according to the uniform quantization strategy, the range of each data interval being equal; or
    当所述候选数据为非均匀分布时,根据非均匀量化策略确定多个数据区间的个数和所述多个数据区间中的每个数据区间的范围,所述多个数据区间的范围中的至少两个数据区间的范围不相等。When the candidate data is non-uniformly distributed, determining a number of the plurality of data intervals and a range of each of the plurality of data intervals according to the non-uniform quantization strategy, in a range of the plurality of data intervals The ranges of at least two data intervals are not equal.
  11. 根据权利要求10所述的装置,其特征在于,当所述候选数据为均匀分布时,所述每个数据区间的范围为Δ时,所述区间配置器具体用于:The apparatus according to claim 10, wherein when the candidate data is uniformly distributed, and the range of each data interval is Δ, the interval configurator is specifically configured to:
    根据式(1)确定多个数据区间的个数M,Determining the number M of the plurality of data intervals according to the formula (1),
    M=x/Δ    (1)M=x/Δ (1)
    其中,x为所述候选数据的数据区间范围,M为多个数据区间的个数。Where x is the data interval range of the candidate data, and M is the number of the plurality of data intervals.
  12. 根据权利要求10所述的装置,其特征在于,所述区间配置器具体用于:The device according to claim 10, wherein the interval configurator is specifically configured to:
    根据所述候选数据的个数和所述输出的目标数据的个数,确定所述多个数据区间的个数M;Determining the number M of the plurality of data intervals according to the number of the candidate data and the number of the output target data;
    根据式(1)确定所述每个数据区间的范围Δ,Determining the range Δ of each of the data intervals according to formula (1),
    M=x/Δ    (1)M=x/Δ (1)
    其中,x为所述候选数据的数据区间范围,M为多个数据区间的个数。Where x is the data interval range of the candidate data, and M is the number of the plurality of data intervals.
  13. 根据权利要求8至12中任一项所述的装置,其特征在于,所述区间统计器具体用于:The apparatus according to any one of claims 8 to 12, wherein the interval statistic is specifically configured to:
    当所述目标数据为所述候选数据中的最小的部分数据时,根据所述多个数据区间的升序,对所述每个数据区间的个数做前缀和运算;或When the target data is the smallest partial data of the candidate data, prefixing and counting the number of each data interval according to the ascending order of the plurality of data intervals; or
    当所述目标数据为所述候选数据中的最大的部分数据时,根据所述多个数据区间的降序,对所述每个数据区间的个数做前缀和运算。When the target data is the largest partial data of the candidate data, a prefix operation is performed on the number of each data interval according to the descending order of the plurality of data intervals.
  14. 根据权利要求8至14中任一项所述的装置,其特征在于,所述数据分析器、所述区间统计器和所述批量选取器为相同的物理实体或部分相同的物理实体。The apparatus according to any one of claims 8 to 14, wherein the data analyzer, the interval statistic, and the batch picker are the same physical entity or partially identical physical entities.
  15. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有程序指令,当所述指令被执行时,所述计算机存储介质可以执行如权利要求1至7中任一项所述的方法。A computer storage medium, characterized in that the computer storage medium stores program instructions, and when the instructions are executed, the computer storage medium can perform the method of any one of claims 1 to 7.
PCT/CN2019/074777 2018-05-07 2019-02-11 Method and device for batch selection of data WO2019214303A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810425693.7 2018-05-07
CN201810425693.7A CN110457649B (en) 2018-05-07 2018-05-07 Method and device for selecting data in batches and computer storage medium

Publications (1)

Publication Number Publication Date
WO2019214303A1 true WO2019214303A1 (en) 2019-11-14

Family

ID=68466820

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/074777 WO2019214303A1 (en) 2018-05-07 2019-02-11 Method and device for batch selection of data

Country Status (2)

Country Link
CN (1) CN110457649B (en)
WO (1) WO2019214303A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530084A (en) * 2013-09-26 2014-01-22 北京奇虎科技有限公司 Data parallel sequencing method and system
CN103746851A (en) * 2014-01-17 2014-04-23 中国联合网络通信集团有限公司 Method and device for realizing counting of independent user number
US20140244658A1 (en) * 2013-02-22 2014-08-28 International Business Machines Corporation Optimizing user selection for performing tasks in social networks
CN105740332A (en) * 2016-01-22 2016-07-06 北京京东尚科信息技术有限公司 Data sorting method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512320B (en) * 2015-12-18 2019-03-01 北京金山安全软件有限公司 User ranking obtaining method and device and server
CN106202280B (en) * 2016-06-29 2020-06-23 联想(北京)有限公司 Information processing method and server
US9753964B1 (en) * 2017-01-19 2017-09-05 Acquire Media Ventures, Inc. Similarity clustering in linear time with error-free retrieval using signature overlap with signature size matching

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140244658A1 (en) * 2013-02-22 2014-08-28 International Business Machines Corporation Optimizing user selection for performing tasks in social networks
CN103530084A (en) * 2013-09-26 2014-01-22 北京奇虎科技有限公司 Data parallel sequencing method and system
CN103746851A (en) * 2014-01-17 2014-04-23 中国联合网络通信集团有限公司 Method and device for realizing counting of independent user number
CN105740332A (en) * 2016-01-22 2016-07-06 北京京东尚科信息技术有限公司 Data sorting method and device

Also Published As

Publication number Publication date
CN110457649A (en) 2019-11-15
CN110457649B (en) 2021-05-04

Similar Documents

Publication Publication Date Title
CN110991311B (en) Target detection method based on dense connection deep network
US11954879B2 (en) Methods, systems and apparatus to optimize pipeline execution
US20210182318A1 (en) Data Retrieval Method and Apparatus
EP3679473B1 (en) A system and method for stream processing
CN111694839B (en) Time sequence index construction method and device based on big data and computer equipment
CN109063194A (en) Data retrieval method and device based on space encoding
CN111160793A (en) Method, device and equipment for configuring number of self-service equipment of service network point
WO2017020735A1 (en) Data processing method, backup server and storage system
CN113656670A (en) Flight data-oriented space-time trajectory data management analysis method and device
CN111989897B (en) Measuring index of computer network
WO2019214303A1 (en) Method and device for batch selection of data
CN116841753B (en) Stream processing and batch processing switching method and switching device
CN104751459B (en) Multi-dimensional feature similarity measuring optimizing method and image matching method
CN113271234A (en) Adaptive event aggregation
Barnes et al. Distributed parallel d8 up-slope area calculation in digital elevation models
CN113743477A (en) Histogram data publishing method based on differential privacy
CN109918543B (en) Link prediction method for nodes in graph flow
CN107943918B (en) Operation system based on hierarchical large-scale graph data
CN114547384A (en) Resource object processing method and device and computer equipment
WO2024016731A1 (en) Data point query method and apparatus, device cluster, program product, and storage medium
CN108984101B (en) Method and device for determining relationship between events in distributed storage system
US10242055B2 (en) Dual filter histogram optimization
WO2018036336A1 (en) Method and device for processing data
CN112148765B (en) Service data processing method, device and storage medium
CN116681767B (en) Point cloud searching method and device and terminal equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19800474

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19800474

Country of ref document: EP

Kind code of ref document: A1