WO2021103835A1 - Training sample acquisition method and apparatus - Google Patents
Training sample acquisition method and apparatus Download PDFInfo
- Publication number
- WO2021103835A1 WO2021103835A1 PCT/CN2020/121356 CN2020121356W WO2021103835A1 WO 2021103835 A1 WO2021103835 A1 WO 2021103835A1 CN 2020121356 W CN2020121356 W CN 2020121356W WO 2021103835 A1 WO2021103835 A1 WO 2021103835A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cost
- filter
- mode
- cloud
- combination
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 123
- 238000000034 method Methods 0.000 title claims abstract description 62
- 238000012216 screening Methods 0.000 claims abstract description 100
- 238000013480 data collection Methods 0.000 claims abstract description 78
- 238000004364 calculation method Methods 0.000 claims abstract description 37
- 238000002372 labelling Methods 0.000 claims description 35
- 238000005070 sampling Methods 0.000 claims description 26
- 238000001914 filtration Methods 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 description 13
- 238000013500 data storage Methods 0.000 description 8
- 238000004519 manufacturing process Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012946 outsourcing Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
Definitions
- the present invention relates to the field of automatic driving, and in particular to a method and device for obtaining training samples.
- cost control occupies a very important position in each production link. Therefore, calculating the total cost of the combination of various data collection modes and filter modes and comparing them, and selecting the most suitable combination to obtain training samples has become an urgent problem to be solved.
- the present invention provides a method and device for obtaining training samples.
- a scientific evaluation system is established, and human experience is avoided when the performance of the filter algorithm and the efficiency of the algorithm are taken into account.
- Subjectivity The specific technical scheme is as follows.
- the present invention provides a universal filter selection method, including:
- the data collection mode includes special collection and crowdsourcing collection
- the cost unit price of the special collection mode includes the unit time collection cost
- the cost unit price of the crowdsourcing collection mode includes Transmission cost of each sample, depreciation cost of equipment installed on crowdsourced vehicles
- the filter modes include uniform sampling, manual screening, car-side filter, cloud filter, and combined filter
- the uniform sampling mode corresponds to the screening
- the performance parameters include the time interval of uniform sampling
- the screening performance parameters corresponding to the manual screening mode include manual screening speed
- the cost unit price corresponding to the manual screening mode includes the manual screening cost per unit time
- the screening corresponding to the car-side filter mode includes the precision rate of the car-side filter, the recall rate of the car-side filter, and the speed of the car-side filter.
- the screening performance parameters corresponding to the cloud filter mode include the precision rate of the cloud filter and the cloud filter
- the recall rate and cloud filter speed of the cloud filter mode include the operating cost of cloud computing resources per unit time;
- the manual filtering speed refers to the number of data manually processed per unit time, the vehicle end
- the filter speed refers to the number of data processed by the vehicle-side filter per unit time, and the cloud filter speed refers to the number of data processed by the cloud filter per unit time;
- the combination mode includes a data collection mode and a filter mode
- the total cost corresponding to the training sample obtained in the data set by this combination method is calculated ;
- the data acquisition combination scheme is sent to the corresponding device to acquire the original data set used to screen the training samples.
- the total cost calculation formula corresponding to the combination is:
- C total is the total cost
- N is the number of training samples
- p is the proportion of the training samples in the total data set
- T is the time interval of the uniform sampling
- c collection is the unit time collection Cost
- c store is the storage cost of the unit data
- c label is the unit label cost.
- the total cost calculation formula corresponding to the combination is:
- C total is the total cost
- N is the number of training samples
- p is the proportion of the training samples in the overall data set
- R car is the recall rate of the car-end filter
- f car is the total cost.
- P car is the precision rate of the car-end filter
- c collection is the collection cost per unit time
- c store is the storage cost of the unit data
- c label is the unit labeling cost .
- the total cost calculation formula corresponding to the combination is:
- C total is the total cost
- N is the number of training samples
- p is the proportion of the training samples in the total data set
- T is the time interval of the uniform sampling
- R cloud is the cloud filter
- F cloud is the speed of the cloud filter
- P cloud is the precision rate of the cloud filter
- c collection is the collection cost per unit time
- c store is the storage cost of the unit data
- c resource is the operating cost of the cloud computing resource per unit time
- c label is the label cost per unit.
- the combination of the data collection mode and the filter mode is a special collection and a first combined filter
- the first combined filter is a combination of uniform sampling and manual screening
- the total cost of the combined mode is The calculation formula:
- C total is the total cost
- N is the number of training samples
- p is the proportion of the training samples in the overall data set
- T is the time interval of the uniform sampling
- f person is the manual screening speed
- C collection is the collection cost per unit time
- c store is the storage cost of the unit data
- c person is the manual screening cost per unit time
- c label is the unit labeling cost.
- the second combined filter is a combination of a cloud filter and a manual filter
- the combined mode corresponds to the total
- the cost calculation formula is:
- C total is the total cost
- N is the number of training samples
- p is the proportion of the training samples in the total data set
- T is the time interval of the uniform sampling
- R cloud is the cloud filter
- F cloud is the speed of the cloud filter
- P cloud is the precision rate of the cloud filter
- f person is the manual screening speed
- c collection is the collection cost per unit time
- c store is For the storage cost of the unit data
- c resource is the operating cost of the cloud computing resource per unit time
- c person is the manual screening cost per unit time
- c label is the unit labeling cost.
- the third combined filter is a combination of a car-side filter, a cloud filter, and manual screening.
- the total cost calculation formula corresponding to the combination method is:
- C total is the total cost
- N is the number of training samples
- p is the proportion of the training samples in the overall data set
- R car is the recall rate of the car -end filter
- f car is the car-end screening
- P car is the precision rate of the car filter
- R cloud is the recall rate of the cloud filter
- f cloud is the cloud filter speed
- P cloud is the precision rate of the cloud filter
- f person is the manual filter Speed
- c collection is the collection cost per unit time
- c store is the storage cost per unit data
- c resource is the operating cost of cloud computing resources per unit time
- c person is the manual screening cost per unit time
- c label is the unit labeling cost.
- the total cost calculation formula corresponding to the combination is:
- C total is the total cost
- N is the number of training samples
- p is the proportion of the training samples in the overall data set
- R cloud is the recall rate of the cloud filter
- f cloud is the Cloud filter speed
- P cloud is the accuracy rate of the cloud filter
- c netword is the traffic cost of transmitting each sample
- c store is the storage cost of unit data
- c resource is the cloud computing per unit time Resource operating cost
- c person is the manual screening cost per unit time
- c label is the unit labeling cost.
- the total cost calculation formula corresponding to the combination is:
- C total is the total cost
- N is the number of training samples
- p is the proportion of the training samples in the overall data set
- R car is the recall rate of the car-end filter
- f car is the total cost.
- P car is the precision rate of the car-side filter
- R cloud is the recall rate of the cloud filter
- f cloud is the cloud filter speed
- P cloud is the cloud
- f person is the manual screening speed
- c device is the depreciation cost of the equipment installed on the crowdsourced vehicle
- c netword is the traffic cost of transmitting each sample
- c store is the cost
- c resource is the operating cost of the cloud computing resource per unit time
- c person is the manual screening cost per unit time
- c label is the unit labeling cost.
- the present invention provides a device for acquiring training samples, including:
- a receiving module configured to receive an acquisition task of training samples, the acquisition task including the number of training samples to be acquired and the proportion of the training samples in the overall data set;
- the acquisition cost acquisition module is configured to acquire the cost unit price corresponding to each data acquisition mode
- the screening performance parameter and cost acquisition module is configured to obtain the screening performance parameter and cost unit price corresponding to each filter mode
- the storage and labeling cost acquisition module is configured to acquire the storage cost of unit data and the unit labeling cost of manually labeling a training sample
- the determining module is configured to initially determine a combination mode of data acquisition, and the combination mode includes a data collection mode and a filter mode;
- the calculation module is configured to, for each combination mode, based on the cost unit price corresponding to the data collection mode of the combination mode and the screening performance parameter and cost unit price corresponding to the filter mode, calculate the combination mode to obtain the training in the data set The total cost corresponding to the sample;
- the comparison module is configured to select the lowest total cost from each total cost obtained, and determine the combination of the data collection mode and the filter mode corresponding to the lowest total cost as the raw data of the acquisition task to obtain training samples Get a combination plan;
- the sending module is configured to send the data acquisition combination scheme to the corresponding device to acquire the original data set used to filter the training samples.
- the acquisition task of training samples When the acquisition task of training samples is received, the number of training samples to be acquired and the proportion of training samples in the total data set are collected from the acquisition task, and the cost unit price corresponding to all data collection modes and the cost unit price corresponding to the screening mode are obtained. And screening performance parameters, according to the preset total cost calculation formula corresponding to each combination of data collection mode and filter mode, find the total cost corresponding to all combinations, so as to select the combination corresponding to the lowest total cost as this time Obtain the combination plan of the task, and send the combination plan to the corresponding device to obtain training samples.
- the costs of various combinations are calculated and summarized to obtain the total cost of various combinations, and select from all total costs
- the lowest total cost is calculated, and the combination method corresponding to the lowest total cost is used as the combination plan for obtaining the task.
- a scientific evaluation system is established. While taking into account the performance and efficiency of the filter algorithm, it can also avoid the subjectivity of human experience and overcome the difficulties in the prior art. The problem of selecting the most appropriate filter in the actual situation, and avoiding the differences in opinions caused by the developers due to subjective differences in experience. Therefore, in the iterative training of data-driven algorithms represented by deep learning in the field of autonomous driving, the required training sample data is selected from the massive data, and the cost of obtaining training sample data is reduced.
- FIG. 1 is a schematic flowchart of a method for obtaining training samples according to an embodiment of the present invention
- FIG. 2 is a schematic structural diagram of a device for acquiring training samples provided by an embodiment of the present invention.
- the embodiment of the present invention discloses a method and device for obtaining training samples. By calculating the total cost of the data collection mode and the filter mode, a scientific evaluation system is established. While taking into account the performance of the filter algorithm and the efficiency of the algorithm, it also The subjectivity of human experience can be avoided. Detailed descriptions are given below.
- FIG. 1 is a schematic flowchart of a method for obtaining training samples according to an embodiment of the present invention. The method specifically includes the following steps:
- S102 Receive a training sample acquisition task, where the acquisition task includes the number of training samples to be acquired and the proportion of the training samples in the overall data set.
- the intuitive idea is to give a certain metric (such as cost) and compare the size relationship of each combination under this metric.
- N training samples that meet the requirements need to be selected from the data set, and the total cost of various combinations is calculated.
- the number of training samples to be obtained and the proportion of training samples in the overall data set determine the total cost of data collection, data storage, data screening, and data labeling. Since the proportion of samples to be screened is used to measure the difficulty of the screening task, in practice, it is difficult to obtain an accurate value of the proportion of samples to be screened.
- the proportions estimated in advance based on experience are used.
- the data collection mode includes special collection mode and crowdsourcing collection mode.
- Special collection is a collection mode that specifically collects vehicles for certain specific purposes.
- the collection process includes data collection, data storage (usually saved by hard disk) and data recovery (hard disk storage). Copy the data to a suitable place), based on the collection process of special collection, when evaluating the cost unit price of the special collection mode, set the data collection cost as the cost unit price of the special collection mode, that is, the unit time collection cost.
- Crowdsourcing collection is a collection mode that collects data by installing data collection equipment on outsourcing vehicles.
- the collection process includes data collection, data transmission (mainly through a traffic network) and data recovery (putting the received data to a suitable place).
- the data collection cost is set as the traffic cost of transmitting each sample and the depreciation cost of the equipment installed on the crowdsourcing vehicle.
- the unit cost of the data collection mode determines the total cost of this link of data collection.
- Filter modes include uniform sampling, manual filtering, car-side filters, cloud filters, and combined filters.
- the car-end filter is installed on the car, and the data can be directly filtered through the filter during the collection process.
- the screening performance parameters corresponding to the car-end filter mode include the precision rate of the car-end filter and the search of the car-end filter. Full rate and car-side filter speed; cloud filter needs to recycle the data collected by the vehicle and filter it in an offline state. Various large servers can be used.
- the screening performance parameters corresponding to the cloud filter mode include the accuracy of the cloud filter
- the cost unit price corresponding to the cloud filter mode includes the operating cost of cloud computing resources per unit time; for comparison, two special screening modes, uniform sampling and manual screening, are also considered.
- Uniform sampling is to directly obtain sample data from the collected data through uniform sampling.
- Manual screening is to select training samples from the collected data through manual observation.
- the screening performance parameters corresponding to the manual screening mode include manual screening speed and manual screening.
- the cost unit price corresponding to the model includes the cost of manual screening per unit time; in addition, there are combination filters, which include the first combination filter, the second combination filter, and the third combination filter.
- the first combination is a combination of uniform sampling and manual screening
- the second combination filter is a combination of cloud filters and manual screening
- the third combination filter is a combination of car-side filters, cloud filters and manual screening.
- the screening performance parameters corresponding to the filter mode determine the total cost of data collection, data storage, data screening, and data labeling.
- the unit cost of the filter mode determines the total cost of data screening.
- S108 Obtain the storage cost of unit data and the unit labeling cost of manually labeling a training sample.
- the storage cost of unit data determines the total cost of data storage, and the unit labeling cost determines the total cost of data labeling.
- S110 Determine a combination mode of raw data acquisition, where the combination mode includes a data collection mode and a filter mode.
- the combination of data collection mode and filter mode includes the following situations: special collection and uniform sampling, special collection and car filter, special collection and cloud filter, special collection and first combination filter, special collection and second Two combination filter, special collection and third combination filter, crowdsourcing collection and second combination filter, crowdsourcing collection and third combination filter. Based on different combination methods, different total cost calculation formulas are obtained.
- the total cost calculation formula corresponding to the combination is:
- the total cost calculation formula corresponding to the combination is:
- the total cost calculation formula corresponding to the combination is:
- the total cost calculation formula corresponding to the combination method is:
- C total is the total cost
- N is the number of training samples
- p is the proportion of the training samples in the total data set
- T is the time interval of the uniform sampling
- R car is the car-end screening
- the recall rate of the filter f car is the speed of the car-side filter
- P car is the precision rate of the car-side filter
- R cloud is the recall rate of the cloud filter
- f cloud is the cloud Filter speed
- P cloud is the accuracy rate of the cloud filter
- f person is the manual screening speed
- c collection is the collection cost per unit time
- c device is the depreciation cost of the equipment installed on the crowdsourced vehicle
- C netword is the traffic cost of the transmission of each sample
- c store is the storage cost of the unit data
- c resource is the operation cost of the cloud computing resource per unit time
- c person is the manual screening cost per unit time
- c label indicates the cost of the unit.
- Cost provides a quantitative decision-making basis for the choice and selection of combination methods.
- S114 Select the lowest total cost from each total cost obtained, and determine the combination mode of the data collection mode and the filter mode corresponding to the lowest total cost as the original data acquisition combination plan for the acquisition task to acquire training samples.
- This method can be determined as the original data acquisition combination plan for obtaining the task to obtain the training sample, while taking into account the performance of the filter algorithm and the efficiency of the algorithm. , It also avoids the subjectivity of human experience.
- S116 Send the data acquisition combination solution to a corresponding device to acquire the original data set used to screen the training samples.
- this embodiment can collect the number of training samples to be acquired and the proportion of training samples in the total data set from the acquisition task when the acquisition task of the training samples is received, and acquire the corresponding data collection modes.
- Cost unit price and screening mode corresponding to cost unit price and screening performance parameters, according to the preset total cost calculation formula corresponding to the combination of each data collection mode and filter mode, find the total cost corresponding to all combinations, so as to select the lowest.
- the combination method corresponding to the total cost is used as the combination plan for this acquisition task, and the combination plan is sent to the corresponding device to obtain training samples.
- the costs of various combinations are calculated and summarized to obtain the total cost of various combinations, and select from all total costs
- the lowest total cost is calculated, and the combination method corresponding to the lowest total cost is used as the combination plan for obtaining the task.
- FIG. 2 is a schematic structural diagram of a device for acquiring training samples provided by an embodiment of the present invention.
- an apparatus for acquiring training samples provided by an embodiment of the present invention may include:
- the receiving module 202 is configured to receive an acquisition task of training samples, where the acquisition task includes the number of training samples to be acquired and the proportion of the training samples in the overall data set;
- the acquisition cost acquisition module 204 is configured to acquire the cost unit price corresponding to each data acquisition mode
- the screening performance parameter and cost obtaining module 206 is configured to obtain screening performance parameters and cost unit prices corresponding to each filter mode
- the storage and labeling cost obtaining module 208 is configured to obtain the storage cost of unit data and the unit labeling cost of manually labeling a training sample;
- the determining module 210 is configured to determine a combination mode for acquiring raw data, the combination mode including a data collection mode and a filter mode;
- the calculation module 212 is configured to, for each combination mode, based on the cost unit price corresponding to the data collection mode of the combination mode and the screening performance parameters and cost unit price corresponding to the filter mode, calculate the combination mode to obtain the said combination in the data set The total cost corresponding to the training sample;
- the comparison module 214 is configured to select the lowest total cost from each total cost obtained, and determine the combination of the data collection mode and the filter mode corresponding to the lowest total cost as the original acquisition task to obtain training samples.
- Data acquisition combination plan ;
- the sending module 216 is configured to send the data acquisition combination scheme to a corresponding device to acquire the original data set used to screen the training samples.
- this device can collect the number of training samples to be acquired and the proportion of training samples in the total data set from the acquisition task when receiving the acquisition task of training samples, and acquire the corresponding cost of all data acquisition modes
- the unit price and the corresponding cost unit price and the screening performance parameters of the screening mode according to the preset total cost calculation formula corresponding to the combination of each data collection mode and the screening mode, find the total cost corresponding to all the combinations, so as to select the lowest total
- the combination method corresponding to the cost is used as the combination plan for this acquisition task, and the combination plan is sent to the corresponding device to obtain training samples.
- the costs of various combinations are calculated and summarized to obtain the total cost of various combinations, and select from all total costs
- the lowest total cost is calculated, and the combination method corresponding to the lowest total cost is used as the combination plan for obtaining the task.
- modules in the device in the embodiment may be distributed in the device in the embodiment according to the description of the embodiment, or may be located in one or more devices different from this embodiment with corresponding changes.
- the modules of the above-mentioned embodiments can be combined into one module, or further divided into multiple sub-modules.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Filters That Use Time-Delay Elements (AREA)
Abstract
Description
Claims (10)
- 一种训练样本的获取方法,其特征在于,包括:A method for obtaining training samples is characterized in that it includes:接收训练样本的获取任务,所述获取任务包含待获取训练样本个数和训练样本在数据集总体中的占比;Receiving an acquisition task of training samples, where the acquisition task includes the number of training samples to be acquired and the proportion of the training samples in the overall data set;获取每种数据采集模式对应的成本单价;其中,所述数据采集模式包括专项采集和众包采集,所述专项采集模式的成本单价包括单位时间采集成本,所述众包采集模式的成本单价包括传输每个样本的流量成本、安装在众包车辆上的设备的折旧成本;Obtain the cost unit price corresponding to each data collection mode; wherein, the data collection mode includes special collection and crowdsourcing collection, the cost unit price of the special collection mode includes the unit time collection cost, and the cost unit price of the crowdsourcing collection mode includes Transmission cost of each sample, depreciation cost of equipment installed on crowdsourced vehicles;获取每种筛选器模式对应的筛选性能参数和成本单价;其中,所述筛选器模式包括均匀采样、人工筛选、车端筛选器、云端筛选器和组合筛选器,所述均匀采样模式对应的筛选性能参数包括均匀采样的时间间隔,所述人工筛选模式对应的筛选性能参数包括人工筛选速度,所述人工筛选模式对应的成本单价包括单位时间人工筛选成本,所述车端筛选器模式对应的筛选性能参数包括车端筛选器的查准率、车端筛选器的查全率和车端筛选器速度,所述云端筛选器模式对应的筛选性能参数包括云端筛选器的查准率、云端筛选器的查全率和云端筛选器速度,所述云端筛选器模式对应的成本单价包括单位时间云端计算资源运行成本;所述人工筛选速度是指单位时间内人工处理的数据个数,所述车端筛选器速度是指单位时间车端筛选器处理的数据个数,所述云端筛选器速度是指单位时间云端筛选器处理的数据个数;Obtain the screening performance parameters and cost unit price corresponding to each filter mode; wherein, the filter modes include uniform sampling, manual screening, car-side filter, cloud filter, and combined filter, and the uniform sampling mode corresponds to the screening The performance parameters include the time interval of uniform sampling, the screening performance parameters corresponding to the manual screening mode include manual screening speed, the cost unit price corresponding to the manual screening mode includes the manual screening cost per unit time, and the screening corresponding to the car-side filter mode The performance parameters include the precision rate of the car-side filter, the recall rate of the car-side filter, and the speed of the car-side filter. The screening performance parameters corresponding to the cloud filter mode include the precision rate of the cloud filter and the cloud filter The recall rate and cloud filter speed of the cloud filter mode include the operating cost of cloud computing resources per unit time; the manual filtering speed refers to the number of data manually processed per unit time, the vehicle end The filter speed refers to the number of data processed by the vehicle-side filter per unit time, and the cloud filter speed refers to the number of data processed by the cloud filter per unit time;获取单位数据的存储成本和人工标注一个训练样本的单位标注成本;The storage cost of acquiring unit data and the unit labeling cost of manually labeling a training sample;确定原始数据获取的组合方式,所述组合方式包括数据采集模式和筛选器模式;Determine a combination mode of raw data acquisition, the combination mode includes a data collection mode and a filter mode;针对每种组合方式,基于该组合方式的数据采集模式对应的成本单价和筛选器模式对应的筛选性能参数和成本单价,计算得到该种组合方式在数据 集中获取所述训练样本所对应的总成本;For each combination method, based on the cost unit price corresponding to the data collection mode of the combination method and the screening performance parameters and cost unit price corresponding to the filter mode, the total cost corresponding to the training sample obtained in the data set by this combination method is calculated ;从得到的每种总成本中挑选出最低总成本,将所述最低总成本对应的数据采集模式和筛选器模式的组合方式确定为所述获取任务获取训练样本的原始数据获取组合方案;Select the lowest total cost from each total cost obtained, and determine the combination of the data collection mode and the filter mode corresponding to the lowest total cost as the original data acquisition combination plan for the acquisition task to acquire training samples;将所述原始数据获取组合方案发送至相应设备以获取用于筛选所述训练样本的原始数据集。The original data acquisition combination scheme is sent to a corresponding device to obtain an original data set for screening the training samples.
- 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式的组合方式为专项采集和均匀采样时,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is special collection and uniform sampling, the total cost calculation formula corresponding to the combination is:其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the total data set, T is the time interval of the uniform sampling, and c collection is the unit time collection Cost, c store is the storage cost of the unit data, and c label is the unit label cost.
- 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式的组合方式为专项采集和车端筛选器时,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is special collection and car-side filter, the total cost calculation formula corresponding to the combination is:其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,R car为所述车端筛选器的查全率,f car为所述车端筛选器速度,P car为所述车端筛选器的查准率,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, R car is the recall rate of the car-end filter, and f car is the total cost. The speed of the car-end filter, P car is the precision rate of the car-end filter, c collection is the collection cost per unit time, c store is the storage cost of the unit data, and c label is the unit labeling cost .
- 根据权利要求1所述的方法,其特征在于,当所述数据采集模式和筛选器模式组合方式为专项采集和云端筛选器时,该组合方式对应的总成本计 算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is special collection and cloud filter, the total cost calculation formula corresponding to the combination is:其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为所述云端筛选器的查准率,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c resource为所述单位时间云端计算资源运行成本,c label为所述单位标注成本。 Wherein, C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the total data set, T is the time interval of the uniform sampling, and R cloud is the cloud filter F cloud is the speed of the cloud filter, P cloud is the precision rate of the cloud filter, c collection is the collection cost per unit time, c store is the storage cost of the unit data, c resource is the operating cost of the cloud computing resource per unit time, and c label is the label cost per unit.
- 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式组合方式为专项采集和第一组合筛选器时,其中,所述第一组合筛选器为均匀采样和人工筛选的组合,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is a special collection and a first combined filter, wherein the first combined filter is uniform sampling and manual screening Combination, the total cost calculation formula corresponding to this combination method is:其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,f person为所述人工筛选速度,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c person为所述单位时间人工筛选成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, T is the time interval of the uniform sampling, and f person is the manual screening speed , C collection is the collection cost per unit time, c store is the storage cost of the unit data, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
- 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式组合方式为专项采集和第二组合筛选器时,其中,所述第二组合筛选器为云端筛选器和人工筛选的组合,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is a special collection and a second combined filter, wherein the second combined filter is a cloud filter and a manual filter The total cost calculation formula corresponding to this combination is:其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为所述云端筛选器的查准率,f person为所述人工筛选速度,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c resource为所述单位时间云端计算资源运行成本,c person为所述单位时间人工筛选成本,c label为所述单位标注成本。 Wherein, C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the total data set, T is the time interval of the uniform sampling, and R cloud is the cloud filter F cloud is the speed of the cloud filter, P cloud is the precision rate of the cloud filter, f person is the manual screening speed, c collection is the collection cost per unit time, and c store is For the storage cost of the unit data, c resource is the operating cost of the cloud computing resource per unit time, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
- 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式组合方式为专项采集和第三组合筛选器时,其中,所述第三组合筛选器为车端筛选器和云端筛选器以及人工筛选的组合,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is a special collection and a third combined filter, wherein the third combined filter is a car-side filter and a cloud Combination of filter and manual screening, the total cost calculation formula corresponding to this combination method is:其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,R car为车端筛选器的查全率,f car为车端筛选器速度,P car为车端筛选器的查准率,R cloud为云端筛选器的查全率,f cloud为云端筛选器速度,P cloud为云端筛选器的查准率,f person为人工筛选速度,c collection为单位时间采集成本,c store为单位数据的存储成本,c resource为单位时间云端计算资源运行成本,c person为单位时间人工筛选成本,c label为单位标注成本。 Among them, C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, R car is the recall rate of the car -end filter, and f car is the car-end screening P car is the precision rate of the car filter, R cloud is the recall rate of the cloud filter, f cloud is the cloud filter speed, P cloud is the precision rate of the cloud filter, and f person is the manual filter Speed, c collection is the collection cost per unit time, c store is the storage cost per unit data, c resource is the operating cost of cloud computing resources per unit time, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
- 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式组合方式为众包采集和第二组合筛选器时,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is crowdsourced collection and the second combined filter, the total cost calculation formula corresponding to the combination is:其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为云端筛选器的查准率,f person为人工筛选速度,c netword为传输每个样本的流量成本,c store为单位数据的存储成本,c resource为单位时间云端计算资源运行成本,c person为单位时间人工筛选成本,c label为单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, R cloud is the recall rate of the cloud filter, and f cloud is the Cloud filter speed, P cloud is the accuracy rate of the cloud filter, f person is the manual screening speed, c netword is the traffic cost of transmitting each sample, c store is the storage cost of unit data, and c resource is the cloud computing per unit time Resource operating cost, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
- 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式组合方式为众包采集和第三组合筛选器时,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is crowdsourced collection and the third combined filter, the total cost calculation formula corresponding to the combined mode is:其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,R car为所述车端筛选器的查全率,f car为所述车端筛选器速度,P car为所述车端筛选器的查准率,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为所述云端筛选器的查准率,f person为所述人工筛选速度,c device为所述安装在众包车辆上的设备的折旧成本,c netword为所述传输每个样本的流量成本,c store为所述单位数据的存储成本,c resource为所述单位时间云端计算资源运行成本,c person为所述单位时间人工筛选成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, R car is the recall rate of the car-end filter, and f car is the total cost. The car-side filter speed, P car is the precision rate of the car-side filter, R cloud is the recall rate of the cloud filter, f cloud is the cloud filter speed, and P cloud is the cloud The precision rate of the filter, f person is the manual screening speed, c device is the depreciation cost of the equipment installed on the crowdsourced vehicle, c netword is the traffic cost of transmitting each sample, and c store is the cost The storage cost of the unit data, c resource is the operating cost of the cloud computing resource per unit time, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
- 一种训练样本的获取装置,其特征在于,包括:A device for acquiring training samples is characterized in that it comprises:接收模块,被配置为接收训练样本的获取任务,所述获取任务包含待获取训练样本个数和训练样本在数据集总体中的占比;A receiving module configured to receive an acquisition task of training samples, the acquisition task including the number of training samples to be acquired and the proportion of the training samples in the overall data set;采集成本获取模块,被配置为获取每种数据采集模式对应的成本单价;The acquisition cost acquisition module is configured to acquire the cost unit price corresponding to each data acquisition mode;筛选性能参数和成本获取模块,被配置为获取每种筛选器模式对应的筛选性能参数和成本单价;The screening performance parameter and cost acquisition module is configured to obtain the screening performance parameter and cost unit price corresponding to each filter mode;存储和标注成本获取模块,被配置为获取单位数据的存储成本和人工标注一个训练样本的单位标注成本;The storage and labeling cost acquisition module is configured to acquire the storage cost of unit data and the unit labeling cost of manually labeling a training sample;确定模块,被配置为确定原始数据获取的组合方式,所述组合方式包括数据采集模式和筛选器模式;A determining module, configured to determine a combination mode of original data acquisition, the combination mode including a data collection mode and a filter mode;计算模块,被配置为针对每种组合方式,基于该组合方式的数据采集模式对应的成本单价和筛选器模式对应的筛选性能参数和成本单价,计算得到该种组合方式在数据集中获取所述训练样本所对应的总成本;The calculation module is configured to, for each combination mode, based on the cost unit price corresponding to the data collection mode of the combination mode and the screening performance parameter and cost unit price corresponding to the filter mode, calculate the combination mode to obtain the training in the data set The total cost corresponding to the sample;比较模块,被配置为从得到的每种总成本中挑选出最低总成本,将所述最低总成本对应的数据采集模式和筛选器模式的组合方式确定为所述获取任务获取训练样本的原始数据获取组合方案;The comparison module is configured to select the lowest total cost from each total cost obtained, and determine the combination of the data collection mode and the filter mode corresponding to the lowest total cost as the raw data of the acquisition task to obtain training samples Get a combination plan;发送模块,被配置为将所述数据获取组合方案发送至相应设备以获取用于筛选所述训练样本的原始数据集。The sending module is configured to send the data acquisition combination scheme to a corresponding device to acquire the original data set used to screen the training samples.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911188397.0 | 2019-11-28 | ||
CN201911188397.0A CN112861898B (en) | 2019-11-28 | 2019-11-28 | Training sample acquisition method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021103835A1 true WO2021103835A1 (en) | 2021-06-03 |
Family
ID=75985256
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/121356 WO2021103835A1 (en) | 2019-11-28 | 2020-10-16 | Training sample acquisition method and apparatus |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112861898B (en) |
WO (1) | WO2021103835A1 (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105447031A (en) * | 2014-08-28 | 2016-03-30 | 百度在线网络技术(北京)有限公司 | Training sample labeling method and device |
CN107909088A (en) * | 2017-09-27 | 2018-04-13 | 百度在线网络技术(北京)有限公司 | Obtain method, apparatus, equipment and the computer-readable storage medium of training sample |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109446783B (en) * | 2018-11-16 | 2023-07-25 | 山东浪潮科学研究院有限公司 | Image recognition efficient sample collection method and system based on machine crowdsourcing |
CN113505730A (en) * | 2021-07-26 | 2021-10-15 | 全景智联(武汉)科技有限公司 | Model evaluation method, device, equipment and storage medium based on mass data |
-
2019
- 2019-11-28 CN CN201911188397.0A patent/CN112861898B/en active Active
-
2020
- 2020-10-16 WO PCT/CN2020/121356 patent/WO2021103835A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105447031A (en) * | 2014-08-28 | 2016-03-30 | 百度在线网络技术(北京)有限公司 | Training sample labeling method and device |
CN107909088A (en) * | 2017-09-27 | 2018-04-13 | 百度在线网络技术(北京)有限公司 | Obtain method, apparatus, equipment and the computer-readable storage medium of training sample |
Also Published As
Publication number | Publication date |
---|---|
CN112861898B (en) | 2022-06-10 |
CN112861898A (en) | 2021-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107786994B (en) | User perception quality difference analysis method and system for end-to-end wireless service | |
CN109376660B (en) | Target monitoring method, device and system | |
CN110930705B (en) | Intersection traffic decision system, method and equipment | |
CN109982361B (en) | Signal interference analysis method, device, equipment and medium | |
WO2018184304A1 (en) | Method and device for detecting health state of network element | |
CN101695170A (en) | Wireless communication network testing data collection and analysis method based on intelligent mobile phone | |
US12019059B2 (en) | Detecting equipment defects using lubricant analysis | |
CN110472581A (en) | A kind of cell image analysis method based on deep learning | |
EP3326109A1 (en) | System and method for providing a recipe | |
CN115660288A (en) | Analysis management system based on internet big data | |
CN113704077A (en) | Test case generation method and device | |
WO2021103835A1 (en) | Training sample acquisition method and apparatus | |
CN116934507B (en) | Intelligent claim settlement method and system based on big data driving | |
CN109995549B (en) | Method and device for evaluating flow value | |
CN113505980A (en) | Reliability evaluation method, device and system for intelligent traffic management system | |
CN116452154B (en) | Project management system suitable for communication operators | |
CN117275644A (en) | Detection result mutual recognition method, system and storage medium based on deep learning | |
CN104392101B (en) | Data sharing method and device | |
CN108692709B (en) | Farmland disaster detection method and system, unmanned aerial vehicle and cloud server | |
CN116385045A (en) | Data processing method, device and equipment for receiving and hosting additional service | |
CN110413902A (en) | The online recommended method in vehicle salvage shop, device, equipment and storage medium | |
JP2017208717A (en) | Analysis system for radio communication network | |
CN109167673A (en) | A kind of Novel cloud service screening technique of abnormal fusion Qos Data Detection | |
CN115222276A (en) | Bidding analysis and evaluation method and device | |
CN114782115A (en) | Method, system and terminal device for recommending site selection of private stores |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20894385 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20894385 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20894385 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24.01.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20894385 Country of ref document: EP Kind code of ref document: A1 |