WO2021103835A1 - Training sample acquisition method and apparatus - Google Patents

Training sample acquisition method and apparatus Download PDF

Info

Publication number
WO2021103835A1
WO2021103835A1 PCT/CN2020/121356 CN2020121356W WO2021103835A1 WO 2021103835 A1 WO2021103835 A1 WO 2021103835A1 CN 2020121356 W CN2020121356 W CN 2020121356W WO 2021103835 A1 WO2021103835 A1 WO 2021103835A1
Authority
WO
WIPO (PCT)
Prior art keywords
cost
filter
mode
cloud
combination
Prior art date
Application number
PCT/CN2020/121356
Other languages
French (fr)
Chinese (zh)
Inventor
马贤忠
胡皓瑜
江浩
董维山
任少卿
范一磊
Original Assignee
初速度(苏州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 初速度(苏州)科技有限公司 filed Critical 初速度(苏州)科技有限公司
Publication of WO2021103835A1 publication Critical patent/WO2021103835A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the present invention relates to the field of automatic driving, and in particular to a method and device for obtaining training samples.
  • cost control occupies a very important position in each production link. Therefore, calculating the total cost of the combination of various data collection modes and filter modes and comparing them, and selecting the most suitable combination to obtain training samples has become an urgent problem to be solved.
  • the present invention provides a method and device for obtaining training samples.
  • a scientific evaluation system is established, and human experience is avoided when the performance of the filter algorithm and the efficiency of the algorithm are taken into account.
  • Subjectivity The specific technical scheme is as follows.
  • the present invention provides a universal filter selection method, including:
  • the data collection mode includes special collection and crowdsourcing collection
  • the cost unit price of the special collection mode includes the unit time collection cost
  • the cost unit price of the crowdsourcing collection mode includes Transmission cost of each sample, depreciation cost of equipment installed on crowdsourced vehicles
  • the filter modes include uniform sampling, manual screening, car-side filter, cloud filter, and combined filter
  • the uniform sampling mode corresponds to the screening
  • the performance parameters include the time interval of uniform sampling
  • the screening performance parameters corresponding to the manual screening mode include manual screening speed
  • the cost unit price corresponding to the manual screening mode includes the manual screening cost per unit time
  • the screening corresponding to the car-side filter mode includes the precision rate of the car-side filter, the recall rate of the car-side filter, and the speed of the car-side filter.
  • the screening performance parameters corresponding to the cloud filter mode include the precision rate of the cloud filter and the cloud filter
  • the recall rate and cloud filter speed of the cloud filter mode include the operating cost of cloud computing resources per unit time;
  • the manual filtering speed refers to the number of data manually processed per unit time, the vehicle end
  • the filter speed refers to the number of data processed by the vehicle-side filter per unit time, and the cloud filter speed refers to the number of data processed by the cloud filter per unit time;
  • the combination mode includes a data collection mode and a filter mode
  • the total cost corresponding to the training sample obtained in the data set by this combination method is calculated ;
  • the data acquisition combination scheme is sent to the corresponding device to acquire the original data set used to screen the training samples.
  • the total cost calculation formula corresponding to the combination is:
  • C total is the total cost
  • N is the number of training samples
  • p is the proportion of the training samples in the total data set
  • T is the time interval of the uniform sampling
  • c collection is the unit time collection Cost
  • c store is the storage cost of the unit data
  • c label is the unit label cost.
  • the total cost calculation formula corresponding to the combination is:
  • C total is the total cost
  • N is the number of training samples
  • p is the proportion of the training samples in the overall data set
  • R car is the recall rate of the car-end filter
  • f car is the total cost.
  • P car is the precision rate of the car-end filter
  • c collection is the collection cost per unit time
  • c store is the storage cost of the unit data
  • c label is the unit labeling cost .
  • the total cost calculation formula corresponding to the combination is:
  • C total is the total cost
  • N is the number of training samples
  • p is the proportion of the training samples in the total data set
  • T is the time interval of the uniform sampling
  • R cloud is the cloud filter
  • F cloud is the speed of the cloud filter
  • P cloud is the precision rate of the cloud filter
  • c collection is the collection cost per unit time
  • c store is the storage cost of the unit data
  • c resource is the operating cost of the cloud computing resource per unit time
  • c label is the label cost per unit.
  • the combination of the data collection mode and the filter mode is a special collection and a first combined filter
  • the first combined filter is a combination of uniform sampling and manual screening
  • the total cost of the combined mode is The calculation formula:
  • C total is the total cost
  • N is the number of training samples
  • p is the proportion of the training samples in the overall data set
  • T is the time interval of the uniform sampling
  • f person is the manual screening speed
  • C collection is the collection cost per unit time
  • c store is the storage cost of the unit data
  • c person is the manual screening cost per unit time
  • c label is the unit labeling cost.
  • the second combined filter is a combination of a cloud filter and a manual filter
  • the combined mode corresponds to the total
  • the cost calculation formula is:
  • C total is the total cost
  • N is the number of training samples
  • p is the proportion of the training samples in the total data set
  • T is the time interval of the uniform sampling
  • R cloud is the cloud filter
  • F cloud is the speed of the cloud filter
  • P cloud is the precision rate of the cloud filter
  • f person is the manual screening speed
  • c collection is the collection cost per unit time
  • c store is For the storage cost of the unit data
  • c resource is the operating cost of the cloud computing resource per unit time
  • c person is the manual screening cost per unit time
  • c label is the unit labeling cost.
  • the third combined filter is a combination of a car-side filter, a cloud filter, and manual screening.
  • the total cost calculation formula corresponding to the combination method is:
  • C total is the total cost
  • N is the number of training samples
  • p is the proportion of the training samples in the overall data set
  • R car is the recall rate of the car -end filter
  • f car is the car-end screening
  • P car is the precision rate of the car filter
  • R cloud is the recall rate of the cloud filter
  • f cloud is the cloud filter speed
  • P cloud is the precision rate of the cloud filter
  • f person is the manual filter Speed
  • c collection is the collection cost per unit time
  • c store is the storage cost per unit data
  • c resource is the operating cost of cloud computing resources per unit time
  • c person is the manual screening cost per unit time
  • c label is the unit labeling cost.
  • the total cost calculation formula corresponding to the combination is:
  • C total is the total cost
  • N is the number of training samples
  • p is the proportion of the training samples in the overall data set
  • R cloud is the recall rate of the cloud filter
  • f cloud is the Cloud filter speed
  • P cloud is the accuracy rate of the cloud filter
  • c netword is the traffic cost of transmitting each sample
  • c store is the storage cost of unit data
  • c resource is the cloud computing per unit time Resource operating cost
  • c person is the manual screening cost per unit time
  • c label is the unit labeling cost.
  • the total cost calculation formula corresponding to the combination is:
  • C total is the total cost
  • N is the number of training samples
  • p is the proportion of the training samples in the overall data set
  • R car is the recall rate of the car-end filter
  • f car is the total cost.
  • P car is the precision rate of the car-side filter
  • R cloud is the recall rate of the cloud filter
  • f cloud is the cloud filter speed
  • P cloud is the cloud
  • f person is the manual screening speed
  • c device is the depreciation cost of the equipment installed on the crowdsourced vehicle
  • c netword is the traffic cost of transmitting each sample
  • c store is the cost
  • c resource is the operating cost of the cloud computing resource per unit time
  • c person is the manual screening cost per unit time
  • c label is the unit labeling cost.
  • the present invention provides a device for acquiring training samples, including:
  • a receiving module configured to receive an acquisition task of training samples, the acquisition task including the number of training samples to be acquired and the proportion of the training samples in the overall data set;
  • the acquisition cost acquisition module is configured to acquire the cost unit price corresponding to each data acquisition mode
  • the screening performance parameter and cost acquisition module is configured to obtain the screening performance parameter and cost unit price corresponding to each filter mode
  • the storage and labeling cost acquisition module is configured to acquire the storage cost of unit data and the unit labeling cost of manually labeling a training sample
  • the determining module is configured to initially determine a combination mode of data acquisition, and the combination mode includes a data collection mode and a filter mode;
  • the calculation module is configured to, for each combination mode, based on the cost unit price corresponding to the data collection mode of the combination mode and the screening performance parameter and cost unit price corresponding to the filter mode, calculate the combination mode to obtain the training in the data set The total cost corresponding to the sample;
  • the comparison module is configured to select the lowest total cost from each total cost obtained, and determine the combination of the data collection mode and the filter mode corresponding to the lowest total cost as the raw data of the acquisition task to obtain training samples Get a combination plan;
  • the sending module is configured to send the data acquisition combination scheme to the corresponding device to acquire the original data set used to filter the training samples.
  • the acquisition task of training samples When the acquisition task of training samples is received, the number of training samples to be acquired and the proportion of training samples in the total data set are collected from the acquisition task, and the cost unit price corresponding to all data collection modes and the cost unit price corresponding to the screening mode are obtained. And screening performance parameters, according to the preset total cost calculation formula corresponding to each combination of data collection mode and filter mode, find the total cost corresponding to all combinations, so as to select the combination corresponding to the lowest total cost as this time Obtain the combination plan of the task, and send the combination plan to the corresponding device to obtain training samples.
  • the costs of various combinations are calculated and summarized to obtain the total cost of various combinations, and select from all total costs
  • the lowest total cost is calculated, and the combination method corresponding to the lowest total cost is used as the combination plan for obtaining the task.
  • a scientific evaluation system is established. While taking into account the performance and efficiency of the filter algorithm, it can also avoid the subjectivity of human experience and overcome the difficulties in the prior art. The problem of selecting the most appropriate filter in the actual situation, and avoiding the differences in opinions caused by the developers due to subjective differences in experience. Therefore, in the iterative training of data-driven algorithms represented by deep learning in the field of autonomous driving, the required training sample data is selected from the massive data, and the cost of obtaining training sample data is reduced.
  • FIG. 1 is a schematic flowchart of a method for obtaining training samples according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a device for acquiring training samples provided by an embodiment of the present invention.
  • the embodiment of the present invention discloses a method and device for obtaining training samples. By calculating the total cost of the data collection mode and the filter mode, a scientific evaluation system is established. While taking into account the performance of the filter algorithm and the efficiency of the algorithm, it also The subjectivity of human experience can be avoided. Detailed descriptions are given below.
  • FIG. 1 is a schematic flowchart of a method for obtaining training samples according to an embodiment of the present invention. The method specifically includes the following steps:
  • S102 Receive a training sample acquisition task, where the acquisition task includes the number of training samples to be acquired and the proportion of the training samples in the overall data set.
  • the intuitive idea is to give a certain metric (such as cost) and compare the size relationship of each combination under this metric.
  • N training samples that meet the requirements need to be selected from the data set, and the total cost of various combinations is calculated.
  • the number of training samples to be obtained and the proportion of training samples in the overall data set determine the total cost of data collection, data storage, data screening, and data labeling. Since the proportion of samples to be screened is used to measure the difficulty of the screening task, in practice, it is difficult to obtain an accurate value of the proportion of samples to be screened.
  • the proportions estimated in advance based on experience are used.
  • the data collection mode includes special collection mode and crowdsourcing collection mode.
  • Special collection is a collection mode that specifically collects vehicles for certain specific purposes.
  • the collection process includes data collection, data storage (usually saved by hard disk) and data recovery (hard disk storage). Copy the data to a suitable place), based on the collection process of special collection, when evaluating the cost unit price of the special collection mode, set the data collection cost as the cost unit price of the special collection mode, that is, the unit time collection cost.
  • Crowdsourcing collection is a collection mode that collects data by installing data collection equipment on outsourcing vehicles.
  • the collection process includes data collection, data transmission (mainly through a traffic network) and data recovery (putting the received data to a suitable place).
  • the data collection cost is set as the traffic cost of transmitting each sample and the depreciation cost of the equipment installed on the crowdsourcing vehicle.
  • the unit cost of the data collection mode determines the total cost of this link of data collection.
  • Filter modes include uniform sampling, manual filtering, car-side filters, cloud filters, and combined filters.
  • the car-end filter is installed on the car, and the data can be directly filtered through the filter during the collection process.
  • the screening performance parameters corresponding to the car-end filter mode include the precision rate of the car-end filter and the search of the car-end filter. Full rate and car-side filter speed; cloud filter needs to recycle the data collected by the vehicle and filter it in an offline state. Various large servers can be used.
  • the screening performance parameters corresponding to the cloud filter mode include the accuracy of the cloud filter
  • the cost unit price corresponding to the cloud filter mode includes the operating cost of cloud computing resources per unit time; for comparison, two special screening modes, uniform sampling and manual screening, are also considered.
  • Uniform sampling is to directly obtain sample data from the collected data through uniform sampling.
  • Manual screening is to select training samples from the collected data through manual observation.
  • the screening performance parameters corresponding to the manual screening mode include manual screening speed and manual screening.
  • the cost unit price corresponding to the model includes the cost of manual screening per unit time; in addition, there are combination filters, which include the first combination filter, the second combination filter, and the third combination filter.
  • the first combination is a combination of uniform sampling and manual screening
  • the second combination filter is a combination of cloud filters and manual screening
  • the third combination filter is a combination of car-side filters, cloud filters and manual screening.
  • the screening performance parameters corresponding to the filter mode determine the total cost of data collection, data storage, data screening, and data labeling.
  • the unit cost of the filter mode determines the total cost of data screening.
  • S108 Obtain the storage cost of unit data and the unit labeling cost of manually labeling a training sample.
  • the storage cost of unit data determines the total cost of data storage, and the unit labeling cost determines the total cost of data labeling.
  • S110 Determine a combination mode of raw data acquisition, where the combination mode includes a data collection mode and a filter mode.
  • the combination of data collection mode and filter mode includes the following situations: special collection and uniform sampling, special collection and car filter, special collection and cloud filter, special collection and first combination filter, special collection and second Two combination filter, special collection and third combination filter, crowdsourcing collection and second combination filter, crowdsourcing collection and third combination filter. Based on different combination methods, different total cost calculation formulas are obtained.
  • the total cost calculation formula corresponding to the combination is:
  • the total cost calculation formula corresponding to the combination is:
  • the total cost calculation formula corresponding to the combination is:
  • the total cost calculation formula corresponding to the combination method is:
  • C total is the total cost
  • N is the number of training samples
  • p is the proportion of the training samples in the total data set
  • T is the time interval of the uniform sampling
  • R car is the car-end screening
  • the recall rate of the filter f car is the speed of the car-side filter
  • P car is the precision rate of the car-side filter
  • R cloud is the recall rate of the cloud filter
  • f cloud is the cloud Filter speed
  • P cloud is the accuracy rate of the cloud filter
  • f person is the manual screening speed
  • c collection is the collection cost per unit time
  • c device is the depreciation cost of the equipment installed on the crowdsourced vehicle
  • C netword is the traffic cost of the transmission of each sample
  • c store is the storage cost of the unit data
  • c resource is the operation cost of the cloud computing resource per unit time
  • c person is the manual screening cost per unit time
  • c label indicates the cost of the unit.
  • Cost provides a quantitative decision-making basis for the choice and selection of combination methods.
  • S114 Select the lowest total cost from each total cost obtained, and determine the combination mode of the data collection mode and the filter mode corresponding to the lowest total cost as the original data acquisition combination plan for the acquisition task to acquire training samples.
  • This method can be determined as the original data acquisition combination plan for obtaining the task to obtain the training sample, while taking into account the performance of the filter algorithm and the efficiency of the algorithm. , It also avoids the subjectivity of human experience.
  • S116 Send the data acquisition combination solution to a corresponding device to acquire the original data set used to screen the training samples.
  • this embodiment can collect the number of training samples to be acquired and the proportion of training samples in the total data set from the acquisition task when the acquisition task of the training samples is received, and acquire the corresponding data collection modes.
  • Cost unit price and screening mode corresponding to cost unit price and screening performance parameters, according to the preset total cost calculation formula corresponding to the combination of each data collection mode and filter mode, find the total cost corresponding to all combinations, so as to select the lowest.
  • the combination method corresponding to the total cost is used as the combination plan for this acquisition task, and the combination plan is sent to the corresponding device to obtain training samples.
  • the costs of various combinations are calculated and summarized to obtain the total cost of various combinations, and select from all total costs
  • the lowest total cost is calculated, and the combination method corresponding to the lowest total cost is used as the combination plan for obtaining the task.
  • FIG. 2 is a schematic structural diagram of a device for acquiring training samples provided by an embodiment of the present invention.
  • an apparatus for acquiring training samples provided by an embodiment of the present invention may include:
  • the receiving module 202 is configured to receive an acquisition task of training samples, where the acquisition task includes the number of training samples to be acquired and the proportion of the training samples in the overall data set;
  • the acquisition cost acquisition module 204 is configured to acquire the cost unit price corresponding to each data acquisition mode
  • the screening performance parameter and cost obtaining module 206 is configured to obtain screening performance parameters and cost unit prices corresponding to each filter mode
  • the storage and labeling cost obtaining module 208 is configured to obtain the storage cost of unit data and the unit labeling cost of manually labeling a training sample;
  • the determining module 210 is configured to determine a combination mode for acquiring raw data, the combination mode including a data collection mode and a filter mode;
  • the calculation module 212 is configured to, for each combination mode, based on the cost unit price corresponding to the data collection mode of the combination mode and the screening performance parameters and cost unit price corresponding to the filter mode, calculate the combination mode to obtain the said combination in the data set The total cost corresponding to the training sample;
  • the comparison module 214 is configured to select the lowest total cost from each total cost obtained, and determine the combination of the data collection mode and the filter mode corresponding to the lowest total cost as the original acquisition task to obtain training samples.
  • Data acquisition combination plan ;
  • the sending module 216 is configured to send the data acquisition combination scheme to a corresponding device to acquire the original data set used to screen the training samples.
  • this device can collect the number of training samples to be acquired and the proportion of training samples in the total data set from the acquisition task when receiving the acquisition task of training samples, and acquire the corresponding cost of all data acquisition modes
  • the unit price and the corresponding cost unit price and the screening performance parameters of the screening mode according to the preset total cost calculation formula corresponding to the combination of each data collection mode and the screening mode, find the total cost corresponding to all the combinations, so as to select the lowest total
  • the combination method corresponding to the cost is used as the combination plan for this acquisition task, and the combination plan is sent to the corresponding device to obtain training samples.
  • the costs of various combinations are calculated and summarized to obtain the total cost of various combinations, and select from all total costs
  • the lowest total cost is calculated, and the combination method corresponding to the lowest total cost is used as the combination plan for obtaining the task.
  • modules in the device in the embodiment may be distributed in the device in the embodiment according to the description of the embodiment, or may be located in one or more devices different from this embodiment with corresponding changes.
  • the modules of the above-mentioned embodiments can be combined into one module, or further divided into multiple sub-modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Filters That Use Time-Delay Elements (AREA)

Abstract

Disclosed are a training sample acquisition method and apparatus. The method comprises: receiving a training sample acquisition task, wherein the acquisition task includes the number of training samples to be acquired, and the proportion of the training samples in a whole data set; acquiring a cost unit price corresponding to each data collection mode; acquiring a screening performance parameter and cost unit price corresponding to each sizer mode; acquiring a storage cost of unit data and a unit label cost; determining combination modes for original data acquisition; with regard to each of the combination modes, on the basis of the cost unit price corresponding to the data collection mode and the screening performance parameter and cost unit price corresponding to the sizer mode, performing calculation to obtain the total cost corresponding to each of the combination modes; selecting the minimum total cost from the obtained various total costs, and determining a combination mode corresponding to the minimum total cost as an original data acquisition combination scheme of the acquisition task acquiring the training samples; and sending the combination scheme to a corresponding device, so as to acquire an original data set for screening the training samples.

Description

一种训练样本的获取方法及装置Method and device for acquiring training samples 技术领域Technical field
本发明涉及自动驾驶领域,具体而言,涉及一种训练样本的获取方法及装置。The present invention relates to the field of automatic driving, and in particular to a method and device for obtaining training samples.
背景技术Background technique
在自动驾驶领域,以深度学习为代表的数据驱动算法的迭代训练中,随着数据量的爆炸增长,从海量数据中进行必要的筛选已经成为数据预处理的重要一步。但是目前关注点主要放在设计筛选器算法上,缺乏对不同筛选器进行比较的科学评价体系。在实际使用中往往通过比较查准率(precision)和查全率(recall)等算法评价指标来衡量筛选器的优劣;或者由决策者和开发人员根据经验主观决定筛选器的取舍和选择。In the field of autonomous driving, in the iterative training of data-driven algorithms represented by deep learning, with the explosive growth of data volume, necessary screening from massive data has become an important step in data preprocessing. However, the current focus is mainly on designing filter algorithms, and there is a lack of a scientific evaluation system for comparing different filters. In actual use, the pros and cons of filters are often measured by comparing algorithm evaluation indicators such as precision and recall; or decision makers and developers subjectively decide the choice and choice of filters based on experience.
但是,单纯采用算法性能评价指标(如查准率和查全率等)只能比较筛选算法的优劣,忽略了有些筛选器虽然能够达到很高的查准率和查全率,却计算过程复杂,占用大量资源,在实际应用场景也许并不是最优选择。而由开发人员根据经验做出的主观判断,缺乏客观的科学依据,并且当多位人员产生意见分歧时,很难选择。However, simply using algorithm performance evaluation indicators (such as precision and recall) can only compare the pros and cons of the screening algorithm, ignoring that some filters can achieve high precision and recall, but the calculation process It is complicated and takes up a lot of resources, and may not be the best choice in actual application scenarios. However, subjective judgments made by developers based on experience lack objective scientific basis, and it is difficult to choose when multiple people have disagreements.
在实际应用中,控制成本在各个生产环节占据非常重要地位。因此,计算出各种数据采集模式和筛选器模式的组合方式的总成本并进行比较,挑选最合适的组合方式来获取训练样本成为亟待解决的问题。In practical applications, cost control occupies a very important position in each production link. Therefore, calculating the total cost of the combination of various data collection modes and filter modes and comparing them, and selecting the most suitable combination to obtain training samples has become an urgent problem to be solved.
发明内容Summary of the invention
本发明提供了一种训练样本的获取方法及装置,通过对数据采集模式和筛选器模式的总成本进行计算,建立科学的评价体系,在兼顾筛选器算法性能和算法效率时,也避免人为经验的主观性。具体技术 方案如下。The present invention provides a method and device for obtaining training samples. By calculating the total cost of the data collection mode and the filter mode, a scientific evaluation system is established, and human experience is avoided when the performance of the filter algorithm and the efficiency of the algorithm are taken into account. Subjectivity. The specific technical scheme is as follows.
第一方面,本发明提供一种通用的筛选器挑选方法,包括:In the first aspect, the present invention provides a universal filter selection method, including:
接收训练样本的获取任务,所述获取任务包含待获取训练样本个数和训练样本在数据集总体中的占比;Receiving an acquisition task of training samples, where the acquisition task includes the number of training samples to be acquired and the proportion of the training samples in the overall data set;
获取每种数据采集模式对应的成本单价;其中,所述数据采集模式包括专项采集和众包采集,所述专项采集模式的成本单价包括单位时间采集成本,所述众包采集模式的成本单价包括传输每个样本的流量成本、安装在众包车辆上的设备的折旧成本;Obtain the cost unit price corresponding to each data collection mode; wherein, the data collection mode includes special collection and crowdsourcing collection, the cost unit price of the special collection mode includes the unit time collection cost, and the cost unit price of the crowdsourcing collection mode includes Transmission cost of each sample, depreciation cost of equipment installed on crowdsourced vehicles;
获取每种筛选器模式对应的筛选性能参数和成本单价;其中,所述筛选器模式包括均匀采样、人工筛选、车端筛选器、云端筛选器和组合筛选器,所述均匀采样模式对应的筛选性能参数包括均匀采样的时间间隔,所述人工筛选模式对应的筛选性能参数包括人工筛选速度,所述人工筛选模式对应的成本单价包括单位时间人工筛选成本,所述车端筛选器模式对应的筛选性能参数包括车端筛选器的查准率、车端筛选器的查全率和车端筛选器速度,所述云端筛选器模式对应的筛选性能参数包括云端筛选器的查准率、云端筛选器的查全率和云端筛选器速度,所述云端筛选器模式对应的成本单价包括单位时间云端计算资源运行成本;所述人工筛选速度是指单位时间内人工处理的数据个数,所述车端筛选器速度是指单位时间车端筛选器处理的数据个数,所述云端筛选器速度是指单位时间云端筛选器处理的数据个数;Obtain the screening performance parameters and cost unit price corresponding to each filter mode; wherein, the filter modes include uniform sampling, manual screening, car-side filter, cloud filter, and combined filter, and the uniform sampling mode corresponds to the screening The performance parameters include the time interval of uniform sampling, the screening performance parameters corresponding to the manual screening mode include manual screening speed, the cost unit price corresponding to the manual screening mode includes the manual screening cost per unit time, and the screening corresponding to the car-side filter mode The performance parameters include the precision rate of the car-side filter, the recall rate of the car-side filter, and the speed of the car-side filter. The screening performance parameters corresponding to the cloud filter mode include the precision rate of the cloud filter and the cloud filter The recall rate and cloud filter speed of the cloud filter mode include the operating cost of cloud computing resources per unit time; the manual filtering speed refers to the number of data manually processed per unit time, the vehicle end The filter speed refers to the number of data processed by the vehicle-side filter per unit time, and the cloud filter speed refers to the number of data processed by the cloud filter per unit time;
获取单位数据的存储成本和人工标注一个训练样本的单位标注成本;The storage cost of acquiring unit data and the unit labeling cost of manually labeling a training sample;
确定原始数据获取的组合方式,所述组合方式包括数据采集模式和筛选器模式;Determine a combination mode of raw data acquisition, the combination mode includes a data collection mode and a filter mode;
针对每种组合方式,基于该组合方式的数据采集模式对应的成本单价和筛选器模式对应的筛选性能参数和成本单价,计算得到该种组合方式在数据集中获取所述训练样本所对应的总成本;For each combination method, based on the cost unit price corresponding to the data collection mode of the combination method and the screening performance parameters and cost unit price corresponding to the filter mode, the total cost corresponding to the training sample obtained in the data set by this combination method is calculated ;
从得到的每种总成本中挑选出最低总成本,将所述最低总成本对 应的数据采集模式和筛选器模式的组合方式确定为所述获取任务获取训练样本的原始数据获取组合方案;Select the lowest total cost from each total cost obtained, and determine the combination of the data collection mode and the filter mode corresponding to the lowest total cost as the original data acquisition combination plan for the acquisition task to acquire training samples;
将所述数据获取组合方案发送至相应设备以获取用于筛选所述训练样本的原始数据集。The data acquisition combination scheme is sent to the corresponding device to acquire the original data set used to screen the training samples.
可选的,当数据采集模式和筛选器模式的组合方式为专项采集和均匀采样时,该组合方式对应的总成本计算公式为:Optionally, when the combination of the data collection mode and the filter mode is special collection and uniform sampling, the total cost calculation formula corresponding to the combination is:
Figure PCTCN2020121356-appb-000001
Figure PCTCN2020121356-appb-000001
其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the total data set, T is the time interval of the uniform sampling, and c collection is the unit time collection Cost, c store is the storage cost of the unit data, and c label is the unit label cost.
可选的,当数据采集模式和筛选器模式的组合方式为专项采集和车端筛选器时,该组合方式对应的总成本计算公式为:Optionally, when the combination of the data collection mode and the filter mode is special collection and car-side filter, the total cost calculation formula corresponding to the combination is:
Figure PCTCN2020121356-appb-000002
Figure PCTCN2020121356-appb-000002
其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,R car为所述车端筛选器的查全率,f car为所述车端筛选器速度,P car为所述车端筛选器的查准率,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, R car is the recall rate of the car-end filter, and f car is the total cost. The speed of the car-end filter, P car is the precision rate of the car-end filter, c collection is the collection cost per unit time, c store is the storage cost of the unit data, and c label is the unit labeling cost .
可选的,当所述数据采集模式和筛选器模式组合方式为专项采集和云端筛选器时,该组合方式对应的总成本计算公式为:Optionally, when the combination of the data collection mode and the filter mode is special collection and cloud filter, the total cost calculation formula corresponding to the combination is:
Figure PCTCN2020121356-appb-000003
Figure PCTCN2020121356-appb-000003
其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为所述云端筛选器的查准率,c collection为所述单位时间采集成本,c store 为所述单位数据的存储成本,c resource为所述单位时间云端计算资源运行成本,c label为所述单位标注成本。 Wherein, C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the total data set, T is the time interval of the uniform sampling, and R cloud is the cloud filter F cloud is the speed of the cloud filter, P cloud is the precision rate of the cloud filter, c collection is the collection cost per unit time, c store is the storage cost of the unit data, c resource is the operating cost of the cloud computing resource per unit time, and c label is the label cost per unit.
可选的,当数据采集模式和筛选器模式组合方式为专项采集和第一组合筛选器时,其中,所述第一组合筛选器为均匀采样和人工筛选的组合,该组合方式对应的总成本计算公式为:Optionally, when the combination of the data collection mode and the filter mode is a special collection and a first combined filter, wherein the first combined filter is a combination of uniform sampling and manual screening, and the total cost of the combined mode is The calculation formula is:
Figure PCTCN2020121356-appb-000004
Figure PCTCN2020121356-appb-000004
其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,f person为所述人工筛选速度,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c person为所述单位时间人工筛选成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, T is the time interval of the uniform sampling, and f person is the manual screening speed , C collection is the collection cost per unit time, c store is the storage cost of the unit data, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
可选的,当数据采集模式和筛选器模式组合方式为专项采集和第二组合筛选器时,其中,所述第二组合筛选器为云端筛选器和人工筛选的组合,该组合方式对应的总成本计算公式为:Optionally, when the combination of the data collection mode and the filter mode is a special collection and a second combined filter, the second combined filter is a combination of a cloud filter and a manual filter, and the combined mode corresponds to the total The cost calculation formula is:
Figure PCTCN2020121356-appb-000005
Figure PCTCN2020121356-appb-000005
其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为所述云端筛选器的查准率,f person为所述人工筛选速度,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c resource为所述单位时间云端计算资源运行成本,c person为所述单位时间人工筛选成本,c label为所述单位标注成本。 Wherein, C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the total data set, T is the time interval of the uniform sampling, and R cloud is the cloud filter F cloud is the speed of the cloud filter, P cloud is the precision rate of the cloud filter, f person is the manual screening speed, c collection is the collection cost per unit time, and c store is For the storage cost of the unit data, c resource is the operating cost of the cloud computing resource per unit time, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
可选的,当数据采集模式和筛选器模式组合方式为专项采集和第三组合筛选器时,其中,所述第三组合筛选器为车端筛选器和云端筛选器以及人工筛选的组合,该组合方式对应的总成本计算公式为:Optionally, when the combination of the data collection mode and the filter mode is a special collection and a third combined filter, the third combined filter is a combination of a car-side filter, a cloud filter, and manual screening. The total cost calculation formula corresponding to the combination method is:
Figure PCTCN2020121356-appb-000006
Figure PCTCN2020121356-appb-000006
其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,R car为车端筛选器的查全率,f car为车端筛选器速度,P car为车端筛选器的查准率,R cloud为云端筛选器的查全率,f cloud为云端筛选器速度,P cloud为云端筛选器的查准率,f person为人工筛选速度,c collection为单位时间采集成本,c store为单位数据的存储成本,c resource为单位时间云端计算资源运行成本,c person为单位时间人工筛选成本,c label为单位标注成本。 Among them, C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, R car is the recall rate of the car -end filter, and f car is the car-end screening P car is the precision rate of the car filter, R cloud is the recall rate of the cloud filter, f cloud is the cloud filter speed, P cloud is the precision rate of the cloud filter, and f person is the manual filter Speed, c collection is the collection cost per unit time, c store is the storage cost per unit data, c resource is the operating cost of cloud computing resources per unit time, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
可选的,当数据采集模式和筛选器模式组合方式为众包采集和第二组合筛选器时,该组合方式对应的总成本计算公式为:Optionally, when the combination of the data collection mode and the filter mode is crowdsourced collection and the second combined filter, the total cost calculation formula corresponding to the combination is:
Figure PCTCN2020121356-appb-000007
Figure PCTCN2020121356-appb-000007
其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为云端筛选器的查准率,f person为人工筛选速度,c netword为传输每个样本的流量成本,c store为单位数据的存储成本,c resource为单位时间云端计算资源运行成本,c person为单位时间人工筛选成本,c label为单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, R cloud is the recall rate of the cloud filter, and f cloud is the Cloud filter speed, P cloud is the accuracy rate of the cloud filter, f person is the manual screening speed, c netword is the traffic cost of transmitting each sample, c store is the storage cost of unit data, and c resource is the cloud computing per unit time Resource operating cost, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
可选的,当数据采集模式和筛选器模式组合方式为众包采集和第三组合筛选器时,该组合方式对应的总成本计算公式为:Optionally, when the combination of the data collection mode and the filter mode is crowdsourced collection and the third combined filter, the total cost calculation formula corresponding to the combination is:
Figure PCTCN2020121356-appb-000008
Figure PCTCN2020121356-appb-000008
其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,R car为所述车端筛选器的查全率,f car为所述车端筛选器速度,P car为所述车端筛选器的查准率,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为所述云端筛选器的查准率,f person为所述人工筛选速度,c device为所述安装在众包车辆上的设备的折旧成本,c netword为所述传输每个样本的流量成本,c store为所述单位数据的存储成本,c resource为所述单位时间云端计算资源运行成本,c person为所述单位时间人工筛选成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, R car is the recall rate of the car-end filter, and f car is the total cost. The car-side filter speed, P car is the precision rate of the car-side filter, R cloud is the recall rate of the cloud filter, f cloud is the cloud filter speed, and P cloud is the cloud The precision rate of the filter, f person is the manual screening speed, c device is the depreciation cost of the equipment installed on the crowdsourced vehicle, c netword is the traffic cost of transmitting each sample, and c store is the cost The storage cost of the unit data, c resource is the operating cost of the cloud computing resource per unit time, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
第二方面,本发明提供了一种训练样本的获取装置,包括:In the second aspect, the present invention provides a device for acquiring training samples, including:
接收模块,被配置为接收训练样本的获取任务,所述获取任务包含待获取训练样本个数和训练样本在数据集总体中的占比;A receiving module configured to receive an acquisition task of training samples, the acquisition task including the number of training samples to be acquired and the proportion of the training samples in the overall data set;
采集成本获取模块,被配置为获取每种数据采集模式对应的成本单价;The acquisition cost acquisition module is configured to acquire the cost unit price corresponding to each data acquisition mode;
筛选性能参数和成本获取模块,被配置为获取每种筛选器模式对应的筛选性能参数和成本单价;The screening performance parameter and cost acquisition module is configured to obtain the screening performance parameter and cost unit price corresponding to each filter mode;
存储和标注成本获取模块,被配置为获取单位数据的存储成本和人工标注一个训练样本的单位标注成本;The storage and labeling cost acquisition module is configured to acquire the storage cost of unit data and the unit labeling cost of manually labeling a training sample;
确定模块,被配置为原始确定数据获取的组合方式,所述组合方式包括数据采集模式和筛选器模式;The determining module is configured to initially determine a combination mode of data acquisition, and the combination mode includes a data collection mode and a filter mode;
计算模块,被配置为针对每种组合方式,基于该组合方式的数据采集模式对应的成本单价和筛选器模式对应的筛选性能参数和成本单价,计算得到该种组合方式在数据集中获取所述训练样本所对应的总成本;The calculation module is configured to, for each combination mode, based on the cost unit price corresponding to the data collection mode of the combination mode and the screening performance parameter and cost unit price corresponding to the filter mode, calculate the combination mode to obtain the training in the data set The total cost corresponding to the sample;
比较模块,被配置为从得到的每种总成本中挑选出最低总成本,将所述最低总成本对应的数据采集模式和筛选器模式的组合方式确定为所述获取任务获取训练样本的原始数据获取组合方案;The comparison module is configured to select the lowest total cost from each total cost obtained, and determine the combination of the data collection mode and the filter mode corresponding to the lowest total cost as the raw data of the acquisition task to obtain training samples Get a combination plan;
发送模块,被配置为将所述数据获取组合方案发送至相应设备以 获取用于筛选所述训练样本的原始数据集。The sending module is configured to send the data acquisition combination scheme to the corresponding device to acquire the original data set used to filter the training samples.
本发明实施例的有益效果如下:The beneficial effects of the embodiments of the present invention are as follows:
在接收到训练样本的获取任务时,从获取任务中采集到待获取训练样本个数和训练样本在数据集总体中的占比,获取所有数据采集模式对应的成本单价和筛选模式对应的成本单价和筛选性能参数,根据每种数据采集模式和筛选器模式的组合方式对应的预设的总成本计算公式,求出所有组合方式对应的总成本,从而挑选最低总成本对应的组合方式作为本次获取任务的组合方案,将该组合方案发送至相应设备以获取训练样本。从实际的筛选流程出发,按照数据采集、数据存储、数据筛选、数据标注各个步骤分别计算各种组合方式下的成本并将其汇总,得到各种组合方式的总成本,从所有总成本中挑选出最低总成本,将最低总成本对应的组合方式作为获取任务的组合方案。通过对数据采集模式和筛选器模式的总成本进行计算,建立科学的评价体系,在兼顾筛选器算法性能和算法效率的同时,又可以避免人为经验的主观性,克服了现有技术中难以根据实际情况选取最合适的筛选器的问题,并且避免了开发人员由于经验主观不同带来意见上的分歧。从而,在自动驾驶领域以深度学习为代表的数据驱动算法的迭代训练中,从海量数据中筛选出所需要的训练样本数据,降低获取训练样本数据的成本。When the acquisition task of training samples is received, the number of training samples to be acquired and the proportion of training samples in the total data set are collected from the acquisition task, and the cost unit price corresponding to all data collection modes and the cost unit price corresponding to the screening mode are obtained. And screening performance parameters, according to the preset total cost calculation formula corresponding to each combination of data collection mode and filter mode, find the total cost corresponding to all combinations, so as to select the combination corresponding to the lowest total cost as this time Obtain the combination plan of the task, and send the combination plan to the corresponding device to obtain training samples. Starting from the actual screening process, according to the steps of data collection, data storage, data filtering, and data labeling, the costs of various combinations are calculated and summarized to obtain the total cost of various combinations, and select from all total costs The lowest total cost is calculated, and the combination method corresponding to the lowest total cost is used as the combination plan for obtaining the task. By calculating the total cost of the data collection mode and the filter mode, a scientific evaluation system is established. While taking into account the performance and efficiency of the filter algorithm, it can also avoid the subjectivity of human experience and overcome the difficulties in the prior art. The problem of selecting the most appropriate filter in the actual situation, and avoiding the differences in opinions caused by the developers due to subjective differences in experience. Therefore, in the iterative training of data-driven algorithms represented by deep learning in the field of autonomous driving, the required training sample data is selected from the massive data, and the cost of obtaining training sample data is reduced.
本发明实施例的创新点包括:The innovative points of the embodiments of the present invention include:
1、在接收到训练样本的获取任务时,从获取任务中采集到待获取训练样本个数和训练样本在数据集总体中的占比,获取所有数据采集模式对应的成本单价和筛选模式对应的成本单价和筛选性能参数,根据每种数据采集模式和筛选器模式的组合方式对应的预设的总成本计算公式,求出所有组合方式对应的总成本,从而挑选最低总成本对应的组合方式作为本次获取任务的组合方案,将该组合方案发送至相应设备以获取训练样本。通过对数据采集模式和筛选器模式的总成本进行计算,建立科学的评价体系,在兼顾筛选器算法性能和算法效率的同时,又可以避免人为经验的主观性,是本发明实施例的创新点之一。1. When the acquisition task of training samples is received, the number of training samples to be acquired and the proportion of training samples in the total data set are collected from the acquisition task, and the cost unit price corresponding to all data acquisition modes and the corresponding screening mode are obtained. Cost unit price and screening performance parameters, according to the preset total cost calculation formula corresponding to the combination of each data collection mode and filter mode, find the total cost corresponding to all combinations, and select the combination corresponding to the lowest total cost as Obtain the combination plan of the task this time, and send the combination plan to the corresponding device to obtain training samples. By calculating the total cost of the data collection mode and the filter mode, a scientific evaluation system is established, which takes into account the performance and efficiency of the filter algorithm while avoiding the subjectivity of human experience. This is the innovation of the embodiment of the present invention. one.
2、基于各生产环节成本的评价方式,获取各生产环节的成本单价,以及基于筛选器模式的评价方式,获取每种筛选器性能参数。将各生产环节的总成本的参数进行科学评价,再进行汇总,得到组合方式的总成本,基于各生产环节成本的科学评价方式,建立科学的评价体系,是本发明实施例的创新点之一。2. Obtain the cost unit price of each production link based on the evaluation method of the cost of each production link, and obtain the performance parameter of each filter based on the evaluation method of the filter mode. Scientifically evaluate the parameters of the total cost of each production link, and then summarize them to obtain the total cost of the combination method. Based on the scientific evaluation method of the cost of each production link, the establishment of a scientific evaluation system is one of the innovations of the embodiments of the present invention .
3、从实际的筛选流程出发,按照数据采集、数据存储、数据筛选、数据标注各个步骤分别计算各种筛选方案下的成本并将其汇总,得到各种筛选方案的总成本。基于各种数据采集模式和筛选器模式的成本计算公式,求得每种组合方式的总成本,为组合方式的取舍和选择问题提供定量化决策依据,在兼顾筛选器算法性能和算法效率的同时,还避免了人为经验的主观性,是本发明实施例的创新点之一。3. Starting from the actual screening process, according to the steps of data collection, data storage, data screening, and data labeling, the costs under various screening schemes are calculated and summarized to obtain the total cost of various screening schemes. Based on the cost calculation formulas of various data collection modes and filter modes, the total cost of each combination method is obtained, and a quantitative decision-making basis is provided for the choice and selection of combination methods, while taking into account the performance of the filter algorithm and the efficiency of the algorithm. It also avoids the subjectivity of human experience, which is one of the innovative points of the embodiments of the present invention.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1为本发明实施例提供的一种训练样本的获取方法的一种流程示意图;FIG. 1 is a schematic flowchart of a method for obtaining training samples according to an embodiment of the present invention;
图2为本发明实施例提供的一种训练样本的获取装置的一种结构示意图。FIG. 2 is a schematic structural diagram of a device for acquiring training samples provided by an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有付出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
需要说明的是,本发明实施例及附图中的术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "including" and "having" in the embodiments of the present invention and the drawings and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.
本发明实施例公开了一种训练样本的获取方法及装置,通过对数据采集模式和筛选器模式的总成本进行计算,建立科学的评价体系,在兼顾筛选器算法性能和算法效率的同时,又可以避免人为经验的主观性。以下分别进行详细说明。The embodiment of the present invention discloses a method and device for obtaining training samples. By calculating the total cost of the data collection mode and the filter mode, a scientific evaluation system is established. While taking into account the performance of the filter algorithm and the efficiency of the algorithm, it also The subjectivity of human experience can be avoided. Detailed descriptions are given below.
图1为本发明实施例提供的一种训练样本的获取方法的一种流程示意图。该方法具体包括以下步骤:FIG. 1 is a schematic flowchart of a method for obtaining training samples according to an embodiment of the present invention. The method specifically includes the following steps:
S102:接收训练样本的获取任务,所述获取任务包含待获取训练样本个数和训练样本在数据集总体中的占比。S102: Receive a training sample acquisition task, where the acquisition task includes the number of training samples to be acquired and the proportion of the training samples in the overall data set.
为比较多个组合方式的优劣,直观想法就是给出某个度量(比如成本),比较各个组合方式在这种度量下的大小关系。将问题具体化:假设需要从数据集中筛选出N个符合要求的训练样本,计算各种组合方式的总成本。在接收到训练样本的获取任务时,需要从获取任务中采集到待获取训练样本个数和训练样本在数据集总体中的占比。待获取训练样本个数和训练样本在数据集总体中的占比决定着数据采集、数据存储、数据筛选、数据标注各个环节的总成本。由于待筛选样本占比是用来衡量筛选任务的困难程度的,实际中由于很难获得待筛选样本占比的准确值,一般采用根据经验事先估计的占比值。In order to compare the pros and cons of multiple combinations, the intuitive idea is to give a certain metric (such as cost) and compare the size relationship of each combination under this metric. Concretize the problem: Suppose that N training samples that meet the requirements need to be selected from the data set, and the total cost of various combinations is calculated. When the acquisition task of training samples is received, the number of training samples to be acquired and the proportion of training samples in the total data set need to be collected from the acquisition task. The number of training samples to be obtained and the proportion of training samples in the overall data set determine the total cost of data collection, data storage, data screening, and data labeling. Since the proportion of samples to be screened is used to measure the difficulty of the screening task, in practice, it is difficult to obtain an accurate value of the proportion of samples to be screened. Generally, the proportions estimated in advance based on experience are used.
S104:获取每种数据采集模式对应的成本单价。S104: Obtain the cost unit price corresponding to each data collection mode.
数据采集模式包括专项采集模式和众包采集模式,专项采集是针对某些特定而出动专门采集车辆的采集模式,采集流程包括数据采集、数据存储(一般是通过硬盘保存)和数据回收(将硬盘数据拷贝到合适地方),基于专项采集的采集流程,在评价专项采集模式的成本单价时,将数据采集成本设为专项采集模式的成本单价即单位时间采集成本。众包采集是通过在外包车辆上安装数据采集设备来采集数据的 采集模式,采集流程包括数据采集、数据传输(主要通过流量网络)和数据回收(将接收的数据放到合适地方),基于众包采集的采集流程,在评价众包采集模式的成本单价时,将数据采集成本设为传输每个样本的流量成本和安装在众包车辆上的设备的折旧成本。数据采集模式的成本单价决定着数据采集这个环节的总成本。The data collection mode includes special collection mode and crowdsourcing collection mode. Special collection is a collection mode that specifically collects vehicles for certain specific purposes. The collection process includes data collection, data storage (usually saved by hard disk) and data recovery (hard disk storage). Copy the data to a suitable place), based on the collection process of special collection, when evaluating the cost unit price of the special collection mode, set the data collection cost as the cost unit price of the special collection mode, that is, the unit time collection cost. Crowdsourcing collection is a collection mode that collects data by installing data collection equipment on outsourcing vehicles. The collection process includes data collection, data transmission (mainly through a traffic network) and data recovery (putting the received data to a suitable place). In the collection process of package collection, when evaluating the cost unit price of the crowdsourcing collection mode, the data collection cost is set as the traffic cost of transmitting each sample and the depreciation cost of the equipment installed on the crowdsourcing vehicle. The unit cost of the data collection mode determines the total cost of this link of data collection.
S106:获取每种筛选器模式对应的筛选性能参数和成本单价。S106: Obtain screening performance parameters and cost unit prices corresponding to each filter mode.
筛选器模式包括均匀采样、人工筛选、车端筛选器、云端筛选器和组合筛选器。车端筛选器安装在车上,数据在采集过程中就可以直接通过该筛选器进行筛选,车端筛选器模式对应的筛选性能参数包括车端筛选器的查准率、车端筛选器的查全率和车端筛选器速度;云端筛选器需要将车辆采集的数据回收,在离线状态下进行筛选,可以使用各种大型服务器,云端筛选器模式对应的筛选性能参数包括云端筛选器的查准率、云端筛选器的查全率和云端筛选器速度,云端筛选器模式对应的成本单价包括单位时间云端计算资源运行成本;为了对照,还考虑均匀采样和人工筛选这两种特殊的筛选模式,均匀采样是将采集的数据直接通过均匀采样的方式获得样本数据,人工筛选是将收集回收的数据通过人工观察的方式挑选出训练样本,人工筛选模式对应的筛选性能参数包括人工筛选速度,人工筛选模式对应的成本单价包括单位时间人工筛选成本;除此之外,还有组合筛选器,组合筛选器包括第一组合筛选器、第二组合筛选器和第三组合筛选器,其中,第一组合筛选器为均匀采样和人工筛选的组合,第二组合筛选器为云端筛选器和人工筛选的组合,第三组合筛选器为车端筛选器和云端筛选器以及人工筛选的组合。筛选器模式对应的筛选性能参数决定着数据采集、数据存储、数据筛选、数据标注各个环节的总成本,筛选器模式对应的成本单价决定着数据筛选这一环节的总成本。Filter modes include uniform sampling, manual filtering, car-side filters, cloud filters, and combined filters. The car-end filter is installed on the car, and the data can be directly filtered through the filter during the collection process. The screening performance parameters corresponding to the car-end filter mode include the precision rate of the car-end filter and the search of the car-end filter. Full rate and car-side filter speed; cloud filter needs to recycle the data collected by the vehicle and filter it in an offline state. Various large servers can be used. The screening performance parameters corresponding to the cloud filter mode include the accuracy of the cloud filter The cost unit price corresponding to the cloud filter mode includes the operating cost of cloud computing resources per unit time; for comparison, two special screening modes, uniform sampling and manual screening, are also considered. Uniform sampling is to directly obtain sample data from the collected data through uniform sampling. Manual screening is to select training samples from the collected data through manual observation. The screening performance parameters corresponding to the manual screening mode include manual screening speed and manual screening. The cost unit price corresponding to the model includes the cost of manual screening per unit time; in addition, there are combination filters, which include the first combination filter, the second combination filter, and the third combination filter. The first combination The filter is a combination of uniform sampling and manual screening, the second combination filter is a combination of cloud filters and manual screening, and the third combination filter is a combination of car-side filters, cloud filters and manual screening. The screening performance parameters corresponding to the filter mode determine the total cost of data collection, data storage, data screening, and data labeling. The unit cost of the filter mode determines the total cost of data screening.
S108:获取单位数据的存储成本和人工标注一个训练样本的单位标注成本。S108: Obtain the storage cost of unit data and the unit labeling cost of manually labeling a training sample.
单位数据的存储成本决定着数据存储这一环节的总成本,单位标注成本决定着数据标注这一环节的总成本。The storage cost of unit data determines the total cost of data storage, and the unit labeling cost determines the total cost of data labeling.
S110:确定原始数据获取的组合方式,所述组合方式包括数据采集模式和筛选器模式。S110: Determine a combination mode of raw data acquisition, where the combination mode includes a data collection mode and a filter mode.
数据采集模式和筛选器模式的组合方式包括以下几种情况:专项采集和均匀采样、专项采集和车端筛选器、专项采集和云端筛选器、专项采集和第一组合筛选器、专项采集和第二组合筛选器、专项采集和第三组合筛选器、众包采集和第二组合筛选器、众包采集和第三组合筛选器。基于不同的组合方式,得到不同的总成本计算公式。The combination of data collection mode and filter mode includes the following situations: special collection and uniform sampling, special collection and car filter, special collection and cloud filter, special collection and first combination filter, special collection and second Two combination filter, special collection and third combination filter, crowdsourcing collection and second combination filter, crowdsourcing collection and third combination filter. Based on different combination methods, different total cost calculation formulas are obtained.
S112:针对每种组合方式,基于该组合方式的数据采集模式对应的成本单价和筛选器模式对应的筛选性能参数和成本单价,计算得到该种组合方式在数据集中获取所述训练样本所对应的总成本。S112: For each combination method, the cost unit price corresponding to the data collection mode of the combination method and the screening performance parameter and cost unit price corresponding to the filter mode are calculated, and the combination method obtains the training sample corresponding to the data set. total cost.
当数据采集模式和筛选器模式的组合方式为专项采集和均匀采样时,该组合方式对应的总成本计算公式为:When the combination of data collection mode and filter mode is special collection and uniform sampling, the total cost calculation formula corresponding to this combination is:
Figure PCTCN2020121356-appb-000009
Figure PCTCN2020121356-appb-000009
当数据采集模式和筛选器模式的组合方式为专项采集和车端筛选器时,该组合方式对应的总成本计算公式为:When the combination of data collection mode and filter mode is special collection and car-side filter, the total cost calculation formula corresponding to this combination is:
Figure PCTCN2020121356-appb-000010
Figure PCTCN2020121356-appb-000010
当所述数据采集模式和筛选器模式组合方式为专项采集和云端筛选器时,该组合方式对应的总成本计算公式为:When the combination of the data collection mode and the filter mode is special collection and cloud filter, the total cost calculation formula corresponding to the combination is:
Figure PCTCN2020121356-appb-000011
Figure PCTCN2020121356-appb-000011
当数据采集模式和筛选器模式组合方式为专项采集和第一组合筛选器时,该组合方式对应的总成本计算公式为:When the combination of data collection mode and filter mode is special collection and the first combination filter, the total cost calculation formula corresponding to this combination is:
Figure PCTCN2020121356-appb-000012
Figure PCTCN2020121356-appb-000012
当数据采集模式和筛选器模式组合方式为专项采集和第二组合筛选器时,该组合方式对应的总成本计算公式为:When the combination of the data collection mode and the filter mode is the special collection and the second combined filter, the total cost calculation formula corresponding to the combination is:
Figure PCTCN2020121356-appb-000013
Figure PCTCN2020121356-appb-000013
当数据采集模式和筛选器模式组合方式为专项采集和第三组合筛选器时,该组合方式对应的总成本计算公式为:When the combination of data collection mode and filter mode is special collection and the third combination filter, the total cost calculation formula corresponding to this combination is:
Figure PCTCN2020121356-appb-000014
Figure PCTCN2020121356-appb-000014
当数据采集模式和筛选器模式组合方式为众包采集和第二组合筛选器时,该组合方式对应的总成本计算公式为:When the combination of the data collection mode and the filter mode is crowdsourced collection and the second combined filter, the total cost calculation formula corresponding to the combination is:
Figure PCTCN2020121356-appb-000015
Figure PCTCN2020121356-appb-000015
当数据采集模式和筛选器模式组合方式为众包采集和第三组合筛选器时,该组合方式对应的总成本计算公式为:When the combination of the data collection mode and the filter mode is crowdsourced collection and the third combination filter, the total cost calculation formula corresponding to the combination method is:
Figure PCTCN2020121356-appb-000016
Figure PCTCN2020121356-appb-000016
其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,R car为所述车端筛选器的查全率,f car为所述车端筛选器速度,P car为所述车端筛选器的查准率,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为所述云端筛选器的查准率,f person为所述人工筛选速度,c collection为单位时间采集成本,c device为所述安装在众包车辆上的设备的折旧成本,c netword为所述传输每个样本的流量成本,c store为所述单位数据的存储成本,c resource为所述单位时 间云端计算资源运行成本,c person为所述单位时间人工筛选成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the total data set, T is the time interval of the uniform sampling, and R car is the car-end screening The recall rate of the filter, f car is the speed of the car-side filter, P car is the precision rate of the car-side filter, R cloud is the recall rate of the cloud filter, and f cloud is the cloud Filter speed, P cloud is the accuracy rate of the cloud filter, f person is the manual screening speed, c collection is the collection cost per unit time, and c device is the depreciation cost of the equipment installed on the crowdsourced vehicle , C netword is the traffic cost of the transmission of each sample, c store is the storage cost of the unit data, c resource is the operation cost of the cloud computing resource per unit time, c person is the manual screening cost per unit time, c label indicates the cost of the unit.
基于每种组合方式的总成本计算公式,在获取到S102、S104、S106、S108中所描述的参数后,将总成本计算公式中出现的参数代入到公式中,得到每种组合方式对应的总成本,为组合方式的取舍和选择问题提供定量化决策依据。Based on the total cost calculation formula of each combination method, after obtaining the parameters described in S102, S104, S106, and S108, the parameters appearing in the total cost calculation formula are substituted into the formula to obtain the total cost corresponding to each combination method. Cost provides a quantitative decision-making basis for the choice and selection of combination methods.
S114:从得到的每种总成本中挑选出最低总成本,将所述最低总成本对应的数据采集模式和筛选器模式的组合方式确定为所述获取任务获取训练样本的原始数据获取组合方案。S114: Select the lowest total cost from each total cost obtained, and determine the combination mode of the data collection mode and the filter mode corresponding to the lowest total cost as the original data acquisition combination plan for the acquisition task to acquire training samples.
将每种组合方式对应的总成本进行比较,会得到最低总成本对应的组合方式,该方式可以确定为获取任务获取训练样本的原始数据获取组合方案,在兼顾筛选器算法性能和算法效率的同时,还避免了人为经验的主观性。Comparing the total cost corresponding to each combination method, you will get the combination method corresponding to the lowest total cost. This method can be determined as the original data acquisition combination plan for obtaining the task to obtain the training sample, while taking into account the performance of the filter algorithm and the efficiency of the algorithm. , It also avoids the subjectivity of human experience.
S116:将所述数据获取组合方案发送至相应设备以获取用于筛选所述训练样本的原始数据集。S116: Send the data acquisition combination solution to a corresponding device to acquire the original data set used to screen the training samples.
由上述内容可知,本实施例可以在接收到训练样本的获取任务时,从获取任务中采集到待获取训练样本个数和训练样本在数据集总体中的占比,获取所有数据采集模式对应的成本单价和筛选模式对应的成本单价和筛选性能参数,根据每种数据采集模式和筛选器模式的组合方式对应的预设的总成本计算公式,求出所有组合方式对应的总成本,从而挑选最低总成本对应的组合方式作为本次获取任务的组合方案,将该组合方案发送至相应设备以获取训练样本。从实际的筛选流程出发,按照数据采集、数据存储、数据筛选、数据标注各个步骤分别计算各种组合方式下的成本并将其汇总,得到各种组合方式的总成本,从所有总成本中挑选出最低总成本,将最低总成本对应的组合方式作为获取任务的组合方案。通过对数据采集模式和筛选器模式的总成本进行计算,建立科学的评价体系,在兼顾筛选器算法性能和算法效率的同时,又可以避免人为经验的主观性,克服了现有技术中难以根据实际情况选取最合适的筛选器的问题,并且避免了开发人员由于 经验主观不同带来意见上的分歧。It can be seen from the above content that this embodiment can collect the number of training samples to be acquired and the proportion of training samples in the total data set from the acquisition task when the acquisition task of the training samples is received, and acquire the corresponding data collection modes. Cost unit price and screening mode corresponding to cost unit price and screening performance parameters, according to the preset total cost calculation formula corresponding to the combination of each data collection mode and filter mode, find the total cost corresponding to all combinations, so as to select the lowest The combination method corresponding to the total cost is used as the combination plan for this acquisition task, and the combination plan is sent to the corresponding device to obtain training samples. Starting from the actual screening process, according to the steps of data collection, data storage, data filtering, and data labeling, the costs of various combinations are calculated and summarized to obtain the total cost of various combinations, and select from all total costs The lowest total cost is calculated, and the combination method corresponding to the lowest total cost is used as the combination plan for obtaining the task. By calculating the total cost of the data collection mode and the filter mode, a scientific evaluation system is established. While taking into account the performance and efficiency of the filter algorithm, it can also avoid the subjectivity of human experience and overcome the difficulties in the prior art. The problem of selecting the most appropriate filter in the actual situation, and avoiding the differences in opinions caused by the developers due to subjective differences in experience.
图2为本发明实施例提供的一种训练样本的获取装置的一种结构示意图。参见图2,本发明实施例提供的一种训练样本的获取装置,可以包括:FIG. 2 is a schematic structural diagram of a device for acquiring training samples provided by an embodiment of the present invention. Referring to FIG. 2, an apparatus for acquiring training samples provided by an embodiment of the present invention may include:
接收模块202,被配置为接收训练样本的获取任务,所述获取任务包含待获取训练样本个数和训练样本在数据集总体中的占比;The receiving module 202 is configured to receive an acquisition task of training samples, where the acquisition task includes the number of training samples to be acquired and the proportion of the training samples in the overall data set;
采集成本获取模块204,被配置为获取每种数据采集模式对应的成本单价;The acquisition cost acquisition module 204 is configured to acquire the cost unit price corresponding to each data acquisition mode;
筛选性能参数和成本获取模块206,被配置为获取每种筛选器模式对应的筛选性能参数和成本单价;The screening performance parameter and cost obtaining module 206 is configured to obtain screening performance parameters and cost unit prices corresponding to each filter mode;
存储和标注成本获取模块208,被配置为获取单位数据的存储成本和人工标注一个训练样本的单位标注成本;The storage and labeling cost obtaining module 208 is configured to obtain the storage cost of unit data and the unit labeling cost of manually labeling a training sample;
确定模块210,被配置为确定原始数据获取的组合方式,所述组合方式包括数据采集模式和筛选器模式;The determining module 210 is configured to determine a combination mode for acquiring raw data, the combination mode including a data collection mode and a filter mode;
计算模块212,被配置为针对每种组合方式,基于该组合方式的数据采集模式对应的成本单价和筛选器模式对应的筛选性能参数和成本单价,计算得到该种组合方式在数据集中获取所述训练样本所对应的总成本;The calculation module 212 is configured to, for each combination mode, based on the cost unit price corresponding to the data collection mode of the combination mode and the screening performance parameters and cost unit price corresponding to the filter mode, calculate the combination mode to obtain the said combination in the data set The total cost corresponding to the training sample;
比较模块214,被配置为从得到的每种总成本中挑选出最低总成本,将所述最低总成本对应的数据采集模式和筛选器模式的组合方式确定为所述获取任务获取训练样本的原始数据获取组合方案;The comparison module 214 is configured to select the lowest total cost from each total cost obtained, and determine the combination of the data collection mode and the filter mode corresponding to the lowest total cost as the original acquisition task to obtain training samples. Data acquisition combination plan;
发送模块216,被配置为将所述数据获取组合方案发送至相应设备以获取用于筛选所述训练样本的原始数据集。The sending module 216 is configured to send the data acquisition combination scheme to a corresponding device to acquire the original data set used to screen the training samples.
由上述内容可知,本装置可以在接收到训练样本的获取任务时,从获取任务中采集到待获取训练样本个数和训练样本在数据集总体中的占比,获取所有数据采集模式对应的成本单价和筛选模式对应的成本单价和筛选性能参数,根据每种数据采集模式和筛选器模式的组合方式对应的预设的总成本计算公式,求出所有组合方式对应的总成 本,从而挑选最低总成本对应的组合方式作为本次获取任务的组合方案,将该组合方案发送至相应设备以获取训练样本。从实际的筛选流程出发,按照数据采集、数据存储、数据筛选、数据标注各个步骤分别计算各种组合方式下的成本并将其汇总,得到各种组合方式的总成本,从所有总成本中挑选出最低总成本,将最低总成本对应的组合方式作为获取任务的组合方案。通过对数据采集模式和筛选器模式的总成本进行计算,建立科学的评价体系,在兼顾筛选器算法性能和算法效率的同时,又可以避免人为经验的主观性,克服了现有技术中难以根据实际情况选取最合适的筛选器的问题,并且避免了开发人员由于经验主观不同带来意见上的分歧。It can be seen from the above content that this device can collect the number of training samples to be acquired and the proportion of training samples in the total data set from the acquisition task when receiving the acquisition task of training samples, and acquire the corresponding cost of all data acquisition modes The unit price and the corresponding cost unit price and the screening performance parameters of the screening mode, according to the preset total cost calculation formula corresponding to the combination of each data collection mode and the screening mode, find the total cost corresponding to all the combinations, so as to select the lowest total The combination method corresponding to the cost is used as the combination plan for this acquisition task, and the combination plan is sent to the corresponding device to obtain training samples. Starting from the actual screening process, according to the steps of data collection, data storage, data filtering, and data labeling, the costs of various combinations are calculated and summarized to obtain the total cost of various combinations, and select from all total costs The lowest total cost is calculated, and the combination method corresponding to the lowest total cost is used as the combination plan for obtaining the task. By calculating the total cost of the data collection mode and the filter mode, a scientific evaluation system is established. While taking into account the performance and efficiency of the filter algorithm, it can also avoid the subjectivity of human experience and overcome the difficulties in the prior art. The problem of selecting the most appropriate filter in the actual situation, and avoiding the differences in opinions caused by the developers due to subjective differences in experience.
本领域普通技术人员可以理解:附图只是一个实施例的示意图,附图中的模块或流程并不一定是实施本发明所必须的。Those of ordinary skill in the art can understand that the drawings are only schematic diagrams of an embodiment, and the modules or processes in the drawings are not necessarily necessary for implementing the present invention.
本领域普通技术人员可以理解:实施例中的装置中的模块可以按照实施例描述分布于实施例的装置中,也可以进行相应变化位于不同于本实施例的一个或多个装置中。上述实施例的模块可以合并为一个模块,也可以进一步拆分成多个子模块。A person of ordinary skill in the art can understand that the modules in the device in the embodiment may be distributed in the device in the embodiment according to the description of the embodiment, or may be located in one or more devices different from this embodiment with corresponding changes. The modules of the above-mentioned embodiments can be combined into one module, or further divided into multiple sub-modules.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments are modified, or some of the technical features thereof are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

  1. 一种训练样本的获取方法,其特征在于,包括:A method for obtaining training samples is characterized in that it includes:
    接收训练样本的获取任务,所述获取任务包含待获取训练样本个数和训练样本在数据集总体中的占比;Receiving an acquisition task of training samples, where the acquisition task includes the number of training samples to be acquired and the proportion of the training samples in the overall data set;
    获取每种数据采集模式对应的成本单价;其中,所述数据采集模式包括专项采集和众包采集,所述专项采集模式的成本单价包括单位时间采集成本,所述众包采集模式的成本单价包括传输每个样本的流量成本、安装在众包车辆上的设备的折旧成本;Obtain the cost unit price corresponding to each data collection mode; wherein, the data collection mode includes special collection and crowdsourcing collection, the cost unit price of the special collection mode includes the unit time collection cost, and the cost unit price of the crowdsourcing collection mode includes Transmission cost of each sample, depreciation cost of equipment installed on crowdsourced vehicles;
    获取每种筛选器模式对应的筛选性能参数和成本单价;其中,所述筛选器模式包括均匀采样、人工筛选、车端筛选器、云端筛选器和组合筛选器,所述均匀采样模式对应的筛选性能参数包括均匀采样的时间间隔,所述人工筛选模式对应的筛选性能参数包括人工筛选速度,所述人工筛选模式对应的成本单价包括单位时间人工筛选成本,所述车端筛选器模式对应的筛选性能参数包括车端筛选器的查准率、车端筛选器的查全率和车端筛选器速度,所述云端筛选器模式对应的筛选性能参数包括云端筛选器的查准率、云端筛选器的查全率和云端筛选器速度,所述云端筛选器模式对应的成本单价包括单位时间云端计算资源运行成本;所述人工筛选速度是指单位时间内人工处理的数据个数,所述车端筛选器速度是指单位时间车端筛选器处理的数据个数,所述云端筛选器速度是指单位时间云端筛选器处理的数据个数;Obtain the screening performance parameters and cost unit price corresponding to each filter mode; wherein, the filter modes include uniform sampling, manual screening, car-side filter, cloud filter, and combined filter, and the uniform sampling mode corresponds to the screening The performance parameters include the time interval of uniform sampling, the screening performance parameters corresponding to the manual screening mode include manual screening speed, the cost unit price corresponding to the manual screening mode includes the manual screening cost per unit time, and the screening corresponding to the car-side filter mode The performance parameters include the precision rate of the car-side filter, the recall rate of the car-side filter, and the speed of the car-side filter. The screening performance parameters corresponding to the cloud filter mode include the precision rate of the cloud filter and the cloud filter The recall rate and cloud filter speed of the cloud filter mode include the operating cost of cloud computing resources per unit time; the manual filtering speed refers to the number of data manually processed per unit time, the vehicle end The filter speed refers to the number of data processed by the vehicle-side filter per unit time, and the cloud filter speed refers to the number of data processed by the cloud filter per unit time;
    获取单位数据的存储成本和人工标注一个训练样本的单位标注成本;The storage cost of acquiring unit data and the unit labeling cost of manually labeling a training sample;
    确定原始数据获取的组合方式,所述组合方式包括数据采集模式和筛选器模式;Determine a combination mode of raw data acquisition, the combination mode includes a data collection mode and a filter mode;
    针对每种组合方式,基于该组合方式的数据采集模式对应的成本单价和筛选器模式对应的筛选性能参数和成本单价,计算得到该种组合方式在数据 集中获取所述训练样本所对应的总成本;For each combination method, based on the cost unit price corresponding to the data collection mode of the combination method and the screening performance parameters and cost unit price corresponding to the filter mode, the total cost corresponding to the training sample obtained in the data set by this combination method is calculated ;
    从得到的每种总成本中挑选出最低总成本,将所述最低总成本对应的数据采集模式和筛选器模式的组合方式确定为所述获取任务获取训练样本的原始数据获取组合方案;Select the lowest total cost from each total cost obtained, and determine the combination of the data collection mode and the filter mode corresponding to the lowest total cost as the original data acquisition combination plan for the acquisition task to acquire training samples;
    将所述原始数据获取组合方案发送至相应设备以获取用于筛选所述训练样本的原始数据集。The original data acquisition combination scheme is sent to a corresponding device to obtain an original data set for screening the training samples.
  2. 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式的组合方式为专项采集和均匀采样时,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is special collection and uniform sampling, the total cost calculation formula corresponding to the combination is:
    Figure PCTCN2020121356-appb-100001
    Figure PCTCN2020121356-appb-100001
    其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the total data set, T is the time interval of the uniform sampling, and c collection is the unit time collection Cost, c store is the storage cost of the unit data, and c label is the unit label cost.
  3. 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式的组合方式为专项采集和车端筛选器时,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is special collection and car-side filter, the total cost calculation formula corresponding to the combination is:
    Figure PCTCN2020121356-appb-100002
    Figure PCTCN2020121356-appb-100002
    其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,R car为所述车端筛选器的查全率,f car为所述车端筛选器速度,P car为所述车端筛选器的查准率,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, R car is the recall rate of the car-end filter, and f car is the total cost. The speed of the car-end filter, P car is the precision rate of the car-end filter, c collection is the collection cost per unit time, c store is the storage cost of the unit data, and c label is the unit labeling cost .
  4. 根据权利要求1所述的方法,其特征在于,当所述数据采集模式和筛选器模式组合方式为专项采集和云端筛选器时,该组合方式对应的总成本计 算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is special collection and cloud filter, the total cost calculation formula corresponding to the combination is:
    Figure PCTCN2020121356-appb-100003
    Figure PCTCN2020121356-appb-100003
    其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为所述云端筛选器的查准率,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c resource为所述单位时间云端计算资源运行成本,c label为所述单位标注成本。 Wherein, C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the total data set, T is the time interval of the uniform sampling, and R cloud is the cloud filter F cloud is the speed of the cloud filter, P cloud is the precision rate of the cloud filter, c collection is the collection cost per unit time, c store is the storage cost of the unit data, c resource is the operating cost of the cloud computing resource per unit time, and c label is the label cost per unit.
  5. 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式组合方式为专项采集和第一组合筛选器时,其中,所述第一组合筛选器为均匀采样和人工筛选的组合,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is a special collection and a first combined filter, wherein the first combined filter is uniform sampling and manual screening Combination, the total cost calculation formula corresponding to this combination method is:
    Figure PCTCN2020121356-appb-100004
    Figure PCTCN2020121356-appb-100004
    其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,f person为所述人工筛选速度,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c person为所述单位时间人工筛选成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, T is the time interval of the uniform sampling, and f person is the manual screening speed , C collection is the collection cost per unit time, c store is the storage cost of the unit data, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
  6. 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式组合方式为专项采集和第二组合筛选器时,其中,所述第二组合筛选器为云端筛选器和人工筛选的组合,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is a special collection and a second combined filter, wherein the second combined filter is a cloud filter and a manual filter The total cost calculation formula corresponding to this combination is:
    Figure PCTCN2020121356-appb-100005
    Figure PCTCN2020121356-appb-100005
    其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,T为所述均匀采样的时间间隔,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为所述云端筛选器的查准率,f person为所述人工筛选速度,c collection为所述单位时间采集成本,c store为所述单位数据的存储成本,c resource为所述单位时间云端计算资源运行成本,c person为所述单位时间人工筛选成本,c label为所述单位标注成本。 Wherein, C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the total data set, T is the time interval of the uniform sampling, and R cloud is the cloud filter F cloud is the speed of the cloud filter, P cloud is the precision rate of the cloud filter, f person is the manual screening speed, c collection is the collection cost per unit time, and c store is For the storage cost of the unit data, c resource is the operating cost of the cloud computing resource per unit time, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
  7. 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式组合方式为专项采集和第三组合筛选器时,其中,所述第三组合筛选器为车端筛选器和云端筛选器以及人工筛选的组合,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is a special collection and a third combined filter, wherein the third combined filter is a car-side filter and a cloud Combination of filter and manual screening, the total cost calculation formula corresponding to this combination method is:
    Figure PCTCN2020121356-appb-100006
    Figure PCTCN2020121356-appb-100006
    其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,R car为车端筛选器的查全率,f car为车端筛选器速度,P car为车端筛选器的查准率,R cloud为云端筛选器的查全率,f cloud为云端筛选器速度,P cloud为云端筛选器的查准率,f person为人工筛选速度,c collection为单位时间采集成本,c store为单位数据的存储成本,c resource为单位时间云端计算资源运行成本,c person为单位时间人工筛选成本,c label为单位标注成本。 Among them, C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, R car is the recall rate of the car -end filter, and f car is the car-end screening P car is the precision rate of the car filter, R cloud is the recall rate of the cloud filter, f cloud is the cloud filter speed, P cloud is the precision rate of the cloud filter, and f person is the manual filter Speed, c collection is the collection cost per unit time, c store is the storage cost per unit data, c resource is the operating cost of cloud computing resources per unit time, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
  8. 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式组合方式为众包采集和第二组合筛选器时,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is crowdsourced collection and the second combined filter, the total cost calculation formula corresponding to the combination is:
    Figure PCTCN2020121356-appb-100007
    Figure PCTCN2020121356-appb-100007
    Figure PCTCN2020121356-appb-100008
    Figure PCTCN2020121356-appb-100008
    其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为云端筛选器的查准率,f person为人工筛选速度,c netword为传输每个样本的流量成本,c store为单位数据的存储成本,c resource为单位时间云端计算资源运行成本,c person为单位时间人工筛选成本,c label为单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, R cloud is the recall rate of the cloud filter, and f cloud is the Cloud filter speed, P cloud is the accuracy rate of the cloud filter, f person is the manual screening speed, c netword is the traffic cost of transmitting each sample, c store is the storage cost of unit data, and c resource is the cloud computing per unit time Resource operating cost, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
  9. 根据权利要求1所述的方法,其特征在于,当数据采集模式和筛选器模式组合方式为众包采集和第三组合筛选器时,该组合方式对应的总成本计算公式为:The method according to claim 1, wherein when the combination of the data collection mode and the filter mode is crowdsourced collection and the third combined filter, the total cost calculation formula corresponding to the combined mode is:
    Figure PCTCN2020121356-appb-100009
    Figure PCTCN2020121356-appb-100009
    其中,C total为总成本,N为所述训练样本个数,p为所述训练样本在数据集总体中的占比,R car为所述车端筛选器的查全率,f car为所述车端筛选器速度,P car为所述车端筛选器的查准率,R cloud为所述云端筛选器的查全率,f cloud为所述云端筛选器速度,P cloud为所述云端筛选器的查准率,f person为所述人工筛选速度,c device为所述安装在众包车辆上的设备的折旧成本,c netword为所述传输每个样本的流量成本,c store为所述单位数据的存储成本,c resource为所述单位时间云端计算资源运行成本,c person为所述单位时间人工筛选成本,c label为所述单位标注成本。 Where C total is the total cost, N is the number of training samples, p is the proportion of the training samples in the overall data set, R car is the recall rate of the car-end filter, and f car is the total cost. The car-side filter speed, P car is the precision rate of the car-side filter, R cloud is the recall rate of the cloud filter, f cloud is the cloud filter speed, and P cloud is the cloud The precision rate of the filter, f person is the manual screening speed, c device is the depreciation cost of the equipment installed on the crowdsourced vehicle, c netword is the traffic cost of transmitting each sample, and c store is the cost The storage cost of the unit data, c resource is the operating cost of the cloud computing resource per unit time, c person is the manual screening cost per unit time, and c label is the unit labeling cost.
  10. 一种训练样本的获取装置,其特征在于,包括:A device for acquiring training samples is characterized in that it comprises:
    接收模块,被配置为接收训练样本的获取任务,所述获取任务包含待获取训练样本个数和训练样本在数据集总体中的占比;A receiving module configured to receive an acquisition task of training samples, the acquisition task including the number of training samples to be acquired and the proportion of the training samples in the overall data set;
    采集成本获取模块,被配置为获取每种数据采集模式对应的成本单价;The acquisition cost acquisition module is configured to acquire the cost unit price corresponding to each data acquisition mode;
    筛选性能参数和成本获取模块,被配置为获取每种筛选器模式对应的筛选性能参数和成本单价;The screening performance parameter and cost acquisition module is configured to obtain the screening performance parameter and cost unit price corresponding to each filter mode;
    存储和标注成本获取模块,被配置为获取单位数据的存储成本和人工标注一个训练样本的单位标注成本;The storage and labeling cost acquisition module is configured to acquire the storage cost of unit data and the unit labeling cost of manually labeling a training sample;
    确定模块,被配置为确定原始数据获取的组合方式,所述组合方式包括数据采集模式和筛选器模式;A determining module, configured to determine a combination mode of original data acquisition, the combination mode including a data collection mode and a filter mode;
    计算模块,被配置为针对每种组合方式,基于该组合方式的数据采集模式对应的成本单价和筛选器模式对应的筛选性能参数和成本单价,计算得到该种组合方式在数据集中获取所述训练样本所对应的总成本;The calculation module is configured to, for each combination mode, based on the cost unit price corresponding to the data collection mode of the combination mode and the screening performance parameter and cost unit price corresponding to the filter mode, calculate the combination mode to obtain the training in the data set The total cost corresponding to the sample;
    比较模块,被配置为从得到的每种总成本中挑选出最低总成本,将所述最低总成本对应的数据采集模式和筛选器模式的组合方式确定为所述获取任务获取训练样本的原始数据获取组合方案;The comparison module is configured to select the lowest total cost from each total cost obtained, and determine the combination of the data collection mode and the filter mode corresponding to the lowest total cost as the raw data of the acquisition task to obtain training samples Get a combination plan;
    发送模块,被配置为将所述数据获取组合方案发送至相应设备以获取用于筛选所述训练样本的原始数据集。The sending module is configured to send the data acquisition combination scheme to a corresponding device to acquire the original data set used to screen the training samples.
PCT/CN2020/121356 2019-11-28 2020-10-16 Training sample acquisition method and apparatus WO2021103835A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911188397.0 2019-11-28
CN201911188397.0A CN112861898B (en) 2019-11-28 2019-11-28 Training sample acquisition method and device

Publications (1)

Publication Number Publication Date
WO2021103835A1 true WO2021103835A1 (en) 2021-06-03

Family

ID=75985256

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/121356 WO2021103835A1 (en) 2019-11-28 2020-10-16 Training sample acquisition method and apparatus

Country Status (2)

Country Link
CN (1) CN112861898B (en)
WO (1) WO2021103835A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447031A (en) * 2014-08-28 2016-03-30 百度在线网络技术(北京)有限公司 Training sample labeling method and device
CN107909088A (en) * 2017-09-27 2018-04-13 百度在线网络技术(北京)有限公司 Obtain method, apparatus, equipment and the computer-readable storage medium of training sample

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446783B (en) * 2018-11-16 2023-07-25 山东浪潮科学研究院有限公司 Image recognition efficient sample collection method and system based on machine crowdsourcing
CN113505730A (en) * 2021-07-26 2021-10-15 全景智联(武汉)科技有限公司 Model evaluation method, device, equipment and storage medium based on mass data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447031A (en) * 2014-08-28 2016-03-30 百度在线网络技术(北京)有限公司 Training sample labeling method and device
CN107909088A (en) * 2017-09-27 2018-04-13 百度在线网络技术(北京)有限公司 Obtain method, apparatus, equipment and the computer-readable storage medium of training sample

Also Published As

Publication number Publication date
CN112861898B (en) 2022-06-10
CN112861898A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
CN107786994B (en) User perception quality difference analysis method and system for end-to-end wireless service
CN109376660B (en) Target monitoring method, device and system
CN110930705B (en) Intersection traffic decision system, method and equipment
CN109982361B (en) Signal interference analysis method, device, equipment and medium
WO2018184304A1 (en) Method and device for detecting health state of network element
CN101695170A (en) Wireless communication network testing data collection and analysis method based on intelligent mobile phone
US12019059B2 (en) Detecting equipment defects using lubricant analysis
CN110472581A (en) A kind of cell image analysis method based on deep learning
EP3326109A1 (en) System and method for providing a recipe
CN115660288A (en) Analysis management system based on internet big data
CN113704077A (en) Test case generation method and device
WO2021103835A1 (en) Training sample acquisition method and apparatus
CN116934507B (en) Intelligent claim settlement method and system based on big data driving
CN109995549B (en) Method and device for evaluating flow value
CN113505980A (en) Reliability evaluation method, device and system for intelligent traffic management system
CN116452154B (en) Project management system suitable for communication operators
CN117275644A (en) Detection result mutual recognition method, system and storage medium based on deep learning
CN104392101B (en) Data sharing method and device
CN108692709B (en) Farmland disaster detection method and system, unmanned aerial vehicle and cloud server
CN116385045A (en) Data processing method, device and equipment for receiving and hosting additional service
CN110413902A (en) The online recommended method in vehicle salvage shop, device, equipment and storage medium
JP2017208717A (en) Analysis system for radio communication network
CN109167673A (en) A kind of Novel cloud service screening technique of abnormal fusion Qos Data Detection
CN115222276A (en) Bidding analysis and evaluation method and device
CN114782115A (en) Method, system and terminal device for recommending site selection of private stores

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20894385

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20894385

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20894385

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24.01.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20894385

Country of ref document: EP

Kind code of ref document: A1