WO2017148266A1 - 一种机器学习系统的训练方法和训练系统 - Google Patents

一种机器学习系统的训练方法和训练系统 Download PDF

Info

Publication number
WO2017148266A1
WO2017148266A1 PCT/CN2017/073719 CN2017073719W WO2017148266A1 WO 2017148266 A1 WO2017148266 A1 WO 2017148266A1 CN 2017073719 W CN2017073719 W CN 2017073719W WO 2017148266 A1 WO2017148266 A1 WO 2017148266A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
machine learning
learning system
sample data
sample set
Prior art date
Application number
PCT/CN2017/073719
Other languages
English (en)
French (fr)
Inventor
周俊
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to JP2018544075A priority Critical patent/JP6991983B2/ja
Publication of WO2017148266A1 publication Critical patent/WO2017148266A1/zh
Priority to US16/114,078 priority patent/US11720787B2/en
Priority to US18/342,204 priority patent/US20230342607A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the field of big data processing, and in particular, to a training method and a training system for a machine learning system.
  • the machine learning system is designed to mimic the neural network of the human brain and is used to predict user behavior. Before the machine learning system goes online, it needs to be trained through large-scale data. However, in the training process, large-scale data will inevitably require large-scale machine resources to be effectively processed. For example, Tencent's advertising data is at the PB level, and it is necessary to use more than one thousand machines. For most companies, It's a huge cost.
  • the usual processing method is to reduce the amount of data processed by the machine learning system by means of random sample sampling.
  • Random sample sampling is to discard the sample with a certain probability, for example, randomly generate a floating point number in the range of 0-1 for each sample, and discard the sample directly when the floating point number is greater than the threshold.
  • the method of randomly discarding samples will discard a large amount of useful data, impair the training effect of the machine learning system, and reduce the accuracy of prediction.
  • embodiments of the present application have been made in order to provide a training method and training system for a machine learning system that overcomes the above problems or at least partially solves the above problems.
  • an embodiment of the present application discloses a training method of a machine learning system, which uses a plurality of sample data to train a machine learning system, and the training method includes:
  • each sample set including sample data corresponding to a sampling period
  • Each of the modified sample data is input to a machine learning system to train the machine learning system.
  • Another embodiment of the present application discloses a training system for a machine learning system that trains a machine learning system using a plurality of sample data, the training system including:
  • a first acquiring module configured to obtain a plurality of sample sets, each sample set including sample data in a corresponding sampling time period
  • a sampling rate setting module configured to set a sampling rate corresponding to the sample set according to a sampling period corresponding to each sample set
  • a second acquiring module configured to obtain a plurality of sample sets sampled according to a sampling rate
  • a importance level determining module configured to respectively set an importance level value of the plurality of sampled sample sets
  • a sample data correction module configured to use the importance level value to correct each sample data in the plurality of sampled sample sets to obtain corrected sample data
  • a training module configured to input each of the modified sample data into a machine learning system to train the machine learning system.
  • the embodiment of the present application has at least the following advantages: the embodiment of the present application discloses a training method and a training system of a machine learning system, which process sample data before inputting sample data into a machine learning system, including acquiring samples according to a sampling time period. Collecting, setting a sampling rate of each sample set according to a sampling time period, sampling according to a sampling rate, determining an importance degree value of the sample set after sampling, and correcting the sample data by using the importance degree value, and inputting the sample data into the machine learning system Training, while reducing the amount of data processed by the machine learning system, while ensuring the adoption rate and utilization of important data, while reducing the memory resource requirements of the machine while minimizing the impact on the learning effect of the machine learning system.
  • FIG. 1 is a flow chart of a training method of a machine learning system of a first embodiment of the present application.
  • FIG. 2 is a flow chart of a training method of the machine learning system of the second embodiment of the present application.
  • FIG. 3 is a flow chart of a training method of the machine learning system of the third embodiment of the present application.
  • FIG. 4 is a block diagram of a training system of a machine learning system of a fourth embodiment of the present application.
  • Figure 5 is a block diagram of a training system of a machine learning system of a fifth embodiment of the present application.
  • Figure 6 is a block diagram of a training system of a machine learning system of a sixth embodiment of the present application.
  • One of the core ideas of the present application is to propose a training method and a training system for a machine learning system, which utilizes a plurality of sample data to train a machine learning system, including dividing the sample data into a plurality of samples according to a sampling period of the sample data. Set; set the sampling rate of each sample set according to the sampling time period; sample each sample set according to the sampling rate, and modify the importance degree value corresponding to each sample set; and correct each sample data by using the importance value, The corrected sample data is input into a machine learning system to train the machine learning system.
  • FIG. 1 is a flowchart of a training method of a machine learning system according to an embodiment of the present application. The following steps:
  • each sample data is, for example, a vector, and one of the dimensions in the vector is, for example, the sampling time of the sample data.
  • the sampling time of all sample data may be divided into multiple sampling time segments, and the plurality of sample data is divided into a plurality of sample sets according to the sampling time period, and each sample set corresponds to one sampling time period.
  • sampling time of all sample data is from January 24th to January 29th
  • this sampling time can be divided into multiple sampling time periods, such as January 29th, January 27th to January 28th.
  • the sample data is divided into a sample set sampled on January 29, a sample data set sampled from January 27 to January 28, and a sample set sampled from January 24 to January 26. . Therefore, each sample set corresponds to one sampling period.
  • sampling period can be divided according to the rules set by the developer or the user.
  • the average distribution or the uneven distribution is not limited to this application.
  • the sample rate of each corresponding sample set can be set according to the sampling period.
  • the sampling rate can be set according to the principle that the sampling rate corresponding to the sample set that is newer in the sampling period is higher. That is, the sampling rate of the sample set increases as the sampling time period corresponding to the sample set changes from old to new.
  • the sampling rate of the sample set corresponding to the sample data sampled on January 29 may be set to 1.0
  • the sampling rate of the sample set corresponding to the sample data sampled from January 27 to January 28 may be set as 0.5
  • the sampling rate of the sample set corresponding to the sample data sampled from January 24 to January 26 is set to 0.1.
  • S104 Determine an importance degree value of the plurality of sampled sample sets respectively.
  • the importance level value may be a coefficient set by an artificial or machine algorithm, and the importance level values corresponding to each sample set after sampling may be manually set or set by a certain rule by a machine. In the above steps, a new importance level value may be set based on the original importance level value of the sample set.
  • each importance sample value may be used to modify each sample data in the plurality of sampled sample sets to obtain corrected sample data
  • each feature dimension of each vector is multiplied by an importance degree value, and the vector is scaled up to obtain corrected sample data.
  • the original or default importance value of the sample set is 1, which can be corrected to 2 in this step, so a sample that was originally a(1,1,1,2,etcn)
  • the data can be corrected to a(2, 2, 2, 4, arranged 2n) in this step, which is the corrected sample data.
  • the importance value is not limited to the coefficient set by the artificial or machine algorithm.
  • the modified sample data can be input into a machine learning system to train the machine learning system.
  • the first embodiment of the present application discloses a training method for a machine learning system, which processes sample data before inputting sample data into a machine learning system, and reduces the amount of data while ensuring the adoption rate and utilization of important data. To the extent that it reduces the memory resource requirements of the machine while minimizing the impact on the learning effectiveness of the machine learning system.
  • FIG. 2 is a flowchart of a training method for a machine learning system according to a second embodiment of the present application, and a training method for the machine learning system according to the embodiment. Including the following steps:
  • Step S204 may include, for example:
  • Sub-step S204a correcting an initial importance level value of the sampled sample set based on a corresponding sampling rate, to obtain an importance degree value of the sampled sample set;
  • the importance degree value and the initial importance degree value are proportional relationship, and the sampling rate of the sampled sample set is inversely proportional.
  • a new importance level value can be calculated by the ratio of the importance level value originally corresponding to the sample set to the sampling rate. For example, you can set the importance value of each sample set for the first time by the following formula:
  • Y1 is a set importance value corresponding to the set of samples
  • Y is the original importance level value corresponding to the sample set
  • a is the sampling rate of the sample set.
  • the sampling rate for the sampling period from January 24 to January 26 is 0.1, and the importance level value corresponding to the set is set to 0.2;
  • Step S204 may further include, for example:
  • Sub-step S204b according to the preset rule, the importance level value of the sample set corresponding to the latest sampling time period is increased.
  • this preset rule may include, for example:
  • the importance value of the sample set corresponding to the updated latest sampling time period is proportional to the importance value of the sample set corresponding to the latest sampling time period before the improvement, and is proportional to the total number of sample sets.
  • the importance level value of the sample set corresponding to the latest sampling time period can be set again according to the following formula:
  • Z1 is a re-modified importance value corresponding to the sample set
  • Z is an initial modified importance value corresponding to the sample set
  • b is the total number of sample sets.
  • the importance level values corresponding to the three sample sets arranged from the old to the new according to the sampling period according to step S204b are 2, 2, and 5, respectively, and in this step, the latest sampling time period may be The sampled sample set, the third sample set, again raises its importance value.
  • Z is an importance level value corresponding to the initial setting of the sample set
  • b is the total number of sample sets.
  • sub-step S204b may be performed before or after sub-step S204a, or separately. That is, the sub-step S204b is independent with respect to the sub-step S204a and does not depend on the sub-step S204a.
  • S205 Modify, by using the importance level value, each sample data in the plurality of sampled sample sets to obtain corrected sample data.
  • S205a Multiply each of the importance degree values by each sample data in the corresponding sampled sample set to obtain corrected sample data.
  • This step may be the same as or similar to step S106 in the first embodiment, and details are not described herein again.
  • the second embodiment of the present application discloses a training method for a machine learning system, which processes sample data before inputting sample data into a machine learning system, and reduces the importance value of different sample sets.
  • the amount of data ensures the adoption rate and utilization of important data, and reduces the impact on the learning effect of the machine learning system while reducing the memory resource requirements of the machine.
  • FIG. 3 is a flowchart of a training method of a machine learning system according to a third embodiment of the present application, and a training method for the machine learning system according to the embodiment. Including the following steps:
  • steps S301 to S305 may be the same as or similar to the steps S101 to S105 disclosed in the first embodiment, and may be the same as or similar to the steps S201 to S205 disclosed in the second embodiment, and details are not described herein again.
  • the modified sample data can be input into a machine learning system to train the machine learning system.
  • This step can include the following substeps:
  • step S306a a gradient of each of the corrected sample data, which is a derivative of the loss function, can be first calculated, and by deriving the loss function, a gradient can be obtained.
  • step S306b the following equation can be used to reduce the storage bytes of the gradient of each of the data to achieve reduced precision:
  • rand() is a floating point number between 0 and d
  • X1 is a low precision floating point number, for example, a memory that requires 4 bytes for computer storage, where each of the reductions is described here.
  • X is a high-precision floating-point number, for example, a computer storing a double of 8 bytes, which is a storage byte for reducing the gradient of each of the previous sample data.
  • the third embodiment of the present application discloses a training method for a machine learning system, in which sample data is input.
  • the sample data is processed before entering the machine learning system.
  • the data usage is reduced while ensuring the adoption rate and utilization degree of important data. Reduce the memory resource requirements of the machine while minimizing the impact on the learning of the machine learning system.
  • FIG. 4 is a block diagram of a training system for a machine learning system according to a fourth embodiment of the present application.
  • the training system of the machine learning system according to the present embodiment is utilized.
  • the plurality of sample data trains a machine learning system, the training system 400 comprising:
  • the first obtaining module 401 is configured to obtain a plurality of sample sets, where each sample set includes sample data in a corresponding sampling time period;
  • the sampling rate setting module 402 is configured to set a sampling rate corresponding to the sample set according to a sampling time period corresponding to each sample set;
  • a second obtaining module 403, configured to obtain a plurality of sample sets sampled according to a sampling rate
  • the importance level value determining module 404 is configured to separately set the importance level values of the plurality of sampled sample sets
  • the sample data correction module 405 is configured to use the importance level value to correct each sample data in the plurality of sampled sample sets to obtain corrected sample data;
  • the training module 406 is configured to input each of the modified sample data into a machine learning system to train the machine learning system.
  • the sampling rate of the sample set increases from the oldest to the new as the sampling time period corresponding to the sample set.
  • the fourth embodiment of the present application discloses a training system for a machine learning system, which processes sample data before inputting sample data into the machine learning system, reduces the amount of data, and ensures the adoption rate and utilization of important data. To the extent that it reduces the memory resource requirements of the machine while minimizing the impact on the learning effectiveness of the machine learning system.
  • FIG. 5 is a block diagram of a training system of a machine learning system according to a fifth embodiment of the present application.
  • the training system of the machine learning system according to the present embodiment is utilized.
  • the plurality of sample data trains a machine learning system, the training system 500 comprising:
  • the first obtaining module 501 is configured to obtain a plurality of sample sets, where each sample set includes a corresponding sampling period Sample data;
  • the sampling rate setting module 502 is configured to set a sampling rate corresponding to the sample set according to a sampling period corresponding to each sample set;
  • a second obtaining module 503, configured to obtain a plurality of sample sets sampled according to a sampling rate
  • the importance level value determining module 504 is configured to separately set the importance level values of the plurality of sampled sample sets
  • the sample data correction module 505 is configured to use the importance level value to correct each sample data in the plurality of sampled sample sets to obtain corrected sample data;
  • the training module 506 is configured to input each of the modified sample data into a machine learning system to train the machine learning system.
  • the sample data correction module 505 is configured to:
  • Each of the importance level values is multiplied with each sample data in the corresponding sampled sample set to obtain corrected sample data.
  • the importance level value determining module 504 includes:
  • the initial correction sub-module 504a is configured to correct an initial importance level value of the sampled sample set based on a corresponding sampling rate, to obtain an importance degree value of the sampled sample set;
  • the importance degree value and the initial importance degree value are proportional relationship, and the sampling rate of the sampled sample set is inversely proportional.
  • the initial correction sub-module may initially set the importance value of each of the sample sets according to the following formula:
  • Y1 is a set importance value corresponding to the set of samples
  • Y is the original importance level value corresponding to the sample set
  • a is the sampling rate of the sample set.
  • the importance level value determining module 504 may further include:
  • the secondary correction sub-module 504b is configured to increase the importance level value of the sample set corresponding to the latest sampling time period according to the preset rule.
  • the preset rule comprises:
  • the importance value of the sample set corresponding to the updated latest sampling time period is proportional to the importance value of the sample set corresponding to the latest sampling time period before the improvement, and is proportional to the total number of sample sets.
  • Z is an importance level value corresponding to the initial setting of the sample set
  • b is the total number of sample sets.
  • the sampling rate of the sample set increases as the sampling time period corresponding to the sample set changes from old to new.
  • the fifth embodiment of the present application discloses a training system for a machine learning system, which processes sample data before inputting sample data into a machine learning system, and reduces the importance value of different sample sets.
  • the amount of data ensures the adoption rate and utilization of important data, and reduces the impact on the learning effect of the machine learning system while reducing the memory resource requirements of the machine.
  • FIG. 6 is a block diagram of a training system of a machine learning system according to a sixth embodiment of the present application.
  • the training system of the machine learning system according to the present embodiment is utilized.
  • the plurality of sample data trains the machine learning system, the training system 600 comprising:
  • the first obtaining module 601 is configured to obtain a plurality of sample sets, where each sample set includes sample data in a corresponding sampling time period;
  • the sampling rate setting module 602 is configured to set a sampling rate corresponding to the sample set according to a sampling time period corresponding to each sample set;
  • a second obtaining module 603, configured to obtain a plurality of sample sets sampled according to a sampling rate
  • the importance level value determining module 604 is configured to separately set the importance level values of the plurality of sampled sample sets
  • the sample data correction module 605 is configured to use the importance level value to correct each sample data in the plurality of sampled sample sets to obtain corrected sample data;
  • the training module 606 is configured to input each of the modified sample data into a machine learning system to train the machine learning system.
  • the training module 606 includes:
  • a calculation submodule 606a configured to calculate a gradient of each of the modified sample data
  • a precision reduction sub-module 606b for reducing the accuracy of each of the gradients
  • the training sub-module 606c is configured to input the reduced accuracy gradient into the machine learning system to train the machine model.
  • the precision reduction sub-module 606b is used to:
  • rand() is a floating point number between 0 and d
  • X1 is the number of bytes of storage after reduction
  • X is the number of bytes of storage before reduction.
  • the sixth embodiment of the present application discloses a training system for a machine learning system that processes sample data before inputting sample data into a machine learning system, by setting importance values of different sample sets, and
  • the process of reducing the gradient accuracy ensures the adoption rate and utilization degree of important data while reducing the amount of data, and reduces the influence of the memory resources of the machine while minimizing the influence on the learning effect of the machine learning system.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media including both permanent and non-persistent, removable and non-removable media may be implemented by any method or technology for signal storage.
  • the signals can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage,
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read only memory
  • EEPROM electrically erasable programmable read only memory
  • flash memory or other memory technology
  • compact disk read only memory CD-ROM
  • DVD digital versatile disk
  • a magnetic tape cartridge, magnetic tape storage or other magnetic storage device or any other non-transporting medium can be used to store signals that can be accessed by a computing device.
  • Computer readable media does not include non-continuous as defined herein Sexual computer readable media, such as modulated data signals and carrier waves.
  • Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG.
  • These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions are executed by a processor of a computer or other programmable data processing terminal device
  • Means are provided for implementing the functions specified in one or more of the flow or in one or more blocks of the flow chart.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the instruction device implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Operations Research (AREA)
  • Feedback Control In General (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Complex Calculations (AREA)

Abstract

一种机器学习系统的训练方法和训练系统,利用多个样本数据对机器学习系统进行训练,该方法包括:获得多个样本集合,每个样本集合包括对应取样时间段内的样本数据(S101);根据每一个样本集合对应的采样时间段,设置该样本集合对应的采样率(S102);获得多个根据采样率采样后的样本集合(S103);分别确定所述多个采样后的样本集合的重要程度值(S104);利用该重要程度值修正所述多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据(S105);将每一个所述修正后的样本数据输入机器学习系统,对该机器学习系统进行训练(S106)。将样本数据输入机器学习系统之前对样本数据进行处理,在减轻机器的内存资源需求的同时尽量降低对机器学习系统的学习效果的影响。

Description

一种机器学习系统的训练方法和训练系统
本申请要求2016年02月29日递交的申请号为201610113716.1、发明名称为“一种机器学习系统的训练方法和训练系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及大数据处理领域,尤其涉及一种机器学习系统的训练方法和训练系统。
背景技术
在如今的大数据时代,互联网公司获取超大规模数据已非常容易。据不完全统计,谷歌2012年每天30亿query/300亿广告,脸书用户2013年每天分享43亿内容,阿里巴巴2015双十一当天就有超过7亿笔交易。这些公司通过机器学习系统,去挖掘数据里面的金矿,包括用户兴趣/行为/习惯等等。
机器学习系统设计为模仿人脑的神经网络,用于预测用户的行为。在机器学习系统上线之前,需要通过大规模的数据进行训练。然而在训练过程中,大规模的数据必然要求大规模的机器资源才能有效处理,例如腾讯的广告数据,都是PB级别,必然要用到千台机器以上,这对大部分公司来说,都是个巨大的成本。
为了降低成本,提高机器学习系统的效率,通常的处理方式是通过随机样本采样的手段减少机器学习系统处理的数据量。随机样本采样就是以一定概率丢弃样本,例如对每一个样本随机生成1个0-1范围内的浮点数,当浮点数大于阈值时则直接丢弃该样本。然而,随机丢弃样本的方式会丢弃大量的有用数据,损害机器学习系统的训练效果,降低预测的精度。
发明内容
鉴于上述问题,提出了本申请实施例以便提供一种克服上述问题或者至少部分地解决上述问题的机器学习系统的训练方法和训练系统。
为解决上述问题,本申请一实施例公开一种机器学习系统的训练方法,利用多个样本数据对机器学习系统进行训练,所述训练方法包括:
获得多个样本集合,每个样本集合包括对应取样时间段内的样本数据;
根据每一个样本集合对应的采样时间段,设置该样本集合对应的采样率;
获得多个根据采样率采样后的样本集合;
分别确定所述多个采样后的样本集合的重要程度值;
利用该重要程度值修正所述多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据;
将每一个所述修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。
本申请另一实施例公开一种机器学习系统的训练系统,利用多个样本数据对机器学习系统进行训练,所述训练系统包括:
第一获取模块,用于获得多个样本集合,每个样本集合包括对应取样时间段内的样本数据;
采样率设置模块,用于根据每一个样本集合对应的采样时间段,设置该样本集合对应的采样率;
第二获取模块,用于获得多个根据采样率采样后的样本集合;
重要程度值确定模块,用于分别设置所述多个采样后的样本集合的重要程度值;
样本数据修正模块,用于利用该重要程度值修正所述多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据;
训练模块,用于将每一个所述修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。
本申请实施例至少具有以下优点:本申请实施例公开一种机器学习系统的训练方法和训练系统,在将样本数据输入机器学习系统之前对样本数据进行处理,包括获取根据取样时间段划分的样本集合、根据取样时间段设置每个样本集合的采样率、根据采样率进行采样、确定采样后样本集合的重要程度值以及利用该重要程度值将样本数据进行修正,并将样本数据输入机器学习系统进行训练,在降低机器学习系统处理的数据量的同时保证了重要数据的采用率和利用程度,在减轻机器的内存资源需求的同时尽量降低对机器学习系统的学习效果的影响。
附图说明
图1是本申请第一实施例的机器学习系统的训练方法的流程图。
图2是本申请第二实施例的机器学习系统的训练方法的流程图。
图3是本申请第三实施例的机器学习系统的训练方法的流程图。
图4是本申请第四实施例的机器学习系统的训练系统的方框图。
图5是本申请第五实施例的机器学习系统的训练系统的方框图。
图6是本申请第六实施例的机器学习系统的训练系统的方框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本申请保护的范围。
本申请的核心思想之一在于,提出一种机器学习系统的训练方法和训练系统,利用多个样本数据对机器学习系统进行训练,包括根据样本数据的取样时间段将样本数据划分为多个样本集合;根据取样时间段设置每一个样本集合的采样率;根据采样率对每一个样本集合采样,并修改每一个采样后的样本集合对应的重要程度值;利用重要程度值修正每一个样本数据,并将该修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。
第一实施例
本申请第一实施例提出一种机器学习系统的训练方法,如图1所示为本申请一实施例的机器学习系统的训练方法的流程图,本实施例提出的机器学习系统的训练方法包括如下步骤:
S101,获得多个样本集合,每个样本集合包括对应取样时间段内的样本数据;
在这一步骤中,每一个样本数据例如为一个向量,该向量中的其中一个维度例如为该样本数据的取样时间。在本步骤中可以将所有样本数据的取样时间划分为多个取样时间段,并将多个样本数据根据取样时间段划分为多个样本集合,每一个样本集合对应一个取样时间段。
例如,所有样本数据的取样时间是从1月24日至1月29日,则可以将这一取样时间划分为多个取样时间段,例如1月29日、1月27日至1月28日、1月24日至1月26日三个取样时间段。按照上述三个取样时间段,将样本数据划分为1月29日取样的样本集合、1月27日至1月28日取样的样本数据集合、1月24日至1月26日取样的样本集合。因此,每一个样本集合对应一个取样时间段。
值得注意的是,上述取样时间段可以是依据开发者或使用者设定的规则划分,可以 平均分布或者不平均分布,本申请并不以此为限。
S102,根据每一个样本集合对应的采样时间段,设置该样本集合对应的采样率;
在这一步骤中,可以根据取样时间段设置对应的每一个取样集合的样本率。例如,可以按照取样时间段越新的样本集合对应的采样率越高的原则设置采样率。即,所述样本集合的采样率随着该样本集合对应的取样时间段从旧到新而增加。例如在上述示例中,可以将1月29日取样的样本数据对应的样本集合的采样率设置为1.0,将1月27日至1月28日取样的样本数据对应的样本集合的采样率设置为0.5,将1月24日至1月26日取样的样本数据对应的样本集合的采样率设置为0.1。
S103,获得多个根据采样率采样后的样本集合;
在这一步骤中,可以根据上一步骤中设置的采样率,对每一个样本集合内的样本进行采样。例如某一个样本集合中包含的样本数据为1000个,采样率为0.1,则采样后该样本集合中包含的样本数据的个数为1000*0.1=100个。通过采样后,样本集合中的样本数据为100个,这100个样本数据对应的集合可以称为采样后的样本集合。
S104,分别确定所述多个采样后的样本集合的重要程度值;
在一实施例中,重要程度值可以是人为或者机器算法设定的系数,每一个采样后的样本集合对应的重要程度值分别可以人为设定或通过机器以一定规则设定。在上述步骤中,可以在该样本集合原重要程度值的基础上,设置新的重要程度值。
S105,利用该重要程度值修正所述多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据;
在这一步骤中,可以利用该重要程度值修正多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据;
利用该重要程度值修正每一个样本数据,可以是将每一个向量的每一个特征维度与重要程度值相乘,使该向量等比例放大,获得修正后的样本数据。
例如,该样本集合原有的或者默认的重要程度值为1,在这一步骤中可以修正为2,因此某个原来为a(1,1,1,2,……..n)的样本数据在这一步骤中可以修正为a(2,2,2,4,……..2n),即为修正后的样本数据。
然而,正如本领域技术人员可以得知的,重要程度值并不限于人为或者机器算法设定的系数,在其他实施例,还可以有多种方法,例如对样本数据a(,1,1,2,……..n)进行数学运算,a1=f(a)等等,这里的函数f可以为等比相乘函数,或者类似指数运算等等各种数学函数,也可以对样本进行修正。
S106,将每一个所述修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。
在这一步骤中,可以将修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。在训练中,首先对损失函数求导,计算出梯度,再结合初始的权重以及设置的步长,根据公式“新的权重=旧的权重+步长*梯度”通过迭代的方式计算出接近最优解的权重值。
综上所述,本申请第一实施例公开一种机器学习系统的训练方法,在将样本数据输入机器学习系统之前对样本数据进行处理,降低数据量的同时保证了重要数据的采用率和利用程度,在减轻机器的内存资源需求的同时尽量降低对机器学习系统的学习效果的影响。
第二实施例
本申请第二实施例提出一种机器学习系统的训练方法,如图2所示为本申请第二实施例的机器学习系统的训练方法的流程图,本实施例提出的机器学习系统的训练方法包括如下步骤:
S201,获得多个样本集合,每个样本集合包括对应取样时间段内的样本数据;
S202,根据每一个样本集合对应的采样时间段,设置该样本集合对应的采样率;
S203,获得多个根据采样率采样后的样本集合;
上述三个步骤与第一实施例中的步骤S101、S102、S103相同或相似,在此不再赘述。
S204,分别确定所述多个采样后的样本集合的重要程度值;
步骤S204例如可以包括:
子步骤S204a:基于对应的采样率对所述采样后的样本集合的初始重要程度值进行修正,得到所述采样后的样本集合的重要程度值;
所述重要程度值和初始重要程度值为正比关系,和所述采样后的样本集合的采样率为反比关系。
在子步骤S204a中,例如可以通过该样本集合原先对应的重要程度值与采样率的比值,计算新的重要程度值。例如,可以按照下述公式初次设置每一个样本集合的重要程度值:
Y1=Y/a;
其中Y1为对应于该样本集合的设置后的重要程度值;
Y为对应于该样本集合的原始的重要程度值;
a为所述样本集合的采样率。
举例来说,在第一实施例所提供的示例中,如果针对1月24日至1月26日这一取样时间段的采样率为0.1,并且该集合对应的重要程度值设为0.2;针对1月29日这一取样时间段的采样率为0.5,并且该集合对应的重要程度值设为1;针对1月27日至1月28日这一取样时间段的采样率为1,并且该集合对应的重要程度值设为5,则根据Y1=Y/a,可以得出按照取样时间段由旧到新排列的这三个集合的重要程度值分别为2、2、5。
步骤S204例如还可以包括:
子步骤S204b,按照预置规则,提高最新的取样时间段对应的样本集合的重要程度值。
在子步骤S204b中,这一预置规则例如可以包括:
提高后的最新的取样时间段对应的样本集合的重要程度值与提高前的最新的取样时间段对应的样本集合的重要程度值成正比,并与样本集合的总个数成正比。
在这一子步骤中,例如可以按照下述公式再次设置最新的取样时间段对应的样本集合的重要程度值:
Z1=Z*b;
其中Z1为对应于该样本集合的再次修改后的重要程度值;
Z为对应于该样本集合的初次修改后的重要程度值;
b为样本集合的总个数。
举例来说,根据步骤S204b得出的按照取样时间段由旧到新排列的三个样本集合对应的重要程度值分别为2、2、5,在这一步骤中,可以针对取样时间段最新的采样后的样本集合,即第三个样本集合,再次提升其重要程度值。
例如,可以按照下述公式再次设置最新的取样时间段对应的样本集合的重要程度值:
Z1=Z*b;
其中Z1为对应于该样本集合的再次设置后的重要程度值;
Z为对应于该样本集合的初次设置后的重要程度值;
b为样本集合的总个数。
举例来说,在子步骤S204a中获得的取样时间段最新的样本集合对应的初次设置后 的重要程度值为5,在这一子步骤中,可以通过Z1=Z*b的公式,获取再次设置后的重要程度值为5*3=15。
值得注意的是,子步骤S204b可以在子步骤S204a之前或之后执行,或者是单独执行。即,子步骤S204b相对于子步骤S204a是独立的,并不依赖于子步骤S204a。
S205,利用该重要程度值修正所述多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据;
在这一步骤例如可以包括如下子步骤:
S205a,将每一个所述重要程度值与对应的采样后的样本集合中的每一个样本数据相乘,获得修正后的样本数据。
S206,将每一个所述修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。
这一步骤可以与第一实施例中的步骤S106相同或相似,在此不再赘述。
综上所述,本申请第二实施例公开一种机器学习系统的训练方法,在将样本数据输入机器学习系统之前对样本数据进行处理,通过对不同样本集合的重要程度值的设置,在降低数据量的同时保证了重要数据的采用率和利用程度,在减轻机器的内存资源需求的同时尽量降低对机器学习系统的学习效果的影响。
第三实施例
本申请第三实施例提出一种机器学习系统的训练方法,如图3所示为本申请第三实施例的机器学习系统的训练方法的流程图,本实施例提出的机器学习系统的训练方法包括如下步骤:
S301,获得多个样本集合,每个样本集合包括对应取样时间段内的样本数据;
S302,根据每一个样本集合对应的采样时间段,设置该样本集合对应的采样率;
S303,获得多个根据采样率采样后的样本集合;
S304,分别确定所述多个采样后的样本集合的重要程度值;
S305,利用该重要程度值修正所述多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据;
上述步骤S301至S305可以与第一实施例公开的步骤S101至S105相同或相似,也可以与第二实施例公开的步骤S201至S205相同或相似,在此不再赘述。
本实施例还可以包括如下步骤:
S306,将每一个所述修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。
在这一步骤中,可以将修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。在训练中,首先对损失函数求导,计算出梯度,再结合初始的权重以及设置的步长,根据公式“新的权重=旧的权重+步长*梯度”通过迭代的方式计算出接近最优解的权重值。
这一步骤可以包括如下子步骤:
S306a,计算出每一个所述修正后的样本数据的梯度;
S306b,降低每一个所述样本数据的梯度的精度;
S306c,将降低精度后的梯度输入所述机器学习系统,对该机器模型进行训练。
在步骤S306a中,可以首先计算每一个修正后的样本数据的梯度,该梯度为损失函数的导数,通过对损失函数求导,可以获得梯度。
在步骤S306b中,机器学习系统的训练一般采用梯度下降法,每一台机器都需要计算出梯度。如果储存1个梯度需要8byte(字节),则100亿个梯度需要10000000000*8/1024/1024/1024=74.5G的存储空间。如果将储存一个梯度的字节数压缩至4byte,则100亿的梯度仅仅需要32.25G内存。
在步骤S306b中,可以使用下述公式,减少每一个本数据的梯度的存储字节,以实现降低精度:
X1=floor(c*X+(rand())/d)/c
其中floor为向下取整;rand()为产生0-d之间的浮点数;X1为低精度浮点数,例如为计算机存储需要4个字节的float,在这里表示减少后每一个所述样本数据的梯度的存储字节;X为高精度浮点数,例如为计算机存储需要8个字节的double,为减少前每一个所述样本数据的梯度的存储字节。
另外,通过利用rand函数引入随机因素,来尽量降低浮点数的累计误差。例如,利用(c*X+(rand())/d)的算法,让X乘以一个固定的数,然后加上一个在0-1范围内的浮点数,目的在于在引入随机因素。C的值是个经验值,例如可以为536870912。D例如可以为232-1,即2147483647,是rand函数所能产生的上限。
通过上述公式,可以实现将一个高精度的浮点数,转成一个低精度的浮点数,并且尽可能减低累计误差。
综上所述,本申请第三实施例公开一种机器学习系统的训练方法,在将样本数据输 入机器学习系统之前对样本数据进行处理,通过对不同样本集合的重要程度值的设置,以及在降低梯度精度时的处理,在降低数据量的同时保证了重要数据的采用率和利用程度,在减轻机器的内存资源需求的同时尽量降低对机器学习系统的学习效果的影响。
第四实施例
本申请第四实施例提出一种机器学习系统的训练系统,如图4所示为本申请第四实施例的机器学习系统的训练系统的方框图,本实施例提出的机器学习系统的训练系统利用多个样本数据对机器学习系统进行训练,所述训练系统400包括:
第一获取模块401,用于获得多个样本集合,每个样本集合包括对应取样时间段内的样本数据;
采样率设置模块402,用于根据每一个样本集合对应的采样时间段,设置该样本集合对应的采样率;
第二获取模块403,用于获得多个根据采样率采样后的样本集合;
重要程度值确定模块404,用于分别设置所述多个采样后的样本集合的重要程度值;
样本数据修正模块405,用于利用该重要程度值修正所述多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据;
训练模块406,用于将每一个所述修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。
优选地,在本实施例中,所述样本集合的采样率随着该样本集合对应的取样时间段从旧到新而增加。
综上所述,本申请第四实施例公开一种机器学习系统的训练系统,在将样本数据输入机器学习系统之前对样本数据进行处理,降低数据量的同时保证了重要数据的采用率和利用程度,在减轻机器的内存资源需求的同时尽量降低对机器学习系统的学习效果的影响。
第五实施例
本申请第五实施例提出一种机器学习系统的训练系统,如图5所示为本申请第五实施例的机器学习系统的训练系统的方框图,本实施例提出的机器学习系统的训练系统利用多个样本数据对机器学习系统进行训练,所述训练系统500包括:
第一获取模块501,用于获得多个样本集合,每个样本集合包括对应取样时间段内 的样本数据;
采样率设置模块502,用于根据每一个样本集合对应的采样时间段,设置该样本集合对应的采样率;
第二获取模块503,用于获得多个根据采样率采样后的样本集合;
重要程度值确定模块504,用于分别设置所述多个采样后的样本集合的重要程度值;
样本数据修正模块505,用于利用该重要程度值修正所述多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据;
训练模块506,用于将每一个所述修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。
在本实施例中,所述样本数据修正模块505用于:
将每一个所述重要程度值与对应的采样后的样本集合中的每一个样本数据相乘,获得修正后的样本数据。
在本实施例中,所述重要程度值确定模块504包括:
初次修正子模块504a,用于基于对应的采样率对所述采样后的样本集合的初始重要程度值进行修正,得到所述采样后的样本集合的重要程度值;
所述重要程度值和初始重要程度值为正比关系,和所述采样后的样本集合的采样率为反比关系。
例如,所述初次修正子模块可以按照下述公式初次设置每一个所述样本集合的重要程度值:
Y1=Y/a;
其中Y1为对应于该样本集合的设置后的重要程度值;
Y为对应于该样本集合的原始的重要程度值;
a为所述样本集合的采样率。
在本实施例中,所述重要程度值确定模块504还可以包括:
二次修正子模块504b,用于按照预置规则,提高最新的取样时间段对应的样本集合的重要程度值。
优选地,所述预置规则包括:
提高后的最新的取样时间段对应的样本集合的重要程度值与提高前的最新的取样时间段对应的样本集合的重要程度值成正比,并与样本集合的总个数成正比。
例如,可以按照下述公式再次设置最新的取样时间段对应的样本集合的重要程度值:
Z1=Z*b;
其中Z1为对应于该样本集合的再次设置后的重要程度值;
Z为对应于该样本集合的初次设置后的重要程度值;
b为样本集合的总个数。
在本实施例中,所述样本集合的采样率随着该样本集合对应的取样时间段从旧到新而增加。
综上所述,本申请第五实施例公开一种机器学习系统的训练系统,在将样本数据输入机器学习系统之前对样本数据进行处理,通过对不同样本集合的重要程度值的设置,在降低数据量的同时保证了重要数据的采用率和利用程度,在减轻机器的内存资源需求的同时尽量降低对机器学习系统的学习效果的影响。
第六实施例
本申请第六实施例提出一种机器学习系统的训练系统,如图6所示为本申请第六实施例的机器学习系统的训练系统的方框图,本实施例提出的机器学习系统的训练系统利用多个样本数据对机器学习系统进行训练,所述训练系统600包括:
第一获取模块601,用于获得多个样本集合,每个样本集合包括对应取样时间段内的样本数据;
采样率设置模块602,用于根据每一个样本集合对应的采样时间段,设置该样本集合对应的采样率;
第二获取模块603,用于获得多个根据采样率采样后的样本集合;
重要程度值确定模块604,用于分别设置所述多个采样后的样本集合的重要程度值;
样本数据修正模块605,用于利用该重要程度值修正所述多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据;
训练模块606,用于将每一个所述修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。
在本实施例中,所述训练模块606包括:
计算子模块606a,用于计算出每一个所述修正后的样本数据的梯度;
精度降低子模块606b,用于降低每一个所述梯度的精度;
训练子模块606c,用于将降低精度后的梯度输入所述机器学习系统,对该机器模型进行训练。
在本实施例中,所述精度降低子模块606b用于:
利用下述公式,减少每一个梯度的存储字节,以实现降低精度:
X1=floor(c*X+(rand())/d)/c
其中floor为向下取整;rand()为产生0-d之间的浮点数;X1为减少后的存储字节数;X为减少前的存储字节数。
综上所述,本申请第六实施例公开一种机器学习系统的训练系统,在将样本数据输入机器学习系统之前对样本数据进行处理,通过对不同样本集合的重要程度值的设置,以及在降低梯度精度时的处理,在降低数据量的同时保证了重要数据的采用率和利用程度,在减轻机器的内存资源需求的同时尽量降低对机器学习系统的学习效果的影响。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
在一个典型的配置中,所述计算机设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信号存储。信号可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信号。按照本文中的界定,计算机可读介质不包括非持续 性的电脑可读媒体(transitory media),如调制的数据信号和载波。
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种针对混淆脚本语言的定位方法和系统,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是 用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (16)

  1. 一种机器学习系统的训练方法,利用多个样本数据对机器学习系统进行训练,其特征在于,所述训练方法包括:
    获得多个样本集合,每个样本集合包括对应取样时间段内的样本数据;
    根据每一个样本集合对应的采样时间段,设置该样本集合对应的采样率;
    获得多个根据采样率采样后的样本集合;
    分别确定所述多个采样后的样本集合的重要程度值;
    利用该重要程度值修正所述多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据;
    将每一个所述修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。
  2. 如权利要求1所述的机器学习系统的训练方法,其特征在于,所述利用该重要程度值修正所述多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据的步骤包括:
    将每一个所述重要程度值与对应的采样后的样本集合中的每一个样本数据相乘,获得修正后的样本数据。
  3. 如权利要求1所述的机器学习系统的训练方法,其特征在于,所述将每一个所述修正后的样本数据输入机器学习系统,对该机器学习系统进行训练的步骤包括:
    计算出每一个所述修正后的样本数据的梯度;
    降低每一个所述梯度的精度;
    将降低精度后的梯度输入所述机器学习系统,对该机器模型进行训练。
  4. 如权利要求3所述的机器学习系统的训练方法,其特征在于,所述降低每一个所述梯度的精度的步骤包括:
    利用下述公式,减少每一个梯度的存储字节,以实现降低精度:
    X1=floor(c*X+(rand())/d)/c
    其中floor为向下取整;rand()为产生0-d之间的浮点数;X1为减少后的存储字节数;X为减少前的存储字节数。
  5. 如权利要求1所述的机器学习系统的训练方法,其特征在于,所述分别确定所述多个采样后的样本集合的重要程度值步骤包括:
    基于对应的采样率对所述采样后的样本集合的初始重要程度值进行修正,得到所述采样后的样本集合的重要程度值;
    所述重要程度值和初始重要程度值为正比关系,和所述采样后的样本集合的采样率为反比关系。
  6. 如权利要求5所述的机器学习系统的训练方法,其特征在于,所述分别设置所述多个采样后的样本集合的重要程度值步骤还包括:
    按照预置规则,提高最新的取样时间段对应的样本集合的重要程度值。
  7. 如权利要求6所述的机器学习系统的训练方法,其特征在于,所述预置规则包括:
    提高后的最新的取样时间段对应的样本集合的重要程度值与提高前的最新的取样时间段对应的样本集合的重要程度值成正比,并与样本集合的总个数成正比。
  8. 如权利要求1所述的机器学习系统的训练方法,其特征在于,在根据每一个样本集合对应的采样时间段,设置该样本集合对应的采样率的步骤中,所述样本集合的采样率随着该样本集合对应的取样时间段从旧到新而增加。
  9. 一种机器学习系统的训练系统,利用多个样本数据对机器学习系统进行训练,其特征在于,所述训练系统包括:
    第一获取模块,用于获得多个样本集合,每个样本集合包括对应取样时间段内的样本数据;
    采样率设置模块,用于根据每一个样本集合对应的采样时间段,设置该样本集合对应的采样率;
    第二获取模块,用于获得多个根据采样率采样后的样本集合;
    重要程度值确定模块,用于分别设置所述多个采样后的样本集合的重要程度值;
    样本数据修正模块,用于利用该重要程度值修正所述多个采样后的样本集合中的每一个样本数据,获得修正后的样本数据;
    训练模块,用于将每一个所述修正后的样本数据输入机器学习系统,对该机器学习系统进行训练。
  10. 如权利要求9所述的机器学习系统的训练系统,其特征在于,所述样本数据修正模块用于:
    将每一个所述重要程度值与对应的采样后的样本集合中的每一个样本数据相乘,获得修正后的样本数据。
  11. 如权利要求9所述的机器学习系统的训练系统,其特征在于,所述训练模块包括:
    计算子模块,用于计算出每一个所述修正后的样本数据的梯度;
    精度降低子模块,用于降低每一个所述梯度的精度;
    训练子模块,用于将降低精度后的梯度输入所述机器学习系统,对该机器模型进行训练。
  12. 如权利要求11所述的机器学习系统的训练系统,其特征在于,所述精度降低子模块用于:
    利用下述公式,减少每一个梯度的存储字节,以实现降低精度:
    X1=floor(c*X+(rand())/d)/c
    其中floor为向下取整;rand()为产生0-d之间的浮点数;X1为减少后的存储字节数;X为减少前的存储字节数。
  13. 如权利要求9所述的机器学习系统的训练系统,其特征在于,所述重要程度值确定模块包括:
    初次修正子模块,用于基于对应的采样率对所述采样后的样本集合的初始重要程度值进行修正,得到所述采样后的样本集合的重要程度值;
    所述重要程度值和初始重要程度值为正比关系,和所述采样后的样本集合的采样率为反比关系。
  14. 如权利要求13所述的机器学习系统的训练系统,其特征在于,所述重要程度值确定模块还包括:
    二次修正子模块,用于按照预置规则,提高最新的取样时间段对应的样本集合的重要程度值。
  15. 如权利要求14所述的机器学习系统的训练系统,其特征在于,所述预置规则包括:
    提高后的最新的取样时间段对应的样本集合的重要程度值与提高前的最新的取样时间段对应的样本集合的重要程度值成正比,并与样本集合的总个数成正比。
  16. 如权利要求9所述的机器学习系统的训练系统,其特征在于,所述采样率设置模块用于,将所述样本集合的采样率设置为随着该样本集合对应的取样时间段从旧到新而增加。
PCT/CN2017/073719 2016-02-29 2017-02-16 一种机器学习系统的训练方法和训练系统 WO2017148266A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2018544075A JP6991983B2 (ja) 2016-02-29 2017-02-16 機械学習システムをトレーニングする方法及びシステム
US16/114,078 US11720787B2 (en) 2016-02-29 2018-08-27 Method and system for training machine learning system
US18/342,204 US20230342607A1 (en) 2016-02-29 2023-06-27 Method and system for training machine learning system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610113716.1A CN107133190A (zh) 2016-02-29 2016-02-29 一种机器学习系统的训练方法和训练系统
CN201610113716.1 2016-02-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/114,078 Continuation US11720787B2 (en) 2016-02-29 2018-08-27 Method and system for training machine learning system

Publications (1)

Publication Number Publication Date
WO2017148266A1 true WO2017148266A1 (zh) 2017-09-08

Family

ID=59720591

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/073719 WO2017148266A1 (zh) 2016-02-29 2017-02-16 一种机器学习系统的训练方法和训练系统

Country Status (5)

Country Link
US (2) US11720787B2 (zh)
JP (1) JP6991983B2 (zh)
CN (1) CN107133190A (zh)
TW (1) TWI796286B (zh)
WO (1) WO2017148266A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210158078A1 (en) * 2018-09-03 2021-05-27 Ping An Technology (Shenzhen) Co., Ltd. Unbalanced sample data preprocessing method and device, and computer device
US11379760B2 (en) 2019-02-14 2022-07-05 Yang Chang Similarity based learning machine and methods of similarity based machine learning
US11720787B2 (en) 2016-02-29 2023-08-08 Alibaba Group Holding Limited Method and system for training machine learning system

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019086928A (ja) * 2017-11-06 2019-06-06 ファナック株式会社 制御装置及び機械学習装置
TWI651664B (zh) * 2017-11-15 2019-02-21 財團法人資訊工業策進會 模型生成伺服器及其模型生成方法
CN116032781A (zh) * 2019-02-27 2023-04-28 华为技术有限公司 人工智能增强数据采样
CN111985651A (zh) * 2019-05-22 2020-11-24 中国移动通信集团福建有限公司 业务系统运维方法和装置
CN113010500A (zh) * 2019-12-18 2021-06-22 中国电信股份有限公司 用于dpi数据的处理方法和处理系统
CN114092632A (zh) 2020-08-06 2022-02-25 财团法人工业技术研究院 标注方法、应用其的装置、系统、方法及计算机程序产品

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968851A (zh) * 2010-09-09 2011-02-09 西安电子科技大学 基于字典学习上采样的医学影像处理方法
CN102156907A (zh) * 2010-02-11 2011-08-17 中国科学院计算技术研究所 面向qa系统的质检方法
US20120209612A1 (en) * 2011-02-10 2012-08-16 Intonow Extraction and Matching of Characteristic Fingerprints from Audio Signals
CN103136361A (zh) * 2013-03-07 2013-06-05 陈一飞 一种生物文本中蛋白质相互关系的半监督抽取方法
CN104166668A (zh) * 2014-06-09 2014-11-26 南京邮电大学 基于folfm模型的新闻推荐系统及方法

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6853920B2 (en) * 2000-03-10 2005-02-08 Smiths Detection-Pasadena, Inc. Control for an industrial process using one or more multidimensional variables
JP5187635B2 (ja) * 2006-12-11 2013-04-24 日本電気株式会社 能動学習システム、能動学習方法、及び能動学習用プログラム
US8315817B2 (en) * 2007-01-26 2012-11-20 Illumina, Inc. Independently removable nucleic acid sequencing system and method
JP4985293B2 (ja) * 2007-10-04 2012-07-25 ソニー株式会社 情報処理装置および方法、プログラム、並びに記録媒体
US8706742B1 (en) * 2009-04-22 2014-04-22 Equivio Ltd. System for enhancing expert-based computerized analysis of a set of digital documents and methods useful in conjunction therewith
US8873813B2 (en) * 2012-09-17 2014-10-28 Z Advanced Computing, Inc. Application of Z-webs and Z-factors to analytics, search engine, learning, recognition, natural language, and other utilities
JP2014016895A (ja) * 2012-07-10 2014-01-30 Canon Inc 情報抽出装置、情報抽出方法及びプログラム
JP5942651B2 (ja) * 2012-07-10 2016-06-29 沖電気工業株式会社 入力装置
US9521158B2 (en) * 2014-01-06 2016-12-13 Cisco Technology, Inc. Feature aggregation in a computer network
US10311375B2 (en) * 2014-10-16 2019-06-04 Nanyang Technological University Systems and methods for classifying electrical signals
DE102016101665A1 (de) * 2015-01-29 2016-08-04 Affectomatics Ltd. Auf datenschutzüberlegungen gestützte filterung von messwerten der affektiven reaktion
CN107133190A (zh) 2016-02-29 2017-09-05 阿里巴巴集团控股有限公司 一种机器学习系统的训练方法和训练系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156907A (zh) * 2010-02-11 2011-08-17 中国科学院计算技术研究所 面向qa系统的质检方法
CN101968851A (zh) * 2010-09-09 2011-02-09 西安电子科技大学 基于字典学习上采样的医学影像处理方法
US20120209612A1 (en) * 2011-02-10 2012-08-16 Intonow Extraction and Matching of Characteristic Fingerprints from Audio Signals
CN103136361A (zh) * 2013-03-07 2013-06-05 陈一飞 一种生物文本中蛋白质相互关系的半监督抽取方法
CN104166668A (zh) * 2014-06-09 2014-11-26 南京邮电大学 基于folfm模型的新闻推荐系统及方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11720787B2 (en) 2016-02-29 2023-08-08 Alibaba Group Holding Limited Method and system for training machine learning system
US20210158078A1 (en) * 2018-09-03 2021-05-27 Ping An Technology (Shenzhen) Co., Ltd. Unbalanced sample data preprocessing method and device, and computer device
US11941087B2 (en) * 2018-09-03 2024-03-26 Ping An Technology (Shenzhen) Co., Ltd. Unbalanced sample data preprocessing method and device, and computer device
US11379760B2 (en) 2019-02-14 2022-07-05 Yang Chang Similarity based learning machine and methods of similarity based machine learning

Also Published As

Publication number Publication date
JP2019512126A (ja) 2019-05-09
TW201737115A (zh) 2017-10-16
US20180365523A1 (en) 2018-12-20
JP6991983B2 (ja) 2022-01-14
TWI796286B (zh) 2023-03-21
US20230342607A1 (en) 2023-10-26
US11720787B2 (en) 2023-08-08
CN107133190A (zh) 2017-09-05

Similar Documents

Publication Publication Date Title
WO2017148266A1 (zh) 一种机器学习系统的训练方法和训练系统
US11023804B2 (en) Generating an output for a neural network output layer
GB2561669A (en) Implementing neural networks in fixed point arithmetic computing systems
CN104298680A (zh) 数据统计方法及数据统计装置
WO2018059302A1 (zh) 文本识别方法、装置及存储介质
WO2021169386A1 (zh) 一种图数据处理方法、装置、设备、介质
CN109669995A (zh) 数据存储、质量计算方法、装置、存储介质及服务器
CN111415180B (zh) 资源价值调整方法、装置、服务器及存储介质
CN111798263A (zh) 一种交易趋势的预测方法和装置
CN110019783B (zh) 属性词聚类方法及装置
CN107977923B (zh) 图像处理方法、装置、电子设备及计算机可读存储介质
CN107368281B (zh) 一种数据处理方法及装置
CN115935723A (zh) 用于实现氮化镓制备场景下的设备组合分析方法及系统
CN108154377B (zh) 广告作弊预测方法及装置
US20220050614A1 (en) System and method for approximating replication completion time
CN110019068B (zh) 一种日志文本处理方法和装置
CN114579419A (zh) 一种数据处理方法及装置、存储介质
CN111815510A (zh) 基于改进的卷积神经网络模型的图像处理方法及相关设备
CN113722573A (zh) 生成网络安全威胁数据集合的方法、系统和存储介质
CN111026879A (zh) 多维度价值导向的针对意图的面向对象数值计算方法
US10191941B1 (en) Iterative skewness calculation for streamed data using components
JP7446359B2 (ja) 交通データ予測方法、交通データ予測装置、電子機器、記憶媒体、コンピュータプログラム製品及びコンピュータプログラム
CN111753200B (zh) 一种数据确定方法、装置、设备及介质
US10313249B1 (en) Incremental autocorrelation calculation for big data using components
US10339136B1 (en) Incremental skewness calculation for big data or streamed data using components

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2018544075

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17759117

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17759117

Country of ref document: EP

Kind code of ref document: A1