CN114372579A - Method of training machine learning model, prediction method, computing device, and medium - Google Patents

Method of training machine learning model, prediction method, computing device, and medium Download PDF

Info

Publication number
CN114372579A
CN114372579A CN202111652923.1A CN202111652923A CN114372579A CN 114372579 A CN114372579 A CN 114372579A CN 202111652923 A CN202111652923 A CN 202111652923A CN 114372579 A CN114372579 A CN 114372579A
Authority
CN
China
Prior art keywords
data set
primary key
time
feature
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111652923.1A
Other languages
Chinese (zh)
Inventor
张卿
袁云滔
王姜
潘雄飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shengdoushi Shanghai Science and Technology Development Co Ltd
Original Assignee
Shengdoushi Shanghai Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shengdoushi Shanghai Technology Development Co Ltd filed Critical Shengdoushi Shanghai Technology Development Co Ltd
Priority to CN202111652923.1A priority Critical patent/CN114372579A/en
Publication of CN114372579A publication Critical patent/CN114372579A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a method for training a machine learning model for a target object, a method of predicting sales of a target object, a computing device, and a computer-readable storage medium. The method comprises the following steps: acquiring time sequence data sets of a plurality of target objects in a plurality of unit times, wherein the time sequence data sets comprise a plurality of characteristic values of each target object in each unit time; segmenting the time-ordered data set along a primary key direction to produce a plurality of primary key segments; segmenting each primary key segment in a temporal direction to produce a plurality of feature dataset files, wherein each feature dataset file comprises time series data of the at least two target objects over a first time period, and the first time period comprises a plurality of unit times; and obtaining a training data set for training the machine learning model according to the characteristic data set file in a second time period, wherein the second time period comprises a plurality of first time periods.

Description

Method of training machine learning model, prediction method, computing device, and medium
Technical Field
The present disclosure relates generally to the field of machine learning, and more particularly, to a method for training a machine learning model for a target object, a method of predicting sales of a target object, a computing device, and a computer-readable storage medium.
Background
When training a machine learning model, it is necessary to reduce the data processing time as much as possible and to efficiently read data for training. It is common practice to process raw data into training data, which is stored in a set of standard format files that can be read sequentially.
Thus, when performing model training using a deep learning algorithm of time-series data, such as the LSTM (Long Short-term Memory) algorithm, it is necessary to process and level a plurality of pieces of time-series raw data of the same primary key (target object) first, combine the plurality of pieces of time-series raw data as different features into one piece of training data, and combine and store the plurality of pieces of training data as a format file such as TFRecord.
However, when the amount of data is very large, the operation of flattening the time series raw data processing to generate training data is very time consuming. In particular, time-series original data are continuously increased along with the time, so that time-consuming leveling operation is performed from the beginning on the whole data including the newly-increased data before each training, the time is very long, and the efficiency of model training is seriously influenced.
Disclosure of Invention
In view of at least one of the above problems, the present disclosure provides a scheme for performing segmentation processing on time-series data of a plurality of target objects in both a primary key and a time direction.
According to one aspect of the present disclosure, a method for training a machine learning model for a target object is provided. The method comprises the following steps: acquiring time sequence data sets of a plurality of target objects in a plurality of unit times, wherein the time sequence data sets comprise a plurality of characteristic values of each target object in each unit time; segmenting the time-series data set in a primary key direction to produce a plurality of primary key segments, wherein each primary key segment comprises time-series data of at least two target objects of the plurality of target objects; segmenting each primary key segment in a temporal direction to produce a plurality of feature dataset files, wherein each feature dataset file comprises time series data of the at least two target objects over a first time period, and the first time period comprises a plurality of unit times; and obtaining a training data set for training the machine learning model according to the characteristic data set file in a second time period, wherein the second time period comprises a plurality of first time periods.
In some embodiments, the method further comprises: for each primary key segment, acquiring increment time sequence data of at least two target objects included in the primary key segment in an increment first time period; generating an incremental feature dataset file for the incremental timing data; and replacing the feature data set file in the second time period with the incremental feature data set file to serve as a next training data set to carry out iterative training on the machine learning model.
In some embodiments, the method further comprises: storing first metadata information of the plurality of feature data set files, the first metadata information indicating a correspondence between each primary key segment and at least two target objects contained by the primary key segment; storing second metadata information of the primary key segment under each primary key segment, wherein the second metadata information indicates a corresponding relationship between a first time period under the primary key segment and a feature data set file; and storing a feature data set file under the primary key segment under each primary key segment.
In some embodiments, in the first metadata information, the plurality of primary key segments are randomly arranged.
In some embodiments, in the first metadata information, the plurality of primary key segments are arranged based on an order of the primary key segments.
In some embodiments, the method further comprises: storing each feature data set file in a manner of taking the feature values as rows and taking the main key as columns, and obtaining a training data set for training the machine learning model according to the feature data set files in the second time period comprises: and under the condition that the characteristic data set files in the plurality of first time periods included in the second time period are read into the internal memory, directly merging the characteristic values of each characteristic data set file along the direction of the row, and performing row-column conversion after merging to generate a training data set of the machine learning model.
In some embodiments, the method further comprises: storing each feature data set file in a manner of taking the primary key as a row and taking the feature value as a column, and obtaining a training data set for training the machine learning model according to the feature data set file in the second time period comprises: and under the condition that the characteristic data set files in the plurality of first time periods contained in the second time period are read into the internal memory, expanding a plurality of characteristic values of each time sequence data of each characteristic data set file along the direction of the column to generate a training data set of the machine learning model.
In some embodiments, the target object comprises a store chain, and the feature value comprises business data of the store chain.
According to another aspect of the present disclosure, there is provided a method for predicting sales of a target object, comprising: acquiring characteristic data of the target object in a plurality of unit times as input data; and inputting the input data into the machine learning model trained by the method to generate a prediction result about the sales of the target object according to the input data.
According to another aspect of the present disclosure, a computing device is provided. The computing device includes: at least one processor; and at least one memory coupled to the at least one processor and storing instructions for execution by the at least one processor, the instructions when executed by the at least one processor causing the computing device to perform steps according to the above-described method.
According to yet another aspect of the present disclosure, a computer-readable storage medium is provided, having stored thereon computer program code, which, when executed, performs the method as described above.
Drawings
The present disclosure will be better understood and other objects, details, features and advantages thereof will become more apparent from the following description of specific embodiments of the disclosure given with reference to the accompanying drawings.
Fig. 1 shows a schematic diagram of a system for implementing a method for training a machine learning model for a target object according to an embodiment of the present disclosure.
Fig. 2 illustrates a flow diagram of a method for training a machine learning model for a target object, in accordance with some embodiments of the present disclosure.
FIG. 3 illustrates a schematic diagram of a storage structure for a feature data set file according to some embodiments of the present disclosure.
Fig. 4 illustrates a flow diagram of a method for training a machine learning model for a target object, according to some embodiments of the present disclosure.
FIG. 5 illustrates a schematic diagram of a storage structure containing an incremental feature dataset file according to some embodiments of the present disclosure.
FIG. 6 illustrates a block diagram of a computing device suitable for implementing embodiments of the present disclosure.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In the following description, for the purposes of illustrating various inventive embodiments, certain specific details are set forth in order to provide a thorough understanding of the various inventive embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details. In other instances, well-known devices, structures and techniques associated with this application may not be shown or described in detail to avoid unnecessarily obscuring the description of the embodiments.
Throughout the specification and claims, the word "comprise" and variations thereof, such as "comprises" and "comprising," are to be understood as an open, inclusive meaning, i.e., as being interpreted to mean "including, but not limited to," unless the context requires otherwise.
Reference throughout this specification to "one embodiment" or "some embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment. Thus, the appearances of the phrases "in one embodiment" or "in some embodiments" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the terms first, second, third, fourth, etc. used in the description and in the claims, are used for distinguishing between various objects for clarity of description only and do not limit the size, other order, etc. of the objects described therein.
Fig. 1 shows a schematic diagram of a system 1 for implementing a method for training a machine learning model for a target object according to an embodiment of the present disclosure. As shown in fig. 1, system 1 includes one or more data sources 10, a computing device 20, and a network 30. Data source 10 and computing device 20 may interact with data via network 30. Here, the data source 10 may be, for example, a device for providing time-series data of a machine learning model, such as a device that generates the time-series data, a server that collects and transmits the time-series data, or the like. Here, the time-series data of the target object may be data including a feature list of the target object obtained by preprocessing time-series raw data of the target object, as shown in table 1 below, and the preprocessing process is not described in detail herein. Computing device 20 may process the timing data to convert it into training data suitable for use by the machine learning model. The computing device 20 may include at least one processor 22 and at least one memory 24 coupled to the at least one processor 22, the memory 24 having stored therein instructions 26 executable by the at least one processor 22, the instructions 26 when executed by the at least one processor 22 performing at least a portion of a method as described below. Note that, herein, the computing device 20 may be part of the device executing the machine learning model or may be independent of the device executing the machine learning model (not shown in the figures). The specific structure of computing device 20 may be described, for example, as follows in connection with FIG. 6.
Table 1 shows one example of a time series data set according to an embodiment of the present disclosure.
TABLE 1
Main key identification Unit time Feature list
Id_1 D1 F11
Id_2 D1 F21
Id_3 D1 F31
Id_1 D2 F12
Id_2 D2 F22
Id_3 D2 F32
Id_1 D3 F13
Id_2 D3 F23
Id_3 D3 F33
The time-series data (i.e., the feature list F) of three target objects (Id _1, Id _2, and Id _3) at three unit times (D1, D2, and D3) are exemplarily shown in table 11_1、F2_1、F3_1、F1_2、F2_2、F3_2、F1_3、F2_3、F3_3) And forming a time series data set. Wherein, Fi_jA feature list representing the target object (primary key) identified as Id _ i at unit time Dj.
Here, the target object is an object that provides a training sample, also referred to herein as a primary key. In machine learning models that utilize time-series data, the goal of model training is generally to predict a future target value of a target object based on its historical characteristics. For example, the machine learning model may be trained using historical shopping data (e.g., browsing data, shopping data, purchasing data, etc.) for a plurality of users to predict future shopping data for the users. As another example, historical characteristic data (e.g., store size, geographic location, sales promotion, day, etc.) of a plurality of store chains may be usedGas conditions, etc.) to predict future operational data, such as sales, etc., for each chain store using the trained machine learning model. Time series data per unit time for each target object (i.e., feature list F)i_j) A plurality of characteristic values may be included. In some cases, each feature list may contain hundreds or even thousands of feature values.
In conventional training data processing, time series data of all unit times of the same target object within one training period need to be integrated to generate one piece of training data. For example, in the case where the training period is three months and the unit time is one day, it is necessary to perform a leveling process on all time series data of a target object in three months, and merge feature lists of the target object in each day of three months to form one piece of training data of the target object. In the case where the feature list of the target object contains hundreds of feature values, the training data for a training period of three months for the target object may contain tens of thousands or even tens of thousands of feature values. In case there are a plurality of, especially a large number or even a large number of target objects, the operation of generating training data will be more time consuming.
In particular, over time, as the target object generates incremental raw data, there is still a need to integrate the full amount of time series data over a new training period to generate new training data. For example, after a piece of training data is generated for the time series data of a target object in the three months of N, N +1 and N +2(N is a positive integer greater than or equal to 1), when the time series data of the target object in the N +3 th month is obtained, a new piece of training data needs to be generated based on the time series data of the target object in the three months of N +1, N +2, and N + 3. In this case, the generation of new training data requires the leveling process from the beginning for the time-series data of the three months of N +1, N +2 and N +3, which is time-consuming and seriously affects the efficiency of model training.
In view of this problem, in the solution according to the present disclosure, the time series data of the target object is processed and stored by adopting a row-column hybrid coding method, so that when new incremental time series data is generated, new training data can be conveniently generated without leveling the full amount of time series data in the whole training time period.
Fig. 2 illustrates a flow diagram of a method 200 for training a machine learning model for a target object, in accordance with some embodiments of the present disclosure. Method 200 may be performed, for example, by computing device 20 in system 1 shown in fig. 1. The method 200 is described below in conjunction with fig. 1-6, with an example being performed in the computing device 20.
As shown in fig. 2, method 200 includes block 210, where computing device 20 may obtain a time-series data set of a plurality of target objects at a plurality of units of time as shown in table 1. The time-series data set includes a plurality of feature values per unit time for each target object. In particular, the time series data set may include time series data of a large number of target objects in a large number of units of time, and the time series data of each target object in each unit of time (i.e., the feature list) may include a large number of feature values. The acquired time series data set is shown in table 1 above, for example. Here, the target object may be, for example, a store chain, and the feature value may include operation data of the store chain, such as a size of the store chain, a sales promotion situation, a weather situation, a geographical location, and the like.
At block 220, computing device 20 may segment the time-series data set shown in Table 1 along the primary key direction to produce a plurality of primary key segments. Each primary key segment includes time series data of at least two target objects of the plurality of target objects.
Specifically, the computing device 20 may integrate time-series data of each target object in the time-series data set shown in table 1 into the feature list in the order of front and rear of each unit time
And (4) sequencing. For example, for a target object with primary key identification Id _1, 90 units of it
The sequence of feature lists between (e.g., 90 days) may be denoted as F1_1、F1_2、F1_3、……、F1_90
Computing device 20 may then segment the time series data of at least two target objects into one primary key segment along the primary key direction. For example, as shown in Table 2 below, each primary key segment KS1, KS2, KS3 can include time series data for two target objects.
TABLE 2
Figure BDA0003447556410000081
Next, at block 230, the computing device 20 may segment each primary key segment in a temporal direction to generate a plurality of feature dataset files, wherein each feature dataset file includes time series data for at least two target objects in the primary key segment over a first time period T, and the first time period T includes a plurality of unit times. That is, the primary key segment containing the feature list sequence of each target object is segmented by the first time period T to generate a plurality of feature data set files of the primary key segment.
Here, the first period T may be taken based on an acquisition interval of time series data or a training interval of a machine learning model. For example, in the case where time series data of each target object is acquired once every month to update the training of the machine learning model, the first period T may be selected to be one month (e.g., 30 days). In this way, in the primary key segment of the feature list sequence containing 90 days (i.e., three consecutive first time periods T1, T2, and T3) of each target object as shown in table 2, each primary key segment may be divided into 3 feature data set files in the time direction, i.e., one feature data set file per month.
TABLE 3
Figure BDA0003447556410000091
In this way, the time-series data set acquired at block 210 may be divided into a plurality of feature data set files divided in both the primary key and time directions, each feature data set file including time-series data of at least two target objects in one primary key segment within one first time period T. For example, as shown in Table 3Illustratively, the time series data set is divided into feature data set files F _001{ F } corresponding to the primary key segment KS1 and a first time period T11_1、F1_2……F1_30、F2_1、F2_2……F2_30F _002{ F } feature data set file corresponding to primary key segment KS2 and first time period T13_1、F3_2……F3_30、F4_1、F4_2……F4_30H, a feature data set file F _003{ F } corresponding to the primary key segment KS3 and a first time period T15_1、F5_2……F5_30、F6_1、F6_2……F6_30F _004{ F } a feature data set file F _004 corresponding to the primary key segment KS1 and a first time period T21_31、F1_32……F1_60、F2_31、F2_32……F2_60F _005{ F } a feature data set file F _005 corresponding to the primary key segment KS2 and a first time period T23_31、F3_32……F3_60、F4_31、F4_32……F4_60F _006{ F } a feature data set file F _006 corresponding to the primary key segment KS3 and the first time period T25_31、F5_32……F5_60、F6_31、F6_32……F6_60}, a feature data set file F _007{ F } corresponding to the primary key segment KS1 and the first time period T31_61、F1_62……F1_90、F2_61、F2_62……F2_90}, a feature data set file F _008{ F } corresponding to the primary key segment KS2 and a first time period T33_61、F3_62……F3_90、F4_61、F4_62……F4_90And a feature data set file F _009{ F } corresponding to the primary key segment KS3 and the first time period T35_61、F5_62……F5_90、F6_61、F6_62……F6_90}……
That is, table 3 can be expressed as table 4 below:
TABLE 4
Figure BDA0003447556410000101
Note that when the primary key segment is segmented by a first time period T, the last time period may be less than the first time period T at block 230. In this case, the corresponding feature data set files may be generated with the actual size of the last time period, i.e. the time periods corresponding to the above-mentioned feature data set files F _007, F _008 and F _009 are smaller than the first time period T. Such a time period less than the first time period T does not affect the subsequent model training process, since the correspondence between each feature data set file and time may be indicated (e.g., by the second metadata information 322 as follows) in the storage structure of the feature data set files. For the sake of brevity in this document, only the case where the last time period is equal to the first time period T is considered.
Continuing with FIG. 2, at block 240, computing device 20 may derive a training data set for training the machine learning model from the feature data set file over a second time period C, wherein the second time period C includes a plurality of first time periods T.
Here, the second period C may be the training period described above. In the case where the second time period C is 3 months and the first time period T is 1 month, the computing device 20 may generate one training data set of the machine learning model from the 3 feature data set files in the 3 first time periods T (e.g., the first time periods T1, T2, and T3 shown in table 4) corresponding to each primary key segment KS. For example, computing device 20 may read and merge the 3 feature dataset files F _001, F _004, and F _007 corresponding to primary key segment KS1 into one training dataset.
In block 240, the computing device 20 needs to read multiple feature data set files into memory and merge them into one training data set. In order to facilitate memory reading, the storage mode of the feature data set file in the database can be designed into a multi-directory and multi-file storage structure. FIG. 3 illustrates a schematic diagram of a storage structure 300 for a feature data set file according to some embodiments of the present disclosure.
As shown in fig. 3, the storage structure 300 may include first metadata information 310 indicating a correspondence between each primary key segment KS and at least two target objects it contains. For example, for a time series data set as shown in table 2, the first metadata information 310 may be represented as < KS1, Id _1, Id _2>, < KS2, Id _3, Id _4>, < KS3, Id _5, Id _6 >.
The storage structure 300 further includes a plurality of primary key directories 320, each primary key directory 320 corresponding to a primary key segment KS. In the storage structure 300, the respective primary key directories 320 are stored in the order of the primary key segments KS in the first metadata information 310. Each primary key directory 320 may further include second metadata information 322 indicating a correspondence between the respective first time period T and the feature data set file under the primary key segment KS to which the primary key directory 320 corresponds. For example, for the primary key directory 320-1, assuming that it corresponds to the primary key segment KS1, the second metadata information 322-1 of that primary key directory 320-1 indicates a correspondence between the respective first time periods T1, T2 and T3 and the feature data set files F _001, F _004 and F _007 for the primary key segment KS1, which may be represented as < F _001, T1>, < F _004, T2>, < F _007, T3>, for example. Similarly, for the primary key directory 320-2, assuming that it corresponds to the primary key segment KS2, the second metadata information 322-2 of the primary key directory 320-2 indicates the correspondence between the respective first time periods T1, T2, and T3 and the feature data set files F _002, F _005, and F _008 under the primary key segment KS2, which may be expressed as < F _002, T1>, < F _005, T2>, < F _008, T3 >.
In addition, a respective feature data set file 324 under each primary key directory 320 is also included under that primary key directory 320. For example, for the primary key directory 320-1, the respective feature data set files F _001, F _004, and F _007 are stored under the primary key directory 320-1, and for the primary key directory 320-2, the respective feature data set files F _002, F _005, and F _008 are stored under the primary key directory 320-2.
In some embodiments, in the first metadata information 310, the respective primary key segments KS may be arranged in the order of the primary key segments, i.e., in the order of the respective primary keys. For example, the first metadata information 310 may be represented as < KS1, Id _1, Id _2>, < KS2, Id _3, Id _4>, < KS3, Id _5, Id _6> in the order of the primary key segments KS1, KS2, KS 3. In this case, the feature data set file is also read sequentially in the order of the primary key segments as it is read during the model training phase in block 240. To avoid the primary key order from affecting the effectiveness of model training, computing device 20 may further shuffle the read feature dataset files to generate a training dataset.
In other embodiments, the primary key segments KS may be randomly arranged in the first metadata information 310, i.e., not in the order of the primary keys. For example, the first metadata information 310 may be randomly represented as < KS2, Id _3, Id _4>, < KS3, Id _5, Id _6>, < KS1, Id _1, Id _2 >. In this case, during the model training phase at block 240, the read feature dataset file is random, so that no additional primary key scrambling operations are required.
For each feature data set file 324 (e.g., feature data set files F _001 through F _009 as described above), multiple time series data of at least two target objects (i.e., primary keys) are contained therein within one first time period T. And memory operations are always performed in a particular manner (e.g., primary key-eigenvalue manner) while the model is trained. Thus, the arrangement between the primary keys and the eigenvalues in the signature dataset file 324 may be designed such that it facilitates memory reads and merges during the model training phase.
In some embodiments, each feature data set file 324 may be stored in rows of primary keys and columns of feature values. For example, for the feature data set file F001, it may be stored in the form of table 5 as follows:
TABLE 5
F1_1、F1_2……F1_30
F2_1、F2_2……F2_30
Here, as described above, each time-series data Fi_jIs a feature list that may include hundreds or even thousands of feature values.
In such an embodiment, in block 240, when the computing device 20 is to read the feature data set file in the second time period C, it needs to read all the feature data set files in the first time periods T included in the second time period C into the memory. At this time, the computing device 20 may perform row-column conversion on the plurality of feature values of each time series data of each feature data set file, and expand the converted feature data set file in the direction of the row to generate a training data set of the machine learning model.
For example, for a second time period C containing the first time periods T1, T2, and T3, the computing device 20 may merge the feature data set files F _001, F _004, and F _007 into table 6 as follows, in the direction of the rows:
TABLE 6
F1_1、F1_2……F1_30、F1_31、F1_32……F1_60、F1_61、F1_62……F1_90
F2_1、F2_2……F2_30、F2_31、F2_32……F2_60、F2_61、F2_62……F2_90
Note that each time series data F in the feature data set filei_jIs a feature list containing a plurality of feature values, and therefore, it is necessary to perform the processing for each time series data Fi_jIn (1)The eigenvalues expand dynamically in the direction of the rows.
In other embodiments, each signature data set file 324 may be stored with the signature values as rows and the primary keys as columns. For example, for the feature data set file F001, it may be stored in the form of table 7 as follows:
TABLE 7
Figure BDA0003447556410000131
In such an embodiment, when the computing device 20 wants to read the feature data set file in the second time period C in block 240, all the feature data set files in the first time periods T included in the second time period C need to be read into the memory. At this time, the computing device 20 may directly additionally merge the feature values of each feature data set file along the column direction, and perform row-column conversion after the merging is completed to generate one training data set of the machine learning model.
For example, for a second time period C containing the first time periods T1, T2, and T3, the computing device 20 may merge the feature data set files F _001, F _004, and F _007 into table 8 as follows, in the direction of the rows:
TABLE 8
Figure BDA0003447556410000141
Computing device 20 may then perform a row-column transformation on table 8 to generate a training data set for the machine learning model. In such an embodiment, since each time-series data Fi_jThe plurality of characteristic values are arranged in a column manner, so that only the row-column conversion of the combined characteristic value list is needed and each time series data F is not neededi_jThe eigenvalues in (1) dynamically expand in the direction of the row. Furthermore, this way of storage allows for efficient data compression, since the same characteristic data types are the same.
To this end, the process of storing time series data of a plurality of target objects and generating a training data set using the stored data to train a machine learning model is completed.
For machine learning models that utilize time-series data, it is also often necessary to iteratively update the trained machine learning model over time and with incremental time-series data generation. In this case, the method 200 may further include processing and storing the incremental time series data, and generating a new training data set based on the incremental time series data to train the machine learning model.
Fig. 4 illustrates a flow diagram of a method 200' for training a machine learning model for a target object, according to some embodiments of the present disclosure. Blocks 210 through 240 of method 200 'are substantially the same as blocks 210 through 240 of method 200 shown in fig. 2, except that method 200' further includes blocks 250 through 270 that process the incremental time series data and generate a new training data set based on the incremental time series data to train the machine learning model.
Specifically, at block 250, for each primary key segment KS, computing device 20 can obtain incremental timing data for at least two target objects included in the primary key segment KS at an incremental first time period. Assume that, for the main key segments KS1, KS2, and KS3, the increment timing data acquired at the increment first period T4 is as shown in table 9 below:
TABLE 9
Figure BDA0003447556410000161
Next, at block 260, the computing device 20 may generate an incremental feature dataset file for the incremental timing data for each primary key segment KS as shown in Table 9. Assume that the incremental feature dataset file generated is as shown in table 10 below:
watch 10
Figure BDA0003447556410000162
After generating the incremental feature set file described above, at block 270 computing device 20 may replace one feature set file read in block 240 over the second time period C with the incremental feature set file to iteratively train the machine learning model as the next training data set. Preferably, the signature data set file that is replaced may be the first signature data set file read in block 240 over the second time period C, such that the next training data set is a natural time duration of the previous training data set.
For example, the computing device 20 may read the incremental feature data set file F010 corresponding to the primary key segment KS1 and replace the first feature data set file F001 of the 3 feature data set files F001, F004, and F007 read in block 240 with it to generate a new training data set.
In this way, the entire signature dataset files of the new second time period C' (i.e., the first time periods T2 through T4) may not need to be processed to generate a new training dataset.
For the convenience of memory reading, the incremental feature data set file F _010 may be stored in a multi-directory and multi-file storage manner similar to the storage structure 300 shown in fig. 3. FIG. 5 illustrates a schematic diagram of a storage structure 300' containing an incremental feature dataset file according to some embodiments of the present disclosure.
The storage structure 300 'is similar to the storage structure 300 shown in fig. 3, except that in the second metadata information 322' in each primary key directory 320, a correspondence between a delta feature data set file and a delta first time period is also indicated, and the delta feature data set file is also included under each primary key directory 320. For example, for the primary key directory 320-1, assuming that it corresponds to the primary key segment KS1, the second metadata information 322' -1 of the primary key directory 320-1 indicates, in addition to the correspondence between the respective first time periods T1, T2 and T3 and the feature data set file (e.g., < F _001, T1>, < F _004, T2>, < F _007, T3>) under the primary key segment KS1, the correspondence between the incremental feature data set file F _010 and the incremental first time periods T4 (e.g., < F _010, T4>), and the incremental feature data set file F _010 is stored under the primary key directory 320-1 in addition to the feature data set files F _001, F _004 and F _ 007. Similarly, for the primary key directory 320-2, assuming that it corresponds to the primary key segment KS2, the second metadata information 322' -2 of that primary key directory 320-2 indicates, in addition to the correspondence between the respective first time periods T1, T2, and T3 and the feature data set file (e.g., < F _002, T1>, < F _005, T2>, < F _008, T3>) under the primary key segment KS2, the correspondence between the incremental feature data set file F _011 and the incremental first time period T4 (e.g., < F _011, T4>), and the incremental feature data set file F _011 is stored under the primary key directory 320-2 in addition to the feature data set files F _002, F _005, and F _ 008.
The incremental feature data set files F _010, F _011 can also be stored in the manner shown in table 5 or table 7.
Thus, similar to block 240, when the computing device 20 generates a new training data set in block 270, the feature values of each time series of data of the feature data set files F _004, F _007 and F _010 in the first time periods T (i.e., the first time periods T2 to T4) included in the second time period C' may be subjected to line-to-line conversion and expanded in the row direction to generate a new training data set of the machine learning model, or the feature values of each feature data set file F _004, F _007 and F _010 may be directly merged in the column direction and subjected to line-to-line conversion after completion of the merging to generate a new training data set of the machine learning model.
By utilizing the scheme disclosed by the invention, the time sequence data of a large number of target objects are segmented and stored according to the main key and the time direction, so that the size of data to be read when a training data set of the machine learning model is generated is limited, the memory operation is more adaptive, and the model training can be conveniently carried out in a pipeline mode. Particularly, under the condition that increment time sequence data are continuously generated along with time, the generation of a new training data set does not need to operate the full time sequence data, so that the operation time is reduced, and the model training speed is accelerated.
In the method 200 or 200', the machine learning model is trained or iteratively trained to produce a trained machine learning model. In some aspects of the disclosure, methods of predicting sales of a target object (e.g., a store chain) using the trained machine learning model are also provided. Specifically, the computing device 20 may acquire, as input data, feature data (a time-series data set as shown in table 1) of the target object (for example, the target object whose primary key is Id _ 1) at a plurality of unit times. For example, the obtained characteristic data may include store size, geographic location, promotional status on a future day, weather status on the day, etc. of the target object. Computing device 20 may then input the input data into the trained machine learning model to generate a prediction regarding sales of the target object based on the input data.
FIG. 6 illustrates a block diagram of a computing device 600 suitable for implementing embodiments of the present disclosure. Computing device 600 may be, for example, computing device 20 as described above.
As shown in fig. 6, computing device 600 may include one or more Central Processing Units (CPUs) 610 (only one shown schematically) that may perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)620 or loaded from a storage unit 680 into a Random Access Memory (RAM) 630. In the RAM 630, various programs and data required for the operation of the computing device 600 may also be stored. The CPU 610, ROM 620, and RAM 630 are connected to each other via a bus 640. An input/output (I/O) interface 650 is also connected to bus 640.
A number of components in computing device 600 are connected to I/O interface 650, including: an input unit 660 such as a keyboard, a mouse, etc.; an output unit 670 such as various types of displays, speakers, and the like; a storage unit 680, such as a magnetic disk, optical disk, or the like; and a communication unit 690 such as a network card, modem, wireless communication transceiver, etc. The communication unit 690 allows the computing device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The methods 200 and 200' described above may be performed, for example, by the CPU 610 of the computing device 600. For example, in some embodiments, methods 200 and 200' may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 680. In some embodiments, part or all of the computer program may be loaded and/or installed onto computing device 600 via ROM 620 and/or communications unit 690. When the computer program is loaded into RAM 630 and executed by CPU 610, one or more operations of methods 200 and 200' described above may be performed. Further, the communication unit 690 may support wired or wireless communication functions.
Those skilled in the art will appreciate that the computing device 600 illustrated in FIG. 6 is merely illustrative. In some embodiments, computing device 20 may contain more or fewer components than computing device 600.
Training data processing methods 200 and 200' for machine learning models and computing device 600, which may be used as computing device 20, according to the present disclosure are described above in connection with the accompanying drawings. However, it will be understood by those skilled in the art that the steps of methods 200 and 200' and their sub-steps may be performed in any other reasonable order without being limited to the order shown in the figures and described above. Further, the computing device 600 also need not include all of the components shown in FIG. 6, it may include only some of the components necessary to perform the functions described in this disclosure, and the manner in which these components are connected is not limited to the form shown in the figures.
The present disclosure may be methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure.
In one or more exemplary designs, the functions described in this disclosure may be implemented in hardware, software, firmware, or any combination thereof. For example, if implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The units of the apparatus disclosed herein may be implemented using discrete hardware components, or may be integrally implemented on a single hardware component, such as a processor. For example, the various illustrative logical blocks, modules, and circuits described in connection with the disclosure may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (11)

1. A method for training a machine learning model for a target object, comprising:
acquiring time sequence data sets of a plurality of target objects in a plurality of unit times, wherein the time sequence data sets comprise a plurality of characteristic values of each target object in each unit time;
segmenting the time-series data set in a primary key direction to produce a plurality of primary key segments, wherein each primary key segment comprises time-series data of at least two target objects of the plurality of target objects;
segmenting each primary key segment in a temporal direction to produce a plurality of feature dataset files, wherein each feature dataset file comprises time series data of the at least two target objects over a first time period, and the first time period comprises a plurality of unit times; and
and obtaining a training data set for training the machine learning model according to the characteristic data set file in a second time period, wherein the second time period comprises a plurality of first time periods.
2. The method of claim 1, further comprising:
for each primary key segment, acquiring increment time sequence data of at least two target objects included in the primary key segment in an increment first time period;
generating an incremental feature dataset file for the incremental timing data; and
and replacing the feature data set file in the second time period with the incremental feature data set file to serve as a next training data set to carry out iterative training on the machine learning model.
3. The method of claim 1 or 2, further comprising:
storing first metadata information of the plurality of feature data set files, the first metadata information indicating a correspondence between each primary key segment and at least two target objects contained by the primary key segment;
storing second metadata information of the primary key segment under each primary key segment, wherein the second metadata information indicates a corresponding relationship between a first time period under the primary key segment and a feature data set file; and
and storing a feature data set file under each primary key segment.
4. The method of claim 3, wherein the plurality of primary key segments are randomly arranged in the first metadata information.
5. The method of claim 3, wherein in the first metadata information, the plurality of primary key segments are arranged based on an order of primary key segments.
6. The method of claim 1 or 2, further comprising:
storing each feature data set file in a manner of taking the feature values as rows and taking the main key as columns, and obtaining a training data set for training the machine learning model according to the feature data set files in the second time period comprises:
and under the condition that the characteristic data set files in the plurality of first time periods included in the second time period are read into the internal memory, directly combining the characteristic values of each characteristic data set file along the direction of the column, and performing column conversion after the combination is completed to generate the training data set of the machine learning model.
7. The method of claim 1 or 2, further comprising:
storing each feature data set file in a manner of taking the primary key as a row and taking the feature value as a column, and obtaining a training data set for training the machine learning model according to the feature data set file in the second time period comprises:
and under the condition that the feature data set files in the plurality of first time periods included in the second time period are to be read into the memory, performing row-column conversion on a plurality of feature values of each time sequence data of each feature data set file, and expanding the converted feature data set files along the row direction to generate a training data set of the machine learning model.
8. The method of claim 1, wherein the target object comprises a store chain and the feature value comprises operational data of the store chain.
9. A method for predicting sales of a target object, comprising:
acquiring characteristic data of the target object in a plurality of unit times as input data; and
inputting the input data into a machine learning model trained via the method of any one of claims 1 to 8 to generate a prediction regarding sales of the target object from the input data.
10. A computing device, comprising:
at least one processor; and
at least one memory coupled to the at least one processor and storing instructions for execution by the at least one processor, the instructions when executed by the at least one processor causing the computing device to perform the steps of the method of any of claims 1-9.
11. A computer readable storage medium having stored thereon computer program code which, when executed, performs the method of any of claims 1 to 9.
CN202111652923.1A 2021-12-30 2021-12-30 Method of training machine learning model, prediction method, computing device, and medium Pending CN114372579A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111652923.1A CN114372579A (en) 2021-12-30 2021-12-30 Method of training machine learning model, prediction method, computing device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111652923.1A CN114372579A (en) 2021-12-30 2021-12-30 Method of training machine learning model, prediction method, computing device, and medium

Publications (1)

Publication Number Publication Date
CN114372579A true CN114372579A (en) 2022-04-19

Family

ID=81141524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111652923.1A Pending CN114372579A (en) 2021-12-30 2021-12-30 Method of training machine learning model, prediction method, computing device, and medium

Country Status (1)

Country Link
CN (1) CN114372579A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118014098A (en) * 2024-02-04 2024-05-10 贝格迈思(深圳)技术有限公司 Machine learning training data scheduling method and equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118014098A (en) * 2024-02-04 2024-05-10 贝格迈思(深圳)技术有限公司 Machine learning training data scheduling method and equipment

Similar Documents

Publication Publication Date Title
CN110287961B (en) Chinese word segmentation method, electronic device and readable storage medium
US20210049507A1 (en) Method and system for distributed machine learning
CN106649890B (en) Data storage method and device
CN110377740B (en) Emotion polarity analysis method and device, electronic equipment and storage medium
CN102388382B (en) Scalable clustered approach and system
US9361343B2 (en) Method for parallel mining of temporal relations in large event file
US11544542B2 (en) Computing device and method
CN112508118A (en) Target object behavior prediction method aiming at data migration and related equipment thereof
CN110990563A (en) Artificial intelligence-based traditional culture material library construction method and system
US20210191927A1 (en) Secure aggregate function computation system, secure computation apparatus, secure aggregate function computation method, and program
US9147168B1 (en) Decision tree representation for big data
CN112801712B (en) Advertisement putting strategy optimization method and device
EP3278238A1 (en) Fast orthogonal projection
CN112116104B (en) Method, device, medium and electronic equipment for automatically integrating machine learning
CN114372579A (en) Method of training machine learning model, prediction method, computing device, and medium
CN113687825B (en) Method, device, equipment and storage medium for constructing software module
CN111914987A (en) Data processing method and device based on neural network, equipment and readable medium
JP6154491B2 (en) Computer and graph data generation method
CN116843970A (en) Fine granularity small sample classification method based on task specific channel reconstruction network
CN115238676B (en) Method and device for identifying bidding requirement hot spot, storage medium and electronic equipment
CN116503608A (en) Data distillation method based on artificial intelligence and related equipment
US9122997B1 (en) Generating attribute-class-statistics for decision trees
CN109299260B (en) Data classification method, device and computer readable storage medium
CN109918564A (en) It is a kind of towards the context autocoding recommended method being cold-started completely and system
CN109992687B (en) Face data searching method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20220419

Assignee: Baisheng Consultation (Shanghai) Co.,Ltd.

Assignor: Shengdoushi (Shanghai) Technology Development Co.,Ltd.

Contract record no.: X2023310000138

Denomination of invention: Methods, prediction methods, computing equipment, and media for training machine learning models

License type: Common License

Record date: 20230714