WO2023155426A1

WO2023155426A1 - Data processing method and apparatus

Info

Publication number: WO2023155426A1
Application number: PCT/CN2022/118700
Authority: WO
Inventors: 陈家禹; 陈浪; 庄晓天; 吴盛楠
Original assignee: 北京京东振世信息技术有限公司
Priority date: 2022-02-17
Filing date: 2022-09-14
Publication date: 2023-08-24
Also published as: CN114219545B; CN114219545A

Abstract

Disclosed in the present disclosure are a data processing method and apparatus. A specific embodiment comprises: determining time series data to be processed; for a case where said time series data has a quantity abrupt change point at a set time point in a historical target time period, segmenting, from said time series data, first time series data before the set time point and second time series data after the set time point; determining a target feature time series cluster to which the second time series data belongs, and using a preset first data processing model to process the first time series data and a second data processing model matching the target feature time series cluster to process the second time series data; and according to the first processing result of the first data processing model and the processing result of the second data processing model, determining a first prediction result of article demands in a prediction time series.

Description

Method and device for data processing

Cross References to Related Applications

This application claims the priority of the Chinese patent application No. 202210144038.0 filed on February 17, 2022, entitled "A Method and Device for Data Processing", and the content disclosed in the above-mentioned Chinese patent application is hereby cited in its entirety as this application part or all of .

technical field

The present disclosure relates to the technical field of smart supply chain, and in particular to a data processing method and device.

Background technique

In the retail industry, the forecast of item demand is the basis of supply chain management. It plays the role of providing input data for inventory replenishment and simulation. The quality of its effect directly determines the validity of subsequent model results. At the same time, the forecast data can also show the expected situation of the forecast data to the supply chain managers, so as to play the role of decision support.

In the existing retail industry, a single FFORMA model (Feature-based forecast model averaging) is often used as a data processing model for full time-series data processing to predict the demand for items, but a single data processing model is not suitable for many different types of time-series Data cannot be differentiated, resulting in low accuracy of forecast results.

Contents of the invention

In view of this, the embodiments of the present disclosure provide a data processing method and device, which analyzes the situation that the time series data to be processed has a sudden change in quantity at a set time point within the historical target time period, and the The time-series data to be processed is divided into first time-series data before the set time point and second time-series data after the set time point, wherein the number of sudden changes is determined by big data statistics, so that the first time-series The data at multiple time points included in the data is less than the value of the quantitative mutation point, and the data at multiple time points included in the second time series data is equal to or greater than the numerical value of the quantitative mutation point, followed by the first data processing model and the second data The processing model processes the first time-series data and the second time-series data respectively, that is, different data processing models are used to process different data at different stages, so that the processing results can more truly reflect the demand for items at different stages, and finally according to the first A first processing result of the data processing model and a processing result of the second data processing model determine the first forecast result of the item demand in the forecast time series, so as to set the item inventory in the forecast time series based on the first forecast result.

To achieve the above purpose, according to the first aspect of the embodiments of the present disclosure, a data processing method is provided.

The data processing method of the embodiment of the present disclosure includes:

Determining the time series data to be processed; wherein, the time series data to be processed indicates changes in demand for items within a historical target time period;

For the situation that the time series data to be processed has a sudden change in quantity at a set time point within the historical target time period, the time series data to be processed is divided into the first time point before the set time point. Time-series data and second time-series data located after the set time point; and determining the target characteristic time-series cluster to which the second time-series data belongs, using a preset first data processing model to process the first time-series data and using A second data processing model that matches the target characteristic time series cluster processes the second time series data; wherein, the number of mutation points is determined through big data statistics;

According to the first processing result of the first data processing model and the processing result of the second data processing model, determine the first prediction result of the item demand situation in the forecast time series, so as to set the Forecasting inventory of items within a time series.

In one or more embodiments of the present application, after the determination of the time series data to be processed, it further includes:

In view of the fact that the time series data to be processed does not have a quantity mutation point at a set time point within the historical target time period,

Processing the time series data to be processed by using the preset first data processing model;

According to the second processing result of the first data processing model, determine a second forecast result of the item demand situation in the forecast time series, so as to set the item inventory in the forecast time series based on the second forecast result.

In one or more embodiments of the present application, the first processing result of the first data processing model includes: the forecast result of the item demand situation before the set time point in the forecast time series; the second processing model The processing result includes: the forecast result of the item demand situation after the set time point in the forecast time series;

The determination of the first forecast result of the item demand situation in the forecast time series includes: combining the forecast result of the item demand situation before the set time point in the forecast time series with the item demand after the set time point in the forecast time series The predicted results of the situations are concatenated to obtain the first predicted result.

In the case that the time series data to be processed satisfies the assumption of normality, the Buishand statistical method is used to determine that the time series data to be processed has a quantitative mutation point at a set time point within the historical target time period or does not exist Quantitative mutation point;

In the case that the time-series data to be processed does not satisfy the assumption of normality, the Pettitt test method is used to determine that the time-series data to be processed has a quantitative mutation point at a set time point within the historical target time period or does not There are quantitative mutation points.

In one or more embodiments of the present application, the method further includes: from all the item time series data of the test case, screening the target item time series data with a number of mutation points; segmenting the target item time series data from the The second time-series data for training after the set time point, wherein the data of multiple time points included in the second time-series data for training is equal to or greater than the time-series data of the target item the number of mutation points to include;

clustering the second time-series data for training to obtain one or more characteristic time-series clusters;

The determining the target characteristic time-series cluster to which the second time-series data belongs includes: selecting a target characteristic time-series cluster for the second time-series data from one or more of the characteristic time-series clusters.

In one or more embodiments of the present application, the selecting the target characteristic time series cluster for the second time series data includes: for the case where there are multiple characteristic time series clusters, the second time series The data is matched with each of the characteristic time-series clusters; according to the matching result, a target characteristic time-series cluster is screened out from the plurality of characteristic time-series clusters.

In one or more embodiments of the present application, the method further includes: segmenting the first time-series data for training before the set time point from the time-series data of the target item, wherein the training The data of multiple time points included in the first time series data is smaller than the value of the quantity mutation point included in the time series data of the target item; for the time series data of other items in the test case except the time series data of the target item and the described The training uses the first time-series data to perform clustering; using the results of the clustering to train a preset first type of model to be trained to obtain a first data processing model, wherein the first type of model to be trained is based on one or more base models The configuration is obtained, and the base model includes one or more of ARIMA, ETS, Croston, simple moving average, FBProphet, Holt-Winters, and first-order exponential smoothing.

In one or more embodiments of the present application, the method further includes: for each characteristic time-series cluster, using the time-series data corresponding to the characteristic time-series cluster to train a preset second type of model to be trained to obtain the A second data processing model that matches the characteristic time series clusters, wherein the second type of model to be trained is configured based on one or more base models, and the second type of model to be trained is the same as the first type of model to be trained different.

In one or more embodiments of the present application, the screening out the target item time series data with quantity mutation points from all the item time series data of the test case includes: for each warehouse item in the test case For the time series data of the identified items, perform the following operations: determine the data segmentation point; in the case where the data division point is a sudden change in the quantity of the item time series data identified by the warehouse item, determine that the item time series data identified by the warehouse item is the Describe the timing data of the target item.

In one or more embodiments of the present application, the clustering of the second time series data for training includes: determining a sequence translation function and a translation distance for a preset distance function; using the determined sequence translation The function and the distance function of the translation distance and the DBSCAN algorithm perform clustering on the second time series data for training.

In one or more embodiments of the present application, the selecting the target characteristic time-series cluster for the second time-series data includes: performing end-to-end splicing on the second time-series data; The one or more characteristic timing clusters are matched, and the characteristic timing cluster with the highest matching value is used as the target characteristic timing cluster.

In one or more embodiments of the present application, using the time series data corresponding to the characteristic time series cluster to train the preset second type of model to be trained includes: using the second time series for training included in the characteristic time series cluster Clipping the data, and splicing the clipped results to generate new training second time series data; using the training second time series data included in the feature time series cluster and the new training second time series data, training pre-training The second type of model to be trained is set.

To achieve the above object, according to a second aspect of one or more embodiments of the present disclosure, a data processing apparatus is provided.

The data processing device of the embodiment of the present disclosure includes:

A determining module, configured to determine the time series data to be processed, wherein the time series data to be processed indicates the change in demand for items within the historical target time period;

A forecasting module, configured to divide the time-series data to be processed according to the quantity mutation point when the time-series data to be processed has a quantity mutation point at a set time point within the historical target time period output the second time series data and the first time series data; and determine the target characteristic time series cluster to which the second time series data belongs, process the first time series data with the preset first data processing model and use the time series with the target feature A second data processing model that matches the clusters processes the second time series data; wherein, the number of mutation points is determined through big data statistics;

To achieve the above object, according to a third aspect of one or more embodiments of the present disclosure, a data processing device is provided.

The data processing device in one or more embodiments of the present disclosure includes: one or more processors; a storage system for storing one or more programs; when the one or more programs are processed by the one or more processors, so that the one or more processors implement the data processing method of the embodiment of the present disclosure.

To achieve the above object, according to a fourth aspect of one or more embodiments of the present disclosure, a computer-readable medium is provided.

A computer program is stored on a computer-readable medium in one or more embodiments of the present disclosure, and when the program is executed by a processor, the data processing method of the embodiments of the present disclosure is implemented.

The further effects of the above-mentioned non-conventional alternatives will be described below in conjunction with specific embodiments.

Description of drawings

The accompanying drawings are for better understanding of the present disclosure, and do not constitute an improper limitation of the present disclosure. in:

FIG. 1 is a schematic diagram of the main flow of a data processing method in one or more embodiments of the present disclosure;

FIG. 2 is a schematic diagram of data segmentation in one or more embodiments of the present disclosure;

Fig. 3 is a schematic diagram of the main process for the case where the time series data to be processed does not have a quantity mutation point at a set time point within the historical target time period in one or more embodiments of the present disclosure;

Fig. 4 is a schematic diagram of the main process of determining the number of mutation points in one or more embodiments of the present disclosure;

FIG. 5 is a schematic diagram of the main process of obtaining characteristic timing clusters in one or more embodiments of the present disclosure;

FIG. 6 is a schematic diagram of the main process of clustering the second time series data for training in one or more embodiments of the present disclosure;

Fig. 7 is a schematic diagram of a main process of selecting a target feature time series cluster for the second time series data in one or more embodiments of the present disclosure;

Fig. 8 is a schematic diagram of the main process of determining the first data processing model in one or more embodiments of the present disclosure;

Fig. 9 is a schematic diagram of another main process for selecting a target feature time series cluster for the second time series data in one or more embodiments of the present disclosure;

Fig. 10 is a schematic diagram of a data pruning step in one or more embodiments of the present disclosure;

Fig. 11 is a schematic diagram of a data clipping and intercepting process in one or more embodiments of the present disclosure;

Fig. 12 is a schematic diagram of an overall flow chart of time series data forecasting sales in one or more embodiments of the present disclosure;

FIG. 13 is a schematic diagram of main modules of a data processing device in one or more embodiments of the present disclosure;

FIG. 14 is an exemplary system architecture diagram that can be applied in one or more embodiments of the present disclosure;

Fig. 15 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server in one or more embodiments of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

According to a first aspect of one or more embodiments of the present disclosure, a data processing method applied to a server is provided.

In the field of commodity retailing, there is often a phenomenon that the demand for items increases significantly within a certain period of time in certain time series. If the demand for items cannot be accurately calculated, there will be a risk of out-of-stock or out-of-stock, resulting in Reduced service levels and increased costs. For the timing of different items, it is usually represented by SKU (Stock Keeping Unit) as the smallest stock keeping unit. The existing FFORMA architecture suffers from the following problems when predicting different warehouse item identities:

(1) The phenomenon of a significant increase in the demand for items often only exists in part of the warehouse item identification. When the model is trained with the full warehouse item identification, the data without a significant increase in the demand for items will lower the predicted value of the data for the phenomenon of a significant increase in the demand for items, making the predicted value low;

(2) The phenomenon of a significant increase in the demand for items shows different forms in different time series. When using FFORMA, it is impossible to differentiate the relevant features that meet the characteristics of the time series during the period when the demand for items increases significantly for different time series commodities;

(3) The optimization of the FFORMA model is relatively difficult, and it is difficult to train and fit a data model suitable for the significantly increased demand for various items in a short period of time.

Fig. 1 is a schematic diagram of a main flow of a data processing method according to one or more embodiments of the present disclosure. As shown in Figure 1, the method mainly includes:

Step S101: Determine the time-series data to be processed, wherein the time-series data to be processed indicates the change of demand for items within the historical target time period;

Step S102: In the case that the time-series data to be processed has a quantity mutation point at the set time point within the historical target time period, according to the quantity mutation point, the time-series data to be processed is divided into the first one before the set time point. For the time series data and the second time series data located after the set time point, the quantitative mutation point is determined through big data statistics;

Step S103: Determine the target characteristic time-series cluster to which the second time-series data belongs, process the first time-series data with a preset first data processing model and process the second time-series data with a second data processing model matching the target characteristic time-series cluster;

Step S104: According to the first processing result of the first data processing model and the processing result of the second data processing model, determine the first forecast result of the item demand situation in the forecast time series, so as to set the items in the forecast time series based on the first forecast result in stock.

Wherein, there are two setting methods for setting the time point. One setting method can be to designate a time point as the set time point, such as the 20th of each month, or it can be a sudden change in quantity determined by statistical analysis of big data on changes in the demand for items in the historical time period The time point of the point is the set time point, for example, if it is determined that the time point of the quantity mutation point is the 10th of each month through the change of the demand for the item, then the 10th is determined to be the set time point. It is worth noting that the set time point can be set or adjusted accordingly according to actual conditions. So that the set time point can be dynamically changed to meet different needs.

Among them, the number of mutation points is determined by big data statistics, which specifically refers to the statistical methods such as various uniformity statistical methods (Buishand statistical method), mutation point test method (Pettitt test method), etc. to count a large amount of historical time Changes in the demand for items within a segment, and determine the quantity mutation point according to the statistical results. Among them, in view of the statistical result that multiple quantity mutation points correspond to multiple time points, the quantity mutation point with the largest value can be selected as the final quantity mutation point, and the time point corresponding to the final quantity mutation point is determined as the set time point. In addition, in view of the statistical result that multiple quantity mutation points correspond to multiple time points, you can also select the time point with the largest number of occurrences to set the time point, and by calculating the value of the number mutation point at the time point with the largest number of occurrences, Determine the value of the quantity mutation point or the mutation range of the quantity mutation point.

In the embodiment of the present disclosure, the second time-series data is the time-series part in which the demand for items suddenly increases in the time-series data to be processed, and the first time-series data is the time-series part in which the second time-series data is removed from the time-series data to be processed. In other words, the values of the multiple time points included in the first time series data are smaller than the values of the quantitative change point, and the values of the multiple time points included in the second time series data are equal to or greater than the value of the quantitative change point. Wherein, a schematic diagram of data segmentation is shown in FIG. 2 . This application mainly predicts the sales volume in the later period for the rapid sales growth at the end of the month. Therefore, in the embodiment of this application, it is considered that the time series data after the sudden change in quantity until the end of the current month are the second time series data. The first time series data refers to the item sales data from the beginning of the month to the date corresponding to the quantity mutation point. Wherein, the data of multiple time points included in the second time-series data is generally equal to or greater than the value of the quantity mutation point, and the data of multiple time points included in the first time-series data is generally smaller than the value of the quantity mutation point. It is worth noting that the dividing line shown in Figure 2 is the sales point at a time point before the sudden change in quantity. That is, the abrupt change point of the number of points behind the dividing line shown in FIG. 2 . That is to say, the number of sudden change points is included in the second time series data. For example, set the time point as the 20th of each month, or determine the time point where there is a sudden change in quantity through changes in the demand for items in the historical time period. The time series data to be processed is January 1, 2021 For the time series data until February 28, 2021, assuming that at the set time point, there is a sudden change point in the quantity to be processed, and the sales volume at the sudden change point is 30 pieces/day, then the second time series data is 1 in 2021 Sales data during the period from February 20 to January 31, 2021 and from February 20, 2021 to February 28, 2021, and the sales data is generally equal to or greater than 30 pieces per day, and the first time series data is 2021 Sales data from January 1, 2021 to January 19, 2021 and from February 1, 2021 to February 19, 2021, and the sales data is generally less than 30 pieces per day.

It is worth noting that the second time-series data after the set time point may include a value greater than the number of sudden change points of the set time point.

Wherein, the first data processing model in step S103 is obtained by training the preset first type of model to be trained by training the first time series data corresponding to the first time series data segmented from all the item time series data of the test case , the second data processing model is obtained by training a preset second type of model to be trained with the second time series data segmented from all item time series data of the test case and corresponding to the second time series data for training.

In an optional embodiment, the time period indicated by the time series data to be processed may be determined according to actual needs. For example, assuming that the current date is June 1, 2021, then the time series data from January 1, 2021 to June 1, 2021 can be used as the data to be processed to process data from June 2, 2021 to December 2021 The data on the 1st can be used for forecasting, or the time series data from January 1, 2021 to March 1, 2021 can be used as the data to be processed to predict the data from June 2, 2021 to December 1, 2021 . Since there are promotional activities or large-scale festivals of merchants in certain months, such as Double Eleven, it is more accurate to use the time series data of the same time period for forecasting, that is, to use the data from November 1, 2020 to November 30, 2020 Let's make a forecast for the data from November 1, 2021 to November 30, 2021.

For step S103, it may specifically include: inputting the first time-series data into the first data processing model, and inputting the second time-series data into the second data processing model, so as to obtain the forecast result of the item demand in the forecast time series. For example, in the application scenario of forecasting sales forecast, the set time point is the 20th of each month, and the time series data to be processed is the time series data from January 1, 2021 to February 28, 2021. Assume that at the set time point, there is a quantitative mutation point in the time series data to be processed, then the second time series data input as the second data model is from January 20, 2021 to January 31, 2021 and from February 20, 2021 to 20212 On January 28, the first time series data input as the first data model are from January 1, 2021 to January 19, 2021 and from February 1, 2021 to February 19, 2021.

In an optional embodiment, after step S101, there is also a situation that the time series data to be processed does not have a quantity mutation point at a set time point within the historical target time period, as shown in FIG. 3 , including:

Step S301: process the time series data to be processed by using the preset first data processing model;

Step S302: According to the second processing result of the first data processing model, determine a second forecast result of the item demand situation in the forecast time series, so as to set the item inventory in the forecast time series based on the second forecast result.

Since there is no quantitative change point in the time-series data to be processed, it is not necessary to process the second time-series data and the first time-series data separately, and only use the first data processing model for processing the first time-series data.

In a further optional embodiment, the first processing result of the first data processing model includes: the forecast result of the item demand situation before the set time point in the forecast time series; the processing result of the second processing model includes: the internal setting of the forecast time series The forecast results of the demand for goods after a certain time point;

In step S104, determining the first forecast result of the item demand situation in the forecast time series further includes: combining the forecast result of the item demand situation before the set time point in the forecast time series with the forecast result of the item demand situation after the set time point in the forecast time series The prediction results are spliced to obtain the first prediction result.

For example, in the application scenario of forecasting sales forecast, the set time point is the 20th of each month, the time series data to be processed is the time series data from January 1, 2021 to February 28, 2021, and the forecast time series The data is from January 1, 2022 to February 28, 2022. Assuming that at the set time point, there is a quantitative mutation point in the time series data to be processed, then the second time series data input as the second data model is from January 20, 2021 to January 31, 2021 and February 20, 2021 From January 1, 2021 to February 28, 2021, the first time series data input as the first data model are from January 1, 2021 to January 19, 2021 and from February 1, 2021 to February 19, 2021. According to the first data model, the forecast results of the demand for items before the set time point in the forecast time series from January 1, 2022 to January 19, 2022 and from February 1, 2022 to February 19, 2022 can be output, According to the second data model, the forecast results of the demand for items after the set time point in the forecast time series from January 20, 2022 to January 31, 2022 and from February 20, 2022 to February 28, 2022 can be output, By splicing the above prediction results, the first prediction result from January 1, 2022 to February 28, 2022 can be obtained.

In this embodiment, the quantitative abrupt point is a point at which the data after this point is significantly larger than the data before this point. Since different data parameters are applicable to different inspection methods, selecting different inspection methods according to different parameters can improve the accuracy of the inspection. The data, in an optional embodiment, as shown in Figure 4, includes:

Step S401: When analyzing the time series data to be processed to meet the normality assumption, use the Buishand statistical method to determine whether there is a quantitative change point or no quantitative change point at the set time point within the historical target time period in the time series data to be processed ;

Step S402: When analyzing that the time-series data to be processed does not satisfy the assumption of normality, use the Pettitt test method to determine that the time-series data to be processed has a quantitative change at a set time point within the historical target time period point or there is no quantitative mutation point.

For the Buishand statistical method, the specific steps are as follows:

The time series data to be processed is defined as a sequence X={x ₁ , x ₂ ,...,x _k ,...,x _n }, where x _k is the item demand data at a set time point k. If X satisfies the normality assumption, the item demand data at each time point in sequence X can be expressed as:

Among them, _εi represents the variance at the time point i, μ represents the mean value, k represents the set time point, and Δ represents the change value. If x _k is the value of the quantity mutation point at the set time point k, x _i represents the item demand data at time point i, and 1...n represents n time points in the historical target time period, then only Detect change value Δ>0. At this time, construct the summation value S _m before the sudden change point of the time series data to be processed to test Δ>0, and the specific S _m is expressed as follows:

in,

is the mean value of the time series data X to be processed, S ₀ =0. For homogeneous data (that is, the distribution of the time series data X to be processed does not change significantly in the time range), we expect the statistic S _k at the set time point k to fluctuate around 0, and at this time in the time series data X to be processed no systematic deviation from the mean

Characteristics. If Δ>0, we would expect the majority of S _m <0 before the quantitative break point and most S _m >0 after the quantitative break point. Since the embodiment of the present disclosure only needs to test the quantity sudden change point at the set time point k, only the statistical quantity S _k at the set time point k is considered to be tested at this time. If we cannot determine the specific number of mutation points, choose k=min _m S _m . To complete the test, we rescale it, that is, construct the statistic Q _k as follows:

in,

Indicates the variance of the time series data to be processed, and n is the data volume of the time series data to be processed. The summation value S _m before the quantitative mutation point is tested by Q _k to determine the significance value.

In the actual application process, if the time point is set to be the 10th day from the end of the month, then the historical sample is taken as 10 days of equal length, and the Q value corresponding to the historical sample is 1.22 when the significance is 95%. In the judgment process, if the obtained statistic Q _k >1.22, it is considered that there is a quantitative mutation point at the set time point.

For the Pettitt test method, the specific steps are as follows:

The time series data to be processed is defined as sequence X={x ₁ ,x ₂ ,…,x _k ,…,x _n }, and the construction statistic U _k is:

Among them, U _k represents the summation value of the sign function before the quantity mutation point, k represents the set time point, x _j represents the item demand data at time point j, and sgn is the sign function. Similar to method 1, if the number of mutation points cannot be determined, select k=min _m U _m . Since we only consider the case of a significant increase in the demand for goods at the set time point, and do not consider the case of a significant decrease in the demand for goods, we expect U _k > 0 to be significantly established. According to the approximate formula proposed by Pettitt, the p-value of one-sided test is calculated as follows:

Among them, exp represents an exponential function based on the natural constant e, and the confidence level is 95%. If p<0.05, there is a quantitative mutation point at the set time point k.

In an optional embodiment, as shown in FIG. 5, before performing the method in one or more embodiments of the present disclosure, a testing process is also included, specifically including:

Step S501: From all the item time series data of the test case, filter out the target item time series data with quantitative mutation points;

Step S502: Segment the second time-series data for training after the set time point from the time-series data of the target item, wherein the data of multiple time points included in the second time-series data for training is equal to or A value greater than the quantity mutation point included in the time series data of the target item;

Step S503: Clustering the second time series data for training to obtain one or more feature time series clusters.

Through the above steps S501 to S503, determining the target characteristic time-series cluster to which the second time-series data belongs in step S103 includes: selecting the target characteristic time-series cluster for the second time-series data from one or more characteristic time-series clusters.

Since the FFORMA model in the prior art based on all item time series data cannot achieve accurate prediction results, in the embodiment of this application, it is necessary to screen the target item time series data with quantitative mutation points, and then filter the target item time series data Training uses the second time series data as model input, which can significantly improve the accuracy of prediction. Not only that, in the embodiment of the present application, the second time series data for training will be clustered to obtain different feature time series clusters, and the feature time series clusters will be used as the input of the model, and the time series data of target items with quantitative mutation points can be processed For further division, the division by feature time series clusters can classify time series clusters with different features into the model, making the prediction more accurate.

Exemplarily, there are 100 item time series data in the test case, among which 50 time series data have quantitative mutation points, then the 50 time series data are target item time series data. Assuming that each time-series data contains only one month’s commodity sales, then for each time-series data in the time-series data of the target item, the second time-series data for training can be separated by the number of mutation points, that is, 50 first time-series data for training can be divided. The second time-series data, and then by clustering the 50 second time-series data for training, the characteristic time-series clusters are obtained.

Wherein, for step S501, in an optional embodiment, it specifically includes: for the time series data of items identified by each warehouse item in the test case, execute:

Step 1: Determine the data split point;

Step 2: Determine whether the data segmentation point is a quantity mutation point in the item time-series data of the warehouse item identification; if so, determine that the item time-series data of the warehouse item identification is the target item time-series data.

Here, the method for determining whether it is a quantitative mutation point is the same as the aforementioned step S301 to step S303, wherein the data segmentation point can be obtained through human experience, or through data pivoting.

For the clustering process in step S503, in an optional embodiment, as shown in Figure 6, including:

Step S601: determining a sequence translation function and a translation distance for a preset distance function;

Step S602: Clustering the second time series data for training by using the distance function and the DBSCAN algorithm that determine the sequence translation function and translation distance.

In the embodiment of the present application, the purpose of clustering is to divide the graphs of changes in demand for items according to similarity. In time series clustering, due to the problems of unequal sequence length, similar shift and data scale in time series, the Euclidean distance used in the prior art is difficult to describe the similarity between time series. John Paparrizos proposed a time series clustering algorithm K-Shape based on shape similarity measure in 2015. This algorithm can better describe the similar distance between time-sequences, and use the idea similar to K-means to cluster time series. However, experiments have found that this method has the problem of a large amount of calculation. More importantly, because it is difficult for us to have sufficient prior knowledge about the time series, the value of the category quantity k is difficult to choose, and it often needs to be determined through repeated experiments. Therefore, in an optional embodiment, the present disclosure adopts the DBSCAN algorithm based on Shape Based Distance for clustering.

Specifically, determine the sequence translation function X _(s) as follows:

Among them, s represents the amount of translation, and x ₁ ,...,x _n represent the demand for items at time points 1 to n in the sequence X to be translated. Based on a translation s, the cross-correlation CC _S (X,Y) between two time series X and Y can be obtained:

Among them, y _i represents the demand for goods at the time point i in the time series Y to be translated, y _s+i represents the demand for goods at the time point s+i in the time series Y to be translated, ω is a constant, and in CC _s ( X, Y) take the maximum value, it is considered that the translation amount s is in the optimal position. At this time, the cross-correlation CC _s (X, Y) is substituted into formula (8) to obtain the normalized quantity NCC (X, Y).

Wherein, ‖X‖ indicates the norm of time series X, and ‖Y‖ indicates the norm of time series Y. Substituting the obtained normalization amount NCC(X,Y) into the following formula (9), the translational distance SBD is obtained.

SBD(X,Y)＝1-NCC(X,Y) Formula (9)

Further, the two hyperparameters involved in the DBSCAN algorithm are: the neighborhood radius ε and the threshold MinPts of the number of samples in the neighborhood, which can be determined using the elbow point method and the silhouette score.

Through the above definition of the distance function and the determination of the sequence translation function and translation distance in the distance function, the second time series data for training can be clustered more accurately according to the similarity of the time series data graphic changes, and at the same time solve the problem of similar graphic changes but Error issues due to shifting of time points.

In an optional embodiment, when there are multiple characteristic time-series clusters, as shown in FIG. 7 , the step of selecting the target characteristic time-series cluster for the second time-series data may further include:

Step S701: matching the second time series data with each characteristic time series cluster;

Step S702: According to the matching result, a target feature time-series cluster is filtered out from multiple feature time-series clusters.

Wherein, the matching result may be set according to actual application scenarios and requirements. In an optional embodiment, the characteristic timing cluster with the highest matching degree is selected as the target characteristic timing cluster. Further, for the matching degree, in the embodiment of the present application, it may be expressed as a degree of similarity between the second time series data and each characteristic time series cluster, and the higher the degree of similarity, the higher the degree of matching. For example, there are three characteristic time series clusters after clustering, namely time series cluster 1, time series cluster 2 and time series cluster 3, and the matching degrees of the second time series data and the three time series clusters are 0.9, 0.6 and 0.1 respectively, then the time series cluster 1 is the time-series cluster of target features. When the matching degree is the same, one can be randomly selected.

In the embodiment of the present application, not only can the second time-series data for training be segmented from the time-series data of the target item, in an optional embodiment, as shown in FIG. 8 , it also includes:

Step S801: Segment the first time-series data for training before the set time point from the time-series data of the target item, wherein the data of multiple time points included in the first time-series data for training is smaller than the number of mutations included in the time-series data of the target item the value of the point;

Step S802: Clustering the time-series data of items other than the time-series data of the target item in the test case and the first time-series data for training;

Step S803: Use the clustering results to train the preset first type of model to be trained to obtain the first data processing model, wherein the first type of model to be trained is obtained based on one or more base model configurations, and the base models include ARIMA, ETS , Croston, simple moving average, FBProphet, Holt-Winters, one or more of first-order exponential smoothing.

Exemplarily, in the application scenario of sales forecasting, there are 10 time series data of all items in the test case, among which there are 2 time series data of the target item, that is, time series data with quantity mutation points. The time point corresponding to the set quantity mutation point is the 20th of each month, and the time series data of the target items is from February 1, 2021 to February 28, 2021 and from March 1, 2020 to March 31, 2020 Time-series data, there are no quantitative mutation points in the remaining 8 time-series data. Then, among the two time-series data of target items, the first time-series data for training are separated from February 1, 2021 to February 19, 2021 and March 1, 2020 to March 19, 2020. , at this time, the data from February 1, 2021 to February 19, 2021 and March 1, 2020 to March 19, 2020 will be clustered with the remaining 8 complete time series data, and based on the clustering As a result, the preset first type of model to be trained is trained to obtain the first data processing model.

In practical application scenarios, the first time-series data usually predicts the same or similar results, so there is no need to perform classification processing like the second time-series data, and only need to cluster all the first time-series data once, without It needs to be clustered like the second time series data to obtain multiple different time series clusters, which simplifies the specific clustering process, reduces the amount of data processing and improves the efficiency of data processing without affecting the prediction results. Wherein, the specific clustering method is the same as step S601 and step S602.

In an optional embodiment, for each characteristic time-series cluster, use the time-series data corresponding to the characteristic time-series cluster to train the preset second type of model to be trained to obtain a second data processing model that matches the characteristic time-series cluster , wherein the second type of model to be trained is configured based on one or more base models, and the second type of model to be trained is different from the first type of model to be trained. Compared with the first type of model to be trained, because the time sequence of the analysis of the second type of model to be trained is more complex, there are relatively more base models configured, which are usually obtained by fusing multiple base models.

In the process of determining the target characteristic time-series cluster to which the second time-series data belongs, in an optional embodiment, the step of selecting the target characteristic time-series cluster for the second time-series data, as shown in FIG. 9 , Also includes:

Step S901: performing end-to-end splicing on the second time series data;

Step S902: Match the spliced results with one or more feature time-series clusters, and use the feature time-series cluster with the highest matching value as the target feature time-series cluster.

As an example, assume that a certain time series data includes changes in demand for items from March 1, 2021 to May 31, 2021, and the set time point where the quantity mutation point is located is 10 days before the end of the month. That is, the second time series data are from March 21, 2021 to March 31, 2021, from April 20, 2021 to April 30, 2021, and from May 21, 2021 to May 31, 2021, and the rest The time-series data of the time period is the first time-series data. Then the time-series data after the end-to-end splicing is from March 21, 2021 to March 31, 2021, from April 20, 2021 to April 30, 2021, and from May 21, 2021 to May 31, 2021 Dates are spliced end to end to form a new time series with a longer time series. Since data with a high number of time series usually has common characteristics, inputting each piece of data with a high number of time series into the model will usually result in similar or identical results. Therefore, through end-to-end splicing, it is possible to reduce the processing of a large number of similar data, and only need to splicing Once the model input is performed in the subsequent time series, accurate prediction results can be obtained.

In the embodiment of this application, by clustering and grouping training on similar time-series data, the relevant features of the clustered time-series clusters can be better extracted, and the model offset and optimization difficulty can be reduced. There will be a certain reduction, and the model is prone to overfitting. Therefore, in an optional embodiment, the overfitting phenomenon can be reduced by introducing data enhancement, as shown in Figure 10, using the time series data corresponding to the feature time series cluster to train the preset second type of model to be trained In the steps, also include:

Step S1001: clipping the second time-series data for training included in the feature time-series cluster, and splicing the clipped results to generate new second time-series data for training;

Step S1002: Using the second time-series data for training included in the feature time-series cluster and the new second time-series data for training, train a preset second type of model to be trained.

For the interception process in step S1001 as shown in Figure 11, the tailoring step may specifically include:

Step S1001-1: set the length threshold L, the sliding length d, and the cutting length s;

Step S1001-2: Select time-series data X ^l whose data exceeds L, and do not process the time-series data that does not exceed the threshold;

Step S1001-3: performing sliding window interception with a length of s for the time series in ^X1 , each sliding window corresponds to a new sample;

Step S1001-4: Combine the intercepted data with the time series not exceeding the threshold as a new sample set.

Through the above clipping steps, the number of sample sets can be expanded on the basis of the existing time series data, so that the preset second-type training model can obtain a more accurate second data processing model after training.

Usually, the first type of model to be trained is used to train the clustering results of the first time-series data for training. It only needs to select the optimal base model from many base models to achieve the training purpose, without complicated and cumbersome models. Can effectively improve efficiency. For the second type of model to be trained, it is usually a fusion model, that is, a complex model containing multiple base models. Different types of base models can be selected according to different feature time series clusters to obtain the optimal training model.

Furthermore, for multiple base models, weights can be set for each base model according to the characteristics of different time series clusters, and through the training process, the weights of each base model can be continuously optimized to obtain only base models with higher weights. The optimization of the base model can also reduce the over-fitting phenomenon of time series data.

In an optional embodiment, the method of the embodiment of the present disclosure can be used for product sales forecast, as shown in FIG. 12 , wherein, FIG. 12 shows a schematic diagram of an overall process for predicting sales based on time series data, specifically including :

Step 1: Determine whether there is a sudden change point in the time series data, and confirm whether there is a high sales phenomenon in the time series data according to the judgment result of the sudden change point;

If there is a phenomenon of high sales, go to step 2; if there is no phenomenon of high sales, go to step 7;

Step 2: Carry out time-series division of the time-series data to obtain high sales periods (second time-series data) and non-high-sales periods (first time-series data);

Further, step 3 to step 5 are performed for the time series data during the period of high sales, and the specific steps are as follows:

Step 3: Match the time-series data of the high-sales period with the characteristic time-series clusters in the clustering results to obtain the target characteristic time-series clusters;

Step 4: Perform time-series splicing of the high-sales time-series data in the target feature time-series cluster to obtain the splicing result of the high-sales time-series data;

Step 5: Take the splicing results of high-sales time series data as the input of the first data processing model, and output the target sales volume;

After Step 2, perform Step 6 and Step 7 for the time series data during the non-high sales period, the specific steps are as follows:

Step 6: Splicing the time series during the non-high sales period to obtain the splicing result of the non-high sales time series data;

Step 7: The splicing result of non-high-selling time-series data or all current time-series data is used as the input of the second data processing model to output the target sales volume.

Through the above steps, not only can the time-series data whether there is a high sales phenomenon be processed separately, but also the high sales period and the non-high sales period in which the high sales phenomenon exists can be processed separately, so as to improve the accuracy of data processing.

The data processing method of the embodiment of the present disclosure analyzes the situation that the time-series data to be processed has a sudden change in quantity at a set time point within the historical target time period, and divides the time-series data to be processed into The first time-series data before the set time point and the second time-series data after the set time point, wherein the data of multiple time points included in the first time-series data is smaller than the value of the number of sudden changes, and the second time-series data includes The data at multiple time points included is equal to or greater than the value of the sudden change point, and then the first time-series data and the second time-series data are processed respectively through the first data processing model and the second data processing model, that is, different data processing methods are used. The model processes different data at different stages, so that the processing results can more truly reflect the demand for items at different stages, and finally, according to the first processing results of the first data processing model and the processing results of the second data processing model, the forecast sequence is determined The first forecast result of the demand situation of the item in the period, so as to set the inventory of the item in the forecast time series based on the first forecast result. Through the above method, the purpose of separately processing different time series data by adopting different processing models can be achieved, and the accuracy of processing results can be improved.

According to a second aspect of one or more embodiments of the present disclosure, a data processing apparatus applied to a server is provided.

FIG. 13 is a schematic diagram of main modules of a data processing apparatus 1300 according to the second aspect of one or more embodiments of the present disclosure. As shown in Figure 13, including:

A determining module 1301, configured to determine the time series data to be processed, wherein the time series data to be processed indicates the change of demand for items within the historical target time period;

The prediction module 1302 is configured to divide the time-series data to be processed into the time-series data that is located at the set The first time-series data before the time point and the second time-series data after the set time point; and determine the target characteristic time-series cluster to which the second time-series data belongs, and use the preset first data processing model to process the The first time-series data and processing the second time-series data by using a second data processing model matching the target feature time-series cluster; wherein, the number of mutation points is determined through big data statistics; according to the first data Processing the first processing result of the model and the processing result of the second data processing model, determining the first forecast result of the item demand situation in the forecast time series, so as to set the item inventory in the forecast time series based on the first forecast result .

In an optional embodiment, the prediction module 1302 is further configured to, after determining the time series data to be processed, set the time series data to be processed within the historical target time period. If there is no sudden change in quantity at the time point, use the preset first data processing model to process the time series data to be processed; determine the items in the forecast time series according to the second processing result of the first data processing model A second forecast result of the demand situation, so as to set the item inventory in the forecast time series based on the second forecast result.

In an optional embodiment, the first processing result of the first data processing model includes: a forecast result of the item demand situation before a set time point in the forecast time series;

The processing result of the second processing model includes: the prediction result of the item demand situation after the set time point in the prediction time series;

The forecasting module 1302 is further configured to splice the forecast result of the item demand situation before the set time point in the forecast time series with the forecast result of the item demand situation after the set time point in the forecast time series to obtain the The first predicted result.

In an optional embodiment, the device further includes an analysis module, configured to use The Buishand statistical method determines that the time-series data to be processed has a quantity mutation point at a set time point within the historical target time period or does not have a quantity mutation point;

In an optional embodiment, the prediction module 1302 is also used to screen out the target item time series data with quantity mutation points from all the item time series data of the test case;

The device further includes a clustering module, which is used to segment the second time-series data for training after the set time point from the time-series data of the target item, wherein the second time-series data for training is The data of multiple time points included in the second time-series data is equal to or greater than the value of the quantity mutation point included in the target item time-series data; the second time-series data for training is clustered to obtain one or more characteristic time-series cluster;

The prediction module 1302 is further configured to select a target characteristic time-series cluster for the second time-series data from one or more of the characteristic time-series clusters.

In an optional embodiment, the prediction module 1302 is further configured to match the second time-series data with each of the characteristic time-series clusters when there are multiple characteristic time-series clusters; As a result, a target characteristic time-series cluster is screened out from a plurality of said characteristic time-series clusters.

In an optional embodiment, the clustering module is further configured to segment the first time-series data for training before the set time point from the time-series data of the target item, wherein the training The data of multiple time points included in the first time-series data is smaller than the value of the number of mutation points included in the target item time-series data; for the other item time-series data in the test case except the target item time-series data and the The training uses the first time-series data to perform clustering; the result of the clustering is used to train the preset first type of model to be trained to obtain a first data processing model, wherein the first type of model to be trained is based on one or more basic The model configuration is obtained, and the base model includes one or more of ARIMA, ETS, Croston, simple moving average, FBProphet, Holt-Winters, and first-order exponential smoothing.

In an optional embodiment, the clustering module is further configured to, for each characteristic time series cluster, use the time series data corresponding to the feature time series cluster to train the preset second type of model to be trained, and obtain the same The second data processing model that matches the feature sequence cluster, wherein the second type of model to be trained is configured based on one or more base models, and the second type of model to be trained is the same as the first type of model to be trained The models are different.

In an optional embodiment, the screening out the target item time series data with quantitative mutation points from all the item time series data of the test case includes:

For the item time series data of each warehouse item identification in the test case, perform the following operations: determine the data segmentation point; in the case where the data division point is the quantity mutation point of the item time series data of the warehouse item identification, It is determined that the item time series data of the warehouse item identifier is the target item time series data.

In an optional embodiment, the clustering module is further configured to determine a sequence translation function and a translation distance for a preset distance function; using the determined sequence translation function and the distance function and the translation distance The DBSCAN algorithm clusters the second time series data for training.

In an optional embodiment, the clustering module is further configured to perform end-to-end splicing on the second time-series data; match the spliced results with the one or more characteristic time-series clusters respectively, and The characteristic timing cluster with the highest matching value is used as the target characteristic timing cluster.

In an optional embodiment, the clustering module is further configured to trim the second time-series data for training included in the feature time-series cluster, and splicing the trimmed results to generate a new first time-series data for training. Two time series data: using the second time series data for training included in the feature time series cluster and the new second time series data for training to train a preset second type of model to be trained.

FIG. 14 shows an exemplary system architecture 1400 of a data processing method or a data processing apparatus to which one or more embodiments of the present disclosure can be applied.

As shown in FIG. 14 , a system architecture 1400 may include

terminal devices

1401 , 1402 , and 1403 , a network 1404 and a server 1405 . The network 1404 is used to provide a communication link medium between the

terminal devices

1401 , 1402 , 1403 and the server 1405 . Network 1404 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.

Users can use the

terminal devices

1401, 1402, 1403 to interact with the server 1405 through the network 1404 to send task execution requests or receive response information to requests, etc. Various communication client applications may be installed on the

terminal devices

1401, 1402, and 1403, such as online service applications, web browser applications, search applications, instant messaging tools, email clients, and social platform software.

The

terminal devices

1401, 1402, and 1403 may be various electronic devices with display screens and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and the like.

The server 1405 may be a server that provides various services, such as a background management server that provides support for online service requests sent by users using

terminal devices

1401 , 1402 , and 1403 , or a server that processes data. The background management server can analyze and process the received time-series data and other data, and feed back the processing result (such as the output forecast result of the item demand) to the terminal device.

It should be noted that the data processing method provided by the first aspect of the embodiments of the present disclosure is generally executed by the server 1405 , and correspondingly, the data processing device provided by the second aspect of the embodiments of the present disclosure is generally set in the server 1405 .

It should be understood that the numbers of terminal devices, networks and servers in Fig. 15 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.

Referring now to FIG. 15 , it shows a schematic structural diagram of a computer system 1500 suitable for implementing a terminal device according to an embodiment of the present disclosure. The terminal device shown in FIG. 15 is only an example, and should not limit the functions and application scope of this embodiment of the present disclosure.

As shown in FIG. 15 , a computer system 1500 includes a central processing unit (CPU) 1501, which can operate according to a program stored in a read-only memory (ROM) 1502 or a program loaded from a storage section 1508 into a random access memory (RAM) 1503 Instead, various appropriate actions and processes are performed. In the RAM 1503, various programs and data required for the operation of the system 1500 are also stored. The CPU 701, ROM 1502, and RAM 1503 are connected to each other through a bus 1504. An input/output (I/O) interface 1505 is also connected to the bus 1504 .

The following components are connected to the I/O interface 1505: an input section 1506 including a keyboard, a mouse, etc.; an output section 1505 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 1508 including a hard disk, etc. and a communication section 1509 including a network interface card such as a LAN card, a modem, or the like. The communication section 1509 performs communication processing via a network such as the Internet. A drive 1510 is also connected to the I/O interface 1505 as needed. A removable medium 1511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc. is mounted on the drive 1510 as necessary so that a computer program read therefrom is installed into the storage section 1508 as necessary.

In particular, according to the disclosed embodiments of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, the disclosed embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, where the computer program includes program codes for executing the methods shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication portion 1509 and/or installed from removable media 1511 . When this computer program is executed by a central processing unit (CPU) 1501, the above-mentioned functions defined in the system of the present disclosure are performed.

It should be noted that the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, system, or device, or a combination of any of the above. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, system, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transport a program for use by or in conjunction with an instruction execution system, system, or device. . Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that includes one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block in the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or operation, or can be implemented by a A combination of dedicated hardware and computer instructions.

The modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. The described modules may also be set in a processor, for example, it may be described as: a processor includes a determination module and a prediction module. Wherein, the names of these modules do not constitute a limitation on the module itself under certain circumstances, for example, the determination module may also be described as "a module for determining the time series data to be processed".

As another aspect, the present disclosure also provides a computer-readable medium, which may be included in the device described in the above-mentioned embodiments, or may exist independently without being assembled into the device. The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the device, the device includes:

For the situation that the time series data to be processed has a quantity mutation point at a set time point within the historical target time period, according to the quantity mutation point, the time series data to be processed is divided into second time series data and the first time-series data; and determining the target characteristic time-series cluster to which the second time-series data belongs, using a preset first data processing model to process the first time-series data and using the first time-series cluster matched with the target characteristic time-series cluster a second data processing model to process the second time series data;

The data processing method and device of the embodiments of the present disclosure analyze the situation that the time-series data to be processed has a sudden change in quantity at the set time point within the historical target time period, and the time-series data to be processed The second time-series data and the first time-series data are respectively processed by the first data processing model and the second data processing model, and finally the prediction is determined according to the first processing result of the first data processing model and the processing result of the second data processing model The first forecast result of the item demand situation in the time series is used to set the forecast item inventory in the time series based on the first forecast result. Through the above method, the purpose of separately processing different time series data by adopting different processing models can be achieved, and the accuracy of processing results can be improved.

The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims

A method of data processing, the method comprising:

Determining the time series data to be processed; wherein, the time series data to be processed indicates changes in demand for items within a historical target time period;

For the situation that the time series data to be processed has a sudden change in quantity at a set time point within the historical target time period, the time series data to be processed is divided into the first time point before the set time point. Time-series data and second time-series data located after the set time point; and determining the target characteristic time-series cluster to which the second time-series data belongs, using a preset first data processing model to process the first time-series data and using Processing the second time series data with a second data processing model matched with the time series cluster of the target feature; the number of mutation points is determined through big data statistics;

According to the first processing result of the first data processing model and the processing result of the second data processing model, determine the first prediction result of the item demand situation in the forecast time series, so as to set the Forecasting inventory of items within a time series.
The method according to claim 1, wherein, after said determining the time series data to be processed, further comprising:

In view of the fact that the time series data to be processed does not have a quantity mutation point at a set time point within the historical target time period,

Processing the time series data to be processed by using the preset first data processing model;

According to the second processing result of the first data processing model, determine a second forecast result of the item demand situation in the forecast time series, so as to set the item inventory in the forecast time series based on the second forecast result.
The method according to claim 1, wherein,

The first processing result of the first data processing model includes: the forecast result of the item demand situation before the set time point in the forecast time series;

The processing result of the second processing model includes: the prediction result of the item demand situation after the set time point in the prediction time series;

The determination of the first forecast result of the item demand situation in the forecast time series includes:

The first prediction result is obtained by concatenating the forecast result of the item demand situation before the set time point in the forecast time series with the forecast result of the item demand situation after the set time point in the forecast time series.
The method according to claim 2, wherein, after said determining the time series data to be processed, further comprising:

In the case that the time series data to be processed satisfies the assumption of normality, the Buishand statistical method is used to determine that the time series data to be processed has a quantitative mutation point at a set time point within the historical target time period or does not exist Quantitative mutation point;

In the case where the analysis of the time series data to be processed does not satisfy the normality assumption,

A Pettitt test method is used to determine whether there is a quantitative change point or no quantitative change point at a set time point within the historical target time period in the time series data to be processed.
The method of claim 1, further comprising:

From the time-series data of all items in the test case, screen out the time-series data of the target items with quantitative mutation points;

The second time-series data for training that is located after the set time point is segmented from the time-series data of the target item, wherein the second time-series data for training includes the multiple time-series data The data is equal to or greater than the value of the quantity mutation point included in the time series data of the target item;

clustering the second time-series data for training to obtain one or more characteristic time-series clusters;

The determining the target feature time series cluster to which the second time series data belongs includes:

From one or more of the characteristic time-series clusters, select a target characteristic time-series cluster for the second time-series data.
The method according to claim 5, wherein the selecting the target characteristic time series cluster for the second time series data comprises:

For the case where there are multiple characteristic timing clusters,

matching the second time series data with each of the characteristic time series clusters;

According to the matching result, a target characteristic timing cluster is screened out from the plurality of characteristic timing clusters.
The method of claim 5, further comprising:

Segment the first time-series data for training before the set time point from the time-series data of the target item, wherein the data of multiple time points included in the first time-series data for training is smaller than the time-series data of the target item The value of the number of mutation points included in the data;

Clustering the time series data of items other than the time series data of the target item in the test case and the first time series data for training;

Using the results of the clustering to train a preset first type of model to be trained to obtain a first data processing model, wherein the first type of model to be trained is obtained based on one or more base model configurations, and the base model includes ARIMA, One or more of ETS, Croston, Simple Moving Average, FBProphet, Holt-Winters, First-Order Exponential Smoothing.
The method of claim 7, further comprising:

For each characteristic time-series cluster, use the time-series data corresponding to the characteristic time-series cluster to train a preset second type of model to be trained to obtain a second data processing model matching the characteristic time-series cluster, wherein the first The second type of model to be trained is configured based on one or more base models, and the second type of model to be trained is different from the first type of model to be trained.
The method according to claim 5, wherein, from all the item time series data of the test cases, screening out the target item time series data with quantitative mutation points comprises:

For the item time series data identified by each warehouse item in the test case, perform the following operations:

Determine the data split point;

In a case where the data splitting point is a quantity mutation point of the item time-series data of the warehouse item identifier, it is determined that the item time-series data of the warehouse item identifier is the target item time-series data.
The method according to claim 5, wherein said clustering the second time series data for training comprises:

Determine the sequence translation function and translation distance for a preset distance function;

Clustering the second time series data for training is performed by using a distance function and a DBSCAN algorithm that determine the sequence translation function and the translation distance.
The method according to claim 5, wherein the selecting the target characteristic time series cluster for the second time series data comprises:

performing end-to-end splicing on the second time series data;

The spliced results are matched with the one or more characteristic time-series clusters respectively, and the characteristic time-series cluster with the highest matching value is used as the target characteristic time-series cluster.
The method according to claim 8, wherein said using the time series data corresponding to the characteristic time series cluster to train the preset second type of model to be trained comprises:

Clipping the second time-series data for training included in the feature time-series cluster, and splicing the clipped results to generate new second time-series data for training;

Using the second time-series data for training included in the feature time-series cluster and the new second time-series data for training, a preset second type of model to be trained is trained.
A data processing device, comprising:

A determining module, configured to determine time series data to be processed, wherein the time series data to be processed indicates changes in demand for items within a historical target time period;

A forecasting module, configured to divide the time-series data to be processed according to the quantity mutation point when the time-series data to be processed has a quantity mutation point at a set time point within the historical target time period Output the first time-series data before the set time point and the second time-series data after the set time point; and determine the target feature time-series cluster to which the second time-series data belongs, using the preset first time-series data A data processing model processes the first time-series data and processes the second time-series data using a second data processing model that matches the target feature time-series cluster; wherein, the number of sudden changes is determined through big data statistics;

According to the first processing result of the first data processing model and the processing result of the second data processing model, determine the first prediction result of the item demand situation in the forecast time series, so as to set the Forecasting inventory of items within a time series.
A data processing device comprising: one or more processors;

a storage system for storing one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors are made to implement the method according to any one of claims 1-12.
A computer-readable medium, on which a computer program is stored, and when the program is executed by a processor, the method according to any one of claims 1-12 is realized.