CN114219545B

CN114219545B - Data processing method and device

Info

Publication number: CN114219545B
Application number: CN202210144038.0A
Authority: CN
Inventors: 陈家禹; 陈浪; 庄晓天; 吴盛楠
Original assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Current assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-07-05
Anticipated expiration: 2042-02-17
Also published as: WO2023155426A1; CN114219545A

Abstract

The invention discloses a data processing method and device, and relates to the technical field of intelligent logistics. The specific implementation mode of the method comprises the following steps: determining time sequence data to be processed; dividing the time sequence data to be processed into first time sequence data located before a set time point and second time sequence data located after the set time point aiming at the condition that the time sequence data to be processed has a quantity of catastrophe points at the set time point in a historical target time period; determining a target characteristic time sequence cluster to which the second time sequence data belongs, and processing the first time sequence data by using a preset first data processing model and processing the second time sequence data by using a second data processing model matched with the target characteristic time sequence cluster; and determining a first prediction result of the demand condition of the goods in the prediction time sequence according to the first processing result of the first data processing model and the processing result of the second data processing model. Different processing models are adopted to process different time sequence data respectively, so that the processing accuracy is improved.

Description

Data processing method and device

Technical Field

The invention relates to the technical field of intelligent supply chains, in particular to a data processing method and device.

Background

In the retail industry, the prediction of the demand condition of goods is the basis of supply chain management, and the prediction plays a role in providing input data for inventory adjustment and simulation, and the effectiveness of the result of the subsequent model is directly determined by the quality of the effect. Meanwhile, the prediction data can also show the prediction condition of the prediction data to supply chain management personnel, so that the decision support effect is achieved.

In the existing retail industry, a single FFORMA (Feature-based platform model operating) model is often used as a data processing model for full-scale time sequence data processing to predict the demand condition of an article, but the single data processing model cannot distinguish multiple different types of time sequence data, so that the accuracy of a prediction result is low.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for data processing, where a situation that there are a number of mutation points in the to-be-processed time series data at a set time point within the historical target time period is analyzed, the to-be-processed time series data is divided into first time series data located before the set time point and second time series data located after the set time point, where the number of mutation points is determined by big data statistics, so that data of a plurality of time points included in the first time series data is smaller than a numerical value of the number of mutation points, data of a plurality of time points included in the second time series data is equal to or greater than the numerical value of the number of mutation points, and then the first time series data and the second time series data are respectively processed by a first data processing model and a second data processing model, that is, different data of different stages are processed by different data processing models, and finally, determining a first prediction result of the article demand condition in the prediction time sequence according to the first processing result of the first data processing model and the processing result of the second data processing model, and setting the article inventory in the prediction time sequence based on the first prediction result. By the method, the purpose of processing different time sequence data by adopting different processing models can be achieved, and the accuracy of the processing result is improved.

To achieve the above object, according to a first aspect of embodiments of the present invention, a method of data processing is provided.

The data processing method of the embodiment of the invention comprises the following steps:

determining time sequence data to be processed; wherein the time series data to be processed indicates the change of the demand of the article in the historical target time period;

for the condition that the time sequence data to be processed has a number of catastrophe points at a set time point in the historical target time period, dividing the time sequence data to be processed into first time sequence data before the set time point and second time sequence data after the set time point; determining a target characteristic time sequence cluster to which the second time sequence data belongs, processing the first time sequence data by using a preset first data processing model, and processing the second time sequence data by using a second data processing model matched with the target characteristic time sequence cluster; wherein the number of mutation points is determined by big data statistics;

and determining a first prediction result of the demand condition of the item in the prediction time sequence according to the first processing result of the first data processing model and the processing result of the second data processing model, so as to set the inventory of the item in the prediction time sequence based on the first prediction result.

Optionally, after the determining the time series data to be processed, further comprising:

aiming at the condition that the time sequence data to be processed has no number of catastrophe points at the set time point in the historical target time period,

processing the time sequence data to be processed by utilizing the preset first data processing model;

and determining a second prediction result of the demand condition of the item in the prediction time sequence according to the second processing result of the first data processing model, so as to set the inventory of the item in the prediction time sequence based on the second prediction result.

Optionally, the first processing result of the first data processing model includes: the forecasting time sequence is internally provided with a forecasting result of the article demand condition before a time point; the processing result of the second processing model comprises: the forecasting result of the article demand condition after the set time point in the forecasting time sequence is set;

the determining a first prediction result of the item demand condition within the prediction time sequence comprises: and splicing the prediction result of the article demand condition before the set time point in the prediction time sequence with the prediction result of the article demand condition after the set time point in the prediction time sequence to obtain the first prediction result.

under the condition that the time sequence data to be processed is analyzed to meet the normality assumption, determining that the time sequence data to be processed has number of mutation points or does not have the number of mutation points at the time point in the historical target time period by using a Buishard statistical method;

and under the condition that the time sequence data to be processed is analyzed not to meet the normality assumption, determining that the time sequence data to be processed has number of catastrophe points or does not have the number of catastrophe points at the time point in the historical target time period by using a Pettitt test method.

Optionally, the method further comprises: screening target article time sequence data with a number of mutation points from all article time sequence data of the test cases; segmenting second time sequence data for training, which are located after the set time point, from the target article time sequence data, wherein the data of a plurality of time points included in the second time sequence data for training are equal to or larger than the numerical values of the number of mutation points included in the target article time sequence data;

clustering the second time sequence data for training to obtain one or more characteristic time sequence clusters;

the determining the target feature timing cluster to which the second timing data belongs includes: and selecting a target characteristic time sequence cluster for the second time sequence data from one or more characteristic time sequence clusters.

Optionally, the selecting the target feature timing cluster to which the second timing data belongs for the second timing data includes: matching the second time sequence data with each feature time sequence cluster under the condition that a plurality of feature time sequence clusters exist; and screening out a target characteristic time sequence cluster from the plurality of characteristic time sequence clusters according to a matching result.

Optionally, the method further comprises: segmenting first time sequence data for training, which are located before the set time point, from the time sequence data of the target object, wherein the data of a plurality of time points included in the first time sequence data for training are smaller than numerical values of number mutation points included in the time sequence data of the target object; clustering other article time sequence data except the target article time sequence data in the test case with the first time sequence data for training; and training a preset first type of model to be trained by utilizing the clustering result to obtain a first data processing model, wherein the first type of model to be trained is obtained based on one or more basic model configurations, and the basic model comprises one or more of ARIMA, ETS, Croston, simple moving average, FBProphet, Holt-Winters and first-order exponential smoothing.

Optionally, the method further comprises: and aiming at each kind of characteristic time sequence cluster, training a preset second type of model to be trained by utilizing the time sequence data corresponding to the characteristic time sequence cluster to obtain a second data processing model matched with the characteristic time sequence cluster, wherein the second type of model to be trained is obtained based on one or more base model configurations, and the second type of model to be trained is different from the first type of model to be trained.

Optionally, the screening out target item time series data with a number of mutation points from all item time series data of the test cases includes: for the item time sequence data of each warehouse item identifier in the test case, the following operations are executed: determining a data segmentation point; and determining the article time sequence data of the warehouse article identifier as the target article time sequence data under the condition that the data segmentation point is the quantity mutation point of the article time sequence data of the warehouse article identifier.

Optionally, the clustering the training second time series data includes: determining a sequence translation function and a translation distance for a preset distance function; and clustering the second time sequence data for training by using the distance function for determining the sequence translation function and the translation distance and the DBSCAN algorithm.

Optionally, the selecting the target feature timing cluster to which the second timing data belongs for the second timing data includes: performing head-to-tail splicing on the second time series data; and matching the spliced result with the one or more characteristic time sequence clusters respectively, and taking the characteristic time sequence cluster with the highest matching value as the target characteristic time sequence cluster.

Optionally, the training of the preset second class of models to be trained by using the time sequence data corresponding to the feature time sequence cluster includes: cutting second time sequence data for training included in the characteristic time sequence cluster, and splicing the cutting results to generate new second time sequence data for training; and training a preset second type of model to be trained by using the second time sequence data for training and the new second time sequence data for training which are included in the characteristic time sequence cluster.

To achieve the above object, according to a second aspect of embodiments of the present invention, there is provided an apparatus for data processing.

The data processing device of the embodiment of the invention comprises:

the system comprises a determining module, a judging module and a processing module, wherein the determining module is used for determining time sequence data to be processed, and the time sequence data to be processed indicates the change situation of the demand quantity of the article in a historical target time period;

the prediction module is used for dividing the time sequence data to be processed into second time sequence data and first time sequence data according to the number of catastrophe points under the condition that the time sequence data to be processed has the number of catastrophe points at set time points in the historical target time period; determining a target characteristic time sequence cluster to which the second time sequence data belongs, processing the first time sequence data by using a preset first data processing model, and processing the second time sequence data by using a second data processing model matched with the target characteristic time sequence cluster; wherein the number of mutation points is determined by big data statistics;

Optionally, after the time series data to be processed is determined, for a case that there is no number of abrupt change points in the time series data to be processed at a set time point within the historical target time period, processing the time series data to be processed by using a preset first data processing model; and determining a second prediction result of the demand condition of the item in the prediction time sequence according to the second processing result of the first data processing model, so as to set the inventory of the item in the prediction time sequence based on the second prediction result.

the forecasting module is further used for splicing the forecasting result of the article demand condition before the preset time point in the forecasting time sequence with the forecasting result of the article demand condition after the preset time point in the forecasting time sequence to obtain the first forecasting result.

Optionally, the apparatus further includes an analyzing module, configured to, after the determining of the to-be-processed time series data, determine, by using a Buishand statistical method, that there is a number of abrupt change points or there is no number of abrupt change points at a time point located within the historical target time period in the to-be-processed time series data when the analyzing of the to-be-processed time series data satisfies a normality assumption;

Optionally, the prediction module is further configured to screen out target item time series data with a number of mutation points from all item time series data of the test cases;

the device also comprises a clustering module, a first time sequence data acquisition module and a second time sequence data acquisition module, wherein the clustering module is used for segmenting second time sequence data for training, which are positioned after the set time point, from the target article time sequence data, and the data of a plurality of time points included in the second time sequence data for training are equal to or larger than the numerical values of the number of mutation points included in the target article time sequence data; clustering the second time sequence data for training to obtain one or more characteristic time sequence clusters;

the prediction module is further configured to select a target feature timing cluster to which the second timing data belongs from one or more feature timing clusters.

Optionally, the prediction module is further configured to, for a case that there are a plurality of feature timing clusters, match the second timing data with each of the feature timing clusters; and screening out a target characteristic time sequence cluster from the plurality of characteristic time sequence clusters according to a matching result.

Optionally, the clustering module is further configured to segment, from the target item time series data, first time series data for training located before the set time point, where data of a plurality of time points included in the first time series data for training is smaller than a numerical value of a number of mutation points included in the target item time series data; clustering other article time sequence data except the target article time sequence data in the test case with the first time sequence data for training; and training a preset first type of model to be trained by utilizing the clustering result to obtain a first data processing model, wherein the first type of model to be trained is obtained based on one or more basic model configurations, and the basic model comprises one or more of ARIMA, ETS, Croston, simple moving average, FBProphet, Holt-Winters and first-order exponential smoothing.

Optionally, the clustering module is further configured to, for each type of feature timing cluster, train a preset second type of model to be trained by using timing data corresponding to the feature timing cluster, to obtain a second data processing model matched with the feature timing cluster, where the second type of model to be trained is obtained based on one or more base model configurations, and the second type of model to be trained is different from the first type of model to be trained.

Optionally, the clustering module is further configured to determine a sequence translation function and a translation distance for a preset distance function; and clustering the second time sequence data for training by using the distance function for determining the sequence translation function and the translation distance and the DBSCAN algorithm.

Optionally, the clustering module is further configured to perform head-to-tail splicing on the second time series data; and matching the spliced result with the one or more characteristic time sequence clusters respectively, and taking the characteristic time sequence cluster with the highest matching value as the target characteristic time sequence cluster.

Optionally, the clustering module is further configured to cut second timing sequence data for training included in the feature timing sequence cluster, and splice cut results to generate new second timing sequence data for training; and training a preset second type of model to be trained by using the second time sequence data for training and the new second time sequence data for training which are included in the characteristic time sequence cluster.

To achieve the above object, according to a third aspect of embodiments of the present invention, there is provided an apparatus for data processing.

The data processing device of the embodiment of the invention comprises: one or more processors; a storage system for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the method of data processing of the embodiment of the present invention.

To achieve the above object, according to a fourth aspect of embodiments of the present invention, there is provided a computer-readable medium.

The computer readable medium of the embodiment of the present invention stores thereon a computer program that realizes the data processing method of the embodiment of the present invention when executed by a processor.

One embodiment of the above invention has the following advantages or benefits: in the embodiment of the present invention, a situation that a number of abrupt change points exist at a set time point within the historical target time period in the time sequence data to be processed is analyzed, the time sequence data to be processed is divided into first time sequence data located before the set time point and second time sequence data located after the set time point, wherein a numerical value of the number of abrupt change points is greater than a numerical value of each time point before the set time point, and a difference between the numerical value of each time point before the set time point and the numerical value of each time point before the set time point is equal to or greater than a preset abrupt change threshold, so that data of a plurality of time points included in the first time sequence data is smaller than the numerical value of the number of abrupt change points, data of a plurality of time points included in the second time sequence data is equal to or greater than the numerical value of the number of abrupt change points, and then the first time sequence data and the second time sequence data are respectively processed through a first data processing model and a second data processing model, the method comprises the steps of processing different data of different stages by adopting different data processing models so that processing results can reflect the article demand conditions of the different stages more truly, and finally determining a first prediction result of the article demand conditions in a prediction time sequence according to a first processing result of a first data processing model and a processing result of a second data processing model so as to set article inventory in the prediction time sequence based on the first prediction result. By the method, the purpose of processing different time sequence data by adopting different processing models can be achieved, and the accuracy of the processing result is improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

fig. 1 is a schematic diagram of a main flow of a data processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of data partitioning according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a main flow of a case where there is no number of discontinuities in the time-series data to be processed at a set time point within a historical target time period according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a main process for determining data mutation points according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a main process for obtaining a feature timing cluster according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a main process of clustering the target high-volume time-series data according to an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating a main process flow of selecting a target feature timing cluster for the high-volume timing data according to an embodiment of the present invention;

FIG. 8 is a schematic illustration of a primary flow of determining a first data processing model in accordance with an embodiment of the present invention;

FIG. 9 is a schematic diagram illustrating another embodiment of the main process for selecting the target feature timing cluster for the high-volume timing data;

FIG. 10 is a schematic diagram of a data cropping step according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of a data pruning intercept process of an embodiment of the present invention;

FIG. 12 is a block diagram illustrating an overall flow of time series data prediction sales, in accordance with an embodiment of the present invention;

FIG. 13 is a schematic diagram of the main blocks of a data processing apparatus according to an embodiment of the present invention;

FIG. 14 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

fig. 15 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

According to a first aspect of the embodiments of the present invention, a data processing method applied to a server is provided.

In the field of retail goods, the phenomenon that the quantity of goods required in a specific time period is increased obviously often occurs in certain time sequences, and if the quantity of goods required by the goods cannot be accurately determined, the risk of stock shortage and stock failure occurs, so that the service level is reduced and the cost is increased. The time sequence for different items is usually expressed in SKU (stock Keeping Unit) as the minimum stock unit. The existing FFORMA architecture can have the following problems when predicting different warehouse item identifications:

(1) the phenomenon of a significant increase in item demand is often present only in part of the warehouse item identification. When the model is trained by using the full-quantity warehouse article identification, the data without the phenomenon that the article demand is increased remarkably can lower the data prediction value of the phenomenon that the article demand is increased remarkably, so that the prediction value is lower;

(2) the phenomenon that the quantity of required goods is remarkably increased shows different forms among different time sequences, and when FFORMA is used, relevant characteristics which meet the time sequence characteristics of the time sequence during the period that the quantity of required goods is remarkably increased cannot be differentially excavated aiming at commodities with different time sequences;

(3) the FFORMA model has high optimization difficulty, and is difficult to train and fit in a short time to obtain a data model which is suitable for the remarkable increase of the demand of various articles.

Fig. 1 is a schematic diagram of a main flow of a data processing method according to an embodiment of the present invention. As shown in fig. 1, the method mainly includes:

step S101: determining time sequence data to be processed, wherein the time sequence data to be processed indicates the change of the demand of the article in the historical target time period;

step S102: aiming at the condition that the time sequence data to be processed has quantity mutation points at set time points in a historical target time period, dividing the time sequence data to be processed into first time sequence data before the set time points and second time sequence data after the set time points according to the quantity mutation points, wherein the quantity mutation points are determined by big data statistics;

step S103: determining a target characteristic time sequence cluster to which the second time sequence data belongs, processing the first time sequence data by using a preset first data processing model, and processing the second time sequence data by using a second data processing model matched with the target characteristic time sequence cluster;

step S104: and determining a first prediction result of the demand condition of the item in the prediction time sequence according to the first processing result of the first data processing model and the processing result of the second data processing model, so as to set the inventory of the item in the prediction time sequence based on the first prediction result.

There are two setting modes for setting the time point. One setting mode may be to designate a time point as a setting time point, for example, 20 days per month, or to set a time point of the quantity mutation point determined by performing a big data statistical analysis on the variation of the demand quantity of the article in the historical time period as the setting time point, for example, to determine that 10 days are the setting time point when the time point of the quantity mutation point determined by the variation of the demand quantity of the article is 10 days per month. It should be noted that the set time point can be set or adjusted accordingly according to the actual situation. So that the set time point can be dynamically changed to meet different requirements.

The number of mutation points is determined through big data statistics, specifically, the number of mutation points is determined according to the statistical result by counting the change of the demand volume of the article in a large number of historical time periods through the existing statistical methods, such as a plurality of uniformity statistical methods (Buishand statistical methods) and a mutation point test method (Pettitt test method). And aiming at the condition that the statistical result shows that a plurality of quantity mutation points correspond to a plurality of time points, selecting the quantity mutation point with the maximum value as a final quantity mutation point, and determining the time point corresponding to the final quantity mutation point as a set time point. In addition, for the case that the statistical result indicates that a plurality of number of mutation points correspond to a plurality of time points, the time point with the largest occurrence number may be selected to set the time point, and the number of mutation points or the mutation range of the number of mutation points may be determined by calculating the number of mutation points at the time point with the largest occurrence number.

In an embodiment of the present invention, the second time series data is a time series part in which the demand of the article is suddenly increased in the time series data to be processed, and the first time series data is a time series part in which the second time series data is removed from the time series data to be processed. The schematic diagram of data division is shown in fig. 2. In the present application, the sales in the later period are predicted mainly for the case where the sales increase rapidly at the end of the month, and therefore, in the present embodiment, it is considered that the time series data after the number of mutation points until the end of the month is the second time series data. The first time series data refers to the sales data of the articles before the date corresponding to the number mutation point from the beginning of the month. The second time sequence data generally comprises data of a plurality of time points which are equal to or more than numerical values of the number of the catastrophes, and the data of the plurality of time points which are included in the first time sequence data are generally less than the numerical values of the number of the catastrophes. It is worth noting that the cut lines shown in fig. 2 are sales points at a point in time before the number break point. I.e. the number of points after the dividing line shown in fig. 2. That is, the number mutation point is included in the second time series data. For example, the time point is set to be 20 days per month or the time point of the time point with the number of mutation points is determined to be 20 days per month through the change of the demand of the articles in the historical time period, the time sequence data to be processed is the time sequence data from 1 month and 1 day of 2021 year to 2 month and 28 days of 2021 year, the number of mutation points exists in the time sequence data to be processed at the set time point, the sales data of the number of mutation points is 30 pieces/day, the second time sequence data is the sales data in the time period from 1 month and 20 day of 2021 year to 1 month and 31 day of 2021 year and from 2 month and 20 day of 2021 year to 2 month and 28 day of 2021 year, the sales data is generally equal to or more than 30 pieces/day, the first time sequence data is the sales data in the time period from 1 month and 1 day of 2021 year to 1 year 1 month and 19 day of 2021 year and 2 month and 19 day of 2021 year, and the sales data is generally less than 30 pieces/day.

It should be noted that the second time series data after the set time point may include a number of abrupt points greater than the set time point.

The first data processing model in step S103 is obtained by training a preset first class of model to be trained with first time series data for training, which is divided from all article time series data of the test case and corresponds to the first time series data, and the second data processing model is obtained by training a preset second class of model to be trained with second time series data for training, which is divided from all article time series data of the test case and corresponds to the second time series data.

In an alternative embodiment, the time period indicated by the time series data to be processed can be determined according to actual needs. For example, assuming that the current date is 2021 year 6 month 1 day, the data of 2021 year 6 month 2 day to 2021 year 12 month 1 day may be predicted by using time series data of 2021 year 1 month 1 day to 2021 year 6 month 1 day as data to be processed, or the data of 2021 year 6 month 2 day to 2021 year 12 month 1 day may be predicted by using time series data of 2021 year 1 month 1 day to 2021 year 3 month 1 day as data to be processed. Because preferential activities or large festivals of merchants exist in certain months, such as twenty-one, the prediction accuracy rate is higher by adopting the time series data of the same time period, namely, the data from 11/month 1 in 2020 to 11/month 30 in 2020 is adopted to predict the data from 11/month 1 in 2021 to 11/month 30 in 2021.

For step S103, the method specifically includes: and inputting the first time sequence data into a first data processing model, and inputting the second time sequence data into a second data processing model to obtain a prediction result of the demand condition of the goods in the prediction time sequence. For example, in an application scenario of the prediction of sales volume, the time point is set to 20 days per month, the time series data to be processed is the time series data from 2021 month 1 to 2021 year 2 month 28, and assuming that there are a number of mutation points in the time series data to be processed at the set time point, the second time series data input as the second data model is from 2021 month 20 to 2021 year 1 month 31 and from 2021 year 2 month 20 to 2021 year 2 month 28, and the first time series data input as the first data model is from 2021 month 1 to 2021 year 1 month 19 and from 2021 year 2 month 1 to 2021 year 2 month 19.

In an alternative embodiment, after step S101, there is also a case where the time series data to be processed does not have a number of abrupt change points at a set time point within the historical target time period, as shown in fig. 3, including:

step S301: processing time sequence data to be processed by utilizing the preset first data processing model;

step S302: and determining a second prediction result of the demand condition of the item in the prediction time sequence according to the second processing result of the first data processing model, so as to set the inventory of the item in the prediction time sequence based on the second prediction result.

Because the number of the mutation points does not exist in the time sequence data to be processed, the second time sequence data and the first time sequence data do not need to be processed respectively, and only the first data processing model for processing the first time sequence data is used for processing.

In a further alternative embodiment, the first processing result of the first data processing model comprises: predicting a prediction result of the article demand condition before a set time point in the time sequence; the processing results of the second processing model include: predicting a prediction result of the article demand condition after a set time point in the time sequence;

in step S104, determining a first prediction result of the demand condition of the item within the prediction time sequence, further includes: and splicing the prediction result of the article demand condition before the time point set in the prediction time sequence with the prediction result of the article demand condition after the time point set in the prediction time sequence to obtain a first prediction result.

Illustratively, in an application scenario of the prediction of the sales volume, the time point is set to be 20 days per month, the to-be-processed time series data is the time series data from 1/2021/2/28 days in 2021, and the predicted time series data is from 1/2022/2/28 days in 2022. Assuming that there are a number of mutation points in the time series data to be processed at the set time point, the second time series data input as the second data model are 2021/20/2021/31/2021/2/20/2021/2/28/2021/2/1, and the first time series data input as the first data model are 2021/1/2021/19/2021/2/1/2021/2/19. According to the first data model, the prediction results of the article demand situation before the time point set in the prediction time sequence from 1/2022/1/19/2022/2/19/2022/20/2022/1/31/2022/20/2022/28 can be output, and the prediction results are spliced to obtain the first prediction result from 1/2022/28/2022.

In this embodiment, the number mutation point is a point after which the data is significantly larger than the data before the point. Since different data parameters are applicable to different inspection methods, selecting different inspection methods according to different parameters can improve the accuracy of the inspection, in the embodiment of the present application, in order to inspect whether the data after the number of mutation points is significantly larger than the data before the mutation points, in an alternative embodiment, as shown in fig. 4, the method includes:

step S401: under the condition that the time sequence data to be processed is analyzed to meet the normality assumption, determining that the time sequence data to be processed has quantity mutation points or does not have quantity mutation points at the timing points in the historical target time period by using a Buishand statistical method;

step S402: and under the condition that the time sequence data to be processed is analyzed not to meet the normality assumption, determining that the time sequence data to be processed has number of catastrophe points or does not have the number of catastrophe points at the time point in the historical target time period by using a Pettitt test method.

The Buishand statistical method comprises the following specific steps:

defining time series data to be processed as a sequence

Wherein

To set the time point

The item demand data. If it is

Satisfying the normality assumption, the sequence can be obtained

The data of the required quantity of the goods at each time point in (1) is expressed as:

formula (1)

Wherein the content of the first and second substances,

is expressed at the time point ofiThe difference of the two kinds of the powder is large,

the mean value is represented by the average value,

which indicates the point in time at which the setting is made,

indicating the change value. If it is not

To set the time point

The number of mutation points of (a) is,

is expressed at the time point ofiTemporal item demand data, 1 … …nTo representWithin a historical target time periodnAt a point in time, then only the change value needs to be detected

. At the moment, the sum value before the mutation point of the time sequence data to be processed is constructed

For testing

In particular

Is represented as follows:

formula (2)

Wherein the content of the first and second substances,

for time series data to be processed

The average value of (a) of (b),

. For homogeneity data (i.e. time series data to be processed)

No significant change in distribution over time), we desire to set the time point

Statistic of (2)

Fluctuating above or below 0, at which time sequence data to be processed

Without systematic deviation from the mean

The method is characterized in that. If it is

We would expect most of them to precede the number of mutation points

<0, and most after the number mutation point

. Because the embodiment of the invention only needs to check at the set time point

Number of mutation points, only the check set time point being taken into account

Statistic of (2)

. If we can not determine the specific number of mutation points, selecting

. To complete the test, we rescale it, i.e., construct statistics

The following were used:

formula (3)

Wherein the content of the first and second substances,

indicating to be processedThe variance of the time series data is calculated,

is the data amount of the time series data to be processed. By passing

Summing the number before the mutation point

Tests were performed to determine significance values.

In the practical application process, if the set time point is the last 10 days of the month, the history sample is taken as 10 days with equal length, namely the Q value corresponding to the history sample with the significance of 95% is 1.22, and in the judgment process, if the obtained statistic is obtained

Then the number of discontinuities is considered to be present at a set point in time.

The method for Pettitt test comprises the following specific steps:

defining time series data to be processed as a sequence

Constructing statistics

Comprises the following steps:

formula (4)

Wherein, the first and the second end of the pipe are connected with each other,

the sum of the sign functions before the number of discontinuities is represented,

which indicates the point in time at which the setting is made,

is shown at a point in timejThe data of the demand quantity of the goods at the time,

is a function of the sign. Similar to method 1, if the number of mutation points cannot be determined, the mutation points are selected

. Since only the case where the quantity of required articles is significantly increased at a set time point is considered, and the case where the quantity of required articles is significantly decreased is not considered, we expect that

A significant hold is established. Calculating unilateral tests according to the approximation formula proposed by Pettitt

The values are as follows:

formula (5)

Wherein the content of the first and second substances,

representing an exponential function with the natural constant e as the base, with a confidence of 95%, if

Then at the set time point

There are number of mutation points.

In an optional embodiment, as shown in fig. 5, before the method of the embodiment of the present invention is executed, a test process is further included, specifically including:

step S501: screening target article time sequence data with a number of mutation points from all article time sequence data of the test cases;

step S502: segmenting second time sequence data for training, which are positioned after a set time point, from the time sequence data of the target object, wherein the data of a plurality of time points included in the second time sequence data for training are equal to or larger than numerical values of quantity mutation points included in the time sequence data of the target object;

step S503: and clustering the second time sequence data for training to obtain one or more characteristic time sequence clusters.

Through the above steps S501 to S503, the determining, in step S103, the target feature timing cluster to which the second timing data belongs includes: and selecting a target characteristic time sequence cluster for the second time sequence data from one or more characteristic time sequence clusters.

In the prior art, the FFORMA model cannot achieve an accurate prediction effect according to all article time sequence data, so that target article time sequence data with a number of mutation points need to be screened in the embodiment of the application, and then the second time sequence data for training in the target article time sequence data is used as model input, so that the prediction accuracy can be remarkably improved. Moreover, in the embodiment of the application, the training is clustered by using the second time sequence data to obtain different characteristic time sequence clusters, the characteristic time sequence clusters are used as the input of the model, the time sequence data of the target articles with the number of mutation points can be further divided, and the time sequence clusters with different characteristics can be classified and input into the model by the division of the characteristic time sequence clusters, so that the prediction is more accurate.

Illustratively, all article time sequence data in the test case are 100, wherein a number of mutation points exist in 50 article time sequence data, and then the 50 article time sequence data are the target article time sequence data. Assuming that each time sequence data only contains commodity sales of one month, second time sequence data for training can be separated from each time sequence data of the target article through quantity mutation points, namely 50 second time sequence data for training can be separated, and then the 50 second time sequence data for training are clustered to obtain a characteristic time sequence cluster.

For step S501, in an optional embodiment, the method specifically includes: and executing the following steps aiming at the article time sequence data of each warehouse article identifier in the test case:

the method comprises the following steps: determining a data segmentation point;

step two: judging whether the data segmentation points are quantity mutation points in the article time sequence data of the warehouse article identification; and if so, determining that the article time sequence data of the warehouse article identifier is the target article time sequence data.

The method for determining whether the number of mutation points is the same as the aforementioned steps S301 to S303, wherein the data segmentation points can be obtained through human experience or through a data perspective method.

For the clustering process in step S503, in an alternative embodiment, as shown in fig. 6, the clustering process includes:

step S601: determining a sequence translation function and a translation distance for a preset distance function;

step S602: and clustering the second time sequence data for training by using the distance function for determining the sequence translation function and the translation distance and the DBSCAN algorithm.

In the embodiment of the application, the purpose of clustering is to divide the graphs of the variation situation of the demand quantity of the goods according to the similarity degree. In the time sequence clustering, the Euclidean distance adopted in the prior art is difficult to be used for describing the similarity between time sequences because the time sequences have the problems of unequal sequence lengths, similar shift, data scale and the like. John Papaniczos proposed a Shape similarity metric-based time sequence clustering algorithm K-Shape in 2015, which can better describe the similarity distance between the time sequences and use the similar K-means idea to cluster the time sequences. However, the method is found by experiments to have a problem of large calculation amount, and more importantly, since it is difficult to have sufficient prior knowledge about the time sequence, the class quantity k value is difficult to select, and often needs to be determined by repeated experiments. Therefore, in an alternative embodiment, the invention uses the DBSCAN algorithm Based on Shape Based Distance for clustering.

In particular, the sequence is determinedTranslation function

The following were used:

formula (6)

Wherein the content of the first and second substances,

the amount of translation is indicated and,

representing timing of a shift to be madeXThe medium time points are the item demand from 1 to n. Based on an amount of translation

Two time sequences can be obtained

And

cross correlation between them

：

Formula (7)

Wherein the content of the first and second substances,

representing timing of a shift to be madeYAt a medium time point ofiThe amount of the required goods at the location,

representing timing of a shift to be madeYAt a medium time point ofs+iThe amount of the required goods at the location,

is constant at

Taking the maximum value, considering the translation amount

At the optimum position. At this time, cross correlation will be carried out

Substituted into formula (8) to obtain normalized quantity

。

Formula (8)

Wherein the content of the first and second substances,

representing time sequence

The norm of (a) of (b),

timing sequence

Norm of (d). Normalizing the obtained normalized quantity

Substituting the following formula (9) to obtain the shift distance SBD.

Formula (9)

Further, in DBSCAN algorithmThe two super parameters are respectively: neighborhood radius

And a sample number threshold in the neighborhood

The determination can be made using the elbow point method and contour coefficients (silouette score).

By the definition of the distance function and the determination of the sequence translation function and the translation distance in the distance function, the second time sequence data for training can be more accurately clustered according to the similarity of the time sequence data graph change, and meanwhile, the problem of errors caused by time point translation due to the similarity of the graph change is solved.

In an alternative embodiment, when there are a plurality of feature timing clusters, as shown in fig. 7, the step of selecting the target feature timing cluster to which the second timing data belongs may further include:

step S701: matching the second time sequence data with each characteristic time sequence cluster;

step S702: and screening out the target characteristic time sequence cluster from the plurality of characteristic time sequence clusters according to the matching result.

The matching result can be set according to the actual application scene and the requirement. In an alternative embodiment, the feature timing cluster with the highest matching degree is selected as the target feature timing cluster. Further, as for the matching degree, it may be expressed as a degree of similarity between the second time series data and each feature time series cluster in the embodiment of the present application, and the higher the degree of similarity is, the higher the matching degree is. For example, the number of the clustered feature time sequence clusters is 3, which are respectively the time sequence cluster 1, the time sequence cluster 2 and the time sequence cluster 3, and the matching degrees of the second time sequence data and the 3 time sequence clusters are respectively 0.9, 0.6 and 0.1, so that the time sequence cluster 1 is the target feature time sequence cluster, and when the matching degrees are the same, one of the feature time sequence clusters can be selected to be randomly selected.

In the embodiment of the present application, the second time series data for training may be segmented from the time series data of the target object, and in an alternative embodiment, as shown in fig. 8, the method further includes:

step S801: segmenting first time sequence data for training before a set time point from the time sequence data of the target object, wherein the data of a plurality of time points included in the first time sequence data for training is smaller than the numerical value of the number of mutation points included in the time sequence data of the target object;

step S802: clustering other article time sequence data except the target article time sequence data in the test case with the first time sequence data for training;

step S803: and training a preset first type of model to be trained by utilizing the clustering result to obtain a first data processing model, wherein the first type of model to be trained is obtained based on one or more base model configurations, and the base model comprises one or more of ARIMA, ETS, Croston, simple moving average, FBProphet, Holt-Winters and first-order exponential smoothing.

For example, in an application scenario of prediction sales volume prediction, the total number of article time series data in a test case is 10, where the number of target article time series data is 2, that is, the time series data has a number of discontinuities. The time points corresponding to the number of mutation points are set to be 20 days of each month, the target article time sequence data are the time sequence data from 2 month 1 of 2021 to 2 month 28 of 2021 and from 3 month 1 of 2020 to 3 month 31 of 2020, and the number of mutation points do not exist in the remaining 8 time sequence data. Then, in the 2 target article time series data, the first time series data for training respectively separated are data from 2021 year 2 month 1 day to 2021 year 2 month 19 day and 2020 year 3 month 1 day to 2020 year 3 month 19 day, at this time, the data from 2021 year 2 month 1 day to 2021 year 2 month 19 day and 2020 year 3 month 1 day to 2020 year 3 month 19 day are clustered with the remaining 8 complete time series data, and a preset first type model to be trained is trained according to the clustering result, so as to obtain a first data processing model.

In an actual application scene, the results obtained by commonly predicting the first time sequence data are the same or similar, so that classification processing is not required to be carried out like the second time sequence data, all the first time sequence data are clustered once, and a plurality of different time sequence clusters are not required to be obtained like the second time sequence data, so that a specific clustering process is simplified, the data processing amount is reduced on the premise of not influencing the prediction result, and the data processing efficiency is improved. The specific clustering method is the same as that in step S601 and step S602.

In an optional embodiment, for each type of feature timing cluster, a preset second type of model to be trained is trained by using timing data corresponding to the feature timing cluster, so as to obtain a second data processing model matched with the feature timing cluster, wherein the second type of model to be trained is obtained based on one or more basic model configurations, and the second type of model to be trained is different from the first type of model to be trained. Compared with the first type of model to be trained, the second type of model to be trained has a relatively more configured base models due to a relatively complex time sequence of analysis, and is generally obtained by fusing a plurality of base models.

In the process of determining the target feature timing cluster to which the second time series data belongs, in an optional embodiment, the step of selecting the target feature timing cluster to which the second time series data belongs includes, as shown in fig. 9, further including:

step S901: splicing the second time sequence data from head to tail;

step S902: and matching the spliced result with one or more characteristic time sequence clusters respectively, and taking the characteristic time sequence cluster with the highest matching value as a target characteristic time sequence cluster.

For example, it is assumed that the time series data includes the change of the demand volume of the item from 3/1/2021/5/31/2021, wherein the set time point of the number mutation point is 10 days before the end of the month, that is, the second time series data is from 21/3/2021/3/31/2021/4/20/2021/4/30/2021/5/21/2021/5/31 respectively, and the time series data of the rest time periods is the first time series data. Then the time sequence data after head and tail splicing is that the head and tail splicing is carried out on the data from 21 days at 3 months in 2021 to 31 days at 3 months in 2021, 20 days at 4 months in 2021 to 30 days at 4 months in 2021 and 21 days at 21 months in 2021 to 31 days at 5 months in 2021, so as to splice a new time sequence with longer time sequence. Because the data with high number of time sequences usually has the characteristic of commonality, the data with high number of time sequences of each section are respectively input into the model, and the obtained results are usually similar or identical, so that the processing of a large amount of similar data can be reduced by head-to-tail splicing, and the accurate prediction result can be obtained only by performing model input on the spliced time sequences once.

In the embodiment of the application, the relevant characteristics of the clustered time sequence clusters can be better extracted by carrying out clustering grouping training on the data of similar time sequences, the offset and the optimization difficulty of the model can be reduced, but the time sequence number can be reduced to a certain extent after clustering, and the model is easy to cause the phenomenon of overfitting. Therefore, in an alternative embodiment, the overfitting phenomenon may be reduced by introducing a data enhancement mode, as shown in fig. 10, in the step of training the preset second-class model to be trained by using the time sequence data corresponding to the feature time sequence cluster, the method further includes:

step S1001: cutting second time sequence data for training included in the characteristic time sequence cluster, and splicing the cutting results to generate new second time sequence data for training;

step S1002: and training a preset second type of model to be trained by using the second time sequence data for training and the new second time sequence data for training which are included in the characteristic time sequence cluster.

As for the clipping process in step S1001, as shown in fig. 11, the clipping step may specifically include:

step S1001-1: setting a length threshold value L, a sliding length d and a cutting length s;

step S1001-2: selecting time sequence data with data exceeding L

Not processing the time sequence which does not exceed the threshold;

step S1001-3: for the

The time sequence in the step (1) is intercepted by a sliding window with the length of s, and each sliding window corresponds to a new sample;

step S1001-4: and combining the intercepted data with the time sequence which does not exceed the threshold value to serve as a new sample set.

The number of the sample sets can be expanded through the cutting step on the basis of the existing time sequence data, so that the second accurate data processing model can be obtained after the preset second training model is trained.

Under a common condition, the first type of model to be trained is used for training a clustering result of the first time sequence data for training, the training purpose can be achieved only by selecting an optimal base model from a plurality of base models, a complex and tedious model is not needed, and the efficiency can be effectively improved. For the second type of model to be trained, which is usually a fusion model, that is, a complex model including multiple base models, different base model types can be selected according to different feature timing clusters to obtain an optimal training model.

Furthermore, for various base models, weights can be set for each base model according to the characteristics of different time sequence clusters, and the weights of each base model are continuously optimized through a training process to obtain the base model only with higher retention weight. The phenomenon of overfitting time series data can be reduced through optimization of the base model.

In an alternative embodiment, the method of the embodiment of the present invention may be used for predicting sales of a commodity, as shown in fig. 12, where fig. 12 shows an overall flow diagram for predicting sales according to time series data, and specifically includes:

the method comprises the following steps: judging whether the time sequence data has a mutation point or not, and confirming whether the time sequence data has a high-cost phenomenon or not according to a mutation point judgment result;

if the high pin phenomenon exists, executing the step two, and if the high pin phenomenon does not exist, directly executing the step seven;

step two: performing time sequence division on the time sequence data to obtain a high-selling period (second time sequence data) and a non-high-selling period (first time sequence data);

further, the third step to the fifth step are executed for the time sequence data in the high-pin period, and the specific steps are as follows:

step three: matching the time sequence data of the high-cost period with the characteristic time sequence cluster in the clustering result to obtain a target characteristic time sequence cluster;

step four: performing time sequence splicing on the high-pin time sequence data in the target characteristic time sequence cluster to obtain a splicing result of the high-pin time sequence data;

step five: taking the splicing result of the high-sales time sequence data as the input of a first data processing model, and outputting a target sales volume;

after the second step, executing a sixth step and a seventh step on the time sequence data in the period of non-high-selling, wherein the specific steps are as follows:

step six: splicing the time sequence of the non-high-selling period to obtain a splicing result of the non-high-selling time sequence data;

step seven: and outputting the target sales volume by taking the splicing result of the non-high-sales time sequence data or all current time sequence data as the input of the second data processing model.

Through the steps, not only can the time sequence data of whether the high-pin phenomenon exists be processed respectively, but also the high-pin period and the non-high-pin period in the high-pin phenomenon can be processed respectively, and the accuracy of data processing is improved.

The data processing method of the embodiment of the invention analyzes the situation that the time sequence data to be processed has number of mutation points at the set time point in the historical target time period, divides the time sequence data to be processed into first time sequence data before the set time point and second time sequence data after the set time point, wherein the data of a plurality of time points included in the first time sequence data are less than the number of the mutation points, the data of a plurality of time points included in the second time sequence data are equal to or more than the number of the mutation points, and then respectively processes the first time sequence data and the second time sequence data through the first data processing model and the second data processing model, namely, different data of different stages are processed by adopting different data processing models, so that the processing result can more truly reflect the article demand situation of different stages, and finally, determining a first prediction result of the demand condition of the goods in the prediction time sequence according to the first processing result of the first data processing model and the processing result of the second data processing model, so as to set the stock of the goods in the prediction time sequence based on the first prediction result. By the method, the purpose of processing different time sequence data by adopting different processing models can be achieved, and the accuracy of the processing result is improved.

According to a second aspect of the embodiments of the present invention, there is provided an apparatus for data processing applied to a server.

Fig. 13 is a schematic diagram of main blocks of a data processing apparatus 1300 according to a second aspect of the embodiment of the present invention. As shown in fig. 13, includes:

a determining module 1301, configured to determine time series data to be processed, where the time series data to be processed indicates a change of a demand of a product in a historical target time period;

the predicting module 1302 is configured to, in case that a number of discontinuities exist in the to-be-processed time series data at a set time point within the historical target time period, segment the to-be-processed time series data into first time series data before the set time point and second time series data after the set time point; determining a target characteristic time sequence cluster to which the second time sequence data belongs, processing the first time sequence data by utilizing a preset first data processing model, and processing the second time sequence data by utilizing a second data processing model matched with the target characteristic time sequence cluster; wherein the number of mutation points is determined by big data statistics; and determining a first prediction result of the demand condition of the goods in the prediction time sequence according to the first processing result of the first data processing model and the processing result of the second data processing model, so as to set the stock of the goods in the prediction time sequence based on the first prediction result.

In an optional embodiment, the prediction module 1302 is further configured to, after the determining of the to-be-processed time series data, process the to-be-processed time series data by using a preset first data processing model, where there is no number of abrupt points in the to-be-processed time series data at a set time point within the historical target time period; and determining a second prediction result of the demand condition of the item in the prediction time sequence according to the second processing result of the first data processing model, so as to set the inventory of the item in the prediction time sequence based on the second prediction result.

In an alternative embodiment, the first processing result of the first data processing model comprises: the forecasting time sequence is internally provided with a forecasting result of the article demand condition before a time point;

the processing result of the second processing model comprises: the forecasting result of the article demand condition after a time point is set in the forecasting time sequence;

the predicting module 1302 is further configured to splice a prediction result of the article demand condition before the time point in the prediction time sequence with a prediction result of the article demand condition after the time point in the prediction time sequence to obtain the first prediction result.

In an optional embodiment, the apparatus further includes an analysis module, configured to, after the determining of the to-be-processed time series data, determine, by using a Buishand statistical method, that there is a number of abrupt change points or there is no number of abrupt change points at a time point located within the historical target time period in the to-be-processed time series data when analyzing that the to-be-processed time series data satisfies a normality assumption;

In an optional embodiment, the predicting module 1302 is further configured to filter out target item time series data with a number of mutation points from all item time series data of test cases;

the prediction module 1302 is further configured to select a target feature timing cluster for the second timing data from one or more feature timing clusters.

In an optional embodiment, the prediction module 1302 is further configured to, for a case that there are a plurality of the feature timing clusters, match the second timing data with each of the feature timing clusters; and screening out a target characteristic time sequence cluster from the plurality of characteristic time sequence clusters according to a matching result.

In an optional embodiment, the clustering module is further configured to segment, from the target item time series data, first time series data for training that is located before the set time point, where data of a plurality of time points included in the first time series data for training is smaller than a numerical value of a number of mutation points included in the target item time series data; clustering other article time sequence data except the target article time sequence data in the test case with the first time sequence data for training; and training a preset first type of model to be trained by utilizing the clustering result to obtain a first data processing model, wherein the first type of model to be trained is obtained based on one or more basic model configurations, and the basic model comprises one or more of ARIMA, ETS, Croston, simple moving average, FBProphet, Holt-Winters and first-order exponential smoothing.

In an optional embodiment, the clustering module is further configured to, for each type of feature timing cluster, train a preset second type of model to be trained by using timing data corresponding to the feature timing cluster, and obtain a second data processing model matched with the feature timing cluster, where the second type of model to be trained is obtained based on one or more base model configurations, and the second type of model to be trained is different from the first type of model to be trained.

In an optional embodiment, the screening out the target item time series data with the number of mutation points from all the item time series data of the test cases includes:

for the item time sequence data of each warehouse item identifier in the test case, executing the following operations: determining a data segmentation point; and determining the article time sequence data of the warehouse article identifier as the target article time sequence data under the condition that the data segmentation point is the quantity mutation point of the article time sequence data of the warehouse article identifier.

In an optional embodiment, the clustering module is further configured to determine a sequence translation function and a translation distance for a preset distance function; and clustering the second time sequence data for training by using the distance function for determining the sequence translation function and the translation distance and the DBSCAN algorithm.

In an optional embodiment, the clustering module is further configured to perform end-to-end splicing on the second time series data; and matching the spliced result with the one or more characteristic time sequence clusters respectively, and taking the characteristic time sequence cluster with the highest matching value as the target characteristic time sequence cluster.

In an optional embodiment, the clustering module is further configured to cut second timing data for training included in the feature timing cluster, and splice cut results to generate new second timing data for training; and training a preset second type of model to be trained by using the second time sequence data for training and the new second time sequence data for training which are included in the characteristic time sequence cluster.

Fig. 14 shows an exemplary system architecture 1400 of a data processing apparatus or method to which embodiments of the invention may be applied.

As shown in fig. 14, the system architecture 1400 may include

terminal devices

1401, 1402, 1403, a network 1404, and a server 1405. The network 1404 serves to provide a medium for communication links between the

terminal devices

1401, 1402, 1403 and the server 1405. The network 1404 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user can use the

terminal devices

1401, 1402, 1403 to interact with the server 1405 through the network 1404 to send a task execution request or receive response information of the request, or the like. Various communication client applications, such as an online service application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the

terminal devices

1401, 1402, and 1403.

The

terminal devices

1401, 1402, 1403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop and desktop computers, and the like.

The server 1405 may be a server that provides various services, such as a background management server that provides support for online service requests sent by users using the

terminal devices

1401, 1402, and 1403, or a server that processes data. The background management server may analyze and perform other processing on the received data such as the time series data, and feed back a processing result (for example, an output prediction result of the article demand condition) to the terminal device.

It should be noted that the method for data processing provided by the first aspect of the embodiment of the present invention is generally executed by the server 1405, and accordingly, the apparatus for data processing provided by the second aspect of the embodiment of the present invention is generally disposed in the server 1405.

It should be understood that the number of terminal devices, networks, and servers in fig. 15 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 15, shown is a block diagram of a computer system 1500 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 15 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 15, the computer system 1500 includes a Central Processing Unit (CPU) 1501 which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 1502 or a program loaded from a storage section 1508 into a Random Access Memory (RAM) 1503. In the RAM 1503, various programs and data necessary for the operation of the system 1500 are also stored. The CPU 701, the ROM 1502, and the RAM 1503 are connected to each other by a bus 1504. An input/output (I/O) interface 1505 is also connected to bus 1504.

The following components are connected to the I/O interface 1505: an input portion 1506 including a keyboard, a mouse, and the like; an output portion 1505 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1508 including a hard disk and the like; and a communication section 1509 including a network interface card such as a LAN card, a modem, or the like. The communication section 1509 performs communication processing via a network such as the internet. A drive 1510 is also connected to the I/O interface 1505 as needed. A removable medium 1511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1510 as necessary, so that a computer program read out therefrom is mounted into the storage section 1508 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1509, and/or installed from the removable medium 1511. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 1501.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a determination module and a prediction module. The names of these modules do not in some cases constitute a limitation on the module itself, and for example, the determination module may also be described as a "module for determining time series data to be processed".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:

aiming at the condition that the time sequence data to be processed has number of catastrophe points at set time points in the historical target time period, dividing the time sequence data to be processed into second time sequence data and first time sequence data according to the number of catastrophe points; determining a target characteristic time sequence cluster to which the second time sequence data belongs, processing the first time sequence data by using a preset first data processing model, and processing the second time sequence data by using a second data processing model matched with the target characteristic time sequence cluster;

The method and the device for processing the data analyze the situation that the time sequence data to be processed has quantity mutation points at the set time points in the historical target time period, respectively process the second time sequence data and the first time sequence data in the time sequence data to be processed through a first data processing model and a second data processing model, and finally determine a first prediction result of the demand situation of the goods in the prediction time sequence according to a first processing result of the first data processing model and a processing result of the second data processing model so as to set the stock of the goods in the prediction time sequence based on the first prediction result. By the method, the purpose of processing different time sequence data by adopting different processing models can be achieved, and the accuracy of the processing result is improved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of data processing, the method comprising:

for the condition that the time sequence data to be processed has a number of catastrophe points at a set time point in the historical target time period, dividing the time sequence data to be processed into first time sequence data before the set time point and second time sequence data after the set time point; determining a target characteristic time sequence cluster to which the second time sequence data belongs, processing the first time sequence data by using a preset first data processing model, and processing the second time sequence data by using a second data processing model matched with the target characteristic time sequence cluster; the number of mutation points is determined by big data statistics;

determining a first prediction result of the demand condition of the goods in a prediction time sequence according to the first processing result of the first data processing model and the processing result of the second data processing model, so as to set the stock of the goods in the prediction time sequence based on the first prediction result;

wherein the first processing result of the first data processing model comprises: the forecasting time sequence is internally provided with a forecasting result of the article demand condition before a time point;

the processing result of the second data processing model comprises: the forecasting result of the article demand condition after the set time point in the forecasting time sequence is set;

the determining a first prediction result of the item demand condition within the prediction time sequence comprises:

and splicing the prediction result of the article demand condition before the time point in the prediction time sequence with the prediction result of the article demand condition after the time point in the prediction time sequence to obtain the first prediction result.

2. The method of claim 1, wherein after the determining the time series data to be processed, further comprising:

processing the time sequence data to be processed by utilizing a preset first data processing model;

3. The method of claim 2, wherein after the determining the time series data to be processed, further comprising:

in the case where the analysis of the to-be-processed time series data does not satisfy the normality assumption,

and determining whether the time sequence data to be processed has quantity mutation points or does not have quantity mutation points at the timing points in the historical target time period by using a Pettitt test method.

4. The method of claim 1, further comprising:

screening target article time sequence data with a number of mutation points from all article time sequence data of the test cases;

segmenting second training time sequence data after the set time point from the target article time sequence data, wherein the data of a plurality of time points included in the second training time sequence data are equal to or larger than the numerical values of the number of mutation points included in the target article time sequence data;

the determining the target feature timing cluster to which the second timing data belongs includes:

and selecting a target characteristic time sequence cluster for the second time sequence data from one or more characteristic time sequence clusters.

5. The method of claim 4, wherein the selecting the target feature timing cluster for the second timing data comprises:

for the case where there are multiple clusters of the feature timing,

matching the second time sequence data with each characteristic time sequence cluster;

and screening out a target characteristic time sequence cluster from the plurality of characteristic time sequence clusters according to a matching result.

6. The method of claim 4, further comprising:

segmenting first time series data for training which are positioned before the set time point from the time series data of the target object, wherein the data of a plurality of time points included in the first time series data for training are smaller than the numerical values of the number mutation points included in the time series data of the target object;

clustering other article time sequence data except the target article time sequence data in the test case with the first time sequence data for training;

and training a preset first type of model to be trained by utilizing the clustering result to obtain a first data processing model, wherein the first type of model to be trained is obtained based on one or more basic model configurations, and the basic model comprises one or more of ARIMA, ETS, Croston, simple moving average, FBProphet, Holt-Winters and first-order exponential smoothing.

7. The method of claim 6, further comprising:

and aiming at each kind of characteristic time sequence cluster, training a preset second type of model to be trained by utilizing the time sequence data corresponding to the characteristic time sequence cluster to obtain a second data processing model matched with the characteristic time sequence cluster, wherein the second type of model to be trained is obtained based on one or more base model configurations, and the second type of model to be trained is different from the first type of model to be trained.

8. The method of claim 4, wherein screening the item time series data of interest with a number of discontinuities from all item time series data of test cases comprises:

for the item time sequence data of each warehouse item identifier in the test case, the following operations are executed:

determining a data segmentation point;

and determining the article time sequence data of the warehouse article identifier as the target article time sequence data under the condition that the data segmentation point is the quantity mutation point of the article time sequence data of the warehouse article identifier.

9. The method of claim 4, wherein clustering the training second timing data comprises:

determining a sequence translation function and a translation distance for a preset distance function;

and clustering the second time sequence data for training by using the distance function for determining the sequence translation function and the translation distance and the DBSCAN algorithm.

10. The method of claim 4, wherein the selecting the target feature timing cluster for the second timing data comprises:

performing head-to-tail splicing on the second time series data;

and matching the spliced result with the one or more characteristic time sequence clusters respectively, and taking the characteristic time sequence cluster with the highest matching value as the target characteristic time sequence cluster.

11. The method according to claim 7, wherein the training a preset second type of model to be trained by using the timing data corresponding to the feature timing cluster includes:

cutting second time sequence data for training included in the characteristic time sequence cluster, and splicing the cutting results to generate new second time sequence data for training;

and training a preset second type of model to be trained by using the second time sequence data for training and the new second time sequence data for training which are included in the characteristic time sequence cluster.

12. An apparatus for data processing, comprising:

the prediction module is used for dividing the time sequence data to be processed into first time sequence data located before the set time point and second time sequence data located after the set time point according to the number of catastrophe points under the condition that the time sequence data to be processed has the number of catastrophe points at the set time point in the historical target time period; determining a target characteristic time sequence cluster to which the second time sequence data belongs, processing the first time sequence data by using a preset first data processing model, and processing the second time sequence data by using a second data processing model matched with the target characteristic time sequence cluster; wherein the number of mutation points is determined by big data statistics;

and splicing the prediction result of the article demand condition before the set time point in the prediction time sequence with the prediction result of the article demand condition after the set time point in the prediction time sequence to obtain the first prediction result.

13. An apparatus for data processing, comprising: one or more processors;

a storage system for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-11.

14. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-11.