CN116308486A

CN116308486A - Target cigarette sales prediction method and device, electronic equipment and storage medium

Info

Publication number: CN116308486A
Application number: CN202310257649.0A
Authority: CN
Inventors: 谭茜; 莫玉华
Original assignee: China Tobacco Guangxi Industrial Co Ltd
Current assignee: China Tobacco Guangxi Industrial Co Ltd
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-06-23

Abstract

The application provides a target cigarette sales prediction method, a target cigarette sales prediction device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring historical cigarette data; determining a comparison group corresponding to the target cigarettes and fitting weights corresponding to a first preset number of reference cigarettes in the comparison group according to the historical cigarette data and a preset algorithm, wherein the correlation between the reference cigarettes and the target cigarettes is larger than a first preset threshold; deleting data except data corresponding to the control group and the target cigarette from the historical cigarette data to obtain residual data; inputting the residual data into a prediction integration model to obtain reference prediction sales of a first preset number of reference cigarettes in a control group; and obtaining the predicted sales of the target cigarettes according to the reference predicted sales and the corresponding fitting weights. By the method and the device, the problem that the sales of the new cigarettes cannot be accurately predicted in the related technology is solved.

Description

Target cigarette sales prediction method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of intersection of computer science and metering economy, in particular to a target cigarette sales prediction method, a target cigarette sales prediction device, electronic equipment and a storage medium.

Background

Along with the continuous progress of social informatization, systemization and intellectualization, the development of various industries gradually takes a big data comprehensive application technology taking machine learning and deep learning as main targets as an important tool. The method of data mining, neural network, data visualization, information fusion and the like is very effective and feasible in combination with the actual application scenes of various industries. Many efforts and attempts have been made in the current tobacco industry in digital marketing, but existing methods are generally focused on predicting sales of existing mature product cigarettes using time series, thereby obtaining a specific delivery model. The technical route contained by the predictive model has two main preconditions: firstly, the time sequence of the cigarettes is longer, so that a model is convenient to capture the data change in the long-time dimension; secondly, the time sequence of the cigarettes is stable, and future data can be reliably predicted according to past or similar variety data. However, the data which can be used for predicting the sales of the new cigarettes has the problems of short data interval, large data fluctuation and the like, and the application of the traditional time sequence model in the prediction of the sales of the new cigarettes is restricted.

The existing scheme for predicting the sales of new cigarettes is to use a deep neural network LSTM for prediction, but the scheme has three problems: firstly, LSTM's own drawbacks, which are disadvantageous in facing parallel processing problems, are also troublesome for longer time sequences and consume a lot of training time; secondly, the history data of the new cigarettes are seriously insufficient; thirdly, due to the early propaganda problem and the new taste psychology of customers, the historical data fluctuation of the new cigarettes is severe, and LSTM is difficult to capture sales trend of the new cigarettes, so that overestimate or underestimate is generated, and larger sales prediction error is generated.

Therefore, the problem that the sales of new cigarettes cannot be accurately predicted exists in the prior art.

Disclosure of Invention

The application provides a target cigarette sales predicting method, device, electronic equipment and storage medium, which at least solve the problem that the sales of new cigarettes cannot be accurately predicted in the related technology.

According to an aspect of the embodiments of the present application, there is provided a target cigarette sales prediction method, including:

acquiring historical cigarette data;

determining a comparison group corresponding to a target cigarette and fitting weights corresponding to a first preset number of reference cigarettes in the comparison group according to the historical cigarette data and a preset algorithm, wherein the correlation between the reference cigarettes and the target cigarette is larger than a first preset threshold;

Deleting data except the data corresponding to the control group and the target cigarettes from the historical cigarette data to obtain residual data;

inputting the residual data into a prediction integration model to obtain the reference prediction sales of a first preset number of reference cigarettes in the control group;

and obtaining the predicted sales of the target cigarettes according to the reference predicted sales and the corresponding fitting weights.

According to another aspect of the embodiments of the present application, there is provided a target cigarette sales predicting apparatus, including:

the acquisition module is used for acquiring historical cigarette data;

the determining module is used for determining a comparison group corresponding to the target cigarettes and fitting weights corresponding to a first preset number of reference cigarettes in the comparison group according to the historical cigarette data and a preset algorithm, wherein the correlation between the reference cigarettes and the target cigarettes is larger than a first preset threshold;

the first obtaining module is used for deleting data except the data corresponding to the control group and the target cigarettes from the historical cigarette data to obtain residual data;

the second obtaining module is used for inputting the residual data into a prediction integration model to obtain the reference prediction sales of a first preset number of reference cigarettes in the control group;

And a third obtaining module, configured to obtain the predicted sales of the target cigarette according to the reference predicted sales and the corresponding fitting weight.

Optionally, the apparatus further comprises:

a fourth obtaining module, configured to calculate variances corresponding to all the numerical features in the historical cigarette data, and delete the numerical features corresponding to which the variances are smaller than a second preset threshold value, so as to obtain first intermediate data;

the judging module is used for judging whether error data exist in the first intermediate data;

the restoration module is used for restoring the error data in the first intermediate data by using a preset method if the error data exist in the first intermediate data, so as to obtain second intermediate data;

the dividing module is used for dividing all the features in the second intermediate data into numerical features and non-numerical features;

and the processing module is used for normalizing the numerical characteristics and digitizing the non-numerical characteristics.

Optionally, the determining module includes:

the first obtaining unit is used for analyzing the relevance between the reference cigarettes and the target cigarettes according to a preset algorithm, and selecting a first preset number of reference cigarettes according to the relevance to obtain the control group;

The acquisition unit is used for acquiring a first matching variable vector of the target cigarette and a second matching variable vector of the reference cigarette from the historical cigarette data, wherein the first matching variable vector is used for determining the historical cigarette data corresponding to the target cigarette, and the second matching variable vector is used for determining the historical cigarette data corresponding to the reference cigarette;

the second obtaining unit is used for obtaining a reference variable according to the first matching variable vector, the second matching variable vector, a preset diagonal matrix, an intermediate weight and a preset formula, wherein the intermediate weight is a dynamic change value in the process of determining the fitting weight;

and the unit is used for taking the intermediate weight corresponding to the reference variable with the smallest numerical value as the fitting weight.

Optionally, the repair module includes:

a selecting unit, configured to select a data column in which the error data exists from the first intermediate data;

the splitting unit is used for splitting the first intermediate data according to the data column to obtain a data group with the error data;

the third obtaining unit is used for obtaining a data restoration model according to a preset algorithm and the data set;

A fourth obtaining unit, configured to obtain a predicted data set corresponding to the data set according to the data repair model, where the predicted data set is used to process the error data;

and the restoration unit is used for restoring the first intermediate data based on the prediction data set to obtain the second intermediate data.

Optionally, the apparatus further comprises:

the input module is used for inputting the residual data into a plurality of initial prediction models to respectively obtain initial prediction values corresponding to the initial prediction models;

a fifth obtaining module, configured to obtain a loss value corresponding to each initial predicted value according to the remaining data, the initial predicted value, and a preset function;

the adjusting module is used for adjusting the model parameters of the initial prediction model corresponding to the loss value according to a second preset algorithm and the loss value until the loss value is smaller than a third preset threshold value to obtain a plurality of prediction models;

the construction module is used for constructing an integrated model based on a preset logistic regression algorithm;

and a sixth obtaining module, configured to obtain the prediction integrated model according to the prediction model and the integrated model.

Optionally, the second obtaining module includes:

a fifth obtaining unit, configured to input the remaining data into each prediction model, to obtain intermediate reference prediction sales amounts of a first preset number of reference cigarettes corresponding to each prediction model;

and a sixth obtaining unit, configured to obtain, according to the integrated model and the intermediate reference predicted sales, the reference predicted sales of a first preset number of reference cigarettes.

Optionally, the sixth obtaining unit includes:

and the obtaining submodule is used for fusing the intermediate reference predicted sales by using a logistic regression algorithm to respectively obtain the predicted sales of the first preset number of reference cigarettes.

According to yet another aspect of the embodiments of the present application, there is also provided an electronic device including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete communication with each other through the communication bus; wherein the memory is used for storing a computer program; a processor for performing the method steps of any of the embodiments described above by running the computer program stored on the memory.

According to a further aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the method steps of any of the embodiments described above when run.

In the embodiment of the application, historical cigarette data are acquired; determining a comparison group corresponding to the target cigarettes and fitting weights corresponding to a first preset number of reference cigarettes in the comparison group according to the historical cigarette data and a preset algorithm, wherein the correlation between the reference cigarettes and the target cigarettes is larger than a first preset threshold; deleting data except data corresponding to the control group and the target cigarette from the historical cigarette data to obtain residual data; inputting the residual data into a prediction integration model to obtain reference prediction sales of a first preset number of reference cigarettes in a control group; and obtaining the predicted sales of the target cigarettes according to the reference predicted sales and the corresponding fitting weights. Through the method, a reference cigarette with larger correlation with the target cigarette is selected by using a preset algorithm to form a comparison group and a fitting weight; the predicted sales of the reference cigarettes in the control group is predicted by using the prediction integrated model, and the predicted sales of the target cigarettes are obtained by combining fitting weights, so that the sales trend of the new cigarettes is simulated and calibrated by using the stable sales trend of the mature cigarettes. The limitation encountered by the sales prediction of the new cigarettes is solved, and the technical blank of the new cigarette delivery model is filled. The method and the device can effectively solve the problem that the sales of the new cigarettes cannot be accurately predicted in the related technology.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a flow chart of an alternative target cigarette sales prediction method according to an embodiment of the present application;

FIG. 2 is a flow chart of an alternative data process according to an embodiment of the present application;

FIG. 3 is a flow chart of an alternative method of filling missing values according to an embodiment of the present application;

FIG. 4 is an overall flowchart of an alternative new cigarette sales prediction method according to an embodiment of the present application;

FIG. 5 is a block diagram of an alternative targeted cigarette sales prediction device according to an embodiment of the present application;

fig. 6 is a block diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

New cigarettes often do not have long time series data, and a short sample data time span can result in estimation errors of traditional time series models. And because of the propaganda factors in the early stage and the new taste psychology of customers, the data of the new cigarettes tend to have severe fluctuation in the early stage of sales and tend to be stable in the later stage, and the overall trend of sales of cooked cigarette products is stable. The sales fluctuation of the new cigarette data is easily amplified by adopting the time sequence data prediction, so that estimation errors are caused, and further estimation errors are increased, and the decision of a delivery strategy of the new cigarette is influenced. The prior art aims at the mature product cigarettes with long time sequences, and the prediction of new product cigarettes is constrained by the short time span of historical sales data, so that the prediction effect required by the industry is difficult to achieve. Therefore, how to predict sales of new cigarettes under the limitation of shorter data intervals is an important problem to be solved in the development of the current industry.

It should be noted that, the target cigarettes in the application include new cigarettes and other cigarettes that do not have long time series data and have short sample data time spans, where the new cigarettes may be cigarettes in a new product lead-in period or cigarettes that have short market time (e.g. less than 3 months in market time) in a certain area, and for convenience in representation, the new cigarettes and other cigarettes that do not have long time series data and have short sample data time spans are hereinafter referred to as new cigarettes. The new product lead-in period is used for clearly recognizing consumer groups, price, packaging, taste, putting strategies and the like and positioning cigarettes on the basis of the thorough market research, and a systematic brand cultivation mode is constructed.

Based on the foregoing, according to an aspect of the embodiments of the present application, there is provided a target cigarette sales prediction method, as shown in fig. 1, a flow of the method may include the following steps:

step S101, historical cigarette data are obtained.

Optionally, the application takes a new product cigarette C saved in the year 2020A as an example, and takes 18 months of cigarette sales data at the retail establishment level in the city of A, B as a research sample. The historical cigarette data includes: the historical sales data, retail sales data, market crowd image data and short-term historical sales volume of new cigarettes of the mature cigarettes and the new cigarettes mainly comprise variables such as cigarette types (mainly comprising conventional cigarettes, fine cigarettes, medium cigarettes, bursting beads and the like), brands (mainly comprising ZH (zero-point) cigarettes, ZL (zero-point) cigarettes) and the like), products (mainly comprising high-end cigarettes, medium-low-end cigarettes and low-end cigarettes), sales prices and the like, wherein the short-term historical sales volume has sales data of the new cigarettes for 2-3 months, the mature cigarettes are reference cigarettes, the new cigarettes are target cigarettes, and the mature cigarettes and the new cigarettes are collectively called as the mature cigarettes and the new cigarettes in the follow-up embodiments.

Step S102, determining fitting weights corresponding to a first preset number of reference cigarettes in a control group corresponding to the target cigarettes according to historical cigarette data and a preset algorithm, wherein the correlation between the reference cigarettes and the target cigarettes is larger than a first preset threshold.

Optionally, according to the standard label data of the mature cigarettes and the new cigarettes, through a preset algorithm, for example, a synthesis control method, the correlation of the mature cigarettes relative to the new cigarettes is determined through analysis, and a first preset number of mature cigarettes with larger correlation (i.e. correlation larger than a first preset threshold value) are selected as reference cigarettes to form a control group. The first preset number represents a plurality, and no specific number limitation is made here.

And then, calculating the fitting weight of each mature cigarette relative to the new cigarette according to the historical sales data of the reference cigarettes in the control group and the short-term historical sales quantity of the new cigarette by a synthesis control method.

And step S103, deleting data except the data corresponding to the control group and the target cigarettes from the historical cigarette data to obtain residual data.

Optionally, after the comparison group is generated, in order to simplify data and improve calculation efficiency, data irrelevant to all reference cigarettes and target cigarettes can be deleted from the historical cigarette data, so that simplified and cleanly cleaned data, namely residual data, can be obtained.

Step S104, inputting the residual data into a prediction integration model to obtain the reference prediction sales of the first preset number of reference cigarettes in the control group.

Alternatively, the prediction integration model is obtained by integrating a plurality of prediction models, and the plurality of prediction models may be respectively built according to an algorithm such as CatBoost, neturalNetTorch, lightGBM, randomForest and fused into the prediction integration model by a synthesis algorithm such as LR synthesis algorithm.

And respectively inputting the residual data into each prediction model, predicting sales of each reference cigarette by each prediction model, and fusing prediction results of the prediction models through an LR synthesis algorithm to obtain reference prediction sales of each reference cigarette.

Step S105, obtaining the predicted sales of the target cigarettes according to the reference predicted sales and the corresponding fitting weights.

Optionally, the fitting weight calculated by the synthesis control method is used for carrying out weighted summation on the reference predicted sales of the reference cigarettes to obtain the predicted sales of the target cigarettes (new cigarettes).

As an alternative embodiment, after acquiring the historical cigarette data, the method further comprises:

calculating variances corresponding to all numerical features in the historical cigarette data, and deleting the numerical features with the corresponding variances smaller than a second preset threshold value to obtain first intermediate data;

judging whether error data exists in the first intermediate data;

if the first intermediate data has error data, repairing the error data in the first intermediate data by using a preset method to obtain second intermediate data;

dividing all the features in the second intermediate data into numerical features and non-numerical features;

the numerical features are normalized and the non-numerical features are quantized.

Optionally, the historical cigarette data is subjected to data preprocessing, and the processing process comprises filling of missing values, detection of abnormal values and non-numerical value quantification. The missing value filling and abnormal value detection use the same filling scheme, namely a random forest algorithm (random forest algorithm) is used for filling the missing value and the abnormal value, wherein the missing value and the abnormal value are collectively called as error data.

The process comprises the following steps: and (3) performing feature screening according to the variance, uniformly deleting the variance which is lower than 5, namely calculating variances corresponding to all numerical features in the historical cigarette data, and deleting the numerical features of which the corresponding variances are smaller than a second preset threshold value, such as 5, so as to obtain first intermediate data. And repairing error data in the first intermediate data based on a random forest algorithm, wherein the error data comprises missing values and abnormal values, and filling the missing values and the abnormal values to obtain second intermediate data. All the features in the second intermediate data are divided into numerical features and non-numerical features, the numerical features are normalized, and the non-numerical features are digitally encoded, i.e. digitized.

The present application further discloses another data processing logic, as shown in fig. 2, the flow includes: reading data; judging whether the data is preprocessed or not, if not, deleting the characteristic that the variance in the data is less than 5; deleting irrelevant cigarette varieties according to a synthesis control method; judging whether a non-numerical feature exists, if so, converting the non-numerical feature into a numerical feature; performing original copy replacement; substituting the model for training; results were obtained. The data in the flow chart is historical cigarette data, the cigarette variety is cigarettes, the cigarettes are substituted into model training, the prediction integrated model is input subsequently, and the sales of new cigarettes are obtained. Historical cigarette data can be processed as well using the logic described above.

In the embodiment of the application, the historical cigarette data is more reliable by carrying out data preprocessing on the historical cigarette data, so that the predicted sales of new cigarettes is more accurate.

As an optional embodiment, determining, according to historical cigarette data and a preset algorithm, fitting weights corresponding to a first preset number of reference cigarettes in a control group corresponding to a target cigarette, including:

analyzing the relevance of the reference cigarettes and the target cigarettes according to a preset algorithm, and selecting a first preset number of reference cigarettes according to the relevance to obtain a control group;

acquiring a first matching variable vector of a target cigarette and a second matching variable vector of a reference cigarette from historical cigarette data, wherein the first matching variable vector is used for determining historical cigarette data corresponding to the target cigarette, and the second matching variable vector is used for determining historical cigarette data corresponding to the reference cigarette;

obtaining a reference variable according to the first matching variable vector, the second matching variable vector, a preset diagonal matrix, an intermediate weight and a preset formula, wherein the intermediate weight is a dynamic change value in the process of determining the fitting weight, and the reference variable is used for determining the fitting weight;

and taking the intermediate weight corresponding to the reference variable with the smallest value as the fitting weight.

Optionally, a preset algorithm, such as a synthesis control method, mainly analyzes the similarity (i.e. relatedness) between each mature cigarette and the new cigarette, and assigns different fitting weights to the new cigarette for different mature cigarettes. If the similarity of the mature cigarette to the new cigarette is strong, the fitting weight is large, and if the similarity is weak, the weight is small.

The method comprises the following specific steps: and analyzing the relevance between the reference cigarettes and the target cigarettes according to a synthesis control method, screening out a first preset number of mature cigarettes most relevant to the target new cigarettes by using the synthesis control method, and obtaining a control group, wherein the first preset number is a plurality of cigarettes, and the specific number is not limited here.

And acquiring a first matching variable vector of the target cigarette (new cigarette) and a second matching variable vector of the reference cigarette from the historical cigarette data.

The idea of determining the fitting weight corresponding to each reference cigarette by adopting the synthesis control method is as follows:

assuming a total of j+1 cigarettes, a mature cigarettes from among which a number a represents a first preset number are selected as reference cigarettes by a synthetic control method.

Assuming that the first cigarette (j=1) is a target cigarette for which sales are expected to be predicted in the application, the rest (j=2, 3, …, j+1) is a mature cigarette, the time of data is t, and for each cigarette J and time t, the information contained in the matching variable vector comprises: cigarette sales, cigarette class, cigarette standard, price interval, retail price of the package, etc. For cigarette j and time t, define X _jt Representing its matching variable vector. For reference cigarettes (i=1, 2, …, a), X _it Representing its matching variable vector at t (i.e., the second matching variable vector), j=1, x for the target cigarette _1t The matching variable vector (namely a first matching variable vector) at t is represented, and the information contained in the first matching variable vector comprises variables such as cigarette sales volume, cigarette class, cigarette standard, price interval, package retail price and the like of target cigarettes.

In order to obtain the fitting weight of the reference cigarette to the target cigarette, the weight obtained by the synthesis control method can be represented by an Ax1 weight vector, namely omega _i ＝(w ₁ ,...,w _A ) Wherein each weight is non-negative, the sum of the weights is 1, the weight is selected as the most important ring in the synthesis control method, the result of the research method only has significance when the reference cigarette can be better fit with the 'real region', and the equation only needs to be applied to x _1t -X _it ω _i And (5) minimum value is obtained. The weight ω needs to be selected _i So that the X of variables including cigarette sales, cigarette class, cigarette standard, price interval, retail price of the strip package and the like _it As close as possible to x _1t Each matching characteristic of the weighted reference cigarettes is close to the target cigarettes as much as possible. To measure this distance, a quadratic (distance between two points in Euclidean-like space) matching equation setting can be used. In determining a prediction error that is to be made based on a synthetic control method In the process of fitting weights with the smallest difference, the fitting weights are called intermediate weights, and the intermediate weights are dynamic change values in the process of determining the fitting weights. The first matching variable vector x _1t Second matching variable vector X _it (i=1, 2, …, a), a preset diagonal matrix V, and an intermediate weight ω of a first preset number of reference cigarettes _i Substituting the reference variable (x) into the formula (1), namely a preset formula, to obtain the reference variable (x _1t -X _it ω _i ) ^T V(x _1t -X _it ω _i ) Wherein V is a diagonal matrix of dimension (k×k), k being set as desired.

At time t, a set of intermediate weights is selected such that the reference variable (x _1t -X _it ω _i ) ^T V(x _1t -X _it ω _i ) The intermediate weights in the group are respectively used as the fitting weights of each corresponding reference cigarette.

min _ω (x _1t -X _it ω _i ) ^T V(x _1t -X _it ω _i ) (1)

Wherein s.t. in formula (2) is an abbreviation of subject to, and represents the constraint condition of formula (1) for formula (2), (x) _1t -X _it ω _i ) ^T Representation (x) _1t -X _it ω _i ) Is a transposed matrix of (a).

In the embodiment of the application, based on the synthesis control method, the comparison group constructed by the mature cigarettes and the fitting weight of each mature cigarette relative to the new cigarette are determined, so that the prediction effect of the subsequent machine learning can be ensured to be more definite and reliable.

As an optional embodiment, repairing the error data in the first intermediate data by using a preset method to obtain second intermediate data, including:

Selecting a data column with error data from the first intermediate data;

splitting the first intermediate data according to the data columns to obtain a data group with error data;

obtaining a data restoration model according to a preset algorithm and a data set;

obtaining a predicted data set corresponding to the data set according to the data repair model, wherein the predicted data set is used for processing error data;

and repairing the first intermediate data based on the predicted data set to obtain second intermediate data.

Optionally, the error data includes a missing value and an abnormal value, wherein the repair method and repair steps for the missing value and the abnormal value are the same, and the repair is performed by using a random forest algorithm (i.e. a preset algorithm).

The missing values herein are considered to be filled in by using random forests, and specific reasons include: 1. the characteristics of the cigarettes are related, so that a certain characteristic can be predicted by using part of characteristic vectors, and the missing value can be predicted by using the relation between the characteristic vectors of the same product. 2. When a certain feature in the cigarette has a missing value, we consider the feature as a target, and the rest as a new feature vector (the original vector feature removes the rest of the target feature, and the rest should be the feature vector without missing), which is modeled by using a random forest. And using the samples with the missing values as a training set, using the samples with the missing values as a test set, and filling the predicted result into the missing values. 3. When a plurality of feature vectors in the feature vectors of the original cigarettes have missing values, processing is started from the feature with the least missing value, and other missing values are filled with 0; and filling the predicted missing value of the feature into the original data after the predicted missing value of the feature is predicted by using a random forest, and then continuously processing the next missing value according to the method. The reason for repairing outliers using the random forest algorithm is the same and will not be described here again.

The restoration process of the true value according to this embodiment will be described with reference to the above and fig. 3, as shown in fig. 3: the first intermediate data is stored by adopting a CSV file, and the reading target CSV file represents that the first intermediate data is processed. Judging whether a missing value exists, if not, directly ending the flow, and if so, arbitrarily selecting a column with the missing value, namely selecting a data column with error data from the first intermediate data; dividing the data with the missing value and the data without the missing value into two groups according to the column selected in the last step, namely splitting the first intermediate data according to the data column to obtain a data group with error data; training on a group without missing values by using a random forest, namely obtaining a data restoration model according to a preset algorithm and a data group; predicting the data of the missing group by using the model trained in the previous step, namely obtaining a predicted data group corresponding to the data group according to the data repair model; filling the missing value according to the prediction in the last step, namely repairing the first intermediate data based on the predicted data group to obtain second intermediate data.

The procedure for repairing the outliers is the same as that described above, and will not be repeated here.

In the embodiment of the application, a random forest error data repairing method is adopted, rules among data are learned through a machine learning method, and the method is more reliable than a traditional method for repairing error data by using mean, median and mode.

As an alternative embodiment, before inputting the remaining data into the prediction integration model to obtain the reference predicted sales of the first preset number of reference cigarettes in the control group, the method further includes:

inputting the residual data into a plurality of initial prediction models to respectively obtain initial prediction values corresponding to each initial prediction model;

respectively obtaining a loss value corresponding to each initial pre-measurement according to the residual data, the initial pre-measurement and the pre-measurement function;

according to a second preset algorithm and the loss value, adjusting model parameters of an initial prediction model corresponding to the loss value until the loss value is smaller than a third preset threshold value to obtain a plurality of prediction models;

constructing an integrated model based on a preset logistic regression algorithm;

and obtaining a prediction integrated model according to the prediction model and the integrated model.

Optionally, in this embodiment, 4 machine learning algorithms are used to construct the prediction model, and the prediction model may be added or subtracted according to the requirement.

The residual data are respectively input into four machine learning algorithms, wherein the four machine learning algorithms are respectively as follows: and carrying out iteration and updating of the model by using a Catboost algorithm, a NeuralNetTorch algorithm, a LightGBM algorithm and a RandomForest algorithm and finally obtaining four independent models with excellent training.

And respectively using an initial prediction model based on a Catboost algorithm, an initial prediction model based on a NetwalNetTorch algorithm, an initial prediction model based on a LightGBM algorithm, and an initial prediction model based on a RandomForest algorithm, and predicting the sales of each reference cigarette according to the residual data to obtain initial prediction quantity. According to the initial predicted value, the actual sales in the residual data and the objective function (i.e. the preset function), calculating to obtain a loss value corresponding to each initial predicted value, wherein the loss value can be used for measuring the degree of inconsistency between the predicted value and the actual value of the model, the smaller the loss is, the better the robustness of the model is, the training can be stopped when the loss is smaller than a certain threshold value, at the moment, the model is considered to be trained, and the objective function can be as follows: mean square error function, absolute value error function, etc.

According to a second preset algorithm (i.e. gradient descent algorithm) and a loss value, adjusting model parameters (i.e. gradient) of an initial prediction model corresponding to the loss value until the loss value is smaller than a third preset threshold value, obtaining a plurality of prediction models, including: prediction model based on Catboost algorithm, prediction model based on NeuralNetTorch algorithm, prediction model based on LightGBM algorithm, and prediction model based on RandomForest algorithm. The method comprises the following specific steps: gradient acquisition is performed on the neural network in each prediction model according to the loss value through a gradient descent algorithm, wherein the neural network can be regarded as an abstract function, training the neural network can be regarded as finding the minimum point on an unknown function, namely finding the derivative at a certain initial point of the unknown function, obtaining the opposite direction of the derivative, namely obtaining gradient update in the process, and saving the update information in an optimizer. The gradient descent algorithm is used, a specific step length is set to update the model gradient, and the next iteration is carried out, namely, the gradient obtained by the optimizer is added to the neural network, namely, the gradient is shifted by one step length in the opposite direction of the gradient. And clearing the gradient of the optimizer, ensuring that the gradient descent iteration of the model of the next round is not influenced by the previous round, and preparing for the gradient descent iteration of the model of the next round.

The integration model is constructed based on a preset logistic regression algorithm, such as an LR algorithm. And obtaining a prediction integrated model according to the prediction model and the integrated model.

In the embodiment of the application, the prediction effect is more definite and reliable by training the prediction model. Therefore, the method can greatly simplify the manual calculation amount, avoid the deviation caused by manual calculation, and make the method more perfect through continuous learning. The method is applied to the throwing strategy of various new cigarettes, so that the whole tobacco industry is served, and the economic benefit of the tobacco industry is improved.

As an alternative embodiment, inputting the remaining data into the prediction integration model to obtain a reference prediction sales of a first preset number of reference cigarettes in the control group, including:

inputting the residual data into each prediction model to obtain intermediate reference prediction sales of a first preset number of reference cigarettes corresponding to each prediction model;

and obtaining the reference predicted sales of the first preset number of reference cigarettes according to the integrated model and the intermediate reference predicted sales.

Optionally, inputting the remaining data into each predictive model includes: prediction models based on the Catboost algorithm, prediction models based on the NetwalNetTorch algorithm, prediction models based on the LightGBM algorithm, prediction models based on the RandomForest algorithm, and each prediction model can output the middle reference prediction sales of each reference cigarette.

And respectively summarizing the plurality of intermediate reference predicted sales of each reference cigarette by using the integrated model to obtain the reference predicted sales of the first preset number of reference cigarettes.

It should be noted that: random forests are a common machine learning algorithm, and the core idea is to construct a plurality of weak classifiers, combine the results obtained by the weak classifiers, and finally calculate a real result. The specific process of obtaining the intermediate reference predicted sales by using a prediction model based on a RandomForest algorithm is as follows:

the feature quantity of the obtained residual data is M; randomly extracting M features, M should be much smaller than M; screening the selected m features from the residual data; constructing a decision tree by using the screened data; circulating until N decision trees are constructed, wherein N is the number designated by a user; and summarizing the result of each decision tree for voting, and finally obtaining the intermediate reference prediction sales.

The decision tree is a tree structure, is a basic weak classifier in a random forest algorithm, the basic working principle of the decision tree is that the decision tree is branched through classification information entropy, when the decision tree reaches a specified depth, the decision tree stops growing, and leaf nodes of the decision tree are the expression of final classification. Decision trees have several advantages: the first is to perform well on the dataset and not easily overfit. And secondly, the random forest has good noise immunity. Thirdly, the method can process data with very high dimensionality, does not need to be used as feature selection, has strong adaptability to the data set, can process discrete data and continuous data, and does not need standardization. Fourth, a matrix may be generated for measuring the similarity between samples. Fifth, unbiased estimates of generalization errors are used when creating random forests. And sixthly, training is fast, and variable importance ranking can be obtained. Seventh, interactions between features can be detected during training. Eighth is to easily make a parallelization approach. And the ninth is that the implementation process is relatively simple.

In the embodiment of the application, a prediction model based on a Catboost algorithm, a prediction model based on a NetwalNetTorch algorithm, a prediction model based on a LightGBM algorithm, and a prediction model based on a RandomForest algorithm are respectively used for predicting the middle reference prediction sales of the reference cigarettes; and the integration model is used for fusing the intermediate reference prediction pins to obtain the reference prediction sales, so that the prediction result is ensured to have good accuracy and proper generalization.

As an alternative embodiment, obtaining the reference predicted sales of the first preset number of reference cigarettes according to the integrated model and the intermediate reference predicted sales includes:

and fusing the intermediate reference predicted sales by using a logistic regression algorithm to respectively obtain the predicted sales of the first preset number of reference cigarettes.

Optionally, the intermediate reference predicted sales for the plurality of machine learning models are aggregated. And establishing a logistic regression by using a logistic regression algorithm such as an LR algorithm, taking the middle reference predicted sales of the four models as characteristics, and carrying out regressive by taking the predicted sales of the reference cigarettes as predicted variables to obtain the predicted sales of each reference cigarette.

In the embodiment of the application, the reference prediction sales of the reference cigarettes are predicted simultaneously by using a plurality of prediction models to obtain the prediction results of the plurality of prediction models, and the results of the plurality of prediction models are finally fused by adopting a logistic regression algorithm to obtain the final prediction sales. The accuracy of predicting sales is greatly improved.

According to another aspect of the embodiments of the present application, there is further provided a method for predicting sales of new cigarettes, as shown in fig. 4, a flow of the method includes:

raw data; data processing; inputting the processed data into a synthesis control method to obtain the fitting weight of the mature cigarette to the new cigarette; inputting the processed data into machine learning, namely respectively inputting the processed data into a Catboost algorithm, a Neuralnet Torch algorithm, a LightGBM algorithm and a Rannomforest algorithm to obtain a prediction result, and fusing the prediction result by using an LR algorithm to obtain the predicted sales of the mature cigarettes; and carrying out weighted summation on the fitting weight of the mature cigarettes to the new cigarettes and the predicted sales of the mature cigarettes to obtain the predicted sales of the new cigarettes.

Optionally, the core idea of the embodiment is to fit sales trend of new cigarettes based on machine learning and synthesis control methods by utilizing the characteristic that the data interval of mature cigarettes is long and sales are stable. And analyzing the relevance of the sales of the mature product cigarettes and the new product cigarettes based on the cigarette standard data by using a synthetic control method, and calculating to obtain the fitting weight of each mature product cigarette to the new product cigarettes. The historical sales volume of the mature product cigarettes, retail sales data, market crowd image data and cigarette historical sales data are used by four machine learning algorithms to obtain sales volume prediction data of various mature cigarette products, and meanwhile, results obtained by the four machine learning algorithms are fused by using a logistic regression LR algorithm (hereinafter also referred to as an integrated model) to obtain prediction data for mature product sales obtained by fusion of the four models. And calculating to obtain sales prediction quantity of the new cigarettes according to the mature cigarette product prediction data obtained by machine learning and the weight obtained by the synthesis control method.

According to the embodiment of the application, the limitation encountered by the sales prediction of the new cigarettes is solved, the study blank of the new cigarette throwing model is filled, the tobacco industry is helped to formulate the throwing strategy of the new cigarettes, the problem of resource mismatching caused by the fact that the new cigarettes cannot be precisely thrown due to the fact that historical data are not available is solved, the tobacco industry is helped to make scientific decisions and accurately throw, and the direct and indirect economic benefits of the tobacco industry are improved.

According to another aspect of the embodiments of the present application, a target cigarette sales prediction apparatus for implementing the target cigarette sales prediction method is provided. Fig. 5 is a block diagram of an alternative target cigarette sales prediction apparatus according to an embodiment of the present application, as shown in fig. 5, the apparatus may include:

an acquisition module 501, configured to acquire historical cigarette data;

the determining module 502 is configured to determine, according to the historical cigarette data and a preset algorithm, a comparison group corresponding to the target cigarette and fitting weights corresponding to a first preset number of reference cigarettes in the comparison group, where a correlation between the reference cigarettes and the target cigarette is greater than a first preset threshold;

a first obtaining module 503, configured to remove data other than data corresponding to the control group and the target cigarette from the historical cigarette data, so as to obtain remaining data;

A second obtaining module 504, configured to input the remaining data into the prediction integrated model, to obtain a reference prediction sales of a first preset number of reference cigarettes in the control group;

and a third obtaining module 505, configured to obtain a predicted sales of the target cigarette according to the reference predicted sales and the corresponding fitting weight.

It should be noted that, the acquiring module 501 in this embodiment may be used to perform the step S101, the determining module 502 in this embodiment may be used to perform the step S102, the first obtaining module 503 in this embodiment may be used to perform the step S103, the second obtaining module 504 in this embodiment may be used to perform the step S104, and the third obtaining module 505 in this embodiment may be used to perform the step S105.

Selecting reference cigarettes with larger correlation with the target cigarettes by using a preset algorithm to form a comparison group and a fitting weight through the modules; the predicted sales of the reference cigarettes in the control group is predicted by using the prediction integrated model, and the predicted sales of the target cigarettes are obtained by combining fitting weights, so that the sales trend of the new cigarettes is simulated and calibrated by using the stable sales trend of the mature cigarettes. The limitation encountered by the sales prediction of the new cigarettes is solved, and the technical blank of the new cigarette delivery model is filled. The method and the device can effectively solve the problem that the sales of the new cigarettes cannot be accurately predicted in the related technology.

As an alternative embodiment, the apparatus further comprises:

a fourth obtaining module, configured to calculate variances corresponding to all the numerical features in the historical cigarette data, and delete the numerical features with the variances smaller than a second preset threshold to obtain first intermediate data;

As an alternative embodiment, the determining module includes:

the first obtaining unit is used for analyzing the relevance between the reference cigarettes and the target cigarettes according to a preset algorithm, and selecting a first preset number of reference cigarettes according to the relevance to obtain a control group;

As an alternative embodiment, the repair module includes:

a selecting unit for selecting a data column with error data from the first intermediate data;

the splitting unit is used for splitting the first intermediate data according to the data columns to obtain a data group with error data;

the third obtaining unit is used for obtaining a data restoration model according to a preset algorithm and a data set;

a fourth obtaining unit, configured to obtain a predicted data set corresponding to the data set according to the data repair model, where the predicted data set is used to process error data;

and the restoration unit is used for restoring the first intermediate data based on the predicted data set to obtain second intermediate data.

As an alternative embodiment, the apparatus further comprises:

the input module is used for inputting the residual data into a plurality of initial prediction models to respectively obtain initial prediction values corresponding to each initial prediction model;

A fifth obtaining module, configured to obtain a loss value corresponding to each initial pre-measurement according to the remaining data, the initial pre-measurement, and the pre-measurement function;

the adjusting module is used for adjusting model parameters of the initial prediction model corresponding to the loss value according to the second preset algorithm and the loss value until the loss value is smaller than a third preset threshold value to obtain a plurality of prediction models;

and a sixth obtaining module, configured to obtain a prediction integrated model according to the prediction model and the integrated model.

As an alternative embodiment, the second obtaining module includes:

a fifth obtaining unit, configured to input remaining data into each prediction model, to obtain intermediate reference prediction sales of a first preset number of reference cigarettes corresponding to each prediction model;

and a sixth obtaining unit, configured to obtain a reference predicted sales of a first preset number of reference cigarettes according to the integrated model and the intermediate reference predicted sales.

As an alternative embodiment, the sixth obtaining unit includes:

It should be noted that the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments.

According to still another aspect of the embodiments of the present application, there is further provided an electronic device for implementing the above-mentioned target cigarette sales prediction method, where the electronic device may be a server, a terminal, or a combination of both.

Fig. 6 is a block diagram of an alternative electronic device, according to an embodiment of the present application, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, as shown in fig. 6, wherein the processor 601, the communication interface 602, and the memory 603 perform communication with each other via the communication bus 604, wherein,

a memory 603 for storing a computer program;

the processor 601 is configured to execute the computer program stored in the memory 603, and implement the following steps:

acquiring historical cigarette data;

determining a comparison group corresponding to the target cigarettes and fitting weights corresponding to a first preset number of reference cigarettes in the comparison group according to the historical cigarette data and a preset algorithm, wherein the correlation between the reference cigarettes and the target cigarettes is larger than a first preset threshold;

deleting data except data corresponding to the control group and the target cigarette from the historical cigarette data to obtain residual data;

Inputting the residual data into a prediction integration model to obtain reference prediction sales of a first preset number of reference cigarettes in a control group;

Alternatively, in the present embodiment, the above-described communication bus may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The memory may include RAM or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

As an example, as shown in fig. 6, the memory 603 may include, but is not limited to, the acquisition module 501, the determination module 502, the first obtaining module 503, the second obtaining module 504, and the third obtaining module 505 in the target cigarette sales predicting device. In addition, other module units in the target cigarette sales predicting device may be included, but are not limited to, and are not described in detail in this example.

The processor may be a general purpose processor and may include, but is not limited to: CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (Digital Signal Processing, digital signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field-Programmable Gate Array, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the structure shown in fig. 6 is only illustrative, and the device implementing the target cigarette sales prediction method may be a terminal device, and the terminal device may be a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palm computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 6 is not limited to the structure of the electronic device described above. For example, the terminal device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in fig. 6, or have a different configuration than shown in fig. 6.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, ROM, RAM, magnetic or optical disk, etc.

According to yet another aspect of embodiments of the present application, there is also provided a storage medium. Alternatively, in the present embodiment, the above-described storage medium may be used to store program code for executing the target cigarette sales prediction method.

Alternatively, in this embodiment, the storage medium may be located on at least one network device of the plurality of network devices in the network shown in the above embodiment.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of:

acquiring historical cigarette data;

Alternatively, specific examples in the present embodiment may refer to examples described in the above embodiments, which are not described in detail in the present embodiment.

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a U disk, ROM, RAM, a mobile hard disk, a magnetic disk or an optical disk.

In the description of the present specification, a description referring to the terms "present embodiment," "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction. In the description of the present disclosure, the meaning of "a plurality" is at least two, such as two, three, etc., unless explicitly specified otherwise.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. A method for predicting sales of a target cigarette, the method comprising:

acquiring historical cigarette data;

2. The method of claim 1, wherein after the acquiring historical cigarette data, the method further comprises:

calculating variances corresponding to all numerical features in the historical cigarette data, and deleting the numerical features of which the variances are smaller than a second preset threshold value to obtain first intermediate data;

judging whether error data exists in the first intermediate data;

if the error data exist in the first intermediate data, repairing the error data in the first intermediate data by using a preset method to obtain second intermediate data;

normalizing the numerical features and digitizing the non-numerical features.

3. The method of claim 2, wherein determining, according to the historical cigarette data and a preset algorithm, a fitting weight corresponding to a control group corresponding to a target cigarette and a first preset number of reference cigarettes in the control group, comprises:

analyzing the relevance of the reference cigarettes and the target cigarettes according to a preset algorithm, and selecting a first preset number of reference cigarettes according to the relevance to obtain the control group;

Acquiring a first matching variable vector of the target cigarette and a second matching variable vector of the reference cigarette from the historical cigarette data, wherein the first matching variable vector is used for determining the historical cigarette data corresponding to the target cigarette, and the second matching variable vector is used for determining the historical cigarette data corresponding to the reference cigarette;

obtaining a reference variable according to the first matching variable vector, the second matching variable vector, a preset diagonal matrix, an intermediate weight and a preset formula, wherein the intermediate weight is a dynamic change value in the process of determining the fitting weight;

and taking the intermediate weight corresponding to the reference variable with the minimum numerical value as the fitting weight.

4. The method of claim 2, wherein the repairing the erroneous data in the first intermediate data using a predetermined method to obtain second intermediate data includes:

selecting a data column with the error data from the first intermediate data;

splitting the first intermediate data according to the data column to obtain a data group with the error data;

Obtaining a data restoration model according to a preset algorithm and the data set;

obtaining a predicted data set corresponding to the data set according to the data repair model, wherein the predicted data set is used for processing the error data;

and repairing the first intermediate data based on the predicted data set to obtain the second intermediate data.

5. The method of claim 3, wherein prior to said inputting said remaining data into a predictive integration model to obtain a first predetermined number of reference predicted sales of said reference cigarettes in said control group, said method further comprises:

inputting the residual data into a plurality of initial prediction models to respectively obtain initial prediction values corresponding to the initial prediction models;

respectively obtaining a loss value corresponding to each initial pre-measurement according to the residual data, the initial pre-measurement and a pre-measurement function;

according to a second preset algorithm and the loss value, adjusting model parameters of the initial prediction model corresponding to the loss value until the loss value is smaller than a third preset threshold value to obtain a plurality of prediction models;

And obtaining the prediction integrated model according to the prediction model and the integrated model.

6. The method of claim 5, wherein said inputting said remaining data into a predictive integrated model to obtain a reference predicted sales of a first predetermined number of said reference cigarettes in said control group comprises:

and obtaining the reference predicted sales of a first preset number of reference cigarettes according to the integrated model and the intermediate reference predicted sales.

7. The method of claim 6, wherein said deriving said reference predicted sales for a first predetermined number of said reference cigarettes from said integrated model, said intermediate reference predicted sales comprises:

and fusing the intermediate reference predicted sales volumes by using a logistic regression algorithm to respectively obtain the predicted sales volumes of the first preset number of reference cigarettes.

8. A target cigarette sales prediction apparatus, comprising:

the acquisition module is used for acquiring historical cigarette data;

9. An electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus, characterized in that,

the memory is used for storing a computer program;

the processor is configured to perform the method steps of any one of claims 1 to 7 by running the computer program stored on the memory.

10. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program, wherein the computer program, when executed by a processor, implements the method steps of any of claims 1 to 7.