CN117745340A

CN117745340A - Cigarette market grid capacity rationality prediction method and system based on big data

Info

Publication number: CN117745340A
Application number: CN202410188091.XA
Authority: CN
Inventors: 王再东; 胡佑安; 姜兵仁; 涂鑫
Original assignee: Hunan Xiaoxiang Big Data Technology Co ltd; Hunan Xiaoxiang Big Data Research Institute
Current assignee: Hunan Xiaoxiang Big Data Technology Co ltd; Hunan Xiaoxiang Big Data Research Institute
Priority date: 2024-02-20
Filing date: 2024-02-20
Publication date: 2024-03-22
Anticipated expiration: 2044-02-20
Also published as: CN117745340B

Abstract

The invention discloses a method and a system for predicting rationality of cigarette market grid capacity based on big data. The invention belongs to the technical field of big data analysis, in particular to a big data-based cigarette market grid capacity rationality prediction method and a big data-based cigarette market grid capacity rationality prediction system.

Description

Cigarette market grid capacity rationality prediction method and system based on big data

Technical Field

The invention relates to the technical field of big data analysis, in particular to a cigarette market grid capacity rationality prediction method and system based on big data.

Background

The accurate delivery of the cigarette products has extremely important significance for commercial companies, the economic benefits of the commercial companies are directly influenced by the sales volume of the cigarettes caused by the delivery of the products, the complexity of market dynamics and consumer behaviors is not considered in the traditional prediction method, and the problem of inaccurate delivery of the cigarettes is caused due to limited data which can be collected; the traditional machine learning method has the problems of insufficient accuracy and poor stability of model prediction caused by incomplete consideration of factors influencing the market capacity of cigarettes.

Disclosure of Invention

Aiming at the situation, in order to overcome the defects of the prior art, the invention provides a method and a system for predicting the rationality of the grid capacity of the cigarette market based on big data, aiming at the problems that the traditional prediction method does not consider the complexity of market dynamics and consumer behaviors and the data which can be collected are limited so as to cause inaccurate cigarette delivery, the scheme introduces business circle data outside a tobacco database, drives intelligent strategies aiming at data of different sub-markets, generates customized marketing strategies and realizes intelligent and accurate cigarette delivery; aiming at the problems of insufficient accuracy and poor stability of model prediction caused by incomplete consideration of factors affecting the market capacity of cigarettes by the traditional machine learning method, the scheme combines the advantages of ARIMA, holt-windows and RF combined integrated learning algorithm to improve the accuracy and stability of model prediction.

The technical scheme adopted by the invention is as follows: the invention provides a big data-based cigarette market grid capacity rationality prediction method, which comprises the following steps:

step S1: defining a business circle, and defining the business circle as a spatial range of cigarette sales capacity of retailers and a geographic area of distribution of cigarette consumers according to an area interaction theory;

step S2: the business circle expanding method comprises the steps of expanding the business circle by taking an initial position of a retailer as an initial value to obtain an expanded business circle;

step S3: data preprocessing, namely acquiring data of an expanded business district, generating a business district data set, and dividing the business district data set into a training set and a testing set;

step S4: predicting the grid capacity of the cigarette market, and predicting the grid capacity of the cigarette market by using a training set through an integrated learning algorithm to obtain an integrated model A;

step S5: evaluating the integrated model A by using a test set to obtain an integrated model B;

step S6: and (3) reasonably predicting the grid capacity of the cigarette market, inputting new grid data into the integrated model B, and predicting the grid capacity of the cigarette market to obtain a prediction result.

Further, in step S1, the definition of the business circle, specifically, the business circle is a spatial range of the sales capability of the retailer and a geographical area of distribution of cigarette consumers, and the probability of the consumer purchasing the cigarettes at the retailer is determined by the area of the retailer and the distance between the consumer and the retailer according to the area interaction theory, and the following formula is used:

；

in the method, in the process of the invention,is positioned at->Is to go to->Probability of purchasing cigarettes at the site, < > A>Is the sum of all retailers in the business district, +.>Is a retailer->Scale of (A)>Is->And->Distance between->Indicating how much importance is placed on time and distance when a customer purchases a cigarette.

Further, in step S2, the business district expansion method specifically includes the following steps:

step S21: defining an initial value, calculating the probability of each retailer and surrounding customers purchasing cigarettes, and defining the positions of the retailers as the initial value;

step S22: calculating geographical range of business district, defining distanceRepresenting the distance between the location of the retailer and the center of the business circle, where the initial n is the number of retailers contained in the grid, will +.>Initial +.>；

Step S23: expanding a business circle range, centering on an initial grid, expanding the business circle range into a square, and calculating the area of the expanded shopping area according to a calculation method of the initial gridIf->Continuing to expand shopping area calculation ++>Up to->And obtaining the expanded business circle.

Further, in step S3, the data preprocessing specifically includes the following steps:

step S31: acquiring data, namely acquiring basic attributes, crowd characteristics and consumption capacity of the expanded business district, and integrating the basic attributes, crowd characteristics and consumption capacity into business district data 1; acquiring the market current situation, consumption index and consumption preference of the expanded business district, integrating the market current situation, consumption index and consumption preference into business district data 2, acquiring POI data related to the sales of the cigarette industry, and explaining the POI data by using the position and attribute characteristics as constraints so as to extract the enterprise number, shopping area, traffic type, walking distance, business type and longitude and latitude data to obtain a POI data set;

step S32: the data conversion, the business turn data 1 and the business turn data 2 comprise numerical data and classification data, the numerical data are converted by using a log1p function to obtain data with Gaussian distribution, and Label-encoding is carried out on the classification data to obtain numerical characteristics, so that business turn data A and business turn data B are obtained;

step S33: constructing a data set, and constructing a cigarette market data set by using a PiFlow fusion business district data A and a business district data B, POI data set;

step S34: dividing a data set into a training set and a testing set;

step S35: and storing the data sets, wherein the business turn data sets are stored in the Hive database in a distributed mode.

Further, in step S4, the method for predicting the grid capacity of the cigarette market specifically includes the following steps:

step S41: ARIMA mouldParameters of ARIMA model includeWill->Fitting to training set, wherein->Is autoregressive item number,/->Is the differential order, +.>Is the number of sliding average terms, and ARIMA model training comprises the following steps:

step S411: determining the number of autoregressive terms and the number of moving average terms, and determining the number of autoregressive terms and the number of moving average terms in an ARIMA model by observing an autocorrelation graph ACF and a partial autocorrelation graph PACF;

step S412: determining the differential order, calculatingFirst order difference>The formula used is as follows:

；

in the method, in the process of the invention,representing the time sequence in->Value of time of day->Representing the time sequence in->A value of time of day;

calculation ofSecond order difference of +.>The formula used is as follows:

；

in the method, in the process of the invention,representing the time sequence in->A value of time of day;

the differential order is calculated using the following formula:

；

in the method, in the process of the invention,is a parameter of the autoregressive part,/->Is a sliding average value>Is an estimation error;

step S413: model checking, namely selecting a proper autoregressive term number, a sliding average term number and a differential order number combination, and then performing significance checking on an ARIMA model;

step S414: AIC was used to evaluate the accuracy of the predictions using the following formula:

；

in the method, in the process of the invention,is the estimated error variance; />Is the sample size,/->Is a parameter value;

according to AIC, predicting the optimal ARIMA model of the studied cigarette market capacity, and verifying the fitting property of the ARIMA model by using a white noise hypothesis;

step S42: holt-windows model training, calculating model equations, the following formulas are used:

；

in the method, in the process of the invention,representing the time sequence at the time point +.>Is>Represents the intercept (I)>The slope is indicated as such,representing the time sequence at the time point +.>Seasonal component of->Is an irregular component;

three smoothing equations are calculated using the following formulas:

；

in the method, in the process of the invention,is a smooth constant +.>Is a time sequence at the time point +.>Level of->Is a time sequence at the time point +.>Trend of->Is a time sequence at the time point +.>Season component of->Representing the time sequence at the time point +.>Inputting a training set into a Holt-windows model, solving parameters of three smooth equations by using a maximum likelihood estimation method, and evaluating the prediction accuracy of the Holt-windows model by using a mean square error MSE and a mean absolute error MAE;

step S43: RF model training, comprising the steps of:

step S431: generating RF, specifically, firstly randomly and repeatedly extracting N samples from a training set to train a decision tree as a root node of the tree; secondly, when each sample has M attributes, when each node of the decision tree needs to be split, randomly selecting M attributes from the M attributes, wherein the general condition is M < < M; 1 attribute is selected from m attributes by utilizing information gain to serve as a splitting attribute of the node, and the node is split until the node cannot be split again, and pruning is not performed in the whole decision tree forming process; repeating the steps to construct a plurality of decision trees to form an RF model;

step 432: the MGF prediction model specifically comprises the following steps:

step 4321: hypothesized time seriesAverage value of (2) isTime series->Expressed as MGF, the formula used is as follows:

；

in the method, in the process of the invention,，/>or->Using this formula, m average generating functions of the time series can be obtained and the periodicity is extended to +.>；

Step 4322: calculating a first order difference sequenceThe formula used is as follows:

；

step 4323: calculating a second order differential sequenceThe formula used is as follows:

；

in the method, in the process of the invention,representing the time sequence in->A first order differential sequence of moments;

definition of the primordial sequenceThe homogeneous function of->First order differential sequence->And second order differential sequence->The homogeneous functions of (2) are respectively marked as +.>And->Their extension sequences->The formula +.>Obtaining;

step 4324: based on the extension sequence of MGF of the original sequence and the first order difference sequence, a cumulative extension sequence is established, and the following formula is used:

；

step 433: RF-MGF model prediction, using RF model to obtain prediction dataObtaining prediction data using MGF prediction model>；

Step 434: the weight of the mixture of the two methods is calculated, and the formula is as follows:

；

in the method, in the process of the invention,is->Weight of->Representing the total sales of business circles of sales history data of each retailer in the training set;

step 435: for actual valuesFirst predictive value->Error value->Weighting is performed using the following formula:

；

in the method, in the process of the invention,representing a baseline prediction value->Representing a historical average;

step 436: by time variation,And->Error value +.>Performing fitting analysis on the output variable by adopting a response surface method to obtain a final predicted value;

step S44: and (3) model fusion, namely distributing weights to the ARIMA model, the Holt-windows model and the RF model by using a weighted average method to obtain an integrated model A.

Further, in step S5, the test set is input into the integrated model a, and the evaluation index adopts the accuracy, and the following formula is used:

；

in the method, in the process of the invention,for measuring the predicted position and the actual sales, < >>Representing the actual sales->Representing a predicted offered sales;

setting the super parameters of the integrated model A, comprising: learning rate and batch size, stopping training when the error of the test set of the integrated model A is no longer reduced in a plurality of continuous iteration times, and adjusting the super parameters of the integrated model A according to the performance of the integrated model A on the test set to obtain an integrated model B;

further, in step S6, the rationality of the grid capacity of the cigarette market is predicted, specifically, new grid area data is input to the integrated model B, and the grid capacity is predicted, so as to obtain a prediction result.

The invention provides a big data-based cigarette market grid capacity rationality prediction system, which comprises a business circle defining module, a business circle expanding mode module, a data preprocessing module, a cigarette market grid capacity prediction method module, an evaluation module and a cigarette market grid capacity rationality prediction module;

the business circle definition module gives out the concept of the business circle, calculates the probability of the purchasing behavior of the customer in the store, and sends the probability of the purchasing behavior of the customer in the store to the business circle expansion mode module;

the business district outward expansion mode module receives probability data defining the purchasing behavior of customers in shops, which is sent by the business district module, expands the range by taking the initial position of a retailer as an initial value to obtain an expanded business district, and sends the expanded business district to the data preprocessing module;

the data preprocessing module receives the expanded business turn sent by the business turn expansion mode module, collects business turn data, constructs a business turn data set, divides the business turn data set into a training set and a testing set, sends the training set to the cigarette market grid capacity prediction method module, and sends the testing set to the evaluation module;

the cigarette market grid capacity prediction method module receives a training set sent by the data preprocessing module, trains a model by utilizing an integrated learning algorithm combining ARIMA, holt-windows and RF, obtains an integrated model A, and sends the integrated model A to the evaluation module;

the evaluation module receives the integrated model A sent by the cigarette market grid capacity prediction method module and the test set sent by the data preprocessing module, evaluates the integrated model A by using the test set to obtain an integrated model B, and sends the integrated model to the cigarette market grid capacity rationality prediction module;

and the cigarette market grid capacity rationality prediction module receives the integrated model B sent by the evaluation module, inputs data of a new grid area, and performs capacity prediction on the new grid area to obtain a predicted value.

By adopting the scheme, the beneficial effects obtained by the invention are as follows:

(1) Aiming at the problems that the complexity of market dynamics and consumer behaviors is not considered in the traditional prediction method, and the cigarette delivery is inaccurate due to limited data which can be collected, business circle data outside a tobacco database is introduced, intelligent strategies are driven according to data of different market segments, customized marketing strategies are generated, and intelligent and accurate cigarette delivery is realized.

(2) Aiming at the problems of insufficient accuracy and poor stability of model prediction caused by incomplete consideration of factors affecting the market capacity of cigarettes by the traditional machine learning method, the scheme combines the advantages of ARIMA, holt-windows and RF combined integrated learning algorithm to improve the accuracy and stability of model prediction.

Drawings

FIG. 1 is a flow diagram of a big data based cigarette market grid capacity rationality prediction method provided by the invention;

FIG. 2 is a schematic diagram of a big data based cigarette market grid capacity rationality prediction system provided by the invention;

FIG. 3 is a flow chart of step S2;

FIG. 4 is a flow chart of step S3;

fig. 5 is a flow chart of step S4.

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention; all other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be understood that the terms "upper," "lower," "front," "rear," "left," "right," "top," "bottom," "inner," "outer," and the like indicate orientation or positional relationships based on those shown in the drawings, merely to facilitate description of the invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the invention.

In a first embodiment, referring to fig. 1, the method for predicting the rationality of the grid capacity of the cigarette market based on big data provided by the invention comprises the following steps:

In the second embodiment, referring to fig. 1 and 3, the embodiment is based on the above embodiment, and in step S2, the business turn expansion method specifically includes the following steps:

Embodiment three, referring to fig. 1 and 4, based on the above embodiment, in step S3, the data preprocessing specifically includes the following steps:

step S34: dividing a data set into a training set and a testing set;

By executing the above operation, the problems that market dynamics and consumer behavior complexity are not considered in the traditional prediction method, and the cigarette delivery is inaccurate due to limited data which can be collected are solved.

Embodiment four, referring to fig. 1 and 5, based on the above embodiment, in step S4, the cigarette market grid capacity prediction specifically includes the following steps:

step S41: ARIMA model training, parameters of ARIMA model includeWill->Fitting to training set, wherein->Is autoregressive item number,/->Is the differential order, +.>Is the number of sliding average terms, and ARIMA model training comprises the following steps:

；

calculation ofSecond order difference of +.>The formula used is as follows:

；

the differential order is calculated using the following formula:

；

three smoothing equations are calculated using the following formulas:

；

step S43: RF model training, comprising the steps of:

step 432: the MGF prediction model specifically comprises the following steps:

；

in the method, in the process of the invention,is->Weight of->Business circle representing sales history data of each retailer in training setIs a total sales amount of (2);

；

By executing the operation, the problems of insufficient accuracy and poor stability of model prediction caused by incomplete consideration of factors affecting the market capacity of cigarettes by a traditional machine learning method are solved, and the model prediction accuracy and stability are improved by combining the advantages of ARIMA, holt-windows and RF combined integrated learning algorithm.

Fifth embodiment, referring to fig. 2, the embodiment is based on the above embodiment, and the big data based cigarette market grid capacity rationality prediction system provided by the invention includes a business circle defining module, a business circle expanding mode module, a data preprocessing module, a cigarette market grid capacity prediction method module, an evaluation module, and a cigarette market grid capacity rationality prediction module;

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

The invention and its embodiments have been described above with no limitation, and the actual construction is not limited to the embodiments of the invention as shown in the drawings. In summary, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical solution should not be creatively devised without departing from the gist of the present invention.

Claims

1. The cigarette market grid capacity rationality prediction method based on big data is characterized by comprising the following steps of: the method comprises the following steps:

2. The big data based cigarette market grid capacity rationality prediction method of claim 1, wherein: in step S1, the definition of the business circle, specifically, the business circle is a spatial range of the cigarette sales capability of the retailer and a geographical area of the distribution of cigarette consumers, and according to the area interaction theory, the probability that the consumer purchases the cigarette at the retailer is determined by the area of the retailer and the distance between the consumer and the retailer, and the following formula is used:

；

in the method, in the process of the invention,is positioned at->Is to go to->Probability of purchasing cigarettes at the site, < > A>Is the sum of all retailers in the business community,is a retailer->Scale of (A)>Is->And->Distance between->Indicating how much importance is placed on time and distance when a customer purchases a cigarette.

3. The big data based cigarette market grid capacity rationality prediction method of claim 2, wherein: in step S2, the business turn expansion method includes the following steps:

Step S23: extending business circle range, centering on initial grid, extending business circle range as square, rootCalculating the expanded shopping area according to the calculation method of the initial gridIf->Continuing to expand shopping area calculationUp to->And obtaining the expanded business circle.

4. The big data based cigarette market grid capacity rationality prediction method of claim 3, wherein: in step S4, the cigarette market grid capacity prediction includes the following steps:

；

calculation ofSecond order difference of +.>The formula used is as follows:

；

the differential order is calculated using the following formula:

；

in the method, in the process of the invention,representing the time sequence at the time point +.>Is>Represents the intercept (I)>Indicating slope, & lt->Representing the time sequence at the time point +.>Seasonal component of->Is an irregular component;

three smoothing equations are calculated using the following formulas:

；

in the method, in the process of the invention,is a smooth constant +.>Is a time sequence at the time point +.>Level of->Is a time sequence at a time pointTrend of->Is a time sequence at the time point +.>Season component of->Representing the time sequence at the time point +.>Inputting a training set into a Holt-windows model, solving parameters of three smooth equations by using a maximum likelihood estimation method, and evaluating the prediction accuracy of the Holt-windows model by using a mean square error MSE and a mean absolute error MAE;

step S43: RF model training, comprising the steps of:

step 432: the MGF prediction model specifically comprises the following steps:

；

definition of the primordial sequenceThe homogeneous function of->First order differential sequence->And second order differential sequence->Homogeneous function of (2)Are respectively marked as->And->Their extension sequences->The formula +.>Obtaining;

；

5. A big data based cigarette market grid capacity rationality prediction system for implementing the big data based cigarette market grid capacity rationality prediction method according to any one of claims 1-4, characterized in that: the method comprises a business circle defining module, a business circle expanding mode module, a data preprocessing module, a cigarette market grid capacity prediction method module, an evaluation module and a cigarette market grid capacity rationality prediction module;