CN109325808A

CN109325808A - Demand for commodity prediction based on Spark big data platform divides storehouse planing method with logistics

Info

Publication number: CN109325808A
Application number: CN201811133491.1A
Authority: CN
Inventors: 舒海东; 胡峰; 雷大江
Original assignee: Chongqing Zhiwanjia Technology Co Ltd
Current assignee: Shu Haidong
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2019-02-12

Abstract

The present invention provides the demand for commodity predictions based on Spark big data platform to divide storehouse planing method with logistics, include the following steps: Q1, data prediction, Q2, feature construction, Q3, characteristic value selection, Q4, model selection, Q5, model prediction result is merged with regular prediction result, fusion coefficients are 0.75model+0.25rule, smart home demand for commodity is predicted the present invention is based on Spark big data platform and divides storehouse planing method, it can effectively help smart home businessman that operation cost is greatly reduced, reduction is received timeliness, promote the experience of user, it is more in line with the practical commercial scene of data volume rapid growth.

Description

Demand for commodity prediction based on Spark big data platform divides storehouse planing method with logistics

Technical field

The present invention relates to big data analysis applied technical fields, are related in e-commerce, produce more particularly, to smart home In product e-commerce, it is used based on Spark big data platform with the planning of logistics point storehouse to meet the prediction of electric business demand for commodity Demand for commodity prediction divides storehouse planing method with logistics.

Background technique

With the fast development of science and technology, internet has brought various convenient services, and e-commerce is more next More complicated, smart home is the embodiment of the instrumentation under the influence of internet, and smart home will be in family by technology of Internet of things Various equipment (such as audio & video equipment, lighting system, curtain control, airconditioning control, security system, Digital Theater Systems, audio-visual clothes Business device, shadow cabinet system, network home appliance etc.) it connects together, home wiring control, Lighting control, remote control using telephone, indoor and outdoor are provided The multiple functions such as remote control, burglar alarm, environmental monitoring, HVAC control, infrared forwarding and programmable Timer control and means.With Common household is compared, and smart home not only has traditional inhabitation function, has both building, network communication, information household appliances, equipment certainly Dynamicization provides comprehensive information exchange function, and even various energy expenditures save fund.

For smart home E-commerce market, time limit and two key factors that price is that user considers, one As in the case of, the promotion in time limit and the reduction of price are constantly present rigid outer limit, and therefore, it is necessary to seek in deeper time Solution is sought, a point storehouse stock service refers to by smart home businessman according to sales forecast, is got ready the goods in advance to warehouse, realizes nearest It is dispensed in delivery, area, smart home businessman also can easily possess the object of a line electric business of cost 10,000,000,000 easily without self-built warehouse Fluid system realizes very fast be sent to.With adding fuel to the flames for various red-letter days and electric business platform, the various forms online shopping such as quick-fried money, special selling Advertising campaign will become normality.If smart home businessman is delivered with traditional single storehouse mode, it is difficult to avoid transprovincially that outbox amount is big, object It flows at high cost, the problems such as delivering and sending time long, customer complaint with charge free, therefore the product that standardization level is higher, inventory's depth is deeper Board should consider that a point storehouse is got ready the goods in advance.

Promote whenever big, consumer is most concerned with when express delivery is sent to.Most effective method be by big data and Algorithm allows cargo to be placed directly on the warehouse nearest from consumer；Businessman can be helped substantially to drop with the supply chain that big data drives Low running cost promotes the experience of user, plays an important role to the improved efficiency of entire smart home electric business industry.High quality Smart home demand for commodity prediction be supply chain management basis and core function.Realizing the demand for commodity prediction of high quality is It is more stepped towards intelligentized supply platform chain direction further.Therefore how to realize more accurate requirement forecasting, make cargo direct Be put into the warehouse nearest from consumer, at the same again can greatly optimizing management cost, be it is essential that being to be badly in need of solving The problem of.

Spark is that the distributive parallel computation framework for providing support DAG figure rapidly and efficiently memory-based is divided Cloth big data handles frame.The data source or file system storing data supported using Hadoop, including HDFS, HBase, Hive, Cassandra etc..Both an individual server can be deployed in or be deployed in as Mesos or YARN On distributed resource management frame.And Scala is provided, the API of tri- kinds of programming languages of Java and Python.It utilizes The API that Spark is provided, developer can create the application based on Spark with the api interface of standard.

RDD (elasticity distribution formula data set) is a kind of abstract data type, is the form of expression of the data in Spark, It is module and class most crucial in Spark, and design essence place.You, which can regard it as one, the big of fault tolerant mechanism Set, Spark provide Persist mechanism and are cached in memory, facilitate interative computation and are used for multiple times.RDD is subregion Record, same district can be distributed in different physical machines, preferably support parallel computation.RDD is it there are one characteristic Be it is elastic, during job run, when the memory of machine overflows, RDD can be interacted with hard disc data, although Efficiency can be reduced, but can guarantee the normal operation of operation.Two kinds of operations: conversion and movement can be carried out on RDD.

Conversion: existing RDD is converted by a new RDD by a series of function operation, i.e. return value is still RDD, and RDD can be converted constantly.Since RDD is distributed storage, so entire conversion process is also to carry out parallel 's.Common conversion higher-order function such as map, flatMap, reduceByKey etc..

Movement: return value is not a RDD.It can be the ordinary set or a value of a Scala, or It is sky, finally or returns to Driver program, or RDD is written in file system.Such as reduce, saveAsTextFile With the functions such as collect.

For Spark Application when encountering action operation, SparkContext can generate Job, and by each Job points are different stage, and (each Job can be split many group Task, and every group task is referred to as Stage, can also claim TaskSet.Task is the working cell being sent on some Executor).Each Spark Application obtains exclusive Executor, the corresponding JVM process of Executor, which is responsible for running Task.From difference The Task of Application is operated in different JVM processes.The process is resident always during Application, and with Multithreading runs Tasks.Each node can play one or more Executor；Each Executor by several Core, Memery composition, each Core of each Executor can only once execute a Task；The result that each Task is executed is exactly Generate a Partiton of target RDD, each Executor core of the concurrency that Task is performed=Executor number * Number.

Summary of the invention

Aiming at the problem that above-mentioned background technique is illustrated, it is an object of the present invention to provide the quotient based on Spark big data platform Product requirement forecasting and logistics divide storehouse planing method, solve existing smart home demand for commodity prediction and information in the planning of logistics point storehouse Low efficiency, problem at high cost can effectively help smart home businessman that operation cost is greatly reduced, and reduce timeliness of receiving, mention Rise the experience of user.

In order to achieve the above object, the invention provides the following technical scheme:

Demand for commodity prediction based on Spark big data platform divides storehouse planing method with logistics, it is characterised in that: including such as Lower step:

Q1, data prediction obtain associated data files, including commodity granularity correlated characteristic dependent merchandise from database User behavior characteristics, commodity and the region Fen Cang granularity correlated characteristic, the region Fen Cang benefit few to mend the relevant informations such as more costs right Afterwards, by database because recent restocking or undercarriage cause the commodity of no sales volume record to carry out filling out 0 processing, to guarantee that data connect Continuous property；

That is: creation is by SparkContext object, then with its textFile (URL) function creation distributed data collection RDD, wherein data include smart home commodity granularity correlated characteristic, including ID, classification, brand, date, price, correlation in RDD Commodity user behavior characteristics include browsing number plus shopping cart number, buy number, flow, commodity and the region Fen Cang granularity phase Closing information, the distributed data collection for creating completion such as the cost of feature, the benefit in the region Fen Cang less, more than benefit can be grasped parallel Make；Secondly, calling mapPartitions operator general<feature 1, feature 2 ..., feature m>form sample will be because recent using 0 Restocking or undercarriage lead to the relevant field polishing of the commodity of no sales volume record, to guarantee data continuity；It calls ZipWithIndex operator, does a label to each sample, and the RDD of creation is converted to < label, commodity ID, warehouse Code, feature 1, feature 2 ..., feature m > form finally call Filter operator according to the commodity transaction date by entire data set It is divided into test set TestRDD and training set TrainRDD, and calls Persist operator will be in obtained TrainRDD persistence In depositing；

Q2, feature construction call mapPartitions operator to TrainRDD using slip window sampling construction feature, Corresponding statistical function is write, the relevant information structure of sample on each Partition (window) in different time period is counted Build corresponding feature, select after each specific time point N days as a window, each commodity of the window, warehouse inventory Total sales volume slides M window as label characteristic value, was used as a window to N days before specific time point, carries out feature structure It builds, count N days various classification characteristic values before the window and value sum and average value avg, friendship of the statistics commodity at nearest N days Easily several characteristic values, including maximum value, minimum value, standard deviation, count its classification id in the characteristic value of nearest N days numbers of deals, Including maximum value, minimum value, standard deviation and rank value, accounting, latter N days total sales volumes slide M window, warp as label It crosses the transformation of a series of data and the TrainRDD of creation is converted to < label, commodity ID, warehouse code, feature 1, feature 2 ..., Feature m, feature m+1 ..., feature n, label > form；

Q3, characteristic value selection select the feature of ranking topk using xgboost, calculate similarity, remove redundancy feature, The associated eigenvalue for selecting construction feature in Q2, then goes to learn, most obtains the importance ranking of feature with xgboost model, It chooses topk important feature and calculates similarity, weed out those unessential features；

Q4, model selection, the multiple regression models of training, first to TrainRDD using LR, SVR, RF, GBRT, Algorithm and the multiple regression models of third party's distributed learning algorithm training in XGBOOSTSpark Mllib machine learning library, will Each model prediction call by result union operator is had the prediction result of model to be defined as model_RDD more；Secondly, calling GroupBy operator is polymerize according to commodity ID.Map operator is finally called, if the benefit of some (commodity ID, warehouse code) Few cost, which is greater than, mends more costs, then it is intended that prediction is more, therefore take maximum value in single model prediction result multiplied by 1.1, on the contrary take the minimum value in single model prediction result multiplied by 0.9, < commodity will be obtained by the transformation of a series of data ID, warehouse code divide storehouse regional aim inventory > form model learning result model in future time section；

Q5, model prediction result are merged with regular prediction result, fusion coefficients 0.75model+0.25rule, Wherein rule rule learning is defined as: N days sales volumes are denoted as day1, day2 ... dayN respectively before prediction window, to each quotient Product mend more costs if mending few cost and being greater than, are predicted as N*max (day1, day2 ... dayN), otherwise are predicted as N* min TestRDD is finally converted to < commodity ID by (day1, day2 ... dayN), warehouse code, the region Fen Cang in future time section Base stock >, it is defined as rule_RDD, will finally obtain each commodity certain following period base stock in the region Fen Cang.

Smart home demand for commodity is predicted the present invention is based on Spark big data platform and divides storehouse planing method, Neng Gouyou Effect helps smart home businessman that operation cost is greatly reduced, and reduces timeliness of receiving, promotes the experience of user, be more in line with data volume The practical commercial scene of rapid growth.

Detailed description of the invention

Flow diagram Fig. 1 of the invention.

Feature Engineering stage RDD variation diagram Fig. 2 of the invention.

Specific embodiment

Below in conjunction with drawings and examples of the invention, technical solution of the present invention is clearly and completely described, Obviously, described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based in the present invention Embodiment, every other embodiment obtained by those of ordinary skill in the art without making creative efforts, all Belong to the scope of protection of the invention.

According to Fig. 1 and 2, the present invention saves as embodiment with the e-commerce product and logistic warehouse of smart home product, Illustrate demand for commodity prediction and logistics point storehouse planing method based on Spark big data platform, includes the following steps:

Q1, data preprocessing phase:

Related data information is obtained from related smart home product database, and multilist data are carried out after integrating conveniently It is continuous to use, then carry out pretreatment operation, by data record because recent restocking or undercarriage lead to the quotient of no sales volume record Product carry out filling out 0 processing, to guarantee data continuity；Secondly it chooses whether that operation is normalized as needed.Finally, according to wanting Predicted time length divides data, will integrally be divided into training set, verifying collection and test set.

Creation is by SparkContext object, then with its textFile (URL) function creation distributed data collection RDD, wherein data include smart home commodity granularity correlated characteristic, including ID, classification, brand, date, price, correlation in RDD Commodity user behavior characteristics include browsing number plus shopping cart number, buy number, flow, commodity and the region Fen Cang granularity phase Closing information, the distributed data collection for creating completion such as the cost of feature, the benefit in the region Fen Cang less, more than benefit can be grasped parallel Make；Secondly, calling mapPartitions operator general<feature 1, feature 2 ..., feature m>form sample will be because recent using 0 Restocking or undercarriage lead to the relevant field polishing of the commodity of no sales volume record, to guarantee data continuity；It calls ZipWithIndex operator, does a label to each sample, and the RDD of creation is converted to < label, commodity ID, warehouse Code, feature 1, feature 2 ..., feature m > form finally call Filter operator according to the commodity transaction date by entire data set It is divided into test set TestRDD and training set TrainRDD, and calls Persist operator will be in obtained TrainRDD persistence In depositing.

Wherein data structure is as shown in the following table 1, table 2:

1 commodity granularity correlated characteristic of table

The cost of the benefit in 2 region commodity Fen Cang of table less, more than benefit

Field	Type	Meaning	Example
				item_id	bigint	Commodity ID	333442
store_code	String	Warehouse CODE	1
				money_a	String	Commodity benefit mends more cost less	10.44
money_b	String	Commodity benefit mends more cost less	20.88

Q2, feature construction:

Using slip window sampling construction feature, a window is used as within N days after choosing each specific time point, the window is each A commodity, the total sales volume of warehouse inventory slide M window as characteristic value label, were used as one to N days before specific time point A window carries out feature construction:

/ sum and avg of N days various classification features are counted 1/2/3/5/7/9/ before the window ..., before counting the window N days various classification characteristic values and value sum and average value avg, statistics commodity nearest N days numbers of deals characteristic value, wrap Maximum value, minimum value, standard deviation are included, counts its classification id in the characteristic value of nearest N days numbers of deals, including maximum value, minimum Value, standard deviation and rank value, accounting and meet multinomial cross feature value.

It selects to be trained in 13 Time of Day section in December 10 to 2015 years July in 2015, slides 11 windows, length of window Progress feature extraction in 14 days two weeks is selected ,/14 days various classification characteristic values that feature includes 1/2/3/5/7/9/ before the window ... And value sum and average value avg, count commodity in the characteristic value of nearest 14 days numbers of deals, including maximum value, minimum value, mark It is quasi- poor, its classification id is counted in the characteristic value of nearest 14 days numbers of deals, including maximum value, minimum value, standard deviation and ranking Value, accounting and meets multinomial cross feature value.Each length of window last day counts the total part of sale in 14 days backward Number summation is used as characteristic value label.It is shown in Table 3 sliding window explanations

3 sliding window date of table explanation

Q3, characteristic value selection:

It is converted based on Spark into type data, using the feature of xgb selection ranking topk, calculates similarity, removal redundancy is special Sign.Concrete operations are as follows: call the distributed version of xgboost to n feature calculation importance of input TrainRDD.Then SortBy algorithm and Filter operator is called to choose topk importance characteristic, TrainRDD is converted to < label, commodity at this time ID, warehouse code, feature x1, feature x2 ..., feature xk, label > form.Finally mapPartitions operator is called to calculate Pearson correlation coefficient between feature rejects redundancy feature according to the similarity size between feature.And to calling Persist operator It will be in obtained TrainRDD persistence memory.Such as: 400 correlated characteristics are constructed, then xgboost model are selected to go Study, then can export the important coefficient of each feature, we select tok40 here, that is, importance has been selected to come preceding 40 Feature；However there may be redundancies for this 40 features.Therefore by calculating the similarity between feature, common similarity calculation side Method includes Pearson correlation coefficients, cosine similarity etc..Such as feature 1 is up to 10 similarity of feature in this 40 features 0.999, then removal feature 1 or feature 10 may be selected, only retain one of feature.Specifically remove that also with consider its with The relationship of other features.

Q4, model selection:

TrainRDD is successively called in the Spark Mllib machine learning such as LR, SVR, RF, GBRT, XGBOOST library Algorithm and the multiple regression models of third party's distributed learning algorithm training, and by each model prediction call by result union operator There is the prediction result of model to be defined as model_RDD more.Then GroupBy operator is called to be polymerize according to commodity ID. Finally call Map operator, if the benefit of some (commodity ID, warehouse code) lack cost and is greater than the more costs of benefit, then it is intended that It predicts more, therefore takes maximum value in single model prediction result multiplied by 1.1, otherwise take in single model prediction result most Small value is multiplied by 0.9.< commodity ID will be obtained by converting by a series of data, warehouse code, the region Fen Cang in future time section Base stock > form model learning result.

Each model predication value exemplary graph of table 4

Commodity	Position in storehouse	LR	SVR	RF	GBRT	XGBOOST
							b	0002	30	45	54	100	10
c	0003	40	60	70	20	10

If it is 10 yuan that smart home commodity b, which mends more costs, mending few cost is 100 yuan, then the smart home commodity are 0002 Prediction result value in warehouse is 100*1.1=110；

If it is 80 yuan that smart home commodity c, which mends more costs, mending few cost is 40 yuan, then the smart home commodity are in 0003 storehouse Prediction result value in library is 10*0.9=9；

Q5, model prediction result are merged with regular prediction result:

N days sales volumes are denoted as day1, day2 ... dayN respectively before prediction window, to each commodity, if it is big to mend few cost In mending more costs, then it is predicted as N*max (day1, day2 ... dayN), otherwise is predicted as N*min (day1, day2 ... dayN)；

Such as: the sales volume of prediction window the last fortnight is denoted as sale1, sale2 respectively, to each (commodity, position in storehouse), if mended Few cost, which is greater than, mends more costs, then is predicted as 2* max (sale1, sale2), otherwise is predicted as 2*min (sale1, sale2).

Fusion forecasting is as a result, model prediction result is merged with regular prediction result, fusion coefficients 0.75model +0.25rule；It is merged between model as shown in Figure 2, Model Fusion result is M1, is then merged with rule, fusion knot Fruit is M2, and using model M 2, the historical data as shown in table 1, table 2 can according to the historical data of different smart home products In Spark big data platform, their following quantities in stock as shown in table 4 needed for the warehouse of prediction, compare table 4 and Speech, table 4 are the output of single model as a result, M2 is the fusion results of multi-model and rule, and effect can be more preferably.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. the demand for commodity prediction based on Spark big data platform divides storehouse planing method with logistics, it is characterised in that: including as follows Step:

Q1, data prediction obtain associated data files, including commodity granularity correlated characteristic dependent merchandise user from database Then behavioural characteristic, commodity and the region Fen Cang granularity correlated characteristic, mending for the region Fen Cang mend the relevant informations such as more costs less, will Because recent restocking or undercarriage cause the commodity of no sales volume record to carry out filling out 0 processing in database, to guarantee data continuity；

That is: creation is by SparkContext object, then with its textFile (URL) function creation distributed data collection RDD, Wherein data include smart home commodity granularity correlated characteristic, including ID, classification, brand, date, price, dependent merchandise in RDD User behavior characteristics include browsing number plus shopping cart number, buy number, and flow, commodity are related to the region Fen Cang granularity special Information, the distributed data collection for creating completion such as the cost of sign, the benefit in the region Fen Cang less, more than benefit can be operated in parallel；Its It is secondary, call mapPartitions operator general<feature 1, feature 2 ..., feature m>form sample use 0 will because of recent restocking or Person's undercarriage leads to the relevant field polishing of the commodity of no sales volume record, to guarantee data continuity；It calls ZipWithIndex operator, does a label to each sample, and the RDD of creation is converted to < label, commodity ID, warehouse Code, feature 1, feature 2 ..., feature m > form finally call Filter operator according to the commodity transaction date by entire data set It is divided into test set TestRDD and training set TrainRDD, and calls Persist operator will be in obtained TrainRDD persistence In depositing；

Q2, feature construction are called mapPartitions operator to TrainRDD, are write phase using slip window sampling construction feature The statistical function answered, the relevant information building for counting sample on each Partition (window) in different time period are corresponding Feature, select after each specific time point N days as a window, each commodity of the window, the total sales volume of warehouse inventory As label characteristic value, M window is slided, was used as a window to N days before specific time point, carries out feature construction, statistics Before the window N days various classification characteristic values and value sum and average value avg, count commodity nearest N days numbers of deals spy Value indicative, including maximum value, minimum value, standard deviation count its classification id in the characteristic value of nearest N days numbers of deals, including maximum Value, minimum value, standard deviation and rank value, accounting, latter N days total sales volumes slide M window, process is a series of as label Data transformation the TrainRDD of creation is converted to < label, commodity ID, warehouse code, feature 1, feature 2 ..., feature m is special Levy m+1 ..., feature n, label > form；

Q3, characteristic value selection select the feature of ranking topk using xgboost, calculate similarity, remove redundancy feature, selection The associated eigenvalue of construction feature in Q2, then goes to learn with xgboost model, most obtains the importance ranking of feature, chooses Topk important feature calculates similarity, weeds out those unessential features；

Q4, model selection, the multiple regression models of training, first to TrainRDD using LR, SVR, RF, GBRT, Algorithm and the multiple regression models of third party's distributed learning algorithm training in XGBOOSTSpark Mllib machine learning library, will Each model prediction call by result union operator is had the prediction result of model to be defined as model_RDD more；Secondly, calling GroupBy operator is polymerize according to commodity ID.Map operator is finally called, if the benefit of some (commodity ID, warehouse code) is few Cost, which is greater than, mends more costs, then it is intended that prediction is more, therefore take maximum value in single model prediction result multiplied by 1.1, on the contrary take the minimum value in single model prediction result multiplied by 0.9, < commodity will be obtained by the transformation of a series of data ID, warehouse code divide storehouse regional aim inventory > form model learning result model in future time section；

Q5, model prediction result are merged with regular prediction result, fusion coefficients 0.75model+0.25rule, wherein Rule rule learning is defined as: N days sales volumes are denoted as day1, day2 ... dayN respectively before prediction window, to each commodity, such as Fruit, which mends few cost and is greater than, mends more costs, then is predicted as N*max (day1, day2 ... dayN), on the contrary be predicted as N*min (day1, Day2 ... dayN), TestRDD is finally converted to < commodity ID, warehouse code divides storehouse regional aim inventory in future time section >, it is defined as rule_RDD, will finally obtain each commodity certain following period base stock in the region Fen Cang.