CN112150209B

CN112150209B - Construction method of CNN-LSTM time sequence prediction model based on clustering center

Info

Publication number: CN112150209B
Application number: CN202011080703.1A
Authority: CN
Inventors: 彭龙; 黄炎焱; 杜振华; 华宇浩; 张钊浩; 宋雅杰; 何新
Original assignee: Jiangsu Huiyucheng Intelligent Equipment Research Institute Co ltd; Xiaotu Intelligent Technology Nanjing Co ltd; Nanjing University of Science and Technology
Current assignee: Jiangsu Huiyucheng Intelligent Equipment Research Institute Co ltd; Xiaotu Intelligent Technology Nanjing Co ltd; Nanjing University of Science and Technology
Priority date: 2020-06-19
Filing date: 2020-10-10
Publication date: 2022-10-18
Anticipated expiration: 2040-10-10
Also published as: CN112150209A; CN112150210B; CN112150210A

Abstract

The invention discloses a method for constructing a CNN-LSTM time sequence prediction model based on a clustering center, which particularly relates to the field of mathematical modeling and comprises the following specific construction steps: s1, collecting sample data; s2, preprocessing data; s3, clustering samples; s4, establishing a CNN-LSTM model for prediction; and S5, analyzing a prediction result. The invention clusters the medicine sales data and establishes a CNN-LSTM model. The purpose of clustering is to find out the association among medicines, cluster the medicine time sequence data with high similarity into a cluster, then establish a plurality of time sequence prediction models based on a clustering center, and predict medicines in the same cluster by using the same model, thereby improving the applicability of the model and enlarging the application range of the model.

Description

Construction method of CNN-LSTM time sequence prediction model based on clustering center

Technical Field

The invention relates to the technical field of mathematical modeling, in particular to a method for constructing a CNN-LSTM time sequence prediction model based on a clustering center.

Background

Clustering analysis, also known as cluster analysis, is a statistical analysis method for studying (sample or index) classification problems, and is also an important algorithm for data mining. Clustering (Cluster) analysis is composed of several patterns (patterns), which are typically vectors of a metric (measure) or a point in a multidimensional space. Cluster analysis is based on similarity, with more similarity between patterns in one cluster than between patterns not in the same cluster.

Different types of traditional Chinese medicinal materials have different shelf lives and frequently fluctuate in price, so that the quantity of input goods is difficult to control, and in order to avoid losing the property of the medicinal materials after long-term storage and save funds, a medicine sales prediction model needs to be established to predict the medicine sales so as to provide references for the purchase quantity of the medicinal materials. For a time sequence prediction model, scholars at home and abroad have carried out a lot of research and put forward a lot of models, but each model has the characteristics, and no scheme or model with absolute advantages exists for predicting the product sales.

Disclosure of Invention

In order to overcome the above defects in the prior art, embodiments of the present invention provide a method for constructing a CNN-LSTM timing prediction model based on a clustering center, and the technical problem to be solved by the present invention is: on the premise of ensuring the prediction accuracy and stability, the prediction speed is accelerated.

In order to achieve the purpose, the invention provides the following technical scheme: a construction method of a CNN-LSTM time sequence prediction model based on a clustering center comprises the following specific construction steps:

s1, collecting sample data: 400-500 ten thousand prescription records in the database of the traditional Chinese medicine hospital automated dispensing system, and the finished drug sales data are used as sample data for constructing a model;

s2, data preprocessing: an ARIMA model is constructed according to the steps of the stabilization of sequence data, the order determination of the model and the significance inspection of the model, and based on the ARIMA model, the Adam algorithm is adopted to carry out error removal processing on the sample data in the step S1:

s3, sample clustering: clustering the various types of medicine sales data preprocessed in the step S2 by adopting a conventional clustering algorithm k-medoids so as to find out the association among medicines, and clustering medicine time sequence data with high similarity into a cluster;

s4, establishing a CNN-LSTM model for prediction: establishing a plurality of time sequence prediction models based on the clustering centers of the clusters classified in the step S3, and predicting the medicines in the same cluster by using the same model;

s5, analyzing a prediction result: the clustering algorithm firstly randomly selects a plurality of objects as the centroids of a plurality of initial clusters, then calculates the distance between each object and each centroid for the rest objects, assigns the object to the nearest cluster, and then recalculates the centroid of each cluster; this process is repeated until the criterion function converges.

In a preferred embodiment, as a result of the clustering of the medicines in step S3, the data set is divided into multiple categories, and the cluster center sequences are named by the names of the corresponding medicines.

In a preferred embodiment, a CNN-LSTM hybrid model is used in the step S4 to predict the clustering center, wherein the CNN layer is used to extract the hidden features of the time series data, and a shorter high-level feature sequence is formed as the input of the LSTM layer, which can greatly shorten the length of the original sequence and improve the calculation efficiency of the model, and meanwhile, the extracted high-level features can help to improve the prediction accuracy of the model.

The invention has the technical effects and advantages that:

1. the invention clusters the drug sales data and establishes a CNN-LSTM model. The purpose of clustering is to find out the association among medicines, cluster the medicine time sequence data with high similarity into a cluster, then establish a plurality of time sequence prediction models based on a clustering center, and predict medicines in the same cluster by using the same model, thereby improving the applicability of the models and enlarging the application range of the models.

2. The invention uses the CNN-LSTM mixed model to predict the clustering center because of the structure of the CNN-LSTM model: the convolution layer is stacked in front of the LSTM layer, hidden features are extracted from the convolution layer, a shorter high-level feature sequence is formed to be used as input of the LSTM layer, the length of an original sequence is greatly shortened, the calculation efficiency of the model can be improved, the extracted high-level features are helpful for improving the prediction accuracy of the model, and the method is flexible in practical application.

Drawings

FIG. 1 is a flow chart of the clustering process of the present invention.

FIG. 2 is a flow chart of the machine learning model construction of the present invention.

FIG. 3 is a sample-to-sales timing diagram of the present invention.

FIG. 4 is a diagnostic picture of the ARIMA model of the present invention.

Fig. 5 is a characteristic coefficient histogram of the present invention.

Fig. 6 is a characteristic coefficient thermodynamic diagram of the present invention.

FIG. 7 is a schematic diagram of a time-series nested cross-validation method according to the present invention.

FIG. 8 is a diagram of an Adaline computational model according to the present invention.

FIG. 9 is a graph of the ARIMA model of the present invention versus the prediction of sample one.

FIG. 10 is a graph of the sales prediction for sample one for the Lasso regression of the present invention.

FIG. 11 is a graph of the ridge regression versus the sales prediction for sample one of the present invention.

FIG. 12 is a graph of the prediction of sales for a random forest versus sample one in accordance with the present invention.

Fig. 13 is a graph of XGBoost versus sample one sales prediction for the present invention.

FIG. 14 is a graph of the prediction error for the four machine learning models of the present invention versus sample one.

Fig. 15 is a graph of the RNN versus sample one sales prediction for the present invention.

Fig. 16 is a graph of the GRU versus sample one sales prediction for the present invention.

FIG. 17 is a graph of the prediction of sales for LSTM versus sample one in accordance with the present invention.

FIG. 18 is a graph of the prediction error curves of three neural network models (sample one) of the present invention

FIG. 19 is a prediction of the sales for sample one for the combined prediction model of the present invention.

FIG. 20 is a comparison of the predicted error curves of the four models of the present invention.

FIG. 21 is a time chart of the drug sales rate with abnormal values according to the present invention.

FIG. 22 is a timing diagram of the cluster center of the present invention.

Fig. 23 is a schematic diagram of the 1D CNN operation of the present invention.

Fig. 24 is a diagram of a CNN-LSTM network architecture of the present invention.

FIG. 25 is a graph of the sales prediction for the CNN-LSTM model of the invention versus sample two.

FIG. 26 is a graph of the sales prediction for the LSTM model of the present invention versus sample two.

FIG. 27 is a comparison of the prediction error curves of the CNN-LSTM model and the LSTM model of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention provides a method for constructing a CNN-LSTM time sequence prediction model based on a clustering center, which comprises the following specific construction steps:

s1, collecting sample data: 400-500 ten thousand prescription records in the database of the automatic medicine dispensing system of the traditional Chinese medicine hospital, and the finished medicine sales data are used as sample data for constructing a model;

prescription records of unadulterated medicine sales data are borrowed from a database of an automatic medicine dispensing system of a hospital and used as an original data set of sample data of a constructed model, medicine id, medicine quantity, prescription time, patient name and doctor name information are recorded in each prescription, historical sales data of various traditional Chinese medicines are finally obtained through splitting, combining and multi-table combined query technology of a database table, the sales data take days as time units, the sales data are randomly extracted from the obtained historical sales data of the medicines from 1 month 10 in 2017 to 12 months 29 in 2018 for 720 days; for example, the drug id is 41400, the drug name is Atractylodis rhizoma, sample one for short, and the timing chart of the sales is shown in FIG. 3;

s2, data preprocessing: an ARIMA model is constructed according to the steps of sequence data stabilization, model order determination and model significance test, a flow chart is shown in figure 4, and based on the ARIMA model, the Adam algorithm is adopted to carry out error removal processing on the sample data in the step S1:

the construction of the ARIMA model is divided into three steps: the flow chart of the method comprises the steps of sequence data smoothing, model grading and model significance checking, and is shown in the attached figure 4:

(1) Smoothing of sequence data

The establishment of the ARIMA model requires stable data, so that the stability of sample data is checked by adopting a unit root method, and the ADF detection result of the drug sales data is shown in Table 1;

TABLE 1 ADF test results of sample time series data

Test item	Test results
		Test Statistic Value	0.807369
P-value	0.991754
		Lags Used	0
Number of Observations Used	720
		Critical Value(1％)	-3.533560
Critical Value(5％)	-2.906444
		Critical Value(10％)	-2.590724

It can be seen that the t statistic is larger than any confidence threshold, and thus the sample data is non-stationary. The experimental data is subjected to first order difference processing to be stabilized, and the processing method is shown as the formula 1-1, wherein y _t Is a time sequence of drug sales, and the processed data is stable;

Z _t ＝y _t -y _t-1 (1-1)

(2) Order determination of model

The ARIMA model is mainly ordered by determining three parameters of p, d and q; after the first-order difference, the original sequence is already stable, so the difference parameter d is 1; and determining the order of the model by a BIC (Bayesian information criterion), wherein the smaller the BIC value is, the more ideal the parameter is. The BIC criterion calculation formula is as follows:

BIC＝-2ln(L)+ln(L)×N (1-2)

in the formula (1-2): l is the maximum likelihood function of the model, and N is the number of model parameters; calculating BIC values of the ARIMA model under all combinations of p and q with 0-5 order lag through grid search, wherein the minimum BIC value is 86.097, and AR2 and MA2 correspond to the minimum BIC value; namely (p, d, q) is ordered to (2, 1, 2);

(3) Model diagnostics

Before the model is used for prediction, the assumed reasonability of the model and the reliability of data need to be investigated through information provided by residual errors, and if the regression model is correct, the residual errors have no linear correlation relationship and are expressed as white noise sequences; FIG. 4 is a diagram of a model diagnostic as plotted by the plot _ diagnostics () function in the python stats libraries;

in the Histogram plus estimated density graph, a red KDE line is basically coincident with an N (0, 1) line (Normal distribution), a Normal Q-Q graph of a residual error (blue point) shows a linear trend, and the two show that the residual error conforms to the N (0, 1) distribution, which is a good indication of the Normal distribution of the residual error; the normalized residual image shows that the residual error is basically stable, and the mean value is 0; the corelogram graph shows that the residual sequence has a low correlation with its own lag value; the residual error sequence can be judged to be a white noise sequence; in conclusion, the ARIMA (2, 1, 2) model can be used to perform the next predictive actions;

s3, sample clustering: clustering various types of medicine sales data preprocessed in the step S2 by adopting a conventional clustering algorithm k-medoids so as to find out the association among medicines and cluster medicine time sequence data with high similarity; clustering results of the medicines divide the data set into a plurality of categories, and the clustering center sequences are named by corresponding medicine names;

the machine learning model is constructed by the same construction process, as shown in fig. 2, which generally includes four steps: 1. selecting and preprocessing original data; 2. extracting characteristics; 3. constructing a model and selecting hyper-parameters; 4. model learning, training model parameters;

feature extraction

The prediction algorithm of most machine learning is supervised learning, and a large number of characteristics are required to be provided for model training; for the prediction of drug sales, the raw data is only 1-dimensional; on a time axis, only the medicine sales volume Y is available, and the characteristic data X is not available for model training of a machine learning algorithm; for this purpose, the original time series data is processed, and the translation characteristic (lag sales value, such as yt-lag to yt-1) and partial time series characteristic (such as year, month) and an object coding characteristic based on the average value (such as average sales per year) are used as characteristic data for model training;

table 2 is the time series translation feature dataset for sample one, table 3 is the time feature dataset for sample one, table 4 is the target coding feature based on mean;

TABLE 2 partial translation characteristics

Time	Amount of sales	lag_1	lag_2	lag_3	...	lag_13	lag_14
								2017-01-10	2242	NaN	NaN	NaN	...	NaN	NaN
2017-01-11	1165	2242	NaN	NaN	...	NaN	NaN
								2017-01-12	3199	1165	2242	NaN	...	NaN	NaN
...	...	...	...	...	...	...	...
								2018-12-28	8207	8201	10055	5300	...	7135	8435
2018-12-29	9625	8207	8201	10055	...	6717	7135

TABLE 3 partial time characteristics

Time	Amount of sales	day	month	year	weekday	is_weekend
							2017-01-10	2242	10	1	2017	1	0
2017-01-11	1165	11	1	2017	2	0
							2017-01-12	3199	12	1	2017	3	0
2017-01-13	1425	13	1	2017	4	0
							2017-01-14	2660	14	1	2017	5	0
2017-01-15	2799	15	1	2017	6	1
							...	...	...	...	...	...	...

TABLE 4 target coding features based on mean

Before the input model is trained, all characteristic variables need to be subjected to unified standardization treatment; after treatment, the average value of all the characteristics is 0, and the variance is 1; therefore, the convergence speed of the gradient descent algorithm can be accelerated, and each characteristic can influence the model in proportion;

after the feature data set is input into the model for training, the coefficient of each feature is shown in figure 5, the feature coefficient thermodynamic diagram is shown in figure 6, and a proper feature value can be selected by combining the two to be input into the model for training;

model construction based on grid search

The grid search method is an exhaustive search method for specifying a super-parameter range, and parameters of a model are optimized by a cross validation method to obtain an optimal learning algorithm; namely, possible values of each parameter are arranged and combined, all possible combination results are listed to generate a 'grid', and then each combination is used for model training;

as the parameters increase, the time complexity of the grid search algorithm is raised in power; therefore, the comprehensive search with large range and small step length is not directly carried out by adopting grid search, but a group of proper initial hyper-parameters is obtained through experience firstly, and then the search with smaller range is carried out on the basis of the hyper-parameters;

model training based on timing sequence cross validation

The model learning of machine learning generally uses a gradient descent method to iteratively learn parameters of a model; in order to obtain a more excellent model, the idea of Cross Validation is embedded into the iterative learning process of gradient descent, and a time-series Nested Cross-Validation (Nested Cross-Validation) method is adopted; the method comprises an outer loop for error estimation and an inner loop for parameter adjustment, as shown in fig. 7; the training set in the inner loop is divided into a training subset and a verification set, the model is trained on the training subset, and then parameters capable of minimizing errors on the verification set are selected; the outer loop divides the data set into a plurality of different training sets and test sets according to the time sequence; in order to calculate robust estimation of model errors, averaging the errors of each segmentation; the nested cross-validation process based on time series can provide an almost unbiased true error estimate;

with this method, the model is first trained on a small segment of time series data, for example, training of the model is performed from the time series to a certain time t, and then n steps after the time t are predicted and an error is calculated; expanding the training sample to t + n values, and predicting data from t + n to t +2 Xn; expansion continues until all observations are exhausted; the number of n between the initial training sample and the final observation value can be contained, and then the number of fold cross validation can be carried out;

construction of neural network model

(1) Data pre-processing

The original medicine sales time sequence data can be used as the input of the model through two steps; firstly, normalizing data, namely scaling the data set to be between 0 and 1 through min-max standardization; secondly, converting the time sequence data set into a supervised learning data set, and extracting time translation characteristics similar to those of machine learning;

(2) Determination of network structure

Determination of the structure of the recurrent neural network model (RNN, LSTM, GRU), i.e. stacking of neurons; compared with a single LSTM neuron, the stacked LSTM has stronger network expression capability and can discover deeper information; stacked LSTM as shown in fig. 8, is an LSTM model consisting of two LSTM layers; the upper LSTM layer provides a sequential output rather than a single value output to the lower LSTM layer; finally, integrating the output of a neural network of an LSTM layer by a full connection layer (Dense) layer to obtain a final predicted value;

the two-layer LSTM network model established for the sample I is shown in the table 5, the first layer LSTM has 110080 parameters, and the calculation cost is high for the prediction of overlong time series data;

TABLE 5 LSTM model network architecture (sample one)

Layer(type)	Output Shape	Param#
			lstm_1(LSTM)	(None，1，64)	110080
lstm_2(LSTM)	(None,32)	12416
			Dense_3(Dense)	(None,1)	33

Because the neural network model has more parameters and powerful functions, in order to prevent the over-fitting phenomenon, the dropout technology is needed to be used in each layer of network to ensure that the activation value of each neuron can stop working with a certain probability, thereby reducing the over-dependence of the model on specific characteristics and improving the generalization capability of the model;

the activation function is also a part of a neural network structure, and the recurrent neural network generally adopts ReLu as the activation function, so that the phenomenon of gradient disappearance can be effectively avoided; functional expression of RuLu is shown in formulas 1-3;

(3) Selection of optimization algorithms

An LSTM model with excellent performance not only needs an accurate data format and a good network structure, but also needs a proper model optimization algorithm; the Adam optimization algorithm is the most comprehensive optimization algorithm, integrates the advantages of various optimization algorithms and is suitable for most situations, so that the Adam optimization algorithm is adopted;

s4, establishing a CNN-LSTM model for prediction: establishing a plurality of time sequence prediction models based on the clustering centers of the clusters classified in the step S3, and predicting the medicines in the same cluster by using the same model; the CNN-LSTM mixed model is adopted to predict the clustering center, wherein the CNN layer is used for extracting time sequence data hidden features to form a shorter high-level feature sequence as the input of the LSTM layer, so that the length of an original sequence can be greatly shortened, the calculation efficiency of the model can be improved, and the extracted high-level features can also help to improve the prediction precision of the model;

(1) Combined prediction model and its characteristics

The real drug sales data is influenced by various factors such as seasons, weather, regions and the like, and is time series data containing complex changes; for such data, any single model, limited to its respective features, may not be adequate to achieve accurate prediction;

for example, ARIMA models and linear models such as Lasso regression and ridge regression generally perform well when fitting time series data with a single component, and when the facing data contains more components, these models tend to be redundant and deficient; although integrated learning models such as random forests, XGboost and the like have strong functions and can have good effect on general machine learning problems, the upper limit of the prediction precision of the integrated learning algorithms is usually limited to the performance of weak classifiers of the integrated learning algorithms;

the combined prediction model combines the prediction results of a plurality of single models by using the weight coefficient, and comprehensively utilizes the information provided by each single model, thereby achieving better prediction effect than that of the single model; the mathematical description of the combined prediction is as follows

Wherein n represents the number of single prediction models; w is a ₁ ，w ₂ ，...，w _n Is each single prediction model f ₁ ，f ₂ ，...，f _n The weighting coefficient of (2); under normal circumstances

(2) Combining the selection of predictive models

Because the more abundant the diversity of the single model, the better the performance of the combined prediction model, the combined prediction model simulates eight single models of ARIMA, lasso, ridge regression, random forest, XGboost, RNN, LSTM and GRU;

(3) Determination of weight coefficients for combined prediction models

As the classical weight coefficient determination methods are not flexible enough, the weight coefficients are required to be manually calculated according to formulas for distribution, and the classical weight coefficient determination methods are all based on linear combination methods, so that the advantages of each single model cannot be completely mined; adaline is a short for adaptive neural network (adaptive neural network), the method continuously updates the combined weight coefficients of each model by gradient descent and minimizing mean square error, and the calculation model of Adaline is shown in figure 8;

in addition to the input X, the output Y and the weight vector W of data, the Adaline neural network mainly comprises three parts, namely an input function (Net input function), an Activation function (Activation) and a Quantizer (Quantizer); the quantizer is a unit step function, and compresses the result to [ -1,1] in the classification problem, and removes the problem because the problem is a prediction problem, so that a continuous prediction value is obtained; the process of Adaline self-adaptively adjusting the weight coefficient of the combined model comprises the following steps:

(1) inputting the predicted value x of each single model _i And an initialization weight coefficient w _i ；

(2) Performing vector point multiplication on the input predicted value X and the weight W by the input function, and summing;

(3) the summation result passes through a linear activation function f (x) = x, and the actual output is obtained

(4) Calculating out

The mean square Error (Error) with the true value Y, the weight vector W is adjusted by the mean square Error; repeating the above 4 steps until the stop condition (iteration times or prediction precision) is satisfied

Updating the weight vector W in the step (4) by adopting a gradient descent method; the update formula for W is:

w:＝w+Δw (1-5)

Δw＝-nΔJ(w) (1-6)

where η is the learning rate of the gradient descent, Δ J (W) is the partial derivative function of the loss function to the weight vector W; the loss function uses the mean square error:

s5, analyzing a prediction result: the clustering algorithm firstly randomly selects a plurality of objects as the centroids of a plurality of initial clusters, then calculates the distance between each object and each centroid for the rest objects, assigns the distance to the closest cluster, and then recalculates the centroid of each cluster; this process is repeated until the criterion function converges;

(1) Experimental Environment

The construction of the prediction model is analyzed and coded by using a python language; the python language has the characteristics of simplicity, expandability and portability, and libraries with rich functions can be used; for example: the sickit-learn library can provide rich machine learning algorithms; the statmodels module comprises functions of powerful regression analysis, time series analysis, hypothesis testing and the like; the tensoflowk frame and the keras platform can rapidly build a strong neural network model and the like; therefore, the various time sequence models can be conveniently and quickly built by using the python language; the development tool adopted by the invention is PyChrom, a professional and popular python IDE; the detailed experimental environment is shown in table 6;

table 6 experimental environment configuration

Design language	Python
		Experiment module	Sickit-learn、Statsmodels、Keras
Development tool	PyChrom
		Operating system	Windows(64bit)
System hardware	4G (RAM), 512G (solid state disk), i5 (CPU)

(2) Design of experiments

The idea of comparative analysis is applied in the experiment, a plurality of models are adopted to predict the sales volume of the first sample, and the optimal model for predicting the sales volume of the medicine is searched; the model to be constructed is: traditional ARIMA models, machine learning correlation models (Lasso regression, ridge regression, random forest, XGboost), neural network correlation models (RNN, LSTM, GRU), and adaline neural network-based combined prediction models constructed in combination with the above eight univariate models; the parameters corresponding to each model are shown in table 7, table 8 and table 9; the parameters not mentioned in the table all adopt default values in packages such as scinit-spare and keras; RNN, LSTM and GRU adopt a network structure of two layers of stacking and one layer of Dense layer, the first layer is 64 neurons, and the second layer is 32 neurons;

as long-term prediction goes on, accumulated errors make the prediction worse and worse; therefore, short-term prediction is adopted, the drug sales volume of a sample in the future 10 days is predicted, namely, the sales volume data of 10 days is reserved as a test set;

TABLE 7 summary of model parameters (non-neural network model)

TABLE 8 summary of model parameters (neural network model)

TABLE 9 weight coefficients for combined prediction models

(3) Evaluation index of experiment

Evaluating the result, wherein evaluation indexes are required; the prediction problem belongs to regression problem, and commonly used error evaluation indexes are Mean Square Error (MSE), root Mean Square Error (RMSE), mean definite error (MAE), mean absolute percentage error (MPAE), R ² Etc.; the prediction effect of each model is evaluated by using three indexes of RMSE, MAE and MPAE; the RMSE and the MAE can reflect the real error between the predicted value and the real value, and the MAPE can simply and visually see the percentage of the error to the real value; the calculation formula of the three indexes is as follows:

in the formulae (1-8) to (1-10), yi is the true value of the sales amount of the medicine,

is a predicted value of drug sales; the smaller the values of the three indexes are, the better the prediction effect is represented;

(4) Comparison and analysis of the results

In order to observe the prediction effect conveniently, the curve of the real sales volume and the prediction curve of the model are put together;

(1) the ARIMA model prediction graph is shown in FIG. 9, and it can be seen that the ARIMA model can basically track the true sales curve trend, but has a small error; this may be that the sales data of the drug contains complex nonlinear components, and the linear ARIMA model is difficult to fit to the nonlinear components, making it difficult to make accurate predictions;

(2) the prediction graphs of the Lasso regression model, the ridge regression model, the random forest model and the XGboost model are sequentially shown in the attached

drawings

10, 11, 12 and 13, and the prediction error curves of the four machine learning models on the first sample are shown in the attached drawing 14;

as can be seen from fig. 14, the Lasso regression model and the ridge regression model not only have too large overall level of error, but also have unstable prediction effect, and the error curves are jagged, which indicates that the prediction effects of the two models have lost the reference meaning; error curves of the random forest and the XGboost are stable, and error values are smaller than those of a linear regression algorithm, so that the prediction accuracy of the integrated learning model is far higher than that of the linear regression model when the integrated learning model faces complex time sequence data; comparing the prediction error curves of the random forest and the XGboost model, the XGboost is proved to be more superior in the effect of predicting the sales of the first sample;

(3) FIGS. 15 to 17 are the prediction results of the recurrent neural network correlation model for a first sample, which are sales prediction curves of the RNN model, the LSTM model, and the GRU model for the first sample in this order;

from fig. 15 to fig. 17, it can be known that the prediction effect of the RNN model is good in the first five days, but the prediction effect becomes worse and worse with the lapse of time in the last five days, which is not only the RNN model, the GRU model and the LSTM model, but also indicates that the long-term prediction of the drug sales is not preferable in the present experiment;

in general, in the three models, the prediction curve of the LSTM model is closest to the real curve, the GRU model is second, and the RNN model is least ideal; the structural difference of the three networks is also verified, the LSTM model is an improvement of an RNN model, and the GRU model is a simplification made on the LSTM model for improving efficiency and sacrificing performance;

FIG. 18 is a graph of the prediction error curves of the RNN, LSTM, GRU models, from which it can be seen that the RNN prediction error curves are lower at the early stage and more stable in the three neural network models; however, the error in the later period suddenly begins to increase, so that the whole prediction result is not desirable; for RNN models, shorter term predictions may be more appropriate for the model; the problem with the GRU model is that the error fluctuates enormously over time during prediction; if the model is applied to actual prediction, the error fluctuation can bring considerable trouble to an applicator; in the three neural network models, the prediction error curve of the LSTM model is most stable, and the prediction error is always at a lower level;

(4) FIG. 19 is a graph of sales prediction for a sample one based on a combined prediction model with adaptive neural network determined weighting coefficients; as can be seen from the figure, the combined model is very good in performance, and the prediction curve of the combined model is closely close to the real drug sales curve; the overall performance of the model is similar to the predicted effect of the LSTM model only in a visual way, and the model has no obvious advantages;

in order to compare the advantages and disadvantages of the prediction effects of the combined prediction model and the LSTM model which are based on the self-adaptive neural network to determine the weight coefficients, and in order to find the optimal medicine sales prediction model from all the models; the prediction error curves of the XGboost model which is best in performance in the traditional ARIMA model and the machine learning model, the LSTM which is best in performance in the neural network model and the combined prediction model are displayed in a graph, as shown in an attached figure 20;

as can be seen from fig. 20, the most stable prediction error curve of the four models is the hybrid prediction model, and the relative error is also the smallest; the theory that the reasonable combined prediction model has higher prediction precision than a single prediction model and the prediction performance is more stable is verified;

table 10 shows the scores of the prediction models in the three evaluation indexes of RMSE, MAE, and MAPE;

TABLE 10 evaluation index score for each model

Model (model)	RMSE	MAE	MAPE(100％)
				ARIMA	1280.23	1190.0	16.08％
Lasso regression	1894.72	1560.0	19.71％
				Ridge regression	2106.43	1912.5	30.56％
Random forest	1382.71	1224.5	16.91％
				XGBoost	1372.36	980.5	13.29％
RNN	1350.95	1194.4	17.43％
				LSTM	679.53	629.8	8.98％
GRU	1226.56	1024.0	15.53％
				Combined prediction model	601.77	588.9	7.81％

The experimental results in table 10 again verify that each index of the combined prediction model based on the adaptive neural network is the lowest and the performance is the best; in summary, it can be concluded that:

the model with the best comprehensive performance determines a combined prediction model of the weight coefficient based on the adaptive neural network; for the prediction of the single medicine sales, the combined model has the highest prediction precision, the most stable prediction error curve and the most stable prediction performance; not to be ignored, the single LSTM model also performs very vigorously; from the comparison of the prediction error curves of the models in fig. 18 and the evaluation index scores of the models in table 10, the LSTM model is second only to the hybrid model designed by the present invention, which proves that the LSTM model is strong in the time series prediction field; if only the prediction precision and the stability of the prediction performance are considered, the combined prediction model and the LSTM model can meet the prediction of the single medicine sales; the combined prediction model has the advantages of higher prediction precision and more stable performance, and has the disadvantages of complicated modeling process, numerous parameters and high calculation cost; the LSTM model has the advantages that the model is relatively simple to construct, and the disadvantage is that the prediction accuracy of the model is slightly inferior;

clustering of drug sales data

(1) Clustering scheme

The time series data has two characteristics: time sequence dynamics and high dimensionality; there are two difficulties in clustering them; the first difficulty is that the existing clustering method mostly aims at static data; although many time sequence clustering algorithms have good effects, the process is complicated, and the efficiency is not high; the second difficulty is high dimensionality of time sequence data, which usually contains missing values and mostly has the characteristics of inconsistent length, sampling rate and change rate, which causes complex calculation and poor clustering result; for the above two problems, two solutions exist at present; converting the time sequence to reduce the dimension of the data and staticize the data; namely, feature extraction is carried out on the model or modeling is carried out on the model; thereby converting the clustering of the time sequence data into the clustering of the characteristic data or the clustering of the model coefficient; however, the method is limited to specific characteristics and specific models, the algorithm expansibility is poor, and the process is complex; the second scheme is that the conventional clustering method is directly used for clustering, but only the time sequence data which is not dynamically increased (the requirement on the data set is higher); or the similarity measurement needs to be modified to adapt to the characteristic that the lengths of the time sequence data are inconsistent; this method is generally based directly on the original data clustering;

the clustering in the invention aims to carry out a step of similar data preprocessing for the short-term sales forecast; the data set growth speed is relatively slow (day is used as a time stamp), and the dynamic increase of a small amount of data in a short period of time does not influence the clustering result, so that the predicted data set can be regarded as static data, as shown in table 11, after the collected drug sales data are processed, the length, the sampling rate and the change speed are all consistent, and a small amount of missing values in the data set are processed by an interpolation method;

TABLE 11 drug sales data information

Therefore, aiming at the characteristics of the data set, the invention adopts a conventional clustering algorithm k-medoids for clustering, the algorithm is an improved version of k-means, the selection of a clustering center is not based on the mean value of samples in a cluster, but a certain sample (median) in the cluster, and thus the anti-interference capability of the algorithm on noise data can be improved;

the algorithm firstly randomly selects k objects as the centroids of initial k clusters, then calculates the distance between each object and each centroid for the rest objects, assigns the object to the nearest cluster, and then recalculates the centroid of each cluster; this process is repeated until the criterion function converges and the kemdoids algorithm is described as follows:

the criterion function that is usually adopted is a square error sum criterion function; i.e., SSE, as defined in formula (2-1); wherein x is a cluster c _i Data of inner, m _i Is a cluster c _i A centroid of (a); the smaller the SSE is, the more compact and independent the clustering result is;

because all the medicine time sequence data have the same length and sampling rate, and the data volume is less; therefore, the calculation method of the distance D in the algorithm step is suitable for adopting the Euclidean distance; euclidean distance is a widely used shape-based distance metric; the method has the advantages of intuition and easy understanding, high calculation speed, good effect, and the following calculation formula:

(2) Clustering process

The clustering process is shown in figure 1, and has 5 steps;

(1) reading a medicine sales time sequence data set; mainly reading in a data set by utilizing a python language, and converting the data set into a data frame in a data analysis format commonly used by python;

(2) preprocessing data; the Euclidean distance is sensitive to noise data and abnormal values, and some abnormal values need to be removed; FIG. 21 is a timing diagram of drug sales with abnormal values; the abnormal points in the graph are abnormal values, and the medicine sales at the moments are suddenly higher than the normal values, which can be errors in the data acquisition process; the principle of computation of outliers is the 3 σ criterion: the most common and simplest gross error discrimination method in machine learning; considering that for a random error of a normal distribution, the probability of falling outside μ ± 3 σ (μ is the mean value, σ is the standard deviation) is only 0.27%, such an error value should be removed or processed and the abnormal value is replaced with the mean value of the day before and the day after the time;

(3) calculating the contour coefficient to determine the number K of clustering clusters; the K-mediads need to set the value of the number K of clustering clusters, and the invention determines the value of the K value by calculating the contour coefficients of different numbers of clustering clusters; the contour coefficient is between-1 and 1, and the value of the contour coefficient can express the clustering effect: the larger the value, the more compact the inside of the cluster, the larger the inter-cluster distance; the specific calculation method of the contour coefficient is as follows:

in the formula (2-3), a is the intra-cluster cohesion degree; the calculation method is as follows: the average distance between sample x and other points within the cluster; b represents the degree of inter-cluster separation; the calculation method is as follows: the average distance between sample x and all points in the nearest cluster; the corresponding situation of the k value of the sample data and the total contour coefficient of the clustering result is shown in table 12; as can be seen from the table, when k is 4, the contour coefficient is maximum; the optimal k value determined herein is therefore 4;

TABLE 12 correspondence of k-value to contour coefficient

k value	Total profile coefficient of clustering result
		1	0.5863
2	0.5063
		3	0.6238
4	0.7712
		5	0.7325
6	0.6526
		7	0.6448
8	0.6612
		9	0.6387
10	0.5821

(1) Building a k-mediads model; there are machine learning packages dedicated to time series data in Python: tsleran; under the tslearn. Clustering package, timeSeries Kmeans () is a class of k-menas algorithms dedicated to time series data clustering; calling the class, modifying a def _ update _ centroids () function in the class, changing the updating of the centroid from the average value of data in the cluster to a circular traversal sample, and finding out the sample with the minimum other points; then calling the class, and setting the number of the clustering clusters, namely building a model;

(2) clustering to obtain a result; calling a fit _ prediction () function of the TimeSeriesKmeans class, inputting the dataFrame format data set processed in the

steps

1 and 2 to obtain a clustering result, checking the class to which each time sequence belongs through the labels _ attribute of the class, and checking all clustering centers through the cluster _ centers _ attribute;

(3) Clustering result analysis and discussion

Clustering is an unsupervised learning behavior, and the evaluation of clustering effect is divided into two categories; one is an external information evaluation index, such as: rand index, fowles-Mallows index, etc.; but the indexes firstly need to be manually divided to determine whether the sample is accurate or not; for the traditional Chinese medicine aimed at by the invention, no clear classification standard exists, so that external indexes have no way to be found; the other type is an internal index, such as the most common contour coefficient, the index is evaluated from two aspects of inter-cluster distance and intra-cluster compactness, and the obtained result is closer to 1, so that the clustering effect is better; as can be seen from Table 12, when the number of clusters is 4, the contour coefficient is 0.7712, which is very close to 1, and the clustering effect is good

Table 13 shows the basic information of the clustering result of the timing sequence data of 609 kinds of drugs by the K-media algorithm; clustering divides the data set into 4 classes (cluster 1, cluster 2, cluster 3 and cluster 4), and the names of medicines corresponding to the cluster center sequence are radix Paeoniae alba, herba asari, notoginseng radix powder and fructus Cnidii respectively; the timing diagram of the cluster centers is shown in FIG. 22;

TABLE 13 clustering result basic information

Number of clusters clustered	Cluster 1	Cluster 2	Cluster 3	Cluster 4
					Clustering center sequences	White peony root	Herba asari	Notoginseng powder	Fructus cnidii
Sequence number	302	118	144	45

As can be known from fig. 22, the numerical value of the medicine sales time series data of cluster type 1 is far greater than that of other types 3, the medicine sales time series data of cluster type 2 has strong annual periodicity, the medicine sales time series data of cluster type 3 remains relatively stable, and the ascending trend of the medicine sales time series data of cluster type 4 is obvious; the 4 clustering center timing diagrams all have different characteristics, which shows that the clusters have low coupling;

by calling the labels _ attribute of the TimeSeries Kmeans class, the belonged class of the time series of each drug sales can be checked; for example, the medicines belonging to cluster 1 are common traditional Chinese medicines (such as angelica, atractylodes, codonopsis pilosula, dried orange peel, chinese yam, chrysanthemum and the like), are mostly used for internal organs, and are used for clearing away heat and toxic materials, strengthening spleen and stomach, enriching blood and promoting blood circulation and the like; the herbs belonging to Cluster 2 (e.g. asarum herb, ginger, notopterygium root, peucedanum root, bupleurum root, etc.) have the main actions of expelling wind and removing cold, and relieving exterior syndrome with pungent and warm natured drugs; the majority of the medicines (such as notoginseng powder, cortex moutan, motherwort herb, nux vomica, ricepaperplant pith and the like) belonging to cluster 3 appear in the prescription for traumatic injury, blood circulation promotion, blood stasis removal, collateral dredging and pain alleviation; the medicines belonging to cluster 4 (such as herba Euphorbiae Lathyridis, radix Cynanchi Paniculati, fructus Cnidii, cortex Dictamni Radicis, etc.) are mostly related to the symptoms of treating dermatosis; the results show that the medicines in the clusters have certain relevance, generally have similar efficacy or act on the same pathological field, and the clustering result clusters are proved to have good polymerization degrees from the side;

therefore, through the contour coefficient score and the analysis of the clustering result, the clustering mode can be considered to basically finish the purpose of classifying the medicine time sequence data, and the early preparation is made for the construction of a CNN-LSTM model based on a clustering center in the later period;

construction of CNN-LSTM model

609, 4 clustering centers of the drug sales time series data are obtained, and a CNN-LSTM model is built by taking the sales time series data (clustering center of cluster 1) of asarum as experimental data; for convenience of description, the sales data of asarum, a drug, is simply referred to as sample two; then, the model carries out sales prediction on other medicines of the cluster 1 type, and the reliability of the CNN-LSTM model scheme based on the cluster center is verified;

(1) CNN-LSTM model principle

CNN has an unexpected effect on the processing of some sequence data; for text classification problems, the small 1D CNNs can even match the RNNs' effect, even faster; the 1D CNN can learn local patterns (namely features) in the sequence and has the advantages of few parameters, fast training, high score and easy migration; each output time step of the 1D CNN is obtained by utilizing a small section of the input sequence in the time dimension; the working principle is shown in figure 23;

the present invention is directed to the prediction problem of time series, but 1D CNN has limitations that it is insensitive to the order of time steps; for learned features, 1D CNN does not make clear the relationship of features to chronological order; therefore, for timing problems, 1D CNN generally does not give meaningful predictions;

therefore, the invention provides an LSTM time sequence prediction model based on CNN characteristic extraction, which is called CNN-LSTM model for short; the model converts long input sequences into high-level features using 1D CNN; then composing shorter sequences; finally, the short sequence is used as the input of the LSTM; the structural advantages of CNN and LSTM are fully utilized, more hidden information can be learned in theory than a pure LSTM model, the training speed is higher, and the calculation cost is lower; the CNN-LSTM model design flow is shown in the abstract attached drawing;

(2) CNN-LSTM model design

The design of the CNN-LSTM model is similar to the LSTM model, except that the network structure is slightly different; because the original time sequence input sequence only has 720 time points, the network structure is not suitable to be complicated; after one-dimensional CNN feature extraction, the number of LSTM layers also needs to be adjusted; after continuous experiments, the network structure of the CNN-LSTM model is finally determined as shown in figure 24; the network structure in the figure consists of a Dense layer, an LSTM layer and two one-dimensional convolutional layers; the convolutional layer is arranged on the LSTM layer, and provides a characteristic sequence of shorter and deeper layers as the input of the LSTM layer;

the convolution window of a two-dimensional convolution layer is generally 3 × 3, and contains 9 feature vectors; however, for a one-dimensional convolutional layer, a convolution window of size 3 only contains 3 eigenvectors, so that the convolution window can be properly enlarged; the periodicity of the week greatly affects the model, so the convolution window size in the CNN-LSTM model is set to be 7; the maximum pooling is adopted by the pooling layer; this makes the features obtained within the window location independent while reducing network parameters; the overfitting phenomenon is well inhibited; the details of the constructed CNN-LSTM network model are shown in Table 14; the overall parameters of the model are calculated as: 16641, much smaller than the parameters of the LSTM model in Table 5; therefore, the CNN-LSTM model is simpler and smaller, and the calculation efficiency is much higher theoretically;

TABLE 14 CNN-LSTM network model summarization

Prediction result of CNN-LSTM model based on clustering center

(1) CNN-LSTM model prediction performance analysis

The LSTM model has strong fitting capability and good prediction accuracy, and is a little inferior to a hybrid prediction model in a plurality of models, so that the modeling process is relatively simple; therefore, only the LSTM model built by the sample II and the CNN-LSTM model are compared and predicted;

the network structure of the LSTM model of sample two is the same as that of the LSTM model of sample one, and the parameters are slightly different: dropout of 0.25, batch size of 4, and number of iterations of 150; FIGS. 25 and 26 are sales prediction curves for the CNN-LSTM model and the LSTM model, respectively, for sample two;

as can be known from the graph, the prediction curves of the two models are both close to the real sales curve, and the prediction curve of the CNN-LSTM model seems to be closer to the real sales curve, so that the prediction accuracy of the CNN-LSTM model is proved to be not poor; FIG. 27 is a graph showing the predicted error of the CNN-LSTM model and the LSTM model for sample two, in order to more fully analyze the predicted effect of the CNN-LSTM model;

as can be seen from the attached figure 27, the LSTM model has good performance, the prediction error curve is relatively stable on the whole, but the prediction error curve has one time of larger fluctuation in the previous period, and the prediction errors in the last two days have a tendency of larger deviation; the prediction error curve of the CNN-LSTM model is basically under the prediction error curve of the LSTM model, is closer to a parallel straight line, and has smaller fluctuation range, which proves that the prediction accuracy of the CNN-LSTM model is higher than that of the LSTM model, and the prediction stability is stronger;

table 15 shows the evaluation indexes of the CNN-LSTM model and the LSTM model for sample two prediction, and the evaluation indexes of RMSE, MAE and MAPE of the CNN-LSTM model are all lower than that of the LSTM, thus proving that the CNN-LSTM model has better prediction effect; meanwhile, the prediction time consumption of the CNN-LSTM model and the LSTM model is remarkably 17.38 seconds which is far less than 153.16 seconds of the LSTM model, and the prediction speed is about one order of magnitude faster; in the course of the experiment, the prediction speed may not matter, however, in practical applications, this is usually one of several factors that need to be considered first;

in conclusion, by comparison with the LSTM model, it can be concluded that: the CNN-LSTM model not only slightly improves the prediction precision and stability, but also greatly accelerates the prediction speed, and has more excellent overall performance;

TABLE 15 evaluation index of prediction effect (sample two)

(2) Suitability analysis of CNN-LSTM model based on clustering center

In order to analyze the applicability of the CNN-LSTM model based on the clustering center, the CNN-LSTM model and the LSTM model predict the sales of all medicines, record MAPE index scores, and then respectively calculate the average value and standard deviation of MAPE indexes for predicting the sales of the medicines of the same cluster and the medicines of non-same cluster; the calculation results are shown in table 16; the second sample is the clustering center of the cluster 1, and the first sample is the non-clustering center (belonging to the cluster 1 after clustering);

TABLE 16 model prediction results (Total data)

As can be seen from table 16, the effect of the three models on the prediction of the drug sales of cluster 1 is much better than that of the prediction of the drug sales of other clusters, because the sample two and the sample one belong to the cluster one; clustering the drug sales data to ensure that the drug sales curves in the same cluster contain the same rule and information, and the difference of the non-same cluster drug sales curves is larger; therefore, the model built based on the drug sales data in the cluster 1 has a better prediction effect on the drug sales of the same cluster than on the drug sales of other clusters, which is of course true;

for clarity, it is again stated herein that sample two is the cluster center, and sample one is the non-cluster center; comparing the LSTM model (sample two) with the LSTM model (sample one) for predicting the class-1 medicines, it can be seen that the mean value and the variance of the MAPE indexes of the LSTM model (sample two) are both smaller than those of the LSTM model (sample one), which proves that the generalization ability and the stability of the model built based on the clustering center in the medicine sales data of the same cluster are higher than those of the model built based on the non-clustering center;

the comparison result of the CNN-LSTM model (sample II) and the LSTM model (sample II) reflects that the prediction effect of the CNN-LSTM model is better than that of a single LSTM model again;

in summary, it can be concluded through comparative experiments that: the CNN-LSTM time sequence prediction model is established based on the clustering center, the prediction precision is high, the model is stable, the operability is strong, and the sales prediction requirements of various medicines in a pharmacy can be met.

Finally, it should be noted that: the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims

1. A construction method of a CNN-LSTM time sequence prediction model based on a clustering center is characterized by comprising the following steps: the specific construction steps are as follows:

s1, collecting sample data: 400-500 ten thousand prescription records are called from a database of an automatic medicine dispensing system of a traditional Chinese medicine hospital, and the arranged medicine sales data are used as sample data of a constructed model according to the prescription records;

2. The method for constructing a CNN-LSTM timing prediction model based on a cluster center as claimed in claim 1, wherein: the clustering result of the drugs in step S3 divides the data set into multiple categories, and the clustering center sequences are named by the names of the corresponding drugs.

3. The method for constructing the CNN-LSTM timing prediction model based on the cluster center as claimed in claim 1, wherein: in the step S4, a CNN-LSTM mixed model is adopted to predict the clustering center, wherein a CNN layer is used for extracting time sequence data hidden features to form a shorter high-level feature sequence as the input of an LSTM layer.