CN113362116A

CN113362116A - Medicine market scale prediction system based on machine learning

Info

Publication number: CN113362116A
Application number: CN202110739439.6A
Authority: CN
Inventors: 朱仁; 卓绮雯; 李晓彤; 劳丽玫
Original assignee: Shenzhen Quanyaowang Technology Co ltd
Current assignee: Shenzhen Quanyaowang Technology Co ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-09-07
Anticipated expiration: 2041-06-30
Also published as: CN113362116B

Abstract

The invention relates to the technical field of medical big data, in particular to a medicine market scale prediction system based on machine learning, which can predict the change trend of the medicine market purchase quantity, assist in making a purchase plan of a medicine market, help a user to know market positioning and assist in monitoring and evaluating reasonable medicine; the method comprises the following steps: s1, data requirement; s2, purchasing quantity related data; s3, data cleaning; s4, if the data are not included; s5, carrying out multi-dimensional statistics; s6, an index library; s7, randomly grouping; s8, evaluating the importance of the variable; s9, building a prediction model; s10, whether all variables are traversed or not; s11, model evaluation; s12, expert evaluation; s13, judging practicability; s14, testing a real environment; s15, whether to reevaluate.

Description

Medicine market scale prediction system based on machine learning

Technical Field

The invention relates to the technical field of medical big data, in particular to a medicine market scale prediction system based on machine learning.

Background

The medicine market scale has unknown and uncertain degree, the change trend of the purchase quantity is influenced by various factors such as medical institutions, medicine characteristics, medical insurance, market competition, medicine policies and the like, and the individual experience has a large limitation on the medicine purchase quantity prediction, and the method is specifically embodied in the following aspects: the medical institution lacks scientific and reasonable medicine market data support in formulating a new round of medicine purchasing plan and national collection medicine report, simply obtains a medicine purchasing quantity predicted value according to coefficient addition by virtue of past medicine purchasing data, and formulates a purchasing plan based on the predicted value, so that part of medicines are overstocked in a warehouse or are in medicine shortage, and the national collection task does not reach the standard and is interviewed or the performance evaluation does not reach the standard, so that the centralized purchasing incentive of the medicines is denied by a vote, and the reasonable utilization and distribution of medical resources are not facilitated; in the aspect of relevant policies of medicine markets such as medicine volume bargaining negotiation, medicine centralized purchasing scheme formulation, medicine centralized purchasing incentive scheme, medicine payment budget scheme and the like, relevant government departments mainly rely on data reported by regulatory units such as medical institutions and the like, medicine market data are lacked as an assistant decision tool to correct the bias of the existing data, chips of the medicine volume bargaining are weakened, and the possibility of phenomena such as centralized purchasing execution obstruction, inappropriate incentive measures and actual purchasing conditions, medical insurance fund waste and the like is increased; in the aspect of making enterprise plans such as a medicine production plan, a medicine sales plan, a business medical insurance risk assessment scheme, a market development strategy and the like, enterprise institutions mainly adopt cross-sectional medicine market data provided by third-party market assessment institutions as reference bases, ignore the longitudinal change characteristics of the medicine data, lack of medicine market data support combining globalization and refinement, and cause the problems of inaccurate medicine market positioning, lagged medicine yield, excessive capacity, increased uncertainty of business medical insurance risk and the like.

Disclosure of Invention

In order to solve the technical problems, the invention provides a medicine market scale prediction system based on machine learning, which can predict the change trend of the medicine market purchase quantity, assist in making a purchase plan of a medicine market, help a user to know market positioning and assist in monitoring and evaluating reasonable medicines.

The invention relates to a medicine market scale prediction system based on machine learning, which comprises the following steps:

s1, data requirement: according to the demand of the user in the aspect of medicine market scale prediction, if historical modeling experience exists, the historical modeling experience and a problem solution are combined to integrate and form comprehensive data demand;

s2, data related to purchase quantity: according to the data requirements in the aspect of medicine market scale prediction, relevant data of the purchase amount are called from a medicine transaction database and stored in a structured standard data table;

s3, data cleaning: marking the data and bringing the effective data into a model;

s4, data inclusion is not: data inspection is carried out on data which are not included in the model, the reason that the data do not meet the standard of the included model is deeply found, and possible data problems are mined; carrying out multi-dimensional statistics on the data incorporated into the model;

s5, carrying out multi-dimensional statistics: carrying out multi-dimensional statistics on data incorporated into the model from the aspects of drug attributes, hospital attributes, market competition, sales price and the like;

s6, index library: structuring the multi-dimensional statistical results of each region and storing the results in an index standard database;

s7, random grouping: according to a certain distribution proportion, according to the unique code of the medical institution, dividing data of part of the medical institution into a test set, and dividing data of the rest of the medical institutions into a training set for model fitting;

s8, evaluating the importance of the variables: evaluating the importance of all independent variables by adopting a random forest model, and adopting a mean square error increment rate (% IncMSE) as an importance evaluation index of the independent variables for predicting the regression problem of the medicine purchase quantity; for the classification problem of predicting the multiplying power rating of the purchase quantity, evaluating the importance of the independent variable by adopting Mean increment Accuracy (MDA);

s9, building a prediction model;

s10, whether all variables are traversed: checking whether the circulation passes all independent variables in the training set, and if not, continuing the circulation process; if all independent variables are passed, ending the circulation, and screening an optimal model in the prediction model set according to the goodness of fit or accuracy;

s11, model evaluation;

s12, expert evaluation: the experts in the related field evaluate and analyze the prediction result of the model according to the related experience and the reference data, provide modification suggestions and evaluate the practicability of the modification suggestions;

s13, judging practicability: when the prediction model does not reach the practical stage, returning to the data requirement generation stage according to the modification suggestion and the evaluation result, and guiding the next model building scheme; when the prediction model reaches the practical stage, storing the prediction model in a prediction model database;

s14, testing a real environment;

s15, whether to reevaluate: judging whether the model needs to be reevaluated according to the real environment test result, returning to the model evaluation stage if reevaluation is needed, modifying the modeling scheme according to the expert evaluation result, and reentering the next modeling stage; if no re-evaluation is required, the predictive model is incorporated into the drug transaction monitoring system.

Further, the step S3 includes the following steps:

1) invalid data such as invalid order data, unknown source data, error data and the like in the data to be cleaned are subjected to invalidation marking;

2) the method comprises the following steps of associating a built drug information standard library by utilizing drug codes, and marking attributes of universal names of catalogs, dosage forms of catalogs, standard specifications, names of standard manufacturers, basic drugs, medical insurance, limited daily doses and the like of drugs;

3) associating the established medical institution information standard library by utilizing hospital codes, and marking the attributes of the medical institution such as grade rating, administrative region, basic level classification and the like;

4) setting a missing value supplement rule for all necessary fields, updating and perfecting the supplement rule along with the feedback of the problems found in the modeling process, and supplementing the missing values of the data to be cleaned by combining the supplement rule;

5) in combination with the inclusion criteria, for invalid data, data that cannot be supplemented by necessary fields, data that is not within a statistical time range, or other data that the modeling experience deems to be excluded, etc., the data is marked as data that is not included in the model, and the other data is marked as data that is included in the model.

Further, the step S9 includes the following steps:

1) combining the importance evaluation result of the independent variable, sorting the independent variable in a descending order according to the importance index to obtain an independent variable set F { x1, x2, x3, … }, and sequentially taking the first i-bit elements in F in the circulation process to ensure that the first i-bit elements in F are in turn in order to obtain the importance evaluation result of the independent variable

In each circulation, all elements in fi are taken as independent variable combinations of the building model;

2) for the regression problem of predicting the medicine purchase quantity, a random forest regression device and a ridge regression model are adopted to carry out regression analysis, and meanwhile, when time series data are complete, a time series analysis method is adopted to optimize the model; for the classification problem of the prediction purchase quantity multiplying power rating, a random forest classifier model is adopted for carrying out cluster analysis;

3) and adjusting parameters in a part of the model, evaluating the goodness-of-fit or accuracy of the model and outputting a prediction model.

Further, the step S11 includes the following steps:

1) obtaining a prediction result of the test set by using the screened model and combining the test set data;

2) for regression analysis, performing consistency evaluation by adopting a Bland-Altman method, and comparing a difference value with an acceptable error threshold value, wherein the acceptable error threshold value is provided by user requirements; for cluster analysis, performing consistency evaluation by adopting ten-fold cross validation and a confusion matrix, and comparing accuracy with an accuracy threshold value, wherein the accuracy threshold value is provided by user requirements;

3) and (4) carrying out comparative analysis on the extrapolation of the two models, analyzing the advantages and the disadvantages of the two models, and forming a model evaluation result.

Further, the step S14 includes the following steps:

1) forecasting the dosage of the medicines in the new round of purchasing period by utilizing a forecasting model and combining the medicine catalog information of the new round of medicine purchasing period and a medicine transaction database;

2) for the prediction result of the medicine purchasing quantity, certain personalized adjustment can be properly carried out according to the purchasing behavior data of the user and the requirement of the user;

3) displaying the prediction result to a user, and the user puts forward a modification demand according to the experience of the user and updates the user demand;

4) when the user has no further modification requirement, storing the adjusted adaptive model in a model database;

5) and when the execution of a new round of medicine purchasing period is finished, comparing the difference between the predicted value and the true value, and implementing deviation analysis to obtain a model reevaluation conclusion and a modification suggestion.

Further, in step S5, the drug attributes include a base drug classification, a medical insurance classification, an ATC group purchase amount, a route of administration, and the like; the hospital attributes comprise medical institution grade rating, medicine purchasing scale, basic level classification, administrative region and the like; the market competition comprises the number of competitive enterprises, the market share of imported enterprises, the number of over-consistency rating enterprises, the number of hundreds of enterprises ranked in the Ministry of industry and trust, the market share of the hundreds of enterprises and the like; and the sales price of the sales volume comprises the purchase volume and the purchase amount of the medicine in the previous purchase period, if the time sequence of the data is complete, the trend increase rate of the purchase volume of the medicine, the weighted average, the standard deviation, the range, the median, the maximum value, the minimum value and the like of the medicine price, the rate rating of the purchase volume and the purchase volume multiplying power in the target purchase period are analyzed and counted.

Compared with the prior art, the invention has the beneficial effects that:

predicting the change trend of the purchase quantity of the medicine market: the method can reveal the change trend of the medicine market purchase quantity within a certain time, and provides theoretical support for aspects of medicine market supervision, medicine production, medicine sale, medicine purchase, medicine economics research and the like through the demonstration of medicine purchase data.

And (3) assisting to establish a purchasing plan of a medicine market: the system can assist the medical institution to formulate a reasonable medicine purchasing plan, guide the work of reporting the purchasing quantity of the alliance collected medicines of the medical institution, and avoid the problems that a large amount of medicines are left behind and scrapped due to overhigh reporting quantity, and the supporting force of the alliance collected matching incentive policy is reduced due to overlow reporting quantity, so that the standardization and rationalization of medicine purchasing of the medical institution are promoted; the invention can also assist relevant administrative departments to formulate a alliance drug collection scheme, improve the reasonability of drug volume purchase, and simultaneously assist relevant administrative departments to formulate more refined medical insurance budget, thereby being beneficial to landing and implementing the DRGs payment in the medical insurance payment mode.

Help users to know market positioning: the invention can help the drug production enterprises to know the drug market demand, help the production enterprises to make production plans from the aspects of drug attributes, time dimension and the like, reasonably distribute production data, avoid the problems of drug yield lag, excess capacity and the like, and improve the sensitivity of drug production to the change of drug demand; the invention can also help drug dealer enterprises to know the drug market positioning, formulate scientific, reasonable and refined drug market deployment strategy, improve the dynamic management level of drug warehouses, and avoid the problems of drug sales and demand disjunction, drug market opportunity loss and the like; the invention can also help the business medical insurance enterprises to evaluate the drug market risk, promote the fine management of the drug business insurance scheme and reduce the risk brought by the unknown and uncertain changes of the drug market.

Monitoring and evaluating auxiliary rational medication: the method can assist users of medical institutions, related supervision departments and the like in the assessment of the intervention measures in the aspect of reasonable medication, and reduce the change trend of the quantity of the monitored medicines under the condition of no intervention measures, so that the interference of other influence factors in the assessment process of the intervention measures is eliminated, the accuracy and the rigor of the assessment scheme of the reasonable medication intervention measures are improved, and the actual effect of the intervention measures is prevented from being exaggerated or underestimated.

Drawings

FIG. 1 is a general flow chart of a drug market size forecasting system;

FIG. 2 is a flow diagram of a drug market size forecasting system;

FIG. 3 is a sub-flow diagram of data cleansing;

FIG. 4 is a flow chart of predictive model building;

FIG. 5 is a flow chart of model evaluation;

FIG. 6 is a flow chart of a real environment test;

FIG. 7 is a schematic diagram of regression model independent variable importance evaluation;

FIG. 8 is a schematic illustration of a classification model independent variable importance evaluation;

FIG. 9 is a graph of regressor decision tree number versus error;

FIG. 10 is a graph of regressor node values versus out-of-bag error;

FIG. 11 is a graph of classifier decision tree number versus error;

FIG. 12 is a graph of classifier node values versus out-of-bag errors;

FIG. 13 is a graph of ridge regression nPC values versus coefficients;

FIG. 14 is a graph of the number of regressor arguments versus RMSE;

FIG. 15 is a graph of the number of classifier arguments versus accuracy;

FIG. 16 is a graph of the effects of a regressor fit;

FIG. 17 is a plot of regressor Bland-Altman consistency assessment;

FIG. 18 is a multi-dimensional scale analysis diagram of a classifier;

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Example (b):

taking city A as an example, a research sample is related data collected and expanded by a first batch of countries and collected by a second batch of countries in the area, and the specific examples are as follows:

1. data requirements: according to the requirement of the user on the aspect of national collection and reporting, the detailed data of the medical institution drug purchase order in the city A containing the contents of medical institution codes, drug codes, order time, order quantity, order amount and the like needs to be called to generate the data requirement.

2. Reporting related data: and (4) according to the data requirements in the aspect of national collection and reporting quantity, calling detail data of the medicine purchase orders of the medical institutions in the urban area A from the medicine transaction database.

3. Data cleaning:

1) invalid data of zero order purchase quantity, error order verification of medical institutions, inaccurate medicine information of offline purchase sources, no medicine codes or hospital codes or purchase time and the like in the data to be cleaned in the urban area A are subjected to invalidation marking;

2) associating the established information standard library by using the drug code and the hospital code, and marking the drug attribute and the hospital attribute;

3) according to the supplement rule, for fields such as medical institution rating, drug base medical insurance category, consistency evaluation over-rating enterprise number and the like, supplementing the missing value of the qualitative data to other fields, and supplementing the missing value of the quantitative data to 0 fields;

4) and marking invalid data, data which are not in the time range of one year before the beginning of each collection batch and purchasing data with an excessively small purchasing quantity in the data of the city area A as data which are not included in the model, and marking other data as data which are included in the model.

4. Data inclusion no: data inspection is carried out on the data of the city area A which is not included in the model, and the screened data are all in a non-statistical time range, so that the data quality problem does not exist; and carrying out next multi-dimensional statistics on the data included in the model.

5. Carrying out multi-dimensional statistics: counting the data of the region A city data which is included in the model from the aspects of medicine attribute, hospital attribute, market competition, sales price and the like; the drug attributes comprise basic drug classification, medical insurance classification, ATC group purchase amount, medication route and the like; the hospital attributes comprise medical institution grade rating, medicine purchasing scale, basic level classification, administrative region and the like; the market competition comprises the number of competitive enterprises, the market share of imported enterprises, the number of over-consistency rating enterprises, the number of hundreds of enterprises ranked in the Ministry of industry and trust, the market share of the hundreds of enterprises and the like; the sales price comprises the purchase amount and purchase amount of the medicine before collection, the trend growth rate of the purchase amount of the medicine, the weighted average, standard deviation, range, median, maximum value, minimum value and the like of the medicine price before collection, and the rate rating of the purchase amount and the purchase amount multiplying power during the national collection execution period.

6. An index library: and structuring the result of the multi-dimensional statistics of the city A region, and storing the result in an index standard database, wherein the number of the data of the city A region is 1453.

7. And (3) random grouping: generating a random number by using system time, and enabling index data of the urban area A to be in a range of 0.8: the training set and the test set are allocated according to the proportion of 0.2, 276 pieces of medicine purchasing data of 27 medical institutions are randomly extracted to be used as the test set according to the unique codes of the medical institutions, and 1177 pieces of medicine purchasing data of the remaining 128 medical institutions are divided into the training set and used for model fitting.

8. Evaluation of variable importance:

1) when the dependent variable is the acquired medicine purchase quantity in the urban area A, all independent variables are included to initially construct a random forest regression model, a Mean Square Error (MSE) is used as an evaluation index of a random forest regression, the contribution degree of the independent variable to MSE reduction is embodied as a mean square error increment rate (% IncMSE), and% IncMSE is used as an importance evaluation index of the independent variable, and the specific formula is as follows:

where MSE represents the mean square error of the model, n represents the number of samples, i represents the number of samples, y_iRepresents the actual procurement amount of collected medicines,

indicating the predicted procurement of the collection chemicals,% IncMSEi indicating the ithMean square error rate of increase, Δ MSE, of samples_iRepresenting the mean square error increment when the original content of the ith sample is replaced by a random value;

obtaining an independent variable importance evaluation result according to the preliminarily constructed A urban area random forest regression model (see figure 7);

2) when the dependent variable is the purchase quantity multiplying power rating of the urban area A, all independent variables are included to initially construct a random forest classification model, the accuracy is used as an evaluation index of a random forest classifier, and the importance of the independent variable is evaluated by adopting average accuracy descending (MDA);

obtaining an independent variable importance evaluation result according to the preliminarily constructed A urban area random forest classification model (see figure 8);

9. building a prediction model:

1) combining the importance evaluation result of the independent variables of the training set data of the urban area A, sorting the independent variables in a descending order according to importance indexes to obtain an independent variable set F { pre-collection purchase amount, trend growth rate, … }, setting an iteration number i as 1 in a circulation process, sequentially taking the first i bit elements in the F to obtain an independent variable set subset fi, and taking all the elements in the fi as independent variable combinations of the building model in each circulation;

2) for the regression problem of predicting the medicine purchase quantity, a random forest regressor and a ridge regression model are adopted to carry out regression analysis; for the classification problem of the prediction purchase quantity multiplying power rating, a random forest classifier model is adopted for carrying out cluster analysis;

3) the random forest model needs to adjust the number value ntree of parameter decision trees and the number (node value) mtry of feature selection, the ridge regression model needs to adjust the value (nPC) of parameter k, and the parameter adjustment cases of the urban area A in the circulation process are as follows:

in the random forest regression model, the error value decreases with the increase of the number of decision trees, (fig. 9) shows that the error value of the model is basically stable when the number of decision trees is 800, (fig. 10) shows that the error value outside the bag is smaller when the number of feature choices is 6, and in the random forest classification model, (fig. 11) shows that the error value of the model is basically stable when the number of decision trees is 1000, (fig. 12) shows that the error value outside the bag is smaller when the number of feature choices is 3;

in the ridge regression model, (fig. 13) shows that when the k value (nPC) is 9, the coefficient of each variable is substantially stable, and the k value satisfies the minimum value of the following formula condition.

4) For random forest regression models and ridge regression models, goodness of fit statistic R²And the root mean square error RMSE calculation formula is as follows.

For the random forest classification model, the Accuracy and kappa value calculation formula is as follows.

10. Whether all variables are traversed: setting I as the total number of independent variables of the training set, and when I is less than I, making I equal to I +1, and entering a step of reconstructing a model; otherwise, respectively screening the prediction models from the regressor and the classifier, and screening the regressor with the minimum RMSE and the classifier with the maximum accuracy as the optimal prediction model.

In the regression model, (fig. 14) shows that the constructed random forest regressor RMSE is minimal when i is 18.

In the classification model, (fig. 15) shows that when i < 19, the accuracy of the constructed random forest classifier is kept at a high level, and the independent variable which leads to the accuracy reduction after the model is introduced is removed by adopting a forward selection method.

11. And (3) model evaluation:

1) obtaining a prediction result of the test set by utilizing the screened random forest regressor and random forest classifier prediction models and combining the test set data of the urban area A;

2) for the random forest regressor, the test result shows goodness-of-fit statistic R²The fitting effect is 0.744, the fitting effect is shown (fig. 16), the RMSE is 57103.93, the Bland-Altman consistency evaluation shows that the mean difference value is 2790.21, and (fig. 17) shows that the difference values of the predicted value and the actual value are more concentrated, the predicted value is higher than the actual value on the whole, and an extreme value with larger prediction error exists; for the random forest classifier, the test result shows that the accuracy is 54.4%, the 95% confidence interval is (0.495,0.594), P is 0.002, the kappa coefficient is 0.221, the paired chi-square test shows that P is 0.058, the prediction error of "more than 3 times" is the largest in the rate rating of three types of purchase quantity multiplying power of "less than 1 time", "1-3 times" and "more than 3 times", the classification error rate of "more than 3 times" is 72.5%, the classification error rate of "1-3 times" is 36.0%, the classification error rate of "less than 1 time" is 40.8%, and the multi-dimensional scale analysis result (fig. 18) shows that the similarity of the three types of groups is relatively close;

3) in the two models established based on the urban area A data, the extrapolation performance of the classifier is weak, and the extrapolation performance of the regressor is strong, so that the two models are not considered to be integrated for advantage complementation.

12. And (4) expert evaluation: and the clinical pharmacy experts and the drug economics experts evaluate and analyze the prediction result of the model according to related experience and reference data, provide modification suggestions for the aspects of the rationality of independent variables, the influence of potential variables, the defects of a modeling method and the like of the included model, and evaluate the practicability of the model.

13. And (3) judging the practicability: when the prediction model does not reach the practical stage, returning to the data demand generation stage according to the modification suggestion and the evaluation result, considering appropriate increase and decrease of independent variables, change of other prediction models and combination models, and entering the next report prediction model building cycle; when the prediction model reaches the practical stage, the prediction model is stored in the prediction model database.

14. And (3) testing a real environment: and taking the report of the city A region where the country collects the fifth batch of medicines as a test environment of the existing model, cooperating with the user, implementing refined and personalized management on the prediction model to a certain extent based on the medicine purchasing behavior related data such as the medicine purchasing budget of the user, and evaluating the deviation between the predicted value and the actual value when the country collects the fifth batch.

The prediction result of the fifth batch of medicine report collected by the country in the urban area A is as follows: the promethazine oral sustained-release dosage form predicts a procurement amount of 33282 and a predicted increase rate of-16.79% for hospital a, and predicts a procurement amount of 5090 and a predicted increase rate of 3.89% for hospital B; the propranolol oral sustained-release preparation has the pre-collection purchase amount of 9062 and the predicted purchase amount of 14611 and the predicted increase rate of 61.23% in hospital C, and has the pre-collection purchase amount of 937, the predicted purchase amount of 14611 and the predicted increase rate of 245.88% in hospital D.

15. Whether to reevaluate: judging whether the model needs to be reevaluated according to the report forecasting deviation result of the fifth batch of medicines collected by the country, returning to the model evaluation stage if the model needs to be reevaluated, confirming the source of the forecasting deviation by an expert according to the real environment test result, proposing a perfecting scheme, and reentering the next modeling stage; and if no re-evaluation is needed, the prediction model is incorporated into the collection monitoring system.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A drug market size forecasting system based on machine learning, comprising the steps of:

s2, data related to purchase quantity: according to the data requirements in the aspect of medicine market scale prediction, related data of the report quantity are called from a medicine transaction database and stored in a structured standard data table;

s9, building a prediction model;

s10, whether all variables are traversed: checking whether the circulation passes all independent variables in the hospital training set, and if not, continuing the circulation process; if all independent variables are passed, ending the circulation, and screening an optimal model in the prediction model set according to the goodness of fit or accuracy;

s11, model evaluation;

s14, testing a real environment;

2. The machine learning-based pharmaceutical market size prediction system of claim 1, wherein the step S3 comprises the steps of:

3. The machine learning-based pharmaceutical market size prediction system of claim 2, wherein the step S9 comprises the steps of:

4. The machine learning-based pharmaceutical market size prediction system of claim 3, wherein the step S11 comprises the steps of:

5. The machine learning-based pharmaceutical market size prediction system of claim 4, wherein the step S14 comprises the steps of:

6. The machine learning-based drug market size forecasting system of claim 5, wherein in step S5, the drug attribute aspects include drug base classification, medical insurance classification, ATC group purchase amount, route of medication, etc.; the hospital attributes comprise medical institution grade rating, medicine purchasing scale, basic level classification, administrative region and the like; the market competition comprises the number of competitive enterprises, the market share of imported enterprises, the number of over-consistency rating enterprises, the number of hundreds of enterprises ranked in the Ministry of industry and trust, the market share of the hundreds of enterprises and the like; and the sales price of the sales volume comprises the purchase volume and the purchase amount of the medicine in the previous purchase period, if the time sequence of the data is complete, the trend increase rate of the purchase volume of the medicine, the weighted average, the standard deviation, the range, the median, the maximum value, the minimum value and the like of the medicine price, the rate rating of the purchase volume and the purchase volume multiplying power in the target purchase period are analyzed and counted.