CN116245399A

CN116245399A - Model training method and device, nonvolatile storage medium and electronic equipment

Info

Publication number: CN116245399A
Application number: CN202211711847.1A
Authority: CN
Inventors: 周栋; 陈雄挺; 夏群峰; 王伟; 赵明席; 祝思佳
Original assignee: Zhejiang Supcon Technology Co Ltd
Current assignee: Zhejiang Supcon Technology Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-06-09

Abstract

The invention discloses a model training method and device, a nonvolatile storage medium and electronic equipment. The method comprises the following steps: obtaining a first training sample set, wherein the first training sample set comprises: carrying out development evaluation on a preset chemical industry park to obtain a preset evaluation result and at least one corresponding first evaluation index; training the first training sample set by using a preset machine learning model to generate a first index model; determining a second training sample set according to the number of times each first evaluation index is used in the first index model, wherein the second training sample set comprises: presetting an evaluation result and at least one corresponding second evaluation index, wherein the second evaluation index is a first evaluation index with times higher than a preset threshold value; and training the second training sample set by using a preset machine learning model to generate a second index model. The invention solves the technical problem of low efficiency of development evaluation of the chemical industry park.

Description

Model training method and device, nonvolatile storage medium and electronic equipment

Technical Field

The present invention relates to the field of machine learning, and in particular, to a model training method and apparatus, a nonvolatile storage medium, and an electronic device.

Background

The work of identification, rechecking, supervision, assessment and the like of the chemical industry park is an important work, a series of assessment standards are required to be formulated in the process of development and assessment of the chemical industry park, a large number of proving materials are collected, and an organization expert analyzes and studies and judges the proving materials and data to finally give out the park assessment grade. The traditional evaluation process is complicated in flow, long in time consumption and high in delay, is not enough to support timely analysis of daily and monthly industrial development conditions, and has a large supervision hidden trouble because the actual condition of park development is like a black box which cannot be monitored during each two evaluation periods.

Aiming at the problem of low efficiency of development evaluation of the chemical industry park, no effective solution is proposed at present.

Disclosure of Invention

The embodiment of the invention provides a model training method and device, a nonvolatile storage medium and electronic equipment, which are used for at least solving the technical problem of low efficiency of development evaluation of a chemical industry park.

According to an aspect of an embodiment of the present invention, there is provided a model training method including: obtaining a first training sample set, wherein the first training sample set comprises a plurality of sets of training data, each set of training data comprising: carrying out development evaluation on a preset chemical industry park to obtain a preset evaluation result and at least one corresponding first evaluation index; training the first training sample set by using a preset machine learning model to generate a first index model, wherein the first index model is used for determining the preset evaluation result according to at least one first evaluation index; determining a second training sample set according to the number of times each first evaluation index is used in the first index model, wherein the second training sample set comprises a plurality of groups of training data, and each group of training data comprises: the preset evaluation result and at least one corresponding second evaluation index, wherein the second evaluation index is the first evaluation index with the frequency higher than a preset threshold value; training the second training sample set by using the preset machine learning model to generate a second index model, wherein the second index model is used for determining the preset evaluation result according to at least one second evaluation index.

Optionally, determining a second training sample set according to the number of times each of the first evaluation metrics is used in the first index model includes: analyzing the first index model, and determining the number of times each first evaluation index is used by the first index model in the process of obtaining the preset evaluation result by using at least one first evaluation index; sequencing at least one first evaluation index according to the order of the times from high to low to generate an evaluation index queue; determining the first evaluation index before the preset ranking as the second evaluation index in the evaluation index queue; and determining the second training sample set according to the plurality of second evaluation indexes and the preset evaluation results corresponding to the second evaluation indexes.

Optionally, acquiring the first training sample set includes: acquiring historical evaluation data, wherein the historical evaluation data at least comprises: the preset evaluation results are a plurality of, and at least one evaluation data corresponding to each preset evaluation result; performing data cleaning on evaluation data in the historical evaluation data to generate a preset sample set, wherein the preset sample set comprises: the preset evaluation results are multiple, and at least one first evaluation index corresponding to each preset evaluation result; and selecting the first training sample set from the preset sample set according to a preset proportion.

Optionally, performing data cleaning on the evaluation data in the historical evaluation data, and generating a preset sample set includes: determining a missing amount of the evaluation data in the case that the evaluation data in the history evaluation data has data missing; deleting a preset evaluation result corresponding to missing evaluation data from the historical evaluation data under the condition that the missing amount is not higher than a preset missing threshold value, and generating the preset sample set; and under the condition that the deficiency amount is higher than a preset deficiency threshold value, filling the deficiency evaluation data by using the average value of the evaluation data in the historical evaluation data, and generating the preset sample set.

Optionally, performing data cleaning on the evaluation data in the historical evaluation data, and generating a preset sample set includes: identifying abnormal data in the evaluation data; and replacing the abnormal data by using an average value of the evaluation data in the historical evaluation data to generate the preset sample set.

Optionally, performing data cleaning on the evaluation data in the historical evaluation data, and generating a preset sample set includes: identifying category data in the assessment data; and carrying out coding conversion on the type data to generate the preset sample set.

Optionally, after training the second training sample set using the preset machine learning model to generate a second index model, the method further includes: acquiring at least one target evaluation index; and analyzing at least one target evaluation index by using the second index model to determine a target evaluation result.

According to another aspect of the embodiment of the present invention, there is also provided a model training apparatus, including: the system comprises an acquisition module for acquiring a first training sample set, wherein the first training sample set comprises a plurality of groups of training data, and each group of training data comprises: carrying out development evaluation on a preset chemical industry park to obtain a preset evaluation result and at least one corresponding first evaluation index; the first training module is used for training the first training sample set by using a preset machine learning model to generate a first index model, wherein the first index model is used for determining the preset evaluation result according to at least one first evaluation index; a determining module, configured to determine a second training sample set according to the number of times each of the first evaluation indexes is used in the first index model, where the second training sample set includes multiple sets of training data, and each set of training data includes: the preset evaluation result and at least one corresponding second evaluation index, wherein the second evaluation index is the first evaluation index with the frequency higher than a preset threshold value; the second training module is used for training the second training sample set by using the preset machine learning model to generate a second index model, wherein the second index model is used for determining the preset evaluation result according to at least one second evaluation index.

According to another aspect of the embodiment of the present invention, there is further provided a nonvolatile storage medium, where a program is stored, where when the program runs, a device where the nonvolatile storage medium is controlled to execute the model training method described above.

According to another aspect of the embodiment of the present invention, there is also provided an electronic device, including: the system comprises a memory and a processor, wherein the processor is used for running a program stored in the memory, and the program executes the model training method.

In an embodiment of the present invention, a first training sample set is obtained, where the first training sample set includes a plurality of sets of training data, and each set of training data includes: carrying out development evaluation on a preset chemical industry park to obtain a preset evaluation result and at least one corresponding first evaluation index; training the first training sample set by using a preset machine learning model to generate a first index model, wherein the first index model is used for determining the preset evaluation result according to at least one first evaluation index; determining a second training sample set according to the number of times each first evaluation index is used in the first index model, wherein the second training sample set comprises a plurality of groups of training data, and each group of training data comprises: the preset evaluation result and at least one corresponding second evaluation index, wherein the second evaluation index is the first evaluation index with the frequency higher than a preset threshold value; training the second training sample set by using the preset machine learning model to generate a second index model, wherein the second index model is used for determining the preset evaluation result according to at least one second evaluation index, so that intelligent and digital evaluation on the chemical industry park is realized, part of second evaluation indexes are selected from the first evaluation indexes of the first training sample set by using the first index model to train the second index model, so that the evaluation result of the chemical industry park can be determined by using fewer second evaluation indexes by using the second index model, the technical effect of improving the evaluation efficiency of development evaluation on the chemical industry park is realized, and the technical problem of low efficiency of development evaluation on the chemical industry park is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a flow chart of a model training method according to an embodiment of the present invention;

FIG. 2a is a schematic diagram of a dynamic analysis method for chemical industry park index based on the LightGBM algorithm according to an embodiment of the invention;

FIG. 2b is a schematic diagram of a chemical industry park index 8 evaluation dimensions and corresponding core data index table and numerical unit table, according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a model training apparatus according to an embodiment of the present invention;

fig. 4 is a block diagram of a computer terminal according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For better understanding of the embodiments of the present application, technical terms related in the embodiments of the present application are explained below:

chemical industry park index: the chemical industry park comprehensive evaluation model constructed through 8 dimensions of park scale, park industry chain, safe production, environmental protection, green development, digital development, mu average benefit, park activity and the like is used for dynamically evaluating the park development condition, and the evaluation result is divided into A, B, C, D types to assist in daily management and cultivation of the park.

LightGBM: GBDT (Gradient Boosting Decision Tree) is a machine learning model, which is iteratively trained by using a weak classifier (decision tree) to obtain an optimal model, and has the advantages of good training effect, difficult fitting and the like. GBDT is commonly used for tasks such as multi-classification, click-through rate prediction, search ranking, etc. in the industry. LightGBM (Light Gradient Boosting Machine) is a framework for realizing the GBDT algorithm, supports high-efficiency parallel training, has higher training speed, lower memory consumption and higher accuracy, and supports rapid processing of mass data.

In accordance with an embodiment of the present invention, a model training method embodiment is provided, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order other than that shown or described herein.

FIG. 1 is a flow chart of a model training method according to an embodiment of the invention, as shown in FIG. 1, comprising the steps of:

step S102, a first training sample set is obtained, wherein the first training sample set comprises a plurality of groups of training data, and each group of training data comprises: carrying out development evaluation on a preset chemical industry park to obtain a preset evaluation result and at least one corresponding first evaluation index;

Step S104, training a first training sample set by using a preset machine learning model to generate a first index model, wherein the first index model is used for determining a preset evaluation result according to at least one first evaluation index;

step S106, determining a second training sample set according to the number of times each first evaluation index is used in the first index model, wherein the second training sample set comprises a plurality of groups of training data, and each group of training data comprises: presetting an evaluation result and at least one corresponding second evaluation index, wherein the second evaluation index is a first evaluation index with times higher than a preset threshold value;

step S108, training the second training sample set by using a preset machine learning model to generate a second index model, wherein the second index model is used for determining a preset evaluation result according to at least one second evaluation index.

In the step S102, the first training sample set may be selected from a plurality of historical data of the preset chemical industry park, where the historical data at least includes: the method comprises the steps of presetting an evaluation result and obtaining one or more first evaluation indexes used by the preset evaluation result.

In the step S102, the preset evaluation result may be an expert evaluation result obtained by applying the chemical industry park expert audit evaluation method.

Optionally, the expert evaluation result (i.e. the preset evaluation result) is a score of the expert on the preset chemical park, and the score is between 55 and 100.

Optionally, grading treatment is carried out on expert scoring, and rules are set as follows: the score is more than or equal to 85, and the chemical industry park index (namely a preset evaluation result) is A; the score is less than or equal to 75 and less than 85, and the chemical industry park index (namely a preset evaluation result) is B; the score is less than or equal to 65 and less than 75, and the chemical industry park index (namely a preset evaluation result) is C; the score is less than 65, and the chemical industry park index (namely the preset evaluation result) is D.

In the above step S102, the first evaluation index includes 8 dimensions, which are respectively: park scale, mu average benefit, digital development, industrial chain, safe production, environmental protection, green development and park liveness.

Optionally, the core data metrics (e.g., the first evaluation metrics) included in the campus scale include: the number of enterprises on the annual rule on the campus, and the number of enterprises on the market on the annual rule on the campus.

Optionally, the core data index (such as the first evaluation index) included in the mu average benefit includes: the mu of the last annual park is added with tax, the mu of the last annual park is added with labor productivity of all people of the last annual park, the unit energy consumption of the last annual park is added with value, the unit emission of the last annual park is added with value, the research and development expense of the last annual park is the ratio of the income of main camping business, and the tax of the enterprise of the last annual park is added with tax of the enterprise of the last month park.

Optionally, the digitizing the core data metrics (e.g., the first evaluation metrics) included in the development includes: whether the intelligent management system is built in the park or not, and the annual intelligent manufacturing system of the enterprise in the park invests money.

Optionally, the core data index (e.g. the first evaluation index) included in the industry chain includes: in-garden product sales rate, wherein in-garden product sales rate = in-garden product sales to in-garden enterprise/in-garden product total 100%.

Optionally, the core data indicators (e.g., the first evaluation indicators) included in the safety production include: the number of current hidden dangers in the park, the correction rate of hidden dangers in the park near one month, the number of current major hidden dangers in the park, the enterprise risk level in the last year park, the number of major safety accidents in the park near one month, the number of emergency resources in the park and the number of emergency teams in the park.

Optionally, the core data index (such as the first evaluation index) included in the environmental protection includes: environmental punishment number in the last month park, atmospheric exceeding condition of the last month park, exceeding days of sewage discharge of the last month park, exceeding days of waste gas discharge of the last month park, environmental credit evaluation of enterprises in last year, pollution days of surface water of the last month park and number of serious pollution events of the last year.

Optionally, the core data index (e.g. the first evaluation index) included in the green development includes: super-match rate of ten thousand-yuan production value energy consumption of a near-month park and energy use of the near-month park, wherein the super-match rate of the near-month park energy use= (the energy of the near-month park-the energy of the near-month park quota)/the energy of the near-month park quota is 100%.

Optionally, the core data indicators (such as the first evaluation indicators) included in the campus activity include: the daily freight car in and out quantity of the park, the number of the park under construction projects, the total investment of the park under construction projects and the number of patents acquired by the annual park.

In the step S104, the preset machine learning model is the LightGBM, and the weak classifier (decision tree) is used for iterative training to obtain the optimal model, which has the advantages of good training effect, difficult fitting and the like.

In the step S106, the second evaluation index is selected from the first evaluation indexes in the first training sample set, and the second training sample set can be determined according to the preset evaluation result in the first training sample set.

As an alternative embodiment, determining the second training sample set based on the number of times each first evaluation index is used in the first index model comprises: analyzing the first index model, and determining the number of times each first evaluation index is used by the first index model in the process of obtaining a preset evaluation result by using at least one first evaluation index; sequencing at least one first evaluation index according to the sequence from high to low in times to generate an evaluation index queue; in the evaluation index queue, determining a first evaluation index before a preset ranking as a second evaluation index; and determining a second training sample set according to the plurality of second evaluation indexes and preset evaluation results corresponding to the second evaluation indexes.

According to the embodiment of the invention, the importance of the first evaluation indexes is determined according to the number of times each first evaluation index is used by the first index model, one or more first evaluation indexes are ranked according to the importance, and the first evaluation index before the first evaluation index is selected as the second evaluation index, so that the second training sample set is selected from the first training sample set.

Alternatively, in the case that the first evaluation index is 34 items, the first 16 items of the first evaluation index may be selected from the evaluation index queue as the second evaluation index.

As an alternative embodiment, acquiring the first training sample set comprises: acquiring historical evaluation data, wherein the historical evaluation data at least comprises: the system comprises a plurality of preset evaluation results and at least one evaluation data corresponding to each preset evaluation result; performing data cleaning on evaluation data in the historical evaluation data to generate a preset sample set, wherein the preset sample set comprises: the system comprises a plurality of preset evaluation results and at least one first evaluation index corresponding to each preset evaluation result; and selecting a first training sample set from the preset sample set according to a preset proportion.

According to the embodiment of the invention, the preset sample set can be selected from the historical data of a plurality of preset chemical parks, and then the first training sample set for training the first index model and the first test sample set for testing the first index model are selected from the preset sample set according to the preset proportion.

Optionally, the preset ratio is used to represent a data volume ratio of the first training sample set to the first test sample set, and the preset ratio is 7:3.

Optionally, a second training sample set may be selected from the first training sample set after determining the first training sample set, wherein selecting the second training sample set includes: and according to the number of times each first evaluation index is used in the first index model, selecting a second evaluation index from at least one first evaluation index of the first training sample set, then jointly determining a preset sample set used by the second index model with a preset evaluation result in the first training sample set, further selecting a second training sample set used for training the second index model from the preset sample set used by the second index model according to a preset proportion, and testing the second training sample set of the second index model.

As an alternative embodiment, performing data cleansing on the evaluation data in the historical evaluation data, and generating the preset sample set includes: determining the missing amount of the evaluation data under the condition that the evaluation data in the historical evaluation data has data missing; deleting a preset evaluation result corresponding to the missing evaluation data from the historical evaluation data under the condition that the missing amount is not higher than a preset missing threshold value, and generating a preset sample set; and under the condition that the missing amount is higher than a preset missing threshold value, filling the missing evaluation data by using the average value of the evaluation data in the historical evaluation data, and generating a preset sample set.

According to the embodiment of the invention, under the condition that the data of the evaluation data in the historical evaluation data is missing, all the specific type data which belongs to the same type with the missing evaluation data are determined in the historical evaluation data, and then whether the missing evaluation data are required to be deleted or the missing evaluation data are filled is judged according to the missing amount of the actual evaluation data relative to all the specific type data, and further, under the condition that the missing amount is not higher than a preset missing threshold value, the preset evaluation result corresponding to the missing evaluation data is deleted in the historical evaluation data; and under the condition that the missing amount is higher than a preset missing threshold value, using all specific type data which belongs to the same type with the missing evaluation data in the historical evaluation data, determining the average value of the evaluation data in the historical evaluation data, filling the missing evaluation data by using the average value, and cleaning the missing evaluation data to obtain a first evaluation index which meets the use requirement of the model, and further obtaining a preset sample set which is required to be used for training the first index model.

As an alternative embodiment, performing data cleansing on the evaluation data in the historical evaluation data, and generating the preset sample set includes: identifying abnormal data in the evaluation data; and replacing the abnormal data by using the average value of the evaluation data in the historical evaluation data to generate a preset sample set.

According to the embodiment of the invention, the abnormal data in the historical evaluation data are determined, all the specific type data which belong to the same type with the abnormal data are determined in the historical evaluation data, and the abnormal data are replaced according to the average value of all the specific type data, so that the data of the abnormal data can be cleaned, a first evaluation index which meets the use requirement of the model can be obtained, and a preset sample set which is required to be used for training the first index model is obtained.

As an alternative embodiment, performing data cleansing on the evaluation data in the historical evaluation data, and generating the preset sample set includes: identifying category data in the assessment data; and performing coding conversion on the category type data to generate a preset sample set.

According to the embodiment of the invention, the data cleaning of the category data is realized by coding and converting the evaluation data of the category data in the history evaluation data, so that the first evaluation index meeting the use requirement of the model can be obtained, and a preset sample set required for training the first index model is further obtained.

As an alternative embodiment, after training the second training sample set using the preset machine learning model to generate the second index model, the method further includes: acquiring at least one target evaluation index; and analyzing at least one target evaluation index by using the second index model to determine a target evaluation result.

According to the embodiment of the invention, fewer target evaluation indexes can be used by utilizing the second index model, and the target evaluation result of the target chemical industry park indicated by the target evaluation indexes can be accurately determined, so that the technical effect of improving the evaluation efficiency is realized.

The invention also provides a preferred embodiment, and the preferred embodiment provides a dynamic analysis method and a system for the chemical industry park index based on the LightGBM algorithm, so that intelligent and accurate assessment of the chemical industry park index is realized, and the level of government real-time supervision on the chemical industry park is improved.

Fig. 2a is a schematic diagram of a dynamic analysis method for a chemical industry park index based on a LightGBM algorithm according to an embodiment of the invention, as shown in fig. 2a, including the following steps:

step S201, obtaining the chemical industry park expert auditing evaluation method and the results (such as preset evaluation results) of the chemical industry park expert evaluation.

In step S202, a chemical industry park index model (e.g., a first index model) for intelligent evaluation is constructed, where the chemical industry park index model (e.g., the first index model) covers 8 evaluation dimensions, and a core data index 34 item is constructed as an initial feature variable set (e.g., a first training sample set) based on 8 evaluation temperatures.

As an optional example, according to the chemical expert auditing and evaluating method, the disassembling and evaluating items comprise 7 total items of basic requirements, planning layout, safe production, environmental protection, green development, digital development, mu average benefit and the like, a chemical industry park index model (such as a first index model) for intelligent evaluation is constructed based on the expert scheme, and the model covers 8 total evaluation dimensions of park scale, mu average benefit, digital development, industrial chain, safe production, environmental protection, green development, park activity and the like; according to the definition of each evaluation dimension, a core data index is respectively constructed, and 34 items in total are taken as initial characteristic variables (such as a first evaluation index).

Optionally, the expert auditing evaluation method should include: expert assessment items, evaluation standards, evaluation basis requirements and the like.

Optionally, the result of the expert evaluation (such as the preset evaluation result) is a score of the expert on the chemical industry park, and the score is between 55 and 100 points.

Optionally, in the training sample set (such as a preset sample set), grading treatment is performed on expert scoring, and rules are set as follows: the fraction is more than or equal to 85, and the index of the chemical industry park is A; the fraction is more than or equal to 75 and less than 85, and the index of the chemical industry park is B; the fraction is less than or equal to 65 and less than 75, and the index of the chemical industry park is C; the score is less than 65, and the chemical industry park index is D.

Fig. 2b is a schematic diagram of 8 evaluation dimensions of a chemical industry park index and corresponding core data index table and numerical unit table according to an embodiment of the invention, as shown in fig. 2b, the 8 dimensions of a chemical industry park index model (e.g. a first index model) are respectively: park scale, mu average benefit, digital development, industrial chain, safe production, environmental protection, green development and park liveness.

In step S203, historical evaluation data of the chemical industry park is obtained, and a modeled data set (e.g., a preset sample set) is obtained through data cleaning and preprocessing.

Optionally, access of real-time data (such as target evaluation indexes) and import of historical data (such as historical evaluation data) are performed through the chemical industry park according to the core data indexes covered by the constructed chemical industry park index.

Optionally, the accessed data are gathered to a data warehouse, unified data cleaning is carried out, a core data index is obtained through data processing, and historical data and expert evaluation results of corresponding years are used as training sample sets.

The optional data cleansing process includes: unified data unit, missing value processing and abnormal value processing;

optionally, the unified data unit is that for integer, floating point and class type data, unified unit conversion processing is performed on each feature variable.

Optionally, the deletion value processing combines a deletion method and a filling method, and when the deletion proportion of a certain variable is not higher than 5%, the deletion method is used for directly deleting the deletion data; when the true proportion exceeds 5%, the filling method is used to take the average value of the variable for filling.

Alternatively, the judgment of the abnormal value includes two kinds of judgment of the occurrence negative number of the non-negative number and judgment of the occurrence zero value of the non-zero value, and the abnormal data is replaced by taking the average value of the data by adopting an average value replacement method.

Optionally, the category type data is transcoded, and the numerical conversion of [0,1,2, ], N ] is performed for the category number N.

In step S204, the processed data is divided into a training set (e.g., a first training sample set) and a test set according to a ratio of 7:3, core parameters of the LightGBM (e.g., a preset machine learning model) are set, the LightGBM (e.g., the preset machine learning model) is applied to train the initial feature set (e.g., the first training sample set), the first 16 are selected as the formal feature set (e.g., the second training sample set) according to the importance ranking, and the LightGBM (e.g., the preset machine learning model) is applied again to train to obtain a final chemical park index analysis model (e.g., the second index model).

Optionally, the core parameters of the LightGBM include: the selected training set (e.g., the first training sample set) is trained on a learning rate learning_rate=0.01, a tree depth max_depth=7, and a leaf node number num_leave=28.

Optionally, the number of times that the feature variable (such as the first evaluation index) is used in the LightGBM model (such as the preset machine learning model or the first index model) is used as the basis of the importance of the feature variable, the feature variable (such as the first evaluation index) is ranked, 16 features with the importance ranked first are selected, and a formal feature variable set (such as a second training sample set) is constructed.

Optionally, based on the formal feature set, splitting the training set (such as the second training sample set) and the test set again, and training with a LightGBM model (such as a preset machine learning model) to obtain a final chemical park index analysis model (such as a second index model).

In step S205, the data (e.g., the target evaluation index) updated in real time in the data warehouse is used to perform real-time predictive evaluation on the chemical industry park index (e.g., the target evaluation index) by using the park index evaluation model (e.g., the second index model), so as to give an evaluation result (e.g., the target evaluation result).

According to the technical scheme provided by the invention, the chemical industry park index (such as the second evaluation index) is constructed to evaluate the park development, and the intelligent evaluation of the chemical industry park index (such as the second evaluation index) is realized by establishing the LightGBM analysis model (such as the second index model), so that the real-time performance and the digital level of the chemical industry park management are effectively improved.

According to the technical scheme provided by the invention, the chemical industry park index analysis model (such as a second index model) is constructed, the initial feature items covered fully cover 8 dimensions of park development, 16 features (such as a second evaluation index) which are ranked at the front are selected through the importance of feature variables, a formal feature set (such as a second training sample set) is constructed, and the model accuracy is effectively improved.

According to the embodiment of the invention, a model training device embodiment is also provided, and it should be noted that the model training device may be used to execute the model training method in the embodiment of the invention, and the model training method in the embodiment of the invention may be executed in the model training device.

FIG. 3 is a schematic diagram of a model training apparatus according to an embodiment of the invention, as shown in FIG. 3, the apparatus may include: an obtaining module 32, configured to obtain a first training sample set, where the first training sample set includes a plurality of sets of training data, and each set of training data includes: carrying out development evaluation on a preset chemical industry park to obtain a preset evaluation result and at least one corresponding first evaluation index; a first training module 34, configured to train the first training sample set using a preset machine learning model, and generate a first index model, where the first index model is configured to determine the preset evaluation result according to at least one of the first evaluation indexes; a determining module 36, configured to determine a second training sample set according to the number of times each of the first evaluation indexes is used in the first index model, where the second training sample set includes a plurality of sets of training data, and each set of training data includes: the preset evaluation result and at least one corresponding second evaluation index, wherein the second evaluation index is the first evaluation index with the frequency higher than a preset threshold value; a second training module 38, configured to train the second training sample set using the preset machine learning model, and generate a second index model, where the second index model is used to determine the preset evaluation result according to at least one second evaluation index.

It should be noted that, the acquiring module 32 in this embodiment may be used to perform step S102 in the embodiment of the present application, the first training module 34 in this embodiment may be used to perform step S104 in the embodiment of the present application, the determining module 36 in this embodiment may be used to perform step S106 in the embodiment of the present application, and the second training module 38 in this embodiment may be used to perform step S108 in the embodiment of the present application. The above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments.

As an alternative embodiment, the determining module includes: the analysis unit is used for analyzing the first index models and determining the times of each first evaluation index used by the first index models in the process of obtaining a preset evaluation result by using at least one first evaluation index; the sorting unit is used for sorting at least one first evaluation index according to the order of the times from high to low to generate an evaluation index queue; the first determining unit is used for determining a first evaluation index before the preset ranking as a second evaluation index in the evaluation index queue; the second determining unit is used for determining a second training sample set according to a plurality of second evaluation indexes and preset evaluation results corresponding to the second evaluation indexes.

As an alternative embodiment, the obtaining module includes: the first acquisition unit is used for acquiring historical evaluation data, wherein the historical evaluation data at least comprises: the system comprises a plurality of preset evaluation results and at least one evaluation data corresponding to each preset evaluation result; the data cleaning unit is used for cleaning the evaluation data in the historical evaluation data to generate a preset sample set, wherein the preset sample set comprises: the system comprises a plurality of preset evaluation results and at least one first evaluation index corresponding to each preset evaluation result; the selecting unit is used for selecting the first training sample set from the preset sample set according to the preset proportion.

As an alternative embodiment, the data cleansing unit comprises: a determination subunit configured to determine, in a case where a data loss occurs in the evaluation data in the historical evaluation data, a missing amount of the evaluation data; a deletion subunit, configured to delete, in the historical evaluation data, a preset evaluation result corresponding to the missing evaluation data, and generate a preset sample set, when the missing amount is not higher than a preset missing threshold; and the filling subunit is used for filling the missing evaluation data by using the average value of the evaluation data in the historical evaluation data under the condition that the missing amount is higher than a preset missing threshold value, and generating a preset sample set.

As an alternative embodiment, the data cleansing unit comprises: a first recognition subunit configured to recognize abnormal data in the evaluation data; and the replacing subunit is used for replacing the abnormal data by using the average value of the evaluation data in the historical evaluation data to generate a preset sample set.

As an alternative embodiment, the data cleansing unit comprises: a second recognition subunit for recognizing category type data in the evaluation data; and the conversion subunit is used for carrying out coding conversion on the type data to generate a preset sample set.

As an alternative embodiment, the apparatus further comprises: the acquisition subunit is used for acquiring at least one target evaluation index after training the second training sample set by using a preset machine learning model and generating a second index model; and the determining subunit is used for analyzing at least one target evaluation index by using the second index model and determining a target evaluation result.

Embodiments of the present invention may provide a computer terminal, which may be any one of a group of computer terminals. Alternatively, in the present embodiment, the above-described computer terminal may be replaced with a terminal device such as a mobile terminal.

Alternatively, in this embodiment, the above-mentioned computer terminal may be located in at least one network device among a plurality of network devices of the computer network.

In this embodiment, the computer terminal may execute the program code for the following steps in the model training method: obtaining a first training sample set, wherein the first training sample set comprises a plurality of sets of training data, each set of training data comprising: carrying out development evaluation on a preset chemical industry park to obtain a preset evaluation result and at least one corresponding first evaluation index; training the first training sample set by using a preset machine learning model to generate a first index model, wherein the first index model is used for determining a preset evaluation result according to at least one first evaluation index; determining a second training sample set according to the number of times each first evaluation index is used in the first index model, wherein the second training sample set comprises a plurality of groups of training data, and each group of training data comprises: presetting an evaluation result and at least one corresponding second evaluation index, wherein the second evaluation index is a first evaluation index with times higher than a preset threshold value; training the second training sample set by using a preset machine learning model to generate a second index model, wherein the second index model is used for determining a preset evaluation result according to at least one second evaluation index.

Alternatively, fig. 4 is a block diagram of a computer terminal according to an embodiment of the present invention. As shown in fig. 4, the computer terminal 40 may include: one or more (only one is shown) processors 42, and memory 44.

The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the model training method and apparatus in the embodiments of the present invention, and the processor executes the software programs and modules stored in the memory, thereby executing various functional applications and data processing, that is, implementing the model training method described above. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located with respect to the processor, which may be connected to the terminal 40 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: obtaining a first training sample set, wherein the first training sample set comprises a plurality of sets of training data, each set of training data comprising: carrying out development evaluation on a preset chemical industry park to obtain a preset evaluation result and at least one corresponding first evaluation index; training the first training sample set by using a preset machine learning model to generate a first index model, wherein the first index model is used for determining a preset evaluation result according to at least one first evaluation index; determining a second training sample set according to the number of times each first evaluation index is used in the first index model, wherein the second training sample set comprises a plurality of groups of training data, and each group of training data comprises: presetting an evaluation result and at least one corresponding second evaluation index, wherein the second evaluation index is a first evaluation index with times higher than a preset threshold value; training the second training sample set by using a preset machine learning model to generate a second index model, wherein the second index model is used for determining a preset evaluation result according to at least one second evaluation index.

Optionally, the above processor may further execute program code for: analyzing the first index model, and determining the number of times each first evaluation index is used by the first index model in the process of obtaining a preset evaluation result by using at least one first evaluation index; sequencing at least one first evaluation index according to the sequence from high to low in times to generate an evaluation index queue; in the evaluation index queue, determining a first evaluation index before a preset ranking as a second evaluation index; and determining a second training sample set according to the plurality of second evaluation indexes and preset evaluation results corresponding to the second evaluation indexes.

Optionally, the above processor may further execute program code for: acquiring historical evaluation data, wherein the historical evaluation data at least comprises: the system comprises a plurality of preset evaluation results and at least one evaluation data corresponding to each preset evaluation result; performing data cleaning on evaluation data in the historical evaluation data to generate a preset sample set, wherein the preset sample set comprises: the system comprises a plurality of preset evaluation results and at least one first evaluation index corresponding to each preset evaluation result; and selecting a first training sample set from the preset sample set according to a preset proportion.

Optionally, the above processor may further execute program code for: determining the missing amount of the evaluation data under the condition that the evaluation data in the historical evaluation data has data missing; deleting a preset evaluation result corresponding to the missing evaluation data from the historical evaluation data under the condition that the missing amount is not higher than a preset missing threshold value, and generating a preset sample set; and under the condition that the missing amount is higher than a preset missing threshold value, filling the missing evaluation data by using the average value of the evaluation data in the historical evaluation data, and generating a preset sample set.

Optionally, the above processor may further execute program code for: identifying abnormal data in the evaluation data; and replacing the abnormal data by using the average value of the evaluation data in the historical evaluation data to generate a preset sample set.

Optionally, the above processor may further execute program code for: identifying category data in the assessment data; and performing coding conversion on the category type data to generate a preset sample set.

Optionally, the above processor may further execute program code for: training a second training sample set by using a preset machine learning model, and acquiring at least one target evaluation index after generating a second index model; and analyzing at least one target evaluation index by using the second index model to determine a target evaluation result.

By adopting the embodiment of the invention, a model training scheme is provided. By acquiring a first training sample set, wherein the first training sample set comprises a plurality of sets of training data, each set of training data comprises: carrying out development evaluation on a preset chemical industry park to obtain a preset evaluation result and at least one corresponding first evaluation index; training the first training sample set by using a preset machine learning model to generate a first index model, wherein the first index model is used for determining the preset evaluation result according to at least one first evaluation index; determining a second training sample set according to the number of times each first evaluation index is used in the first index model, wherein the second training sample set comprises a plurality of groups of training data, and each group of training data comprises: the preset evaluation result and at least one corresponding second evaluation index, wherein the second evaluation index is the first evaluation index with the frequency higher than a preset threshold value; training the second training sample set by using the preset machine learning model to generate a second index model, wherein the second index model is used for determining the preset evaluation result according to at least one second evaluation index, so that intelligent and digital evaluation on the chemical industry park is realized, part of second evaluation indexes are selected from the first evaluation indexes of the first training sample set by using the first index model to train the second index model, so that the evaluation result of the chemical industry park can be determined by using fewer second evaluation indexes by using the second index model, the technical effect of improving the evaluation efficiency of development evaluation on the chemical industry park is realized, and the technical problem of low efficiency of development evaluation on the chemical industry park is solved.

It will be appreciated by those skilled in the art that the configuration shown in fig. 4 is only illustrative, and the computer terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm-phone computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 4 is not limited to the structure of the electronic device. For example, the computer terminal 40 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 4, or have a different configuration than shown in FIG. 4.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

Embodiments of the present invention also provide a nonvolatile storage medium. Alternatively, in this embodiment, the storage medium may be used to store the program code executed by the model training method provided in the above embodiment.

Alternatively, in this embodiment, the above-mentioned nonvolatile storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.

Optionally, in the present embodiment, the non-volatile storage medium is arranged to store program code for performing the steps of: obtaining a first training sample set, wherein the first training sample set comprises a plurality of sets of training data, each set of training data comprising: carrying out development evaluation on a preset chemical industry park to obtain a preset evaluation result and at least one corresponding first evaluation index; training the first training sample set by using a preset machine learning model to generate a first index model, wherein the first index model is used for determining a preset evaluation result according to at least one first evaluation index; determining a second training sample set according to the number of times each first evaluation index is used in the first index model, wherein the second training sample set comprises a plurality of groups of training data, and each group of training data comprises: presetting an evaluation result and at least one corresponding second evaluation index, wherein the second evaluation index is a first evaluation index with times higher than a preset threshold value; training the second training sample set by using a preset machine learning model to generate a second index model, wherein the second index model is used for determining a preset evaluation result according to at least one second evaluation index.

Optionally, in the present embodiment, the non-volatile storage medium is arranged to store program code for performing the steps of: analyzing the first index model, and determining the number of times each first evaluation index is used by the first index model in the process of obtaining a preset evaluation result by using at least one first evaluation index; sequencing at least one first evaluation index according to the sequence from high to low in times to generate an evaluation index queue; in the evaluation index queue, determining a first evaluation index before a preset ranking as a second evaluation index; and determining a second training sample set according to the plurality of second evaluation indexes and preset evaluation results corresponding to the second evaluation indexes.

Optionally, in the present embodiment, the non-volatile storage medium is arranged to store program code for performing the steps of: acquiring historical evaluation data, wherein the historical evaluation data at least comprises: the system comprises a plurality of preset evaluation results and at least one evaluation data corresponding to each preset evaluation result; performing data cleaning on evaluation data in the historical evaluation data to generate a preset sample set, wherein the preset sample set comprises: the system comprises a plurality of preset evaluation results and at least one first evaluation index corresponding to each preset evaluation result; and selecting a first training sample set from the preset sample set according to a preset proportion.

Optionally, in the present embodiment, the non-volatile storage medium is arranged to store program code for performing the steps of: determining the missing amount of the evaluation data under the condition that the evaluation data in the historical evaluation data has data missing; deleting a preset evaluation result corresponding to the missing evaluation data from the historical evaluation data under the condition that the missing amount is not higher than a preset missing threshold value, and generating a preset sample set; and under the condition that the missing amount is higher than a preset missing threshold value, filling the missing evaluation data by using the average value of the evaluation data in the historical evaluation data, and generating a preset sample set.

Optionally, in the present embodiment, the non-volatile storage medium is arranged to store program code for performing the steps of: identifying abnormal data in the evaluation data; and replacing the abnormal data by using the average value of the evaluation data in the historical evaluation data to generate a preset sample set.

Optionally, in the present embodiment, the non-volatile storage medium is arranged to store program code for performing the steps of: identifying category data in the assessment data; and performing coding conversion on the category type data to generate a preset sample set.

Optionally, in the present embodiment, the non-volatile storage medium is arranged to store program code for performing the steps of: training a second training sample set by using a preset machine learning model, and acquiring at least one target evaluation index after generating a second index model; and analyzing at least one target evaluation index by using the second index model to determine a target evaluation result.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A method of model training, comprising:

obtaining a first training sample set, wherein the first training sample set comprises a plurality of sets of training data, each set of training data comprising: carrying out development evaluation on a preset chemical industry park to obtain a preset evaluation result and at least one corresponding first evaluation index;

training the first training sample set by using a preset machine learning model to generate a first index model, wherein the first index model is used for determining the preset evaluation result according to at least one first evaluation index;

determining a second training sample set according to the number of times each first evaluation index is used in the first index model, wherein the second training sample set comprises a plurality of groups of training data, and each group of training data comprises: the preset evaluation result and at least one corresponding second evaluation index, wherein the second evaluation index is the first evaluation index with the frequency higher than a preset threshold value;

Training the second training sample set by using the preset machine learning model to generate a second index model, wherein the second index model is used for determining the preset evaluation result according to at least one second evaluation index.

2. The method of claim 1, wherein determining a second set of training samples based on the number of times each of the first evaluation metrics is used in the first exponential model comprises:

analyzing the first index model, and determining the number of times each first evaluation index is used by the first index model in the process of obtaining the preset evaluation result by using at least one first evaluation index;

sequencing at least one first evaluation index according to the order of the times from high to low to generate an evaluation index queue;

determining the first evaluation index before the preset ranking as the second evaluation index in the evaluation index queue;

and determining the second training sample set according to the plurality of second evaluation indexes and the preset evaluation results corresponding to the second evaluation indexes.

3. The method of claim 1, wherein obtaining a first training sample set comprises:

Acquiring historical evaluation data, wherein the historical evaluation data at least comprises: the preset evaluation results are a plurality of, and at least one evaluation data corresponding to each preset evaluation result;

performing data cleaning on evaluation data in the historical evaluation data to generate a preset sample set, wherein the preset sample set comprises: the preset evaluation results are multiple, and at least one first evaluation index corresponding to each preset evaluation result;

and selecting the first training sample set from the preset sample set according to a preset proportion.

4. The method of claim 3, wherein data cleansing the evaluation data in the historical evaluation data to generate a predetermined set of samples comprises:

determining a missing amount of the evaluation data in the case that the evaluation data in the history evaluation data has data missing;

deleting a preset evaluation result corresponding to missing evaluation data from the historical evaluation data under the condition that the missing amount is not higher than a preset missing threshold value, and generating the preset sample set;

and under the condition that the deficiency amount is higher than a preset deficiency threshold value, filling the deficiency evaluation data by using the average value of the evaluation data in the historical evaluation data, and generating the preset sample set.

5. The method of claim 3, wherein data cleansing the evaluation data in the historical evaluation data to generate a predetermined set of samples comprises:

identifying abnormal data in the evaluation data;

and replacing the abnormal data by using an average value of the evaluation data in the historical evaluation data to generate the preset sample set.

6. The method of claim 3, wherein data cleansing the evaluation data in the historical evaluation data to generate a predetermined set of samples comprises:

identifying category data in the assessment data;

and carrying out coding conversion on the type data to generate the preset sample set.

7. The method of claim 1, wherein after training the second training sample set using the preset machine learning model to generate a second exponential model, the method further comprises:

acquiring at least one target evaluation index;

and analyzing at least one target evaluation index by using the second index model to determine a target evaluation result.

8. A model training device, comprising:

the system comprises an acquisition module for acquiring a first training sample set, wherein the first training sample set comprises a plurality of groups of training data, and each group of training data comprises: carrying out development evaluation on a preset chemical industry park to obtain a preset evaluation result and at least one corresponding first evaluation index;

The first training module is used for training the first training sample set by using a preset machine learning model to generate a first index model, wherein the first index model is used for determining the preset evaluation result according to at least one first evaluation index;

a determining module, configured to determine a second training sample set according to the number of times each of the first evaluation indexes is used in the first index model, where the second training sample set includes multiple sets of training data, and each set of training data includes: the preset evaluation result and at least one corresponding second evaluation index, wherein the second evaluation index is the first evaluation index with the frequency higher than a preset threshold value;

the second training module is used for training the second training sample set by using the preset machine learning model to generate a second index model, wherein the second index model is used for determining the preset evaluation result according to at least one second evaluation index.

9. A non-volatile storage medium, wherein a program is stored in the non-volatile storage medium, and wherein the program, when executed, controls a device in which the non-volatile storage medium is located to perform the model training method according to any one of claims 1 to 7.

10. An electronic device, comprising: a memory and a processor for executing a program stored in the memory, wherein the program is run to perform the model training method of any one of claims 1 to 7.