CN114580086B

CN114580086B - Vehicle component modeling method based on supervised machine learning

Info

Publication number: CN114580086B
Application number: CN202210478749.1A
Authority: CN
Inventors: 李文博; 王伟; 曲辅凡; 王长青; 颜燕
Original assignee: China Automotive Technology and Research Center Co Ltd; CATARC Automotive Test Center Tianjin Co Ltd
Current assignee: China Automotive Technology and Research Center Co Ltd; CATARC Automotive Test Center Tianjin Co Ltd
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-08-09
Anticipated expiration: 2042-05-05
Also published as: CN114580086A

Abstract

The invention provides a vehicle component modeling method based on supervised machine learning, which comprises the following steps: collecting test data of the modeling component under different working conditions; preprocessing the test data; selecting a classification or regression learning algorithm according to the modeling component characteristics; extracting relevant features by using feature selection and feature transformation; constructing and training a model; and exporting and applying the trained model to the whole vehicle model. The vehicle component modeling method based on the supervised machine learning establishes a high-precision vehicle component model by training the selected supervised machine learning algorithm through test data, and improves the integral simulation precision of the whole vehicle.

Description

Vehicle component modeling method based on supervised machine learning

Technical Field

The invention belongs to the technical field of automobile simulation, and particularly relates to a vehicle component modeling method based on supervised machine learning.

Background

With the rapid development of new energy vehicles, the virtual simulation technology as an important development tool is widely applied, and simulation software is generally adopted in the development stage of the new energy vehicles, so that the research and development period can be greatly shortened, and the cost is reduced. However, the current simulation status of the new energy automobile has obvious defects. The current vehicle component test items serving as a modeling basis are gradually subdivided, the items are increased, but the working conditions are limited, all application scenes cannot be covered, the test conditions and the parameters are relatively isolated, and meanwhile, the evaluation data is real and reliable, but errors and fluctuation inevitably exist, so that the data are scattered and fuzzy to a certain degree. The quantity of the related variables of the vehicle component modeling is large, the logical relation among the variables is complex, the traditional modeling technology is built depending on engineering experience, the precision is difficult to improve, when a vehicle component model control strategy part is built, the influence degree of the parameters on the result is difficult to determine, and the related key threshold is difficult to directly obtain.

Therefore, original evaluation data are applied to a virtual simulation technology through machine learning, information such as a high-precision key performance curve, a curved surface or a key threshold value, a classification boundary and the like which is difficult to obtain in general tests is obtained, and the advantages of the traditional vehicle component modeling technology are combined to establish a vehicle component modeling method based on supervised machine learning, so that the model precision can be greatly improved, the multivariable high-precision high-efficiency vehicle component modeling problem is solved, and the development of the simulation technology is promoted.

Disclosure of Invention

In view of this, the present invention aims to provide a vehicle component modeling method based on supervised machine learning to solve the multivariable high-precision high-efficiency vehicle component modeling problem.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

a vehicle component modeling method based on supervised machine learning comprises the following steps:

s1, determining a target signal required to be output by a standby device learning modeling technology part in the vehicle component model;

s2, designing a test scheme according to the target signal required to be output in the step S1;

s3, testing the real vehicle parts according to the test scheme, and collecting test data of the modeling parts under different working conditions;

s4, preprocessing the test data collected in the step S3 to obtain a preprocessed target signal;

s5, dividing the target signal preprocessed in the step S4 to obtain a training set T and a test set D, and extracting n training subsets from the training set T;

s6, extracting the relevant features of the n training subsets by using feature selection and feature change, randomly selecting attributes from the relevant features of the n training subsets, and performing node splitting attributes to form a decision tree; repeating the steps of extracting relevant features, randomly selecting attributes and forming decision trees for n times to generate n decision trees, and combining the n decision trees to obtain a random forest machine learning model;

s7, obtaining a classification boundary, a key threshold, a key performance curve and a key performance curved surface by using the random forest machine learning model in the step S6;

s8, deriving a random forest machine learning model to a vehicle part model and applying the random forest machine learning model to the vehicle part model;

and S9, completing modeling through a traditional modeling technology according to the classification boundary, the key threshold, the key performance curve and the key performance curved surface obtained in the step S7 and the random forest machine learning model derived in the step S8.

Further, the vehicle component model in step S1 includes a part built using a machine learning modeling technique and a part built using a conventional modeling technique; the target signal that needs to be output refers to a signal that is difficult to process by conventional modeling techniques used in modeling vehicle components and needs to be processed by machine learning modeling techniques.

Further, the test scheme in step S2 is a scheme for ensuring that the maximum amount of information is obtained in each test, and includes setting up different initial states and operating states.

Further, the test data preprocessing in step S4 includes the following steps:

a1, loading and reading test data;

a2, taking data points deviating from the mean value by plus or minus three times of standard deviation in the test data as abnormal data points, and eliminating the abnormal data points by using a normal distribution diagram method to obtain missing test data;

a3, filling the missing test data in the step A2 to obtain complete test data;

a4, filtering signal noise of the complete test data in the step A3.

Further, in step S5, the target signal preprocessed in step S4 is divided into a training set T and a test set D, and the extracting n training subsets from the training set T includes the following steps:

b1, judging whether the data sample size of the preprocessed target signal is less than 20 ten thousand;

b2, if yes, the preprocessed target signal is calculated according to the ratio of 2: 8, dividing the test set D and the training set T according to the proportion, and entering the step B4;

b3, if not, the preprocessed target signal is processed in a mode that 2: 2: 6, dividing the test set D, the verification set and the training set T according to the proportion, and entering the step B4;

b4, extracting N samples from the training set T with the capacity of N by adopting an autonomous sampling method to be used as a training subset;

b5, repeating the step B1-the step B4 for n times to obtain n training subsets;

the feature selection in step S6 includes stepwise regression, sequential feature selection, regularization, neighbor analysis, and the feature variation includes principal component analysis, non-negative matrix factorization, factor analysis.

Further, the obtaining of the random forest machine learning model in step S6 includes the following steps:

c1, importing the correlation features obtained in the step S6, calculating information gains of the correlation features from the correlation features respectively by using an information gain formula, selecting the correlation feature with the largest information gain as a response variable, and using the response variable as the output of the random forest machine learning model obtained in the step C2;

c2, making the relevant characteristics in the step C1 into node classification attributes to form a decision tree; repeating the steps of extracting relevant features, selecting attributes and forming decision trees for n times to generate n decision trees, and combining the n decision trees to obtain a random forest machine learning model;

c3, configuring the following parameters of the random forest machine learning model under the training set T: the number of decision trees, the maximum depth of the trees, the minimum sample number of the segmented internal nodes and the maximum characteristic number used by each tree;

c4, training the model obtained by the configuration in the step C3;

c5, judging whether the model in the step C4 meets the requirement, if yes, executing the step S7, if no, returning to the step C1.

Further, the obtaining of the classification boundary, the key threshold, the key performance curve and the key performance surface by using the random forest machine learning model in the step S6 in the step S7 includes the following steps:

d1, determining the minimum step length to be 5% of the minimum time interval according to the amplitude and the span of the real value by using the test set D;

d2, dividing the real value of the input signal at equal intervals according to the minimum step length through linear interpolation to obtain an expanded input signal;

d3, transmitting the expanded input signals in the step D2 to a trained random forest machine learning model to obtain an output result calculated by the random forest machine learning model;

d4, analyzing the output result in the step D3 to obtain a classification boundary, a key threshold, a key performance curve and a key performance curved surface.

Further, the deriving and applying the random forest machine learning model to the vehicle component model in step S8 includes:

e1, configuring an input interface and an output interface of the random forest machine learning model;

e2, outputting the random forest machine learning model as a C code file;

e3, compiling the C code file into a vehicle component modeling platform interface file;

e4, linking the compiled interface file in the step E3 to the vehicle component model in the vehicle component modeling platform.

Compared with the prior art, the vehicle component modeling method based on supervised machine learning has the following advantages:

(1) the vehicle component modeling method based on the supervised machine learning is reasonable in design, original evaluation data are applied to a virtual simulation technology through the machine learning, information such as a high-precision key performance curve, a curved surface or a key threshold value, a classification boundary and the like which are difficult to obtain in general tests is obtained, and the multivariate high-precision high-efficiency vehicle component model building problem is solved by combining the advantages of the traditional vehicle component modeling technology.

(2) According to the vehicle component modeling method based on the supervised machine learning, the generalization characteristic of the machine learning technology is utilized to improve the modeling precision, meanwhile, the test testing times are greatly reduced, the cost is saved, and meanwhile, the problem that modeling basic data such as a key performance curve, a curved surface or a key threshold, a classification boundary and the like in the traditional modeling technology are difficult to obtain is solved.

(3) The vehicle component modeling method based on the supervised machine learning utilizes the advantages of generalization characteristic of the machine learning technology and solving multivariate complex relation, and combines the advantages of strong robustness and strong interpretability of the traditional modeling technology, thereby mutually promoting and improving the precision of the whole vehicle component model.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of the overall structure according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating test data preprocessing according to an embodiment of the present invention;

fig. 3 is a schematic diagram of cross-platform application of a machine learning model to a complete vehicle model according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

The noun explains:

normal distribution method:

a normal distribution is a probability distribution with two parameters

And

of the continuous type random variable, the first parameter

Is the mean of the random variables following a normal distribution, the second parameter

Is the variance of this random variable, so a normal distribution is recorded as

. The probability law of random variables following normal distribution is

The probability of a neighboring value is large, while the probability of taking a value farther away from μ is smaller;

the smaller the distribution, the more concentrated the distribution

In the vicinity of the location of the mobile station,

the larger the distribution, the more dispersed. The normal distribution diagram method is a method for removing abnormal data by using normal distribution.

Information gain formula:

the information gain formula is a formula for calculating a difference value of entropies before and after dividing a data set by a certain characteristic, wherein the entropy can represent the uncertainty of a sample set, and the larger the entropy is, the larger the uncertainty of the sample is.

As shown in fig. 1 to 3, a vehicle component modeling method based on supervised machine learning includes the following steps:

s1, selecting multivariable complex logic relation which is difficult to realize by the traditional modeling technology or information such as high-precision key performance curve, curved surface or key threshold value, classification boundary and the like which is difficult to obtain by the traditional modeling technology as target signals which need to be output by a standby device learning modeling part in the vehicle component model;

s2, designing a test scheme according to the target signal to be output;

s3, testing the real vehicle parts under different working conditions according to the test scheme and collecting summarized test data; testing real vehicle components under the combined working condition of WLTC and NEDC according to the test scheme and collecting and summarizing test data; the modeling component refers to a component corresponding to a vehicle component model to be created;

in this embodiment, step S3 may perform 5 working condition tests on each of the real vehicle components at-30, -10, 0, 10, and 30 degrees celsius, respectively, to obtain 25 sets of test data;

s5, dividing the target signal processed in the step S4 to obtain a training set and a test set, and extracting n training subsets from the training set;

s6, extracting relevant features by using feature selection and feature change, randomly selecting attributes from the features, and performing node splitting attributes to form a complete decision tree; repeating the steps of extracting features, randomly selecting attributes and forming decision trees for n times to generate n decision trees, and combining the decision trees to obtain a random forest;

and S9, completing modeling through a traditional modeling technology according to the classification boundary, the key threshold, the key performance curve and the key performance curved surface obtained in the step S7 and the random forest machine learning model derived in the step S8. The vehicle component modeling method based on the supervised machine learning is reasonable in design, original evaluation data are applied to a virtual simulation technology through the machine learning, information such as a high-precision key performance curve, a curved surface or a key threshold value, a classification boundary and the like which are difficult to obtain in general tests are obtained, and the multivariable high-precision high-efficiency vehicle component model building problem is solved by combining the advantages of the traditional vehicle component modeling technology; the test times are greatly reduced by applying the machine learning technology, the cost is saved, and the problem that modeling basic data such as key performance curves, curved surfaces or key thresholds, classification boundaries and the like are difficult to obtain in the traditional modeling technology is solved; the method has the advantages of utilizing the generalization characteristic of the machine learning technology and solving the multivariable complex relation, and simultaneously combining the advantages of strong robustness and strong interpretability of the traditional modeling technology, and mutually promoting and improving the precision of the integral model of the vehicle component.

The vehicle component model in step S1 includes a part built by using a machine learning modeling technique and a part built by using a conventional modeling technique, and covers the vehicle actual mechanical component, the control strategy component; the target signals needing to be output refer to signals which are difficult to process by using a traditional modeling technology in vehicle component modeling and need to be processed by a machine learning modeling technology, the signals often need a large amount of engineering experience or unknown principles but have large influence on results, and the machine learning modeling technology refers to a technology for realizing modeling by a machine learning algorithm means; the conventional modeling technology refers to a technology for realizing modeling by means of general physics laws, empirical formulas and the like without using a machine learning algorithm.

The reasonable trial scenario in step S2 refers to a scenario that ensures that the amount of information obtained per trial is maximized. In this embodiment, the test scheme in step S2 ensures that the amount of information obtained in each test is the largest, including setting up different initial states and operating states, and avoiding data duplication caused by too many similar tests.

The test data preprocessing in step S4 includes the steps of:

a1, selecting a proper storage format to load and read test data; the purpose is that the test data for machine learning generally has a large volume, and there may be a plurality of data sources that are collected and sorted, resulting in different data storage formats, so it is necessary to select a suitable loading and reading manner for performing standardized and unified processing on the data by comprehensively considering aspects of data memory occupation, access speed, processing means, and the like.

A2, regarding the data points deviating from the mean value by plus or minus three times of standard deviation as abnormal data points, and eliminating the abnormal data points by using a normal distribution diagram method, specifically, in step S3, 5 times of NEDC working condition tests are respectively carried out at-30, -10, 0, 10 and 30 ℃, the data mean value and the standard deviation under each time of the NEDC working condition are respectively calculated, regarding the data points deviating from the mean value by plus or minus three times of standard deviation under each time as abnormal data points, and eliminating the abnormal data points; the method aims to solve the problems that on the basis of data driving characteristics, training effects of machine learning are directly related to the quality of test data, if the test data have abnormal values, training results are prone to be deviated, and particularly when the abnormal values far exceed normal values, influences are extremely obvious, so that abnormal data points need to be eliminated, and influences of unreasonable outliers are avoided.

In step a2, abnormal data point values based on known conditions, such as data points that exceed theoretical limit values (e.g., SOC values greater than 1, sample points well below sample time, voltage values that exceed maximum voltage, etc.), data points that violate realistic laws (e.g., negative distance traveled, negative mass, etc.).

A3, filling data missing, wherein the data missing occurs after the abnormal data points are eliminated in the step A2, and the general principle of processing the data missing is to fill the data by utilizing the information of the existing variables; if the data sample size is large, the average value of the data at the time under the two previous and next tests can be used for filling the missing data; if the data sample size is small, the average value of all tested data at the time under the NEDC working condition can be used for filling the missing data; the method aims to solve the problems that on one hand, the loss of test data set information is caused by data loss, on the other hand, machine learning cannot run normally or is low in efficiency due to errors or blanks caused by the data loss, and the efficiency of a machine learning algorithm is directly influenced, so that the missing data needs to be processed and filled, and the missing data entry can be directly deleted when the data volume is large enough.

A4, carrying out signal noise filtering operation on the data processed by the A3, and dividing the operation into the following three steps:

a41, performing data binning on data obtained by performing working condition tests at-30 degrees, at-10 degrees, at 0 degrees, at 10 degrees and at 30 degrees for five times respectively, and putting five groups of working condition data obtained at each temperature into one box to obtain five boxes A, B, C, D, E in total;

a42, carrying out mean value filtering on data at each time in the A box, respectively calculating a data mean value at each time, replacing the data at the time with the data mean value, filtering five data sets into a data set with the data at each time replaced by the mean value, respectively carrying out mean value filtering on the data in the five boxes, wherein the five boxes obtain five data sets in total, and the mean value filtering can remove irrelevant details in an image;

a43, performing median filtering on five data sets obtained by the five boxes, calculating data median values at each time in the five data sets, obtaining a new data set by using the data median values at each time, and using the median filtering at the stage to remove isolated noise, improve data smoothness and protect data edges;

the method aims to solve the problems that test data are derived from data stored in real time by specific test equipment in a test working condition, errors and noises caused by a test system cannot be avoided, random irregular fluctuation exists in the test data, the characteristics of the data are possibly covered, and training speed and accuracy are reduced, so that noise filtration needs to be carried out on the test data.

Further, in step S5, the dividing the target signal processed in step S4 into a training set T and a test set D, and the extracting n training subsets from the training set includes the following steps:

b1, when the data sample size is small (less than 20 ten thousand), the ratio of 2: 8, dividing a test set and a training set according to the proportion; when the data sample size is large (more than 20 ten thousand), the ratio of 2: 2: 6, dividing a test set, a verification set and a training set according to the proportion; from the training set T with the capacity of N, an autonomous sampling method is adopted, namely N samples are extracted in a replacement way to be used as a training subset

；

B2, repeating the step B1 n times to obtain n training subsets

。

The obtaining of the random forest machine learning model comprises the following steps:

c1, importing the correlation features obtained in step S6, calculating information gains of the correlation features from the correlation features by using an information gain formula, selecting the correlation feature with the largest information gain as a response variable, and using the response variable as an output of the random forest machine learning model obtained in step C2;

the important new of the characteristic variable can be obtained by an information gain method, the larger the information gain is, the more important the characteristic is, and the information gain is calculated by the following formula:

suppose there are k types of features:

the probability of each feature occurrence is:

the information entropy for each feature is calculated as follows:

the probability of each data occurring in all features is:

the probability of each data not occurring is:

the formula for calculating conditional entropy is as follows:

wherein the content of the first and second substances,

is the information entropy of the data and,

is the entropy of the information in which the data does not appear;

the overall information gain formula is:

the larger the information gain is, the larger the entropy change is, the more the classification is facilitated, and the correlation characteristic with the largest information gain is selected as a response variable;

c2, making node classification attribute according to the relevant characteristics in the step C1 to form a complete decision tree; repeating the steps of extracting features, selecting attributes and forming decision trees for n times to generate n decision trees, and combining the decision trees to obtain a random forest machine learning model;

c3, configuring the following parameters in the random forest machine learning model under the training set: the number of decision trees, the maximum depth of the trees, the minimum sample number of the segmented internal nodes and the maximum characteristic number used by each tree;

the number of decision trees is: if the number of the decision trees is too large, the calculated amount is too large; if the number of the decision trees is too small, fitting is easy to be underfitted, and the number is generally selected to be 100;

maximum depth of tree: setting the maximum depth parameter max _ depth of the number as None, and fitting the node to the condition that the information gain is 0, wherein the preparation degree is higher;

minimum number of samples to segment internal nodes: setting the minimum sample number min _ sample _ leaf set by the segmentation internal node as 1, wherein the significance is that when the leaf node sample number is less than the minimum sample number min _ sample _ leaf of the node, the leaf node is pruned, and only the father node of the leaf node is left;

maximum number of features per tree: set to None. There is no limit to the maximum number of features;

the parameter configuration of the random forest machine learning model directly influences the training result, and meanwhile, the parameters are generally more, so that the optimal parameter configuration combination can be found by selecting common optimization technologies such as Bayesian optimization, grid search and a gradient-based optimization method, and the parameter range can be quickly narrowed by selecting a proper test method such as an orthogonal test method;

c4, training the model obtained by the step C3;

c5, judging whether the model meets the requirements, if yes, executing the step S7, and if not, returning to the step C1; judging whether the model meets the evaluation function of the output index of the key model, so that the accuracy of the model can be judged by comprehensively considering error evaluation indexes such as ROC (rock characteristic) curves, confusion matrixes, MSE (mean square error) curves and the like;

in the present embodiment, the feature selection in step S7 includes, but is not limited to, stepwise regression, sequential feature selection, regularization, neighbor analysis (NCA), and the like; characteristic variations include, but are not limited to, principal component analysis, non-negative matrix factorization, and the like.

The step of obtaining the classification boundary, the key threshold, the key performance curve and the key performance curved surface by the random forest machine learning model in the step S8 includes the following steps:

d1, determining the minimum step length to be 5% of the minimum time interval according to the information such as the amplitude, the span and the like of the true value by using the test set D; the actual value of the input signal obtained in the test is necessarily a test carried out according to a certain interval value no matter in cost control or in protection of vehicle components, so that the minimum step length is determined according to the amplitude and the span, and can be 5% of the minimum interval;

d2, dividing the real value of the input signal at equal intervals according to the minimum step length by linear interpolation; more and more data are needed by high-precision modeling, so that the minimum step length obtained in the step D1 is utilized, and linear interpolation is used for dividing each interval of the true value of the input signal at equal intervals according to the minimum step length;

d3, transmitting the input signal obtained after the expansion in the step D2 to a random forest machine learning model to obtain an output result obtained by model calculation;

Deriving and applying the random forest machine learning model to the vehicle component model in step S8 includes the steps of:

e2, outputting the random forest machine learning model as a C code file;

e3, compiling the C code file into a vehicle component modeling platform interface file; such as MEX files in MATLAB/SIMULINK, dll files in CRUISE;

e4, linking the compiled interface file in the step E3 to the vehicle component model in the vehicle component modeling platform, and completing modeling.

The modeling completed by the conventional modeling technique in step S9 refers to a part that requires a lot of engineering experience and parameters for the conventional modeling technique and is difficult to obtain, and by using the machine learning modeling technique, step S7 provides a classification boundary, a key threshold, a key performance curve and a key performance curve that are difficult to obtain by the conventional modeling technique, and the part can be modeled by using the conventional modeling technique through physical formula laws, logical algorithms, standard specifications, and the like; for the part with unknown principle but important influence on the result, the random forest machine learning model obtained in the step S8 is directly used for modeling the part, and the modeling of the whole vehicle part is completed by using the traditional modeling technology except for the part which requires a large amount of engineering experience and parameters and is difficult to obtain and has an unknown principle but has important influence on the result.

Example 1

The invention is explained in an embodiment using a supervised machine learning algorithm to model a battery model in a MATLAB environment, but it should be noted that the method of the invention is not limited to the MATLAB platform only, and the steps are as follows:

f1, determining a target signal. The method determines the final output signal of a machine learning modeling part in a battery model as an SOC value; and finally outputting a signal to be a battery output current by using the traditional modeling part.

F2, determining the test scheme. The test conditions of this example are-30, -10, 0, 10, 30 degrees centigrade, each of which is subjected to 5 NEDC condition tests.

And F3, testing and collecting battery test data. And collecting test data of the output voltage, the output current, the battery temperature, the open-circuit voltage and the SOC value of the battery model to be built under the working condition of a key signal NEDC. The collection mode can adopt a bench test or a real vehicle sensor test, but the data is ensured to have enough volume, the embodiment relates to that the signal test data exceeds 15 ten thousand and is less than 20 ten thousand, the covered working conditions are as comprehensive as possible, the working state of the battery in the full life cycle is included as far as possible, and the trained model is ensured to have universality.

F4, preprocessing the test data. And preprocessing the battery test data to obtain a preprocessed SOC value.

F5, processing the target signal preprocessed in the step H4 in a ratio of 2: 8, obtaining a training set T and a test set D by proportional division, and extracting 10 training subsets from the training set T;

f6, extracting relevant features and constructing a training battery random forest machine learning model. And importing all relevant features by taking the SOC value as a response, training a model by taking a confusion matrix and an MSE value as indexes without using cross validation, and circularly modifying the number of decision trees of the random forest, the maximum depth of the numbers, the minimum sample number of the segmented internal nodes and the maximum feature number used by each tree according to the indexes until the model meets the precision requirement.

F7, obtaining a key performance curve by using the battery random forest machine learning model. And acquiring a performance curve of the battery output voltage and the battery temperature by using the model trained in the F6.

And F8, deriving a battery random forest machine learning model. Exporting trained battery model m script codes to MATLAB, compiling the m script by using a C compiler to generate an MATLAB MEX file, and packaging the generated MEX file into sub-modules by using an S function to be imported into the battery SIMULINK model.

F9, completing battery model building by using the key performance curve and using the traditional modeling technology. And (3) according to a performance curve of the battery output voltage and the battery temperature obtained by H7 and a physics law of the battery, the product of the output current and the output voltage is equal to the output power of the battery, and the battery model building is completed.

Preprocessing the data in step F4 includes the following detailed steps:

f41, because the battery test data volume is large, and the data formats and limit values of all key parameters are not uniform, the storage formats of common numerical values, arrays, character strings and the like occupy a large memory and are complicated to operate, as shown in the figure, the test data time is in a character string format, the battery control signal is in a Boolean quantity format, and the effective digits of the rest data are different and not uniform, so that a table data storage format is selected to introduce the data into MATLAB.

And F42, marking outliers by using a normal distribution diagram method and removing outlier abnormal data. However, when outliers are eliminated, certain verification is carried out on the eliminated data, and it is necessary to determine whether the data are few meaningful working condition test data or not.

F43, processing the obvious abnormal value of each signal test data independently. As shown in table 1, data of more than 400 need to be deleted for output voltage and VOC; the output current needs to delete data larger than 1.5; temperature requires deletion of data greater than 50; the SOC signal needs to delete data with SOC greater than 1 or less than 0, for example, the SOC exceeds 0 to 1 at the time of 9:03:08 to 9:03:10 and needs to be deleted.

TABLE 1

Time	Output voltage	Output current	Temperature of	VOC	SOC
						9:03:05	378.03	1.11	20.96	378.11	0.58159832
9:03:06	378.03	1.11	20.96	378.11	0.581597853
						9:03:07	378.03	1.11	20.96	378.11	0.581597386
9:03:08	378.03	1.11	20.96	378.11	1.581596918
						9:03:09	378.03	1.11	20.96	378.11	1.581596451
9:03:10	378.03	1.11	20.96	378.11	22.27625333
						9:03:11	378.03	1.11	20.96	378.11	0.581595516
9:03:12	378.03	1.11	20.96	378.11	0.581595049
						9:03:13	378.03	1.11	20.96	378.11	0.581594582
9:03:14	378.03	1.11	20.96	378.11	0.581594114
						9:03:15	378.03	1.11	20.96	378.11	0.581593647
9:03:16	378.03	1.11	20.96	378.11	0.581593179
						9:03:17	378.03	1.11	20.96	378.11	0.581592712
9:03:18	378.03	1.11	20.96	378.11	0.581592245
						9:03:19	378.03	1.11	20.96	378.11	0.581591777
9:03:20	378.03	1.11	20.96	378.11	0.58159131

F44, processing the missing data. Counting the number of the items missing from the test data, if the ratio is not large (i.e. the number of the items missing from the test data is less than 5% of the total number of the items missing from the test data), directly deleting the missing items, otherwise filling the missing items with the average value of the neighboring points, as shown in table 2, recursively acquiring the front value and the rear value of the missing data from the time point 4:03:37 to 4:03:39, and filling the data from the time point 4:03:37 to 4:03:39 with the average value of the neighboring points.

TABLE 2

Time	Output voltage	Output current	Temperature of	VOC	SOC
						4:03:20	382.17	0.00	20.00	382.18	0.7
4:03:21	382.17	0.00	20.00	382.18	0.7
						4:03:22	382.17	0.00	20.00	382.18	0.7
4:03:23	382.17	1.10	20.00	382.18	0.699999584
						4:03:24	382.09	1.10	20.00	382.18	0.699999122
4:03:25	382.09	1.10	20.00	382.18	0.699998659
						4:03:26	382.09	1.10	20.00	382.18	0.699998197
4:03:29	382.09	1.10	20.00	382.18	0.69999681
						4:03:30	382.09	1.10	20.00	382.18	0.699996347
4:03:31	382.09	1.10	20.00	382.18	0.699995885
						4:03:32	382.09	1.10	20.00	382.18	0.699995423
4:03:33	382.09	1.10	20.00	382.18	0.69999496
						4:03:34	382.09	1.10	20.00	382.18	0.699994498
4:03:35	382.09	1.10	20.00	382.18	0.699994035
						4:03:36	382.09	1.10	20.00	382.18	0.699993573
4:03:37		1.10	20.00	382.18	0.699993111
						4:03:38	382.09	1.10	20.00	382.18
4:03:39	382.09	1.10	20.00	382.18
						4:03:40	382.09	1.10	20.00	382.18	0.699991724
4:03:41	382.09	1.10	20.00	382.18	0.699991261
						4:03:42	382.09	1.10	20.00	382.18	0.699990799

F45, removing isolated noise by using mean value filtering.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A vehicle component modeling method based on supervised machine learning is characterized in that: the method comprises the following steps:

the vehicle component model in step S1 includes a part built using a machine learning modeling technique and a part built using a conventional modeling technique; the target signal to be output refers to a signal which is difficult to process by a traditional modeling technology used in vehicle component modeling and needs to be processed by a machine learning modeling technology;

the test data preprocessing in step S4 includes the steps of:

a1, loading and reading test data;

a3, filling the missing test data in the step A2 to obtain complete test data;

a4, filtering signal noise of the complete test data in the step A3;

in step S5, the target signal preprocessed in step S4 is divided into a training set T and a test set D, and the extraction of n training subsets from the training set T includes the following steps:

b4, extracting N samples from a training set T with the capacity of N by adopting an autonomous sampling method to be used as a training subset;

b5, repeating the step B1-the step B4 for n times to obtain n training subsets;

the obtaining of the random forest machine learning model in step S6 includes the steps of:

the information gain is calculated by the following formula:

suppose there are k types of features:

C ₁ ，C ₂ ，C ₃ ...C _k

the probability of each feature occurrence is:

P(C ₁ )，P(C ₂ )，P(C ₃ )，...P(C _k )

the information entropy for each feature is calculated as follows:

the probability of each data occurring in all features is:

P(data)

the probability of each data not occurring is:

the formula for calculating conditional entropy is as follows:

wherein H (C | data) is the information entropy of the data,

is the entropy of the information in which the data does not appear;

the overall information gain formula is:

IG(T)＝H(C)-H(C|T)；

c2, making the relevant characteristics in the step C1 into node classification attributes to form a decision tree; repeating the steps of extracting relevant features, selecting attributes and forming a decision tree for n times to generate n decision trees, and combining the n decision trees to obtain a random forest machine learning model;

the number of decision trees is: the number of decision trees is 100;

maximum depth of tree: setting a maximum depth parameter max _ depth of the number as None, and fitting the node until the information gain is 0;

maximum number of features per tree: set to None;

c4, training the model obtained by the configuration in the step C3;

c5, judging whether the model in the step C4 meets the requirement, if yes, executing the step S7, and if not, returning to the step C1;

obtaining classification boundaries, key thresholds, key performance curves and key performance surfaces using the random forest machine learning model in step S6 in step S7 includes the steps of:

d4, analyzing the output result in the step D3 to obtain a classification boundary, a key threshold, a key performance curve and a key performance curved surface;

the deriving and applying a random forest machine learning model to a vehicle component model in step S8 includes the steps of:

e2, outputting the random forest machine learning model as a C code file;

e4, linking the compiled interface file in the step E3 to the vehicle component model in the vehicle component modeling platform;

2. The supervised machine learning-based vehicle component modeling method of claim 1, wherein: the test scenario in step S2 refers to a scenario for ensuring the maximum amount of information obtained in each test, and includes setting up different initial states and operating states.

3. The supervised machine learning-based vehicle component modeling method of claim 1, wherein: the feature selection in step S6 includes stepwise regression, sequential feature selection, regularization, neighbor analysis, and the feature variation includes principal component analysis, non-negative matrix factorization, factor analysis.