CN114971675A

CN114971675A - Second-hand car price evaluation method based on deep FM model

Info

Publication number: CN114971675A
Application number: CN202210357696.8A
Authority: CN
Inventors: 肖文栋; 尹旭阳; 黄越
Original assignee: University of Science and Technology Beijing USTB; Shunde Graduate School of USTB
Current assignee: University of Science and Technology Beijing USTB; Shunde Graduate School of USTB
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2022-08-30

Abstract

The invention discloses a second-hand car price evaluation method based on a deep FM model, which comprises the following steps: taking the second-hand car transaction data as input data; performing feature segmentation on attribute features in second-hand vehicle data; respectively preprocessing the three characteristics of the second-hand vehicle; arranging the attribute characteristics of the same second-hand vehicle after preprocessing into a row to form a row vector; arranging and splicing the data of all the second-hand vehicles according to rows to form a second-hand vehicle data matrix; performing data dimension reduction on numerical characteristics in the second-hand vehicle data matrix to obtain a data matrix; splicing the second-hand car data price as a label to the tail of the corresponding second-hand car price row; constructing a deep FM network; inputting the obtained second-hand vehicle data matrix into a deep FM model for training to obtain parameters of the model; and inputting the obtained second-hand vehicle data matrix into a deep FM model for price estimation. The invention has the advantages that: the accuracy of the price evaluation of the second-hand vehicle is improved, the workload is reduced, the feature dimension is reduced, and the memory and the operation time are saved.

Description

Second-hand car price evaluation method based on deep FM model

Technical Field

The invention relates to the technical field of price evaluation of second-hand vehicles, in particular to a price evaluation method of second-hand vehicles based on a deep FM model.

Background

Along with the improvement of the popularization rate of automobiles, the trading volume of second-hand cars is continuously improved, and the method has a wide development prospect. The evaluation of the value of the used cars is particularly important in the market of gradually-increased used cars trading. The traditional price evaluation method has the defects of dependence on market and experience of an evaluator, influence of subjective factors on evaluation results, high evaluation cost, low evaluation efficiency and the like. The conventional method is mainly adopted in the conventional second-hand car market, and the evaluation result is very dependent on personal subjective feeling. Therefore, an accurate and scientific used vehicle price prediction method is provided, the accuracy of used vehicle value prediction is improved, and the method has important significance for development of used vehicle industry. In recent years, machine learning methods have been used in attempts to evaluate used car prices as a reference price for used car transactions.

Deep learning is a novel machine learning method, which forms more abstract high-level representation attribute classes or features by combining low-level features to discover distributed feature representation of data. Deep learning structures such as Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs) have been successfully applied in the fields of computer vision, speech recognition, natural language processing, and the like. Compared with a shallow neural network, the deep neural network has the advantages that more layers provide higher abstract layers for the model, and the prediction capability of the model is improved. Aiming at complex vehicle types and regional conditions, the evaluation price of the second-hand vehicle is obtained by using a deep learning method, and the problems of dependence on market experience, dependence on subjective feeling, low evaluation efficiency and the like in price evaluation can be solved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a used vehicle price evaluation method based on a deep FM model.

In order to realize the purpose, the technical scheme adopted by the invention is as follows:

a used vehicle price evaluation method based on a deep FM model comprises the following steps:

1) taking historical second-hand car transaction data as input data; the historical used vehicle transaction data comprises used vehicle attribute characteristic data x _origin And a transaction price y, wherein the attribute characteristics comprise vehicle registration date, vehicle transaction date, vehicle type, vehicle brand, vehicle type, fuel type, transmission type, engine power, vehicle mileage, and vehicle region.

2) The attribute characteristics in the second-hand vehicle data are subjected to characteristic segmentation, and the attribute characteristics are segmented into three types of characteristics: a numerical feature, a high cardinality category feature, and a low cardinality category feature;

3) respectively preprocessing the three characteristics of the second-hand vehicle; the pretreatment comprises the following steps: data cleaning, missing value filling, feature coding and data standardization;

4) arranging the attribute characteristics of the same second-hand vehicle after preprocessing into a row to form a row vector x;

5) data x of all used cars are compared _i Arranging and splicing the handcart data matrix according to rows to form a second-hand vehicle data matrix;

6) performing data dimension reduction on the numerical characteristics in the second-hand vehicle data matrix X to obtain a data matrix X';

7) splicing the second-hand car data price as a label to the tail of the corresponding second-hand car vector line;

8) constructing a deep FM model for evaluating the price of the second-hand vehicle;

9) carrying out steps 1) -7) on second-hand car data used for model training, inputting the obtained second-hand car data matrix into a deep FM model for training, and obtaining parameters of the model and a network model used for estimating price of the second-hand car;

10) and (3) carrying out steps 1) to 6) on the used vehicle data to be estimated, and inputting the obtained used vehicle data matrix into a deep FM model for estimating the price.

Further, in the step 2, the characteristic segmentation divides the attribute characteristics of the original used vehicle into numerical characteristics and category characteristics, and the category characteristics are segmented according to the cardinality, wherein the cardinality larger than 10 is high cardinality category characteristics, and the cardinality smaller than or equal to 10 is low cardinality category characteristics.

Further, in step 3

And (4) data cleaning is carried out by using a box line graph to remove abnormal values, and the maximum and minimum values in the data are removed.

Missing value filling means that the missing value of the category feature is filled by using the mode of the feature of all data, and the numerical feature is filled by using the mean value of the feature.

Feature encoding refers to mean encoding and one-hot encoding. Carrying out mean value coding on the high-cardinality classification characteristics, wherein the specific formula is as follows:

wherein g (y, x) _i ) For the coded eigenvalues, y is the price of the second-hand vehicle, λ (n) _i )∈[0,1]Default value of 0.5, n for reliability of two means _i Is a characteristic value of x _i N is the total number of samples,

is x ═ x _i The corresponding y-means value is then calculated,

is the y mean value over the entire training set;

carrying out one-hot coding on the low-cardinality class characteristics by the following process:

the second-hand vehicle data is provided with a class characteristic x with a base number of m, an n-m sparse matrix A is constructed through dummy coding, each column of the matrix corresponds to a value of the characteristic x, the numerical value of each column indicates whether the characteristic belongs to the current characteristic, and the original characteristic x is replaced by the coded sparse numerical value matrix A.

The data normalization adopts a normalization method, and the formula is as follows:

wherein x _i Before normalization, x' is the value after normalization, and n is the number of samples.

Further, in the step 6, a principal component analysis method is adopted for data dimensionality reduction, and data with 99% of principal components is selected, and the specific steps are as follows:

first, a covariance matrix of original data X is obtained

Wherein X is an original data matrix, n is the number of columns of the matrix X, and m is the number of rows of the matrix X;

calculating a covariance matrix C _m×m Characteristic value (λ) of _i ) _i＝0,…,m And a feature vector (p) _i ) _i＝0,…,m ；

The eigenvalue lambda is arranged from large to small as { lambda ₀ ,λ ₁ ,…,λ _m In which λ is ₀ ≥λ ₁ ≥…≥λ _m Taking the first k characteristic values, wherein the sum of the characteristic values accounts for 99 percent of the sum of all the characteristic values, and corresponding characteristic vectors { p% ₀ ,p ₁ ,…,p _k Are combined into a transform matrix

P _k×m ＝[p ₀ ,p ₁ ,…,p _k ] ^T

Transforming the matrix P _k×m And original second-hand car data X _m×n Multiplying to obtain the data after dimensionality reduction

Y _k×n ＝P _k×m X _m×n

Wherein, Y _k×n To reduce the dimension of the post-matrix, P _k×m For transforming the matrix, X _m×n Is a matrix of raw data.

Further, the deep FM model constructed in step 8 includes an input layer, an embedded layer, an FM layer, a DNN layer, and an output layer in order from input to output. The deep FM model input is composed of a plurality of input fields, and is divided into a category characteristic field and a numerical characteristic field. The category feature domain corresponds to the low-radix category feature preprocessed in the step 3, and the numerical feature domain corresponds to the high-radix category feature and the numerical feature.

Each input field is connected with an embedding unit of the embedding layer, and is converted into an embedding vector with a dimension k after passing through the embedding layer, wherein the default value of k is 8.

The model FM layer is a factorization machine, the first order unit of which is connected with each input domain, the second order unit domain of which is connected with the embedded layer, and the output formula of which is

Where x is an input value, y _FM For FM layer output, w ₀ For global bias, w is a weight parameter, the < w, x > parts represent the components of the first-order features in the model,

the part represents the component of the second-order feature crossing in the model, n is the number of input domains, E _i For inputting field x _i The k order embedding vector.

The DNN layer of the model is a feedforward neural network, the input of which is a dense embedded vector of the output of the embedded layer, the output of which is represented as a ⁽⁰⁾ ＝[e ₁ ,e ₁ ,…,e _k ]The DNN part of the operation process is

Where l denotes the current number of layers of the DNN model, a ^(l+1) For the current layer output, W ^(l) Is a weight, b ^(l) In order to be offset,

is an activation function. The output of the DNN part is

Wherein, y _DNN For partial output of DNN, | H | represents the number of hidden layer layers. By default, the DNN-layer network structure is a 4-layer hidden layer, and the number of neurons is 100, 64, 32, 8; the activation function is a ReLU function.

The model output layer is connected with the FM layer and the DNN layer, and a neuron is used for outputting a result, wherein the formula is

Wherein y is the model evaluation result, y _FM Is the output of the FM layer, y _DNN Is the output of the DNN layer.

Is a ReLU activation function. The model loss function uses the mean absolute error.

Compared with the prior art, the invention has the advantages that:

1) the invention improves the accuracy of price evaluation of the second-hand vehicle;

2) according to the invention, the deep FM model is adopted to evaluate the price of the second-hand car, so that the workload of characteristic engineering is reduced, the low-order and high-order characteristic intersection of input data can be better captured, and the mapping relation between the price and the input characteristic is obtained;

3) the invention adopts mean value coding and principal component analysis to carry out data preprocessing, reduces the feature dimension and saves the memory and the operation time.

Drawings

Fig. 1 is a flowchart of a second-hand vehicle price evaluation method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a second-hand vehicle price evaluation method according to an embodiment of the invention;

fig. 3 is a schematic network structure diagram of a used vehicle price evaluation method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings by way of examples.

As shown in fig. 1 to 3, a used vehicle price evaluation method based on a deep fm model includes the specific steps of:

1) taking historical second-hand vehicle transaction data as input data;

the historical used vehicle transaction data comprises used vehicle attribute characteristic data x _origin And a transaction price y, wherein the attribute characteristics comprise vehicle registration date, vehicle transaction date, vehicle type, vehicle brand, vehicle type, fuel type, transmission type, engine power, vehicle mileage, and vehicle region.

2) Performing feature segmentation on attribute features in the second-hand vehicle data, and segmenting the attribute features into numerical features, high-cardinal-number class features and low-cardinal-number class features;

segmenting attribute features in historical used cars into numerical features x _num And class characteristics x _cate ；

Dividing the class features according to the cardinality, wherein the cardinality is more than 10, and the class features are high cardinality _high-cate Radix of less than or equal to 10 is a low radix class feature x _low-cate 。

3) Respectively preprocessing the three characteristics of the second-hand vehicle;

preprocessing comprises data cleaning, missing value filling, feature coding, data standardization and data dimension reduction;

For high cardinality class features x _high-cate Carrying out mean value coding, wherein the specific formula is as follows:

wherein g (y, x) _i ) For the coded eigenvalues, y is the price of the second-hand vehicle, λ (n) _i )∈[0,1]Reliability of two means, default valueIs 0.5, n _i Is a characteristic value of x _i N is the total number of samples,

is x ═ x _i The corresponding y-means value is then calculated,

is the y-mean over the entire training set.

For low cardinality class features x _low-cate Carrying out one-hot coding, which comprises the following steps:

let the used-hand car data have a class feature x with a base m _i An m-dimensional vector x 'is constructed by dummy coding' ₁ ,x' ₁ ,…,x' _m ]Each element of the vector corresponds to a feature x _i The value of (1) indicates whether the current feature belongs to the numerical value, only one element in the vector is 1, and the others are 0. Replacement of original features x with encoded vectors x _i 。

wherein x _i Before normalization, x' is the value after normalization, and n is the number of samples. Replacing the original value x with a normalized value x _i 。

4) Attribute feature x after preprocessing of the same used car _num 、x _low-cate 、x _high-cate Arranged in a row to form a row vector x ═ x _num ,x _low-cate ,x _high-cate ]；

5) Arranging and splicing the data x of all the second-hand vehicles according to rows to form a second-hand vehicle data matrix;

6) carrying out data dimension reduction on numerical characteristics in the second-hand vehicle data matrix to obtain a data matrix X';

the numerical characteristic part in the second-hand vehicle data matrix X is A _n×m Number ofAccording to the principal component analysis method for reducing the dimension, selecting data with the principal component accounting for 99 percent, and the method comprises the following specific steps: solving a numerical characteristic matrix A _n×m Covariance matrix of

Wherein A is an original numerical characteristic matrix, n is the number of columns of the matrix A, and m is the number of rows of the matrix A;

calculating a covariance matrix C _m×m Characteristic value (λ) of _i ) _i＝0,…,m And a feature vector (p) _i ) _i＝0,…,m (ii) a The eigenvalue lambda is arranged from large to small as { lambda ₀ ,λ ₁ ,…,λ _m In which λ is ₀ ≥λ ₁ ≥…≥λ _m . Selecting the first k characteristic values, wherein the sum of the characteristic values accounts for 99 percent of the sum of all the characteristic values, and the corresponding characteristic vector { p% ₀ ,p ₁ ,…,p _k Are combined into a transformation matrix P _k×m ＝[p ₀ ,p ₁ ,…,p _k ]. Transforming the matrix P _k×m And original second-hand car data X _m×n Multiplying to obtain dimensionality reduction data A' _n×k ＝[P _k×m A _n×m ^T ] ^T 。

Using dimensionality reduction rear matrix A' _n×k In the replacement data A _n×m And obtaining a dimension-reduced second-hand vehicle data matrix X'.

the built DeepFM model comprises an input layer, an embedded layer, an FM layer, a DNN layer and an output layer from input to output in sequence. FIG. 3 is a block diagram of the constructed model.

The model input consists of multiple input fields, denoted X ═ F ₁ ,F ₂ ,…,F _m ]. Wherein, the category feature fields respectively correspond to the low cardinality category features F preprocessed in the step 3 _cate ＝[f ₁ ,f ₂ ,…,f _i ]Numerical characteristic field correspondenceHigh cardinality class and numeric features F _num ＝[f]. Each input field is connected with an embedding unit of the embedding layer, and is converted into an embedding vector e with the dimension k [ e ] after passing through the embedding layer ₁ ,e ₂ ,…,e _k ]And k is 8. The embedded layer output is E ═ E ₁ ,E ₂ ,…,E _k ]In which E _i Embedding a vector E for k dimension _i ＝[e _1,i ,e _2,i ,…,e _k,i ]。

Wherein, y _FM For FM layer output, w ₀ For global bias, the < w, x > part represents the component of the first-order feature in the model, w is the weight parameter

The part represents the component of the second-order feature crossing in the model, n is the number of input domains, E _i As an input field F _i The k order embedding vector.

The DNN layer of the model is a feedforward neural network, the input of the feedforward neural network is a dense embedded vector output by an embedded layer, the DNN layer network structure is a 4-layer hidden layer, and the number of neurons is 100, 64, 32 and 8 respectively; the activation function is a ReLU function. Denote the output of the embedding layer as a ⁽⁰⁾ ＝[e ₁ ,e ₁ ,…,e _k ]The DNN part of the operation process is

Wherein l is the current layer number of the DNN model, a ^(l+1) For the current layer output, W ^(l) Is a weight, b ^(l) In order to be offset,

is an activation function. The output of the DNN part is

Wherein, y _DNN Representing the DNN model output, | H | representing the number of hidden layer layers.

Is a ReLU activation function. The model loss function uses the mean absolute error MAE, which is given by:

wherein, y is a true value,

is the model prediction value, and n is the prediction sample number.

9) Carrying out steps 1) -7) on second-hand vehicle data used for model training, inputting the obtained second-hand vehicle data matrix into a deep FM model for training, and obtaining parameters of the model and a network model used for estimating price of the second-hand vehicle;

model training parameters: setting a network parameter adjustment algorithm as Back Propagation (BP), wherein an Adam optimizer is used by the optimizer; setting the learning rate to be 0.015, and gradually decreasing along with the iteration times; setting an L2 regular term coefficient of 0.08; set batch _ size to 2000 and epoch to 500.

10) Passing the used vehicle data to be estimated through 1) -6), inputting the obtained used vehicle data matrix into a deep FM model for estimating the price, wherein the output of an output layer of the deep FM model is the used vehicle price estimation value.

It will be appreciated by those of ordinary skill in the art that the examples described herein are intended to assist the reader in understanding the manner in which the invention is practiced, and it is to be understood that the scope of the invention is not limited to such specifically recited statements and examples. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto and changes may be made without departing from the scope of the invention in its aspects.

Claims

1. A used vehicle price evaluation method based on a deep FM model is characterized by comprising the following steps:

1) taking historical second-hand car transaction data as input data; the historical used vehicle transaction data comprises used vehicle attribute characteristic data x _origin And transaction price y, wherein the attribute characteristics comprise vehicle registration date, vehicle transaction date, vehicle type, vehicle brand, vehicle type, fuel type, transmission type, engine power, vehicle mileage and vehicle region;

2. The deep FM model-based used vehicle price assessment method according to claim 1, characterized in that: and 2, feature segmentation in the step 2, dividing the attribute features of the original used vehicle into numerical features and category features, and segmenting the category features according to the cardinality, wherein the cardinality larger than 10 is a high cardinality category feature, and the cardinality smaller than or equal to 10 is a low cardinality category feature.

3. The deep FM model-based used vehicle price assessment method according to claim 1, characterized in that: in the step 3, the data is cleaned, abnormal values are removed by using a box plot, and the maximum and minimum values in the data are removed;

missing value filling means that the category feature missing value is filled by using the mode of the feature of all data, and the numerical feature is filled by using the mean value of the feature;

the characteristic coding refers to mean value coding and one-hot coding; carrying out mean value coding on the high-cardinality classification characteristics, wherein the specific formula is as follows:

is x ═ x _i The corresponding y-means value is then calculated,

is y on the whole training setA value;

setting the second-hand vehicle data to have a class characteristic x with a base number of m, constructing a sparse matrix A with n x m through dummy coding, wherein each column of the matrix corresponds to a value of the characteristic x, the numerical value of each column represents whether the characteristic belongs to the current characteristic, and replacing the original characteristic x with the coded sparse numerical matrix A;

wherein x is _i Before normalization, x' is the value after normalization, and n is the number of samples.

4. The deep FM model-based used vehicle price assessment method according to claim 1, characterized in that: in the step 6, a principal component analysis method is adopted for data dimensionality reduction, and data with 99% of principal components is selected, and the specific steps are as follows:

first, a covariance matrix of original data X is obtained

The eigenvalue lambda is arranged from large to small as { lambda ₀ ,λ ₁ ,…,λ _m In which λ is ₀ ≥λ ₁ ≥…≥λ _m Taking the first k characteristic values, the sum of which accounts for 99 percent of the sum of all the characteristic values, and corresponding characteristic vector { p% ₀ ,p ₁ ,…,p _k Are combined into transformationsMatrix array

P _k×m ＝[p ₀ ,p ₁ ,…,p _k ] ^T

Y _k×n ＝P _k×m X _m×n

5. The method for evaluating the price of the used vehicle based on the deep FM model according to claim 1, characterized in that: the DeepFM model constructed in the step 8 sequentially comprises an input layer, an embedded layer, an FM layer, a DNN layer and an output layer from input to output; the deep FM model input consists of a plurality of input domains, and is divided into a category characteristic domain and a numerical characteristic domain; the category feature domain corresponds to the low-radix category features preprocessed in the step 3, and the numerical feature domain corresponds to the high-radix category features and the numerical features;

each input domain is connected with one embedding unit of the embedding layer, and is converted into an embedding vector with a dimension of k after passing through the embedding layer, wherein the default value of k is 8;

the part represents the component of the second-order feature crossing in the model, n is the number of input domains, E _i For inputting field x _i K order embedding ofA vector;

is an activation function; the output of the DNN part is

Wherein, y _DNN For partial output of DNN, | H | represents the number of hidden layers; by default, the DNN-layer network structure is a 4-layer hidden layer, and the number of neurons is 100, 64, 32, 8; the activation function is a ReLU function;

Wherein y is the model evaluation result, y _FM Is the output of the FM layer, y _DNN Is the output of the DNN layer;

activating a function for the ReLU; the model loss function uses the mean absolute error.