CN112183615B

CN112183615B - Automobile risk user screening method with Markov chain data processing function

Info

Publication number: CN112183615B
Application number: CN202011021233.1A
Authority: CN
Inventors: 刘洋; 郑泉
Original assignee: Ruichida New Energy Automotive Technology Beijing Co ltd
Current assignee: Ruichida New Energy Automotive Technology Beijing Co ltd
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2023-08-18
Anticipated expiration: 2040-09-25
Also published as: CN112183615A

Abstract

The invention discloses an automobile risk user screening method with Markov chain data processing, which belongs to the technical field of automobile risk user classification, and comprises the steps of acquiring data attributes in a user journey, acquiring longitude and latitude and time of data, and cleaning and integrating the data; dividing data into small areas according to time and space, and extracting data characteristics of each area; mapping the data processed by each region into a data format of a Markov chain to obtain a state transition matrix; the extracted features are applied to a convolutional neural network model training classifier, and network parameters are updated with cross entropy loss. The invention solves the problem of how to extract the most effective feature for classifying from a plurality of features, realizes the transformation of feature space dimension, and obtains a group of classification features with invariance of similar samples and discriminativity of different samples.

Description

Automobile risk user screening method with Markov chain data processing function

Technical Field

The invention relates to the technical field of automobile risk user classification, in particular to an automobile risk user screening method with Markov chain data processing.

Background

With the continuous development of the scientific and technological achievement of artificial intelligence, the deep learning classification neural network is migrated to each research field, and the actual value brought by the artificial intelligence to human becomes the direction of efforts of a plurality of scientific researchers; for example, application of classification networks to high and low risk user screening in the automotive industry is an important aid and reference for the automotive industry to better serve users. Because the data screened by the high-risk and low-risk users of the automobile come from a vehicle driving system, the data provided by the system is huge in quantity, and also has a large quantity of noise, meanwhile, the identification degree of the data is low, and large differences exist among available data attributes, so that the network training is difficult. As the network deepens, overfitting is easily caused, resulting in failure of the network model to converge.

At present, a Markov chain is widely applied to the artificial intelligence fields such as voice recognition, text recognition, path recognition and the like as a concept for explaining a time process; either in the financial field, it is used to predict market share of enterprise products, or as a signal model for entropy coding techniques, etc., but it is a main approach to solve various problems. Related applications of Markov chains are not available in the aspect of screening the automobile risk users temporarily so as to solve the problem of preprocessing data and strengthen the screening accuracy of the automobile risk users.

Disclosure of Invention

In view of the above-described deficiencies of the prior art, the present invention provides a method for screening risk users of an automobile with markov chain data processing.

In order to solve the technical problems, the invention adopts the following technical scheme: an automobile risk user screening method with Markov chain data processing, comprising the following steps:

step 1: the driving behavior related data are read from the database, the data are preprocessed according to the longitude and latitude and the data acquisition time acquired by the GPS, and the confidence and reliability of the data are improved, wherein the process is as follows:

step 1.1: checking whether repeated data exist in the data, and if so, only reserving one piece of data;

step 1.2: deleting tuples, 0 values and filling the average value and filling the deleted data by a K nearest neighbor distance method;

step 1.3: according to the longitude and latitude of each city, data which are not in the range of each city are regarded as abnormal data, and according to actual conditions, one statistical method of gradual backward deletion, average elimination and logic error deletion is adopted for data cleaning;

step 1.4: and according to the influence of the satellite positioning technology on the positioning precision, regarding data less than the threshold data amount as invalid data, and then carrying out data cleaning again.

Step 2: according to specific position information of each city, meshing division is carried out according to longitude and latitude and time data, vehicle operation data in driving behaviors of each small area in each time period are counted, data processing is carried out, and the process is as follows:

step 2.1: selecting a certain city, merging all data of different acquisition times, drawing a scatter diagram according to longitude and latitude, observing driving distribution conditions of an automobile, setting a city grid division standard according to the density degree of the scatter diagram, and obtaining the area grid size under different division standard conditions;

step 2.1.1: assuming that the maximum and minimum longitudes of the city are respectively max (X) and min (X), the maximum and minimum longitudes are respectively max (Y) and min (Y), and the side length of the city grid is set as r _i (i=1, 2,3, …, m), where m represents m likelihood criteria for dividing the grid, then the number of grids divided by the city in terms of longitude and latitude is:

wherein ,n_length,i Represents the number of grids divided by longitude under the ith possibility division standard, n _width,i Representing the number of latitudinal grids under the ith possibility division standard;

step 2.1.2: adding variances of the vehicle operation data of each area under different possibility division standards, and determining the optimal grid division standard with the smallest variance from the different possibility division standards according to a minimum variance method; or the voting method is adjusted according to the variance in the small area so as to avoid a large number of non-data areas;

step 2.2.: in the divided space cell, dividing the space cell into M time segments according to whether the space cell is a road section peak time point or not;

step 2.3: in each time period of each divided small area, carrying out mean and variance statistics on the data, and carrying out data calculation on the basis:

wherein ,and sigma (sigma) _ijk The ith row and the jth column of the city respectivelyMean and variance of data in the kth time period of the grid, x _k For the data unprocessed for the grid k time period, x' _k Is the data processed by the grid k time period.

Step 3: dividing the time sequence data into a plurality of states according to the processed characteristic data, determining the interval dividing form of the measured distribution without tendency, and counting the state transition condition and state transition matrix of the data in the dividing interval, wherein the process is as follows:

step 3.1: dividing the time sequence data into N states according to the distribution condition of the processed characteristic data x';

the state is divided into equal intervals or unequal intervals according to actual distribution conditions;

step 3.2: converting the data subjected to the gridding treatment into states according to the upper and lower boundaries of the states, namely x (i) to s (i), wherein i=1, 2 and …, and generating a Markov chain;

wherein x (i) is data subjected to meshing processing at the moment i, and s (i) is a state at the moment i;

step 3.2.1: assuming that the upper and lower boundaries of the states are B and a, respectively, the interval between the states is:

step 3.2.2: when x (i) epsilon [ a+ (k-1) delta, A+kdelta ], s (i) =k, k=1, 2, …, N, so that the characteristic data corresponding to each time point is converted into state data between [1,2, …, N ], and the state data has the property of a Markov chain, so that the data set formed by all the state information s (i) is a Markov chain;

step 3.3: counting the transition condition of each state s (i), and extracting Markov characteristics;

step 3.3.1: defining the Markov characteristic, namely the transition condition of each state, and counting the upward and downward transition times of the state i asThe number of times of holding state i is k _i ，/> and k_i The calculation formula of (2) is as follows:

where s (j) represents the state at the moment j, s (j+1) represents the state at the moment j+1, and L represents the number of data points;

step 3.4: according to the extraction and k_i The state transition probability and the state transition matrix are calculated as follows:

step 3.4.1: when the state is i=1, the corresponding state transition probability and state retention probability are:

wherein ,p_1,1 To be the probability of transition from state 1 to state 1, p _1,2 A probability of transitioning from state 1 to state 2;

step 3.4.2: when the state is 1 < i < N, the corresponding state transition probability and state retention probability are as follows:

wherein ,p_i,i-1 To transition from state i-1 to state i, p _i,i To the probability of transition from state i to state i, p _i,i+1 A probability of transition from state i to state i+1;

step 3.4.3: when the state is i=n, the corresponding state transition probability and state retention probability are:

wherein ,p_N,N-1 To the probability of transition from state N to state N-1, p _N,N The probability of transitioning to state N for state N;

step 3.4.4: the state transition matrix can be expressed as:

step 4: preprocessing a state transition condition and a state transition matrix, combining partial characteristics which are not processed by a Markov chain, and jointly forming data characteristics for classifying the neural network input, wherein the process is as follows:

step 4.1: after the data torque and power of part of the data which are not processed by the Markov chain are subjected to standardization processing, combining a state transition matrix to form a feature vector of a high-low risk user screening neural network together;

step 4.2: and randomly selecting 75% of data by adopting an S-fold cross validation model to manufacture a training set and 25% of data to manufacture a test set.

Step 5: in the training stage, the feature size is compressed by using a deep convolutional neural network, feature dimensions are enriched, main features are extracted, feature vectors output by the last layer of the feature network are input to a full-connection layer, and a classification result of high and low risks of a user is obtained after softmax normalization, and the process is as follows:

step 5.1: three layers of convolution layers of the neural network perform local feature extraction and combination by using convolution check feature vectors with shared parameters of various feature dimensions, and a standard convolution output matrix Y= (Y) _ij ) Can be obtained by inputting a feature matrix x= (X) _ij ) And convolution kernel matrix w= (W _ij ) The calculation results are that:

wherein m, n are weight matrix position coordinates, i, j areInputting position coordinates of a feature matrix, w _mn For the filter size at m, n positions, x _i+m,j+n For the feature tensor to be processed by the filter at the i, j position, K is the convolution kernel size;

step 5.2: extracting key information by extracting the point with the largest median value in the local receiving area by the two maximum pooling layers of the neural network, and compressing the characteristics;

step 5.3: the neural network inputs the feature vector output by the last convolution layer into a full-connection layer, connects the extracted local features through a weight matrix, maps the extracted local features back to the global, adopts two layers of full-connection layers to improve the nonlinear expression capacity of the model, uses Dropout to prevent the model from being over-fitted, and obtains a classification result of high and low risks of the user through softmax normalization.

Step 6: and calculating cross entropy loss, and minimizing a loss function through random gradient descent, so that network model parameters are updated, and a better high-low risk user screening effect is realized.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in:

1. the method designs a light network, extracts and combines local features by convolution calculation, and utilizes the maximum pooling compression feature to adapt to the task of screening high-risk users and low-risk users.

2. Aiming at the noise problem of data, the invention provides a data processing method using a Markov chain, the data characteristics of the data are considered, time series data are divided into a plurality of states, the Markov chain is established, and a state transition condition and a state transition matrix are generated for constructing partial data of a neural network.

3. The invention reasonably combines deep learning in the artificial intelligence field with high-low risk user screening and Markov chain screening, can extract the most effective classification characteristic from a plurality of characteristics under the condition of low noise identification degree of data, and simultaneously realizes the transformation of characteristic space dimension by Markov chain data processing, thereby obtaining a group of identification classification characteristics with invariance of similar samples and different samples.

Drawings

FIG. 1 is a flow chart of a method for screening risk users of an automobile with Markov chain data processing in an embodiment of the invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

As shown in fig. 1, the method of this embodiment is as follows.

In this embodiment, data such as the condition of the vehicle, the behavior of the driver, and the environment outside the vehicle during the driving process of the driver are obtained from the vehicle driving system and analyzed and processed.

wherein ,and sigma (sigma) _ijk The mean and variance of the data in the kth time period of the ith row and jth column grid of the city, x, respectively _k For the data unprocessed for the grid k time period, x' _k Is the data processed by the grid k time period.

In this embodiment, 4 meshing likelihood criteria are determined. Under four standards, the number of grids divided by longitude and latitude of the city is respectively as follows:

standard 1:38 and 27;

standard 2:18 and 15;

standards 3:12 and 8;

standards 4:9 and 7.

step 3.4.4: the state transition matrix can be expressed as:

in this example, taking the speed data of the driver in the driving system of the vehicle as an example, the speed data is divided into 7 states, and the state transition matrix is as follows:

In this embodiment, the feature data dimension of the feature vector of the co-formed high-low risk user screening neural network is 175.

wherein m, n are weight matrix position coordinates, i, j are input feature matrix position coordinates, w _mn For the filter size at m, n positions, x _i+m,j+n For the feature tensor to be processed by the filter at the i, j position, K is the convolution kernel size;

In this embodiment, three convolutional layers are defined, each followed by an activation function layer and a max pooling layer.

In this embodiment, the loss is calculated using softmax cross entropy. In the test stage, the embodiment achieves a classification accuracy of 86.46%, wherein the screening accuracy of high-risk users is 89.74%, and the screening accuracy of low-risk users is 79.95%.

Claims

1. A method for screening a risk user of an automobile with markov chain data processing, comprising the steps of:

step 1: reading driving behavior related data from a database, preprocessing the data according to longitude and latitude and data acquisition time acquired by a GPS, and improving the confidence coefficient and reliability of the data;

step 2: according to specific position information of each city, performing gridding division according to longitude and latitude and time data, counting vehicle operation data in driving behaviors of each small area in each time period, and performing data processing;

step 3: dividing the time sequence data into a plurality of states according to the processed characteristic data, determining a section dividing form of the measured distribution without tendency, and counting state transition conditions and state transition matrixes of the data in the dividing sections;

step 4: preprocessing a state transition condition and a state transition matrix, combining partial characteristics which are not processed by a Markov chain, and forming data characteristics together for classifying neural network input;

step 5: in the training stage, the feature size is compressed by using a deep convolutional neural network, feature dimensions are enriched, main features are extracted, feature vectors output by the last layer of the feature network are input to a full-connection layer, and a classification result of high risk and low risk of a user is obtained after softmax normalization;

said step 5 comprises the steps of:

step 5.3: the neural network inputs the feature vector output by the last convolution layer into a full-connection layer, connects the extracted local features through a weight matrix, maps the extracted local features back to the global, adopts two layers of full-connection layers to improve the nonlinear expression capacity of the model, uses Dropout to prevent the model from being over-fitted, and obtains a classification result of high and low risks of the user through softmax normalization;

2. The method for screening risk users of an automobile with markov chain data processing of claim 1, wherein: the step 1 comprises the following steps:

3. The method for screening risk users of an automobile with markov chain data processing of claim 1, wherein: the step 2 comprises the following steps:

step 2.1.1: assuming that the maximum and minimum longitudes of the city are respectively max (X) and min (X), the maximum and minimum longitudes are respectively max (Y) and min (Y), and the side length of the city grid is set as r _i I=1, 2,3, …, m, where m represents that there are m likelihood criteria for dividing the grid, and then the number of grids divided by the city in terms of longitude and latitude is respectively:

4. The method for screening risk users of an automobile with markov chain data processing of claim 1, wherein: the step 3 comprises the following steps:

step 3.4: according to the extraction and k_i The state transition probabilities and state transition matrices are calculated.

5. The method for screening risk users of an automobile with markov chain data processing of claim 4, wherein: the step 3.4 comprises the following steps:

step 3.4.4: the state transition matrix is expressed as:

6. the method for screening risk users of an automobile with markov chain data processing of claim 1, wherein: the step 4 comprises the following steps: