CN106095963B

CN106095963B - Vehicle driving behavior analysis big data public service platform under internet + era

Info

Publication number: CN106095963B
Application number: CN201610430864.6A
Authority: CN
Inventors: 文巨峰; 罗赞文; 穆祥强; 许伟
Original assignee: Shanghai Jingda Information Technology Co Ltd
Current assignee: Shanghai Jingda Information Technology Co Ltd
Priority date: 2016-06-17
Filing date: 2016-06-17
Publication date: 2019-12-06
Anticipated expiration: 2036-06-17
Also published as: CN106095963A

Abstract

A big data public service platform for analyzing vehicle driving behaviors in the Internet + era analyzes the vehicle driving behaviors and the vehicle driving risk probability, and comprises a first step of storing driving data of a vehicle in a period of time through a big data platform and processing and analyzing the data by adopting a MapReduce frame, a second step of counting abnormal driving times of the vehicle in a certain area and period of time after processing the data based on the big data platform, wherein the abnormal driving times comprise overspeed, overload, rapid acceleration and rapid deceleration, fatigue driving and night driving, and a third step of calculating the abnormal driving behaviors, calculating the vehicle risk probability by utilizing a self-organized mapping neural network algorithm and predicting the subsequent vehicle driving risk. According to the invention, the Beidou/GPS navigation data of the vehicle is collected, the collected data is preprocessed by using the big data platform, and the driving behavior of the driver can be analyzed and judged after further data mining on the data.

Description

vehicle driving behavior analysis big data public service platform under internet + era

The technical field is as follows:

The invention relates to the field of physics, in particular to a data storage and processing technology, and particularly relates to a vehicle driving behavior analysis big data public service platform in the Internet + era.

Background art:

with the improvement of the informatization degree, the life is full of massive information, some of the information is closely related to the life of people, and the analysis and the mining of the information can improve the life, serve the public, improve the social informatization degree and provide a new opportunity for a smart city and an informatization dynamic management scheme.

as the number of vehicles traveling on roads increases, the risk of personal property loss due to illegal driving by people also increases. The risk of vehicle violation in different regional areas also varies, so the risk cannot be simply averaged and varies from location to location. The environmental factors of the automobile driving need to be considered, so the automobile driving area is divided firstly and then the risk probability in each area is analyzed. In the prior art, the driving behavior of a vehicle cannot be analyzed from massive information in life.

The invention content is as follows:

the invention aims to provide a public service platform for analyzing vehicle driving behaviors in the Internet + era, which aims to solve the technical problem that the vehicle driving behaviors cannot be analyzed from mass information in life in the prior art.

The public service platform for vehicle driving behavior analysis big data in the Internet + era comprises the following steps of analyzing vehicle driving behaviors and automobile driving risk probability:

step one, storing driving data of the automobile within a period of time through a big data platform, processing and analyzing the data by adopting a MapReduce framework,

Step two, counting the abnormal driving times of the automobile in a certain area and a certain time period based on the processed data under the big data platform, wherein the abnormal driving times comprise overspeed, overload, rapid acceleration and rapid deceleration, fatigue driving and night driving,

And step three, calculating the vehicle risk probability by utilizing a self-organizing mapping neural network algorithm after counting the abnormal driving behaviors, and predicting the subsequent vehicle running risk.

Further, processing and analyzing the automobile driving data, wherein the method comprises the steps of establishing a data storage layer in the first step, wherein the main scheme is a Hadoop Distributed File System (HDFS) based storage scheme and a HDFS + distributed open-oriented source database (HBase) technology based storage scheme; the method comprises the following steps of establishing a data processing layer, processing original data in the data processing layer by adopting a large-scale parallel computing framework MapReduce (MR for short), establishing a data analysis layer, further analyzing, modeling and mining the data, establishing data visualization in the fourth step, establishing a security layer in the fifth step, providing a security protection system for a big data platform, providing a uniform service interface for the outside, and accessing the big data service platform by a user through a uniform data access interface or a uniform portal page.

Furthermore, a road network road section on which the vehicle runs is found by using a hidden Markov algorithm, road section information on which the vehicle runs is found through map data, and finally the running speed in the positioning data is input into a space-time speed limit model, so that the judgment of whether the vehicle is overspeed is finally obtained.

Further, an algorithm of a support vector machine is realized by utilizing a big data platform, wherein the method comprises the following steps of firstly, standardizing original data to eliminate the influence of dimensions; secondly, calculating average speed, average acceleration and starting acceleration through time, speed and mileage in the original data, and taking the average speed, the average acceleration and the starting acceleration as training samples; thirdly, uploading a large number of training samples to the HDFS and partitioning, enabling a Map task of the HDFS to start a sequence minimum optimization (SMO for short) training of data in the blocks by each node to calculate a support vector of each block of data, and finally inputting the support vector at the training position of each node into a Reduce task of the HDFS and submitting the support vector to an SMO algorithm training to obtain a final support vector; and fourthly, obtaining a classifier by using the support vector in the third step, and judging whether the vehicle is overloaded by using the classifier.

further, establishing a linear regression model under a big data platform, wherein the method comprises the steps of firstly, finding out the driving data of all automobiles in the road section by using the big data platform, and calculating the speed difference between two adjacent data; the method comprises the steps of firstly calculating a linear regression algorithm, secondly, obtaining a model S (delta v, vo) of the speed variation of the automobile in the road section and the driving distance of the automobile in the road section by using a statistical linear regression algorithm, thirdly, calculating a regression equation to obtain the model S (delta v, vo) of the speed variation of the automobile in the road section and the driving distance, finally, judging the probability of the automobile in rapid acceleration and rapid deceleration by using the equation, and judging the rule that the larger the error e between the distance of the ith automobile and the model distance is | | | | S-Si |, the larger the probability of the automobile in rapid acceleration and rapid deceleration is, and the smaller the probability is.

Specifically, the driving data of all vehicles at all time of a certain road section is found out from the original data through a big data platform, the data are processed to obtain the difference value of the speeds of two adjacent points, and then a regression model is obtained by using a regression analysis method. And obtaining the probability of the automobile in the road section for sudden acceleration and sudden deceleration through the regression model. Let Δ v be the speed difference between two time points on the same road section, and vo be the initial speed of each time point of the car. Since the distance traveled by the vehicle over a period of time is linear with the speed difference and the initial speed, the regression model can be as

Where bo, b1, b2 are partial regression coefficients for the variables Δ v and v 0.

And obtaining the value of the partial regression coefficient through training data and a least square principle, testing the obtained model by using variance analysis, and finally obtaining the probability of the rapid acceleration and the rapid deceleration of the automobile according to the size. Wherein S is the actual driving distance of the automobile and is the driving distance meeting the regression model.

Further, the step of calculating the vehicle risk probability includes measuring the risk probability in the first step, establishing a risk probability model in the second step, preprocessing historical data, clustering the data to obtain l clusters, namely Cn ═ { I1, I2.... Il }, wherein Ii is one of the clusters, and then constructing a prediction model according to a clustering result includes calculating an average risk probability of the clusters and a similarity coefficient between a feature vector and the average vector, and predicting the risk probability.

Specifically, the risk probability is a percentage of the occurrence probability of the risk, and is a measure of the occurrence probability of the object state. The driving data of the automobiles in the same area has higher similarity, and the driving data in different areas have larger difference. And clustering the automobile driving data according to the characteristic points by taking one area as a cluster to obtain the driving data in different areas. Each feature point (feature vector) is associated with a point in the cluster, and assuming that the ith feature vector in the cluster is Xi (i ∈ N), the risk probability can be expressed as

Wherein | | | nr (Xi) | | represents the number of all points in the region r, and nr (Xi) is a field in which the point Xi is in the radius of r. U (Nr (xi)) is the number of security risks occurring in the cluster. It can be seen that (2) the risk probability is the average probability of all points in the cluster. Can also be expressed as

Where the similarity of the feature points Xi to the average vector of all points is represented.

The risk probability prediction model determination comprises three parts:

pre-processing of historical data

(1) data attributes are abstracted from historical data, and the driving behavior data of the automobile comprises license plate numbers, longitude and latitude, driving speed, driving direction and driving time.

(2) Clustering the historical data to obtain l clusters, namely Cn ═ { I1, I2.. Il }, wherein Ii is one of the clusters, and k feature vectors in the Ii cluster are assumed, namely Xij is the jth feature vector in the cluster.

Construction prediction model

calculating the risk probability of the feature vector in each cluster after clustering to obtain a result, wherein the steps are as follows:

(1) The average risk probability in cluster Ii is calculated.

Where | Ii | is the number of elements in cluster Ii, if the point represented is a risk occurrence, then 0 otherwise.

(2) Calculate the average eigenvector of all points in the cluster Ii

(3) Calculating the similarity coefficient of each feature vector in the cluster

(4) Calculating a risk probability for each feature vector

5.3 Risk probability prediction

After the risk probability model is obtained through calculation, new automobile data can be predicted, and the method comprises the following steps:

(1) carrying out cluster analysis on the new automobile driving data Xp;

(2) After general clustering, new data are clustered into a certain cluster, and then the similarity coefficient of the data and an average vector is calculated;

(3) then the probability calculation for the new data point is:

further, in the step of clustering the data, a neural network clustering algorithm based on hadoop self-organizing mapping is adopted, and a Mapreduce parallel computing frame is used for computing the data.

Further, in the step of calculating the data by using the Mapreduce parallel calculation framework, a winning neuron and a weight modifier are calculated in the Map stage, and the total update quantity of each weight connected with the same winning neuron is counted in the Reducer stage, and then the weights are updated.

Further, in an algorithm for realizing the support vector machine, the similarity between two vectors is measured by using the Euclidean distance attitude.

specifically, in the above-mentioned hadoop-based self-organizing map (SOM) -based neural network clustering algorithm, competition learning is adopted to realize internal adaptation and change of the internal structure of self-organization, that is, after external input is received, the internal structure is automatically divided into different regions, each region has different response characteristics to the input, and the process is automatically completed. The steps of the algorithm are as follows:

(1) Initialization and normalization. Firstly, normalization processing is carried out on the current input vector and the neuron weight vector Wi in the competition layer, an initial win area is established, and the gain mu (0) is assigned.

(2) Winning neurons were sought. And performing similarity comparison on the input vector and all the response weight vectors in the competition layer, and taking the most missing weight vector as a winning neuron. The greater the dot product of two vectors in mathematics that are most similar can be expressed as the following equation:

(3) Adjusting node weight in the winning field

Where μ (t) is the gain and decreases to 0 over time.

(4) If the gain drops to the set threshold μmin then the process ends, otherwise the input data continues to start a new process calculation from the first step.

The algorithm is operated on a Hadoop platform, and the implementation steps are as follows:

map phase

the key-value pair < key, value >, where key denotes the sample sequence number value denotes the sample feature set.

and calculating the output value of the competition layer according to the competition learning principle.

The winning neuron is determined, and the state thereof is determined as 1, and the other states are determined as 0. The flag is set to the position of the winning neuron in the competition layer.

and calculating the updating quantity of each weight connected with the winning neuron, and keeping other weights unchanged.

And outputting the key value pair, wherein key represents flag, and value represents the weight updating amount connected with the winning neuron.

reducer phase

And accepting the output of the map function as the input of the reduce function, wherein the input key represents the position of the winning neuron, and the values represent the combination of weight updating quantities connected with the winning neuron.

And accumulating to obtain the total updating amount of each weight value, and calculating the average updating amount.

And adjusting the corresponding network weight.

And outputting the key value pair, wherein key represents the weight updating state, and value represents the updated network weight.

and updating the global vector, and repeating the MapReduce process until the errors of the weight vector and the winning sample are smaller than a set threshold value, and finishing.

It can be seen from the above calculation steps that learning of the SOM network can adaptively perform clustering according to intrinsic connections of samples, the output neuron weight vector Wi gradually approaches to a winning sample, and a weight vector set { Wi | i ═ 1,2.. l } is regarded as a clustering center of all samples. The similarity measure method is described as follows:

an X-X, X-1, X2.. xn, belongs to a cluster Ii in the feature space, and the similarity coefficient between the feature vector and the average vector (cluster center) can be expressed as,

It can be seen from equation (9) that the similarity between the metric vectors is measured using euclidean distance, and the more similar the two vectors, the larger the value is, and the maximum value is 1.

Compared with the prior art, the invention has positive and obvious effect. According to the invention, the Beidou/GPS navigation data of the vehicle is collected, the collected data is preprocessed by using the big data platform, and the driving behavior of the driver can be analyzed and judged after further data mining on the data.

Description of the drawings:

Fig. 1 is a frame diagram of a vehicle driving behavior analysis big data public service platform in the internet + age of the present invention.

Fig. 2 is a flow chart of the neural network algorithm based on self-organizing map in the invention.

the specific implementation mode is as follows:

Example 1:

As shown in fig. 1 and 2, the public service platform for vehicle driving behavior analysis in the internet + age of the present invention includes four parts: the data visualization system comprises a data storage layer, a data processing layer, a data analysis layer and a data visualization layer.

Judging the overspeed of the automobile, and finding the optimal road section for the automobile to run by utilizing a hidden Markov algorithm; secondly, finding out the speed limit size corresponding to the road type of the automobile by using the map data; and thirdly, comparing the driving speed in the obtained data with the speed limit of the road section to judge whether the vehicle is overspeed or not.

Judging the overload of the automobile, and normalizing the original data to eliminate the influence of dimensions; secondly, calculating average speed, average acceleration and starting acceleration through time, speed and mileage in the original data, and taking the average speed, the average acceleration and the starting acceleration as training samples; thirdly, training a classifier by using the training samples in the second step; and fourthly, judging whether the vehicle is overloaded by using the classifier.

and judging the rapid acceleration and the rapid deceleration, firstly finding out the driving data of all vehicles at all time of a certain road section from the original data through a large data platform, processing the data to obtain the difference value of the speeds of two adjacent points, and then obtaining a regression model by using a regression analysis method. And obtaining the probability of the automobile in the road section for sudden acceleration and sudden deceleration through the regression model. Let Δ v be the speed difference between two time points on the same road section, and vo be the initial speed of each time point of the car. Because the distance traveled by the vehicle over a period of time is linear with the speed difference and the initial speed, the regression model is:

the risk probability is the percentage of the occurrence probability of the risk and is a measure of the occurrence probability of the state of the thing. The driving data of the automobiles in the same area has higher similarity, and the driving data in different areas have larger difference. And clustering the automobile driving data according to the characteristic points by taking one area as a cluster to obtain the driving data in different areas. Each feature point (feature vector) is associated with a point in the cluster, and assuming that the ith feature vector in the cluster is Xi (i ∈ N), the risk probability can be expressed as

wherein | | | nr (Xi) | | represents the number of all points in the region r, and nr (Xi) is a field in which the point Xi is in the radius of r. U (Nr (xi)) is the number of security risks occurring in the cluster. One can see that (1) the risk probability is the average probability of all points in the cluster. Can also be expressed as

The risk probability prediction model needs to be divided into three parts:

pre-processing of historical data

Data attributes are abstracted from historical data, and the driving behavior data of the automobile comprises license plate numbers, longitude and latitude, driving speed, driving direction and driving time.

Clustering the historical data to obtain l clusters, namely Cn ═ { I1, I2.. Il }, wherein Ii is one of the clusters, and k feature vectors in the Ii cluster are assumed, namely the jth feature vector in the cluster.

Construction prediction model

(1) The average risk probability in cluster Ii is calculated.

(2) Calculate the average eigenvector of all points in the cluster Ii

(4) Calculating a risk probability for each feature vector

Risk probability prediction

Carrying out cluster analysis on the new automobile driving data Xp;

after general clustering, new data are clustered into a certain cluster, and then the similarity coefficient of the data and an average vector is calculated;

The probability calculation formula for the new data point is:

the key technology for solving the risk probability model of the data points is how to cluster historical data by using a clustering algorithm. A self-organizing map (SOM) based neural network clustering algorithm is employed herein. The algorithm adopts competitive learning to realize internal self-adaptation and self-organization change internal structure, namely, the algorithm can be automatically divided into different areas after receiving external input, each area has different response characteristics to the input, and the process is automatically completed. The steps of the algorithm are as follows:

initialization and normalization. Firstly, normalization processing is carried out on the current input vector and the neuron weight vector Wi in the competition layer, an initial win region Ni (0) is established, and the gain mu (0) is assigned.

Winning neurons were sought. And performing similarity comparison on the input vector and all the response weight vectors in the competition layer, and taking the most missing weight vector as a winning neuron. The greater the dot product of two vectors in mathematics that are most similar can be expressed as the following equation:

Adjusting node weight in the winning field

where μ (t) is the gain and decreases to 0 over time.

If the gain drops to the set threshold μmin then the process ends, otherwise the input data continues to start a new process calculation from the first step.

Map phase

Inputting a key-value pair < key, value >, wherein key represents a sample sequence number value represents a sample feature set

Reducer phase

And adjusting the corresponding network weight.

it can be seen from the above calculation steps that learning of the SOM network can adaptively perform clustering according to intrinsic connections of samples, and the output neuron weight vectors Wi gradually approach winning samples, so that a weight vector set { Wi | i ═ 1,2.. l } can be regarded as a clustering center of all samples. The similarity measurement method comprises the following steps:

An X-X { X1, X2.. xn } feature vector belongs to a cluster Ii in the feature space, and the similarity coefficient between the feature vector and the average vector (cluster center) can be expressed as

Claims

1. the utility model provides a vehicle driving behavior analysis big data public service platform under internet + time which characterized in that: the method for analyzing the driving behavior of the vehicle and the driving risk probability of the vehicle comprises the following steps:

Thirdly, calculating the risk probability of the vehicle by utilizing a self-organizing mapping neural network algorithm after counting the abnormal driving behaviors, and predicting the subsequent vehicle running risk;

The method comprises the steps of establishing a data storage layer in the first step, and establishing a Hadoop distributed file system based storage scheme and a Hadoop distributed file system + distributed column-oriented source database technology based storage scheme; establishing a data processing layer, processing original data by adopting a large-scale parallel computing frame in the data processing layer, establishing a data analysis layer in the third step, further analyzing, modeling and mining the data, establishing data visualization in the fourth step, establishing a security layer in the fifth step, providing a security protection system for the big data platform, providing a uniform service interface for the outside, and accessing the big data service platform by a user through a uniform data access interface or a uniform portal page;

Finding road network sections where vehicles run by using a hidden Markov algorithm, finding information of the road sections where the vehicles run by using map data, and finally inputting the running speed in positioning data into a space-time speed-limiting model to finally obtain the judgment of whether the vehicles overspeed;

an algorithm of a support vector machine is realized by utilizing a big data platform, wherein the algorithm comprises the following steps of firstly, standardizing original data to eliminate the influence of dimensions; secondly, calculating average speed, average acceleration and starting acceleration through time, speed and mileage in the original data, and taking the average speed, the average acceleration and the starting acceleration as training samples; thirdly, uploading a large number of training samples to a Hadoop distributed file system and partitioning, starting a Map task of a Hadoop distributed processing platform by each node to perform sequence minimum optimization training on data in the block to calculate a support vector of each block of data, and finally inputting the support vector of each node training position into a Reduce task of the Hadoop distributed processing platform and submitting the support vector to an SMO algorithm training to obtain a final support vector; fourthly, obtaining a classifier by using the support vector in the third step, and judging whether the vehicle is overloaded by using the classifier;

Establishing a linear regression model under a big data platform, wherein the linear regression model comprises the steps of finding out the driving data of all automobiles in the road section by using the big data platform, and calculating the speed difference between two adjacent data; secondly, obtaining a speed variation delta v of a vehicle on a certain road by using a statistical linear regression algorithm, wherein the speed variation delta v of the vehicle from a certain initial speed vo to the next driving time is input, the driving distance S of the vehicle in the time is output, and the data of all the driven vehicles in the road are counted;

The vehicle risk probability calculating method comprises the steps of measuring risk probability in the first step, establishing a risk probability model in the second step, preprocessing historical data, clustering the data to obtain l clusters, namely Cn ═ { I1, I2.... Il }, wherein Ii is one of the clusters, constructing a prediction model according to clustering results, calculating the average risk probability of the clusters and the similarity coefficient between a feature vector and the average vector, and predicting the risk probability.

2. the internet + time vehicle driving behavior analysis big data public service platform according to claim 1, wherein: in the step of clustering the data, a neural network clustering algorithm of self-organizing mapping is adopted, and a large-scale parallel computing framework Mapreduce parallel computing framework is used for computing the data.

3. The internet + time vehicle driving behavior analysis big data public service platform according to claim 2, wherein: in the step of calculating data by using a massively parallel calculation framework Mapreduce parallel calculation framework, a winning neuron and a weight modifier are calculated in a Map stage, and a total update quantity of each weight connected with the same winning neuron is counted in a Reduce stage, and then the weights are updated.

4. The internet + time vehicle driving behavior analysis big data public service platform according to claim 1, wherein: in the algorithm for realizing the support vector machine, the similarity between two vectors is measured by using the Euclidean distance attitude.