CN109977149A

CN109977149A - Crime big data point pattern analysis method based on G-function and improvement KD tree

Info

Publication number: CN109977149A
Application number: CN201910204662.3A
Authority: CN
Inventors: 何雨情; 杨立涛; 白璐斌; 黄舒哲
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2019-07-05

Abstract

The invention discloses a kind of based on G-function and improves the crime big data point pattern analysis method of KD tree parallel computation, the demand that the present invention is handled for crime big data instantly, by combining closest range points pattern analysis method (G-function) to provide a kind of method that can rapidly analyze crime spatial distribution mode improved KD tree parallel computation --- the crime big data point pattern analysis method based on G-function and improvement KD tree parallel computation.Crime point event sub-clustering in space is constructed KD tree by this method, the closest distance of crime point event in each KD tree of parallel computation, with playing it is whole be effect scattered, that block parallel is handled, quickening computational efficiency improves the utilization rate of computing resource.

Description

Crime big data point pattern analysis method based on G-function and improvement KD tree

Technical field

The invention belongs to big data excavation applications, are related to a kind of crime big data point pattern analysis method, and in particular to one The novel crime big data point pattern analysis method based on G-function and improvement KD tree parallel computation of kind.

Background technique

With the development of internet technology, the world today has come into big data era.Big data is sea in form The set for measuring relevant data refers to the ability for collecting and analyzing bulk information in practical application.In recent ten years, public security machine Close informatization and achieve the progress advanced by leaps and bounds, it is established that it is longitudinal on earth, lateral to the Police Information network on side, each police Kind business realizes information system management comprehensively, builds up the basic business data of magnanimity.Wherein, crime data amount it is big and disperse, Complicated, information extraction difficulty is constituted, traditional crime dramas analysis management mode embarrassment heavy burden is made, is badly in need of deep transition.And lead to Cross collection, arrangement, classification, the analysis to mass data, it can be deduced that the indetectable crime spatial distribution feature of traditional means, And then the immense value contained in mining data.

When the cholera map for pushing away Snow, which terminates the example of spatial point patterns the most famous research in geography Occur within 1853 popular in the cholera disease in London.Analysis space point distribution pattern is calculated to quantification to measure from the 1960s The revolution epoch are prevailing, are widely used in earth science research.Such as residential area density research (Dacey, 1962；King, 1962) drumlin distribution (Trenhaile, 1971) and in ice formation etc..With the rise of geospatial information system technology, mould is put Important content of the formula as spatial analysis, is furtherd investigate and is widely applied, and the moulds such as closest distance algorithm G-function occurs Type.

Closest distance algorithm G-function can analyze the spatial distribution characteristic of crime dramas point, need to ask institute when calculating There is the closest point of an event.In face of the crime initial data of nowadays magnanimity, traditional traversal search method calculate it is closest away from From when need to calculate the distance between all the points in central point and range one by one, calculate overlong time, serious waste of resources, efficiency Lowly.

Summary of the invention

In order to solve the above-mentioned technical problems, the present invention provides it is a kind of it is novel by G-function and improve KD tree parallel based on The crime big data point pattern analysis method of calculation.

The technical scheme adopted by the invention is that: a kind of crime big data based on G-function and improvement KD tree parallel computation Point pattern analysis method, which comprises the following steps:

Step 1: data prediction；

Point is divided into several points cluster by clustering algorithm, sets threshold by the pending crime dramas point coordinate of input institute Value is used to judge whether cluster is excessive, if it is continues to cluster sub-clustering, until the number at each cluster midpoint is appropriate；So KD tree is established using parallel computation strategy to each cluster afterwards；

Step 2: searching for closest point；

For each cluster calculated where point inquires it, and determine the KD tree where the cluster；It is searched in KD tree later The point closest to all the points, and calculate closest distance d_min, finish, owned until the point of all inputs all calculates The closest distance of point；

Step 3: calculating G-function；

In magnitude order by the closest distance of obtained all the points, the change journey R and group for calculating closest distance away from D, Middle R=max (d_min)-min(d_min), the quantity according to group away from upper limit value stored counts point, and calculate cumulative frequency G (d)；

Step 4: carrying out significance test and obtain analysis result；

Using the method for Monte Carlo stochastic simulation, if stochastic simulation distribution functionProbability greater than upper bound U (d)WithProbability less than lower bound L (d)MeetThen calculated result meets significance test index, output G (d) about The curve graph of distance d judges the Spatial Distribution Pattern of point data；With the variation of distance d, crime dramas statistic frequency becomes Change, if fruit dot event tends to Assembled distribution in space, G-function value can in shorter distance rapid increase；If dot pattern Middle event tends to dispersed distribution, then G-function value increases just relatively slower.

The present invention improves traditional traversal search method, with improved KD tree (the tree data knot in segmentation K dimension data space Structure) the closest point of algorithm search, the crime data point set after parallel processing piecemeal calculates corresponding function and carries out conspicuousness inspection It tests, the Spatial Distribution Pattern of crime point event is judged according to corresponding function curve, data calculating speed greatly improved, solve Nowadays the complicated calculations problem in magnanimity crime data, further increases computing resource utilization rate.

Detailed description of the invention

Fig. 1 is the flow chart of the embodiment of the present invention；

Fig. 2 is the G-function analysis graph of California, USA crime data in the embodiment of the present invention.

Specific embodiment

Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.

Spatial distribution mould based on the closest distance algorithm G-function analysis crime dramas in spatial point patterns analysis method Need to ask the closest point of all crime points when formula, traditional traversal search mode need to follow certain order, systematically access one Whole elements in a data structure, this searching method are only applicable to the lesser situation of data volume, and biggish in data volume When, it may appear that calculate the problem that the time is long, computational efficiency is low.

The demand that the present invention is handled for crime big data instantly, it is closest by combining improved KD tree parallel computation Range points pattern analysis method (G-function) provides a kind of method that can rapidly analyze crime spatial distribution mode --- it is based on G-function and the crime big data point pattern analysis method for improving the parallel computation of KD tree.This method is by the crime point event in space Sub-clustering constructs KD tree, the closest distance for the event of putting of committing a crime in each KD tree of parallel computation, with playing it is whole be scattered, block parallel The effect of processing accelerates computational efficiency, improves the utilization rate of computing resource.

Referring to Fig.1, a kind of crime big data dot pattern based on G-function and improvement KD tree parallel computation provided by the invention Analysis method, comprising the following steps:

Step 1: data prediction；

Point is divided into several points cluster by clustering algorithm, sets threshold by the pending crime dramas point coordinate of input institute Value is used to judge whether cluster is excessive, if it is continues to cluster sub-clustering, until the number at each cluster midpoint is appropriate；So KD tree is established using parallel computation strategy to each cluster afterwards, the closest list of each point will appear in cluster.

KD tree is established in the present embodiment, calculates every one-dimensional variance of all data in each cluster first, then selection side Difference it is maximum that it is one-dimensional in all data median as super face, i.e. root node is divided, finally determining left subtree right subtree, is passed Return progress, until leaf node.

Step 2: searching for closest point；

In the present embodiment, closest point is searched in KD tree using multi-threaded parallel search.

Improved KD tree parallel algorithm, can be according to the spacial distribution density automatic cluster of mass data point, each group It forms cluster and KD tree is established rapidly using parallel mode, the time consumed by KD tree is established in saving.After establishing KD tree, utilize Parallel computation handles data using multithreading, dramatically saves data processing time.This method avoid establish one A huge KD tree, but multiple lesser KD trees are established, reduce the time for building KD tree, more reduces the closest point of search Time, improve data-handling efficiency.

Step 3: calculating G-function；

In the present embodiment, the cumulative frequency G of a closest distance is constructed using the distance of all closest events (d)；

In formula, s_iIt is event in survey region；N is the quantity of an event；D is distance；#(d_min(s_i)≤d) Indicate the counting of closest point of the distance less than d.

Step 4: carrying out significance test and obtain analysis result；

Using the method for Monte Carlo stochastic simulation, if stochastic simulation distribution functionProbability greater than upper bound U (d)WithProbability less than lower bound L (d)MeetThen calculated result meets significance test index, output G (d) about The curve graph of distance d judges the Spatial Distribution Pattern of point data；With the variation of distance d, crime dramas statistic frequency becomes Change, if fruit dot event tends to Assembled distribution in space, G-function value can in shorter distance rapid increase；If dot pattern Middle event tends to dispersed distribution, then G-function value increases just relatively slower；

In the present embodiment, method that significance test uses Monte Carlo stochastic simulation:

1, m times CSR (complete space random, complete space random point) dot pattern, and estimation theory are generated Distribution

WhereinFor m independent random analog function of n CSR event of simulation.

2, stochastic simulation distribution function is calculatedUpper bound U (d) and lower bound L (d)；

3, it calculates separatelyGreater than stochastic simulation distribution functionUpper bound U (d) probability WithLess than stochastic simulation distribution functionLower bound L (d) probability

If 4, meetingThen calculated result meets significance test Index exports G-function calculated result curve.

The G-function curve being calculated by taking the crime data of California, USA as an example is shown in attached drawing 2, and judgement is somebody's turn to do Space clustering distribution pattern is presented in crime point event in area, for the crime " severely afflicated area " of Assembled distribution is presented, it is possible to determine that be The high frequency generation area of crime, should increase police strength, improve patrol frequency, and give building for the public attention person and property safety View.

Crime dramas is abstracted as point spatially, analyzing by this method can be obtained three kinds of crime distribution patterns --- it is poly- Collection distribution is uniformly distributed and random distribution, is excavated and is obtained the Spatial Distribution Pattern of crime and have great significance:

1) synthesis for facilitating the police of profession to carry out crime data is analyzed comprehensively, is obtained crime hot spot (severely afflicated area), In a planned way rational allocation police strength is to reduce the generations of certain crimes；

2) legal consciousness publicity is reinforced in the area that Assembled distribution mode can be presented to crime, and Xiang Gongzhong gives suggestion and avoids not The necessary person and property loss, it is horizontal to improve municipal public safety；

3) based on point pattern analysis as a result, the crime quantity in predicted city future can also be analyzed further, crime is excavated The rule being distributed at any time, auxiliary government carry out the global assurance and decision of municipal public safety.

It should be understood that the part that this specification does not elaborate belongs to the prior art.

It should be understood that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered to this The limitation of invention patent protection range, those skilled in the art under the inspiration of the present invention, are not departing from power of the present invention Benefit requires to make replacement or deformation under protected ambit, fall within the scope of protection of the present invention, this hair It is bright range is claimed to be determined by the appended claims.

Claims

1. a kind of crime big data point pattern analysis method based on G-function and improvement KD tree, which is characterized in that including following step It is rapid:

Step 1: data prediction；

Point is divided into several points cluster by clustering algorithm by the pending crime dramas point coordinate of input institute, and given threshold is used Judge whether cluster is excessive, if it is continues to cluster sub-clustering, until the number at each cluster midpoint is appropriate；Then right Each cluster establishes KD tree using parallel computation strategy；

Step 2: searching for closest point；

For each cluster calculated where point inquires it, and determine the KD tree where the cluster；Search obtains institute in KD tree later Somewhat closest point, and calculate closest distance d_min, finished until the point of all inputs all calculates, obtain all the points Closest distance；

Step 3: calculating G-function；

In magnitude order by the closest distance of obtained all the points, the change journey R and group for calculating closest distance are away from D, wherein R= max(d_min)-min(d_min), the quantity according to group away from upper limit value stored counts point, and calculate cumulative frequency G (d)；

Step 4: carrying out significance test and obtain analysis result；

If calculated result meets significance test index, the curve graph of G (d) about distance d is exported, judges the space point of point data Cloth mode；With the variation of distance d, crime dramas statistic frequency changes, as fruit dot event tends to aggregation point in space Cloth, G-function value can in shorter distance rapid increase；If event tends to dispersed distribution, G-function value in dot pattern Increase just relatively slower.

2. the crime big data point pattern analysis method according to claim 1 based on G-function and improvement KD tree, feature It is: establishes KD tree described in step 1, calculates every one-dimensional variance of all data in each cluster first, then choose variance It is maximum that it is one-dimensional in all data median as super face, i.e. root node is divided, last determining left subtree right subtree, recurrence It carries out, until leaf node.

3. the crime big data point pattern analysis method according to claim 1 based on G-function and improvement KD tree, feature It is: in step 2, closest point is searched in KD tree using multi-threaded parallel search.

4. the crime big data point pattern analysis method according to claim 1 based on G-function and improvement KD tree, feature It is: in step 3, the cumulative frequency G (d) of a closest distance is constructed using the distance of all closest events；

In formula, s_iIt is event in survey region；N is the quantity of an event；D is distance；#(d_min(s_i)≤d) it indicates The counting of closest point of the distance less than d.

5. the crime big data point pattern analysis according to any one of claims 1-4 based on G-function and improvement KD tree Method, it is characterised in that: in step 4, using the method for Monte Carlo stochastic simulation, if stochastic simulation distribution functionIt is greater than The probability of upper bound U (d)WithProbability less than lower bound L (d)MeetThen calculated result meets significance test index.