Disclosure of Invention
The embodiment of the invention aims to provide a criminal land identification method based on a discrete selection model, which can realize accurate criminal land selection identification, improves the identification effectiveness and accuracy and has an important reference effect on police prevention and control.
In order to achieve the above object, the embodiment of the present invention provides a criminal plausible identification method based on a discrete selection model, including the steps of:
acquiring multi-source space-time data, and carrying out integrated processing on the multi-source space-time data to generate a community attribute set of each community;
obtaining a coding result corresponding to the address text of the area to be identified, optimizing the coding result through a multi-source geocode classification optimization model based on rules and clusters, and generating coding information corresponding to optimized address text data to construct a criminal perpetrator data set;
matching according to the perpetrator place data set and the community attribute sets of all communities to obtain a first community attribute set of the perpetrator place, and fusing the first community attribute set and the perpetrator place data set to generate a model sample;
obtaining a variable meeting the requirements by performing a co-linear diagnosis process on the model sample, fitting the criminal with a discrete selection model to perpetrate selection preferences, obtaining an expected effect selected by the perpetrator, and identifying a community in which the perpetrator is working to obtain the maximum expected effect;
fitting the result obtained by the perpetrator according to a conditional logic model and the discrete selection model, constructing a perpetrator selection probability function, and obtaining the probability of perpetrator selection according to the perpetrator selection probability function.
Further, the multi-source spatiotemporal data comprises original public security department data, mobile phone signaling data and census data.
Further, the integrated processing of the multi-source spatio-temporal data specifically includes:
according to the mobile phone signaling data, calculating to obtain the population of the base station of each community;
generating Thiessen polygons corresponding to all communities according to the population numbers of the base stations of all communities, and calculating the complete Thiessen polygon areas of all communities;
cutting the corresponding Thiessen polygons through shp files of all communities to obtain fragmented Thiessen polygons corresponding to all communities;
calculating to obtain the crowd flow density corresponding to each community according to the crowd flow density formula, the population of the base station of each community, the complete Thiessen polygon area of each community and the fragmented Thiessen polygons corresponding to each community;
and calculating the socioeconomic heterogeneity of each community according to a socioeconomic heterogeneity formula and the number of different socioeconomic groups in the census data.
Further, the crowd flow density formula is:
wherein D is i The crowd flow density of the ith community is that n is the number of Thiessen polygons related to the ith community, P k Is the kth Thiessen polygon population, S ji For the area of fragment j in the ith community, S ki S is the total area of the kth Thiessen polygon associated with the ith community i Is the area of the ith community.
Further, the socioeconomic heterogeneity formula is:
wherein n is the number of different socioeconomic groups, P ki SE for the proportion of the kth socioeconomic group living in the ith community i The larger the number of (c) indicates the more heterogeneous the community population.
Further, the method includes the steps of obtaining a variable meeting requirements by performing co-linearity diagnosis on the model sample, fitting the criminal perpetrator selection preference through a discrete selection model, obtaining an expected effect selected by the criminal perpetrator, and identifying a community in which the criminal can obtain the maximum expected effect by performing the criminal perpetrator, wherein the community comprises the following specific steps:
screening variables affecting the scheme selection from the model sample, and performing co-linearity diagnosis on the variables through a variance expansion formula to obtain variables with variance expansion less than 10; the variables comprise build environment factors, social environment factors, crowd flow environment factors and crime prevention and control factor variables;
and calculating the expected effect of the criminals on the selection of each community through an expected effect calculation formula to obtain the community with the maximum expected effect of criminals executing the criminal.
Further, the variance expansion formula is:
wherein R is i As variable x i The negative correlation coefficient of regression analysis was performed on the remaining variables, VIF being the coefficient of variance expansion.
Further, the expected effect calculation formula is:
U ij =βx ij +ε ij ;
wherein U is ij Selecting for an ith perpetrator an expected effect of crime to be performed in a jth community, x ij For the values of the explanatory variables related to the ith perpetrator and the jth community, β is the estimated coefficient of the explanatory variable, ε ij Is the random error of the model.
Further, the contemplatively selected probability function is a Prob probability function.
Further, the calculation formula of the Prob probability function is as follows:
wherein Y is i Is the choice for the ith perpetrator, x ij Beta is the estimated coefficient of the explanatory variable, which is the value of the explanatory variable related to the i-th perpetrator and the j-th community.
The embodiment of the invention has the following beneficial effects:
compared with the prior art, the criminal land identification method based on the discrete selection model provided by the embodiment of the invention generates a community attribute set of each community by acquiring multi-source space-time data and carrying out integrated processing on the multi-source space-time data; then obtaining a coding result corresponding to the address text of the area to be identified, optimizing the coding result through a multi-source geocode classification optimization model based on rules and clusters, and generating coding information corresponding to optimized address text data to construct a criminal perpetrator data set; matching according to the perpetrator data set and the community attribute sets of all communities to obtain a first community attribute set of the perpetrator, and fusing the first community attribute set and the perpetrator data set to generate a model sample; then, the model sample is subjected to co-linear diagnosis processing to obtain a variable which meets the requirements, the criminal is fitted to the criminal by a discrete selection model to select preference in a criminal manner, obtaining an expected effect selected by the perpetrator, and identifying a community in which the perpetrator is working to obtain the maximum expected effect; and finally, fitting the result obtained by the criminal according to a conditional logic model and the discrete selection model, constructing a criminal selection probability function, and obtaining the probability of criminal selection according to the criminal selection probability function.
According to the embodiment provided by the invention, the multi-source space-time data such as police condition data, capturing data, POI data and mobile phone signaling data are fused, and based on the principle of discrete selection model, the quantitative model of criminal perpetrators is optimized by increasing the data such as crowd flowing environment, crime prevention and control environment and the like. Compared with the basic model, the fitting precision of the full model to criminal selection is improved by 8.63%.
The embodiment provided by the invention also calculates the expected effect of the criminals on the perpetrator community and the probability of criminal perpetrator selection through the effect function and the probability function, therefore, the preference of criminals for criminal selection is accurately identified, the criminals can be accurately identified, the identification effectiveness and accuracy are improved, and the criminal identification method has an important reference function for police prevention and control.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a criminal identification method based on a discrete selection model provided in the present invention; the embodiment of the invention provides a criminal plausible identification method based on a discrete selection model, which comprises the following steps of S1 to S5;
s1, multi-source space-time data are obtained, integrated processing is carried out on the multi-source space-time data, and a community attribute set of each community is generated.
In the embodiment of the invention, the multi-source space-time data comprises original police department data (police condition data and capturing data), built environment data, mobile phone signaling data, population census data and other data, and the arcGIS software is utilized to effectively integrate the multi-source heterogeneous big data so as to generate a community attribute set of each community.
The public security department data comprises case information data and criminal individual information data, and the case information data and the criminal individual information data respectively comprise information such as case reporting time, address text, case records, case issuing addresses, living places, case issuing time, capturing time, robbery objects, robbery places, crime tools, robbery modes, robbery articles and the like, and information such as names, nationalities, sexes, native, birth dates, occupation, drug absorption and the like.
The step S3 specifically comprises the following steps: according to the mobile phone signaling data, calculating to obtain the population of the base station of each community; generating Thiessen polygons corresponding to all communities according to the population numbers of the base stations of all communities, and calculating the complete Thiessen polygon areas of all communities; cutting the corresponding Thiessen polygons through shp files of all communities to obtain fragmented Thiessen polygons corresponding to all communities; calculating to obtain the crowd flow density corresponding to each community according to the crowd flow density formula, the population of the base station of each community, the complete Thiessen polygon area of each community and the fragmented Thiessen polygons corresponding to each community; and calculating the socioeconomic heterogeneity of each community according to a socioeconomic heterogeneity formula and the number of different socioeconomic groups in the census data.
In one embodiment of the present invention, the crowd flow density is measured using the mobile phone signaling data and used as an index of crowd flow environment. Firstly generating Thiessen polygons according to population numbers of base stations in different time periods, then calculating the whole Thiessen polygon area, then cutting the Thiessen polygons by shp files of communities to obtain fragmented Thiessen polygons and communities to which each fragment belongs, calculating the fragment area, and finally dividing the fragment area by the community area to obtain the crowd flow density. The calculation formula is as follows:
wherein D is i The crowd flow density of the ith community is that n is the number of Thiessen polygons related to the ith community, P k Is the kth Thiessen polygon population, S ji For the area of fragment j in the ith community, S ki S is the total area of the kth Thiessen polygon associated with the ith community i Is the area of the ith community.
And then, describing socioeconomic heterogeneity by using the housing property difference in the census data, wherein the calculation formula is as follows:
wherein n is the number of different socioeconomic groups, P ki For livingProportion of kth socioeconomic group in ith community, SE i The larger the number of (c) indicates the more heterogeneous the community population.
S2, obtaining a coding result corresponding to the address text of the area to be identified, optimizing the coding result through a multisource geocode classification optimization model based on rules and clusters, and generating coding information corresponding to optimized address text data to construct a criminal place data set.
And S3, matching the perpetrator place data set with the community attribute sets of all communities to obtain a first community attribute set of the perpetrator place, and fusing the first community attribute set with the perpetrator place data set to generate a model sample.
S4, performing co-linear diagnosis processing on the model sample to obtain a variable meeting the requirement, fitting the criminal perpetrator selection preference through a discrete selection model to obtain an expected effect selected by the criminal perpetrator, and identifying a community in which the criminal perpetrator can obtain the maximum expected effect.
The step S4 specifically comprises the following steps: screening variables affecting the scheme selection from the model sample, and performing co-linearity diagnosis on the variables through a variance expansion formula to obtain variables with variance expansion less than 10; the variables comprise build environment factors, social environment factors, crowd flow environment factors and crime prevention and control factor variables; and calculating the expected effect of the criminals on the selection of each community through an expected effect calculation formula to obtain the community with the maximum expected effect of criminals executing the criminal.
In another embodiment of the invention, independent variables affecting the contemplation are screened and co-linearity diagnostics performed. Independent variables include built-up environment, social environment, crowd flow environment and crime prevention and control factors. The co-linearity diagnosis uses the coefficient of expansion of variance, calculated as follows:
wherein R is i As variable x i The negative correlation coefficient of the regression analysis was performed on the remaining variables, VIF was the coefficient of variance expansion, and was required to be as small as possible at VIF < 10.
And then according to the independent variables obtained by screening in the steps, calculating the expected effect of the community selected by the criminal in a criminal way based on a discrete selection model principle through an effect function to obtain the community with the maximum expected effect of the criminal in the criminal way, wherein the calculation formula is as follows:
U ij =βx ij +ε ij
wherein U is ij Selecting for an ith perpetrator an expected effect of crime to be performed in a jth community, x ij For the values of the explanatory variables related to the ith perpetrator and the jth community, β is the estimated coefficient of the explanatory variable, ε ij Is the random error of the model.
And S5, fitting the result obtained by the criminal according to the conditional logic model and the discrete selection model, constructing a criminal selection probability function, and obtaining the criminal selection probability according to the criminal selection probability function.
In this embodiment, the scenario selects a probability function as a Prob probability function, and a calculation formula of the Prob probability function is:
wherein Y is i Is the choice for the ith perpetrator, x ij Beta is the estimated coefficient of the explanatory variable, which is the value of the explanatory variable related to the i-th perpetrator and the j-th community.
In order to better illustrate the principles of the method of the present invention, the following are specific examples of the method provided by the present invention:
referring to fig. 2, first, data information such as ZG city public security department data, built environment data, mobile phone signaling data, census data and the like is obtained, classified ZG city crime data, built environment data, social environment data, crowd flowing data, crime prevention and control data and the like are calculated through a crowd flowing density formula, a socioeconomic heterogeneity formula and the like, and multisource heterogeneous big data is effectively integrated by ArcGIS software to generate attribute sets of communities in areas needing to be identified. The data used include, but are not limited to: the data provided by the public security department of ZG city is of two types, the first type is the police condition data of 2012-2016 ZG street robbery. The data has 85,898 police records, and records the case information of street robbery, such as robbery time and robbery address, robbery objects, lost property and the like. The second category is the capture data of 2012-2016 ZG urban head robbers, which is recorded in 14,863 pieces. The data records detailed individual information and case information of the street robbery. Individual information such as name, sex, ethnicity, address of household, date of birth, cultural degree, and whether to take medicine or not of the hijack; case information such as robbery time, robbery address and residence at the time of robbery, and crime object, crime place, crime tool, crime measure, and lost property or money. The ZG city building environment data comprise 2014 POI data (point coordinates), 2014 traffic network vector map data, 2015 bus stop data (point coordinates) grabbed based on hundred-degree API and the like. The mobile phone signaling data is the mobile phone signaling data of 2G and 3G in ZG city of 2016 (day 12 5 month-day 18), and the data is derived from a certain communication company in ZG city of China. The population census data is population census data of the sixth 2010 of ZG city, and the data is based on communities, and is mainly used for counting social attributes such as community general population, community foreign population, community juvenile population, community people average housing area, community housing property and the like.
And then obtaining the coding result of the address text by using a Geocoding API, preliminarily judging the credibility of the coding result by using a simple rule, and cleaning and optimizing the coding result by using a multi-source Geocoding classification optimizing model based on the rule and the cluster to obtain coding information corresponding to the optimized address text data to form a composition data set.
According to the principle of fig. 3, by utilizing the Stata software dataset intersection function, the community attribute set to which the crime perpetrator belongs is matched, and the community attribute set and the crime perpetrator dataset are effectively fused, so that a sample is constructed. The ZG urban head robber selects 1 community of 1,971 communities to act as a table, the selected community is marked as "1", otherwise, the unselected communities are all marked as "0". Each community selects 19 attributes, 1,971 lines are 27,949 plots in one-to-one correspondence, and cross coding is adopted in Stata software, so that 55,087,479 samples are constructed.
And (3) utilizing the collinearity diagnosis to screen variables conforming to VIF < 10, fitting criminal perpetrator selection preference based on a discrete selection model, and obtaining the expected effect that the ith criminal selects the jth community as the perpetrator.
And obtaining a Prob probability function according to the conditional logic model form, and calculating the probability of selecting a criminal to act in the jth community.
The method simulates a Full Model (FM) selected by criminals in a criminal case, and performs fitting degree test comparison of a pseudo R party with a basic Model (NM) and a combined Model1 and a combined Model2 (CM 1 and CM 2) which are respectively integrated with only a variable of a people stream environment dimension and a variable of a crime prevention and control dimension on the basic Model. The fitness test requires that the pseudo-R-square be greater than 0.20, otherwise it does not have a perfect fitness. The degree of fit effect of each model based on the discrete selection model is compared with that of the basic model, as shown in the following table:
in summary, the criminal plagued identification method based on the discrete selection model provided by the embodiment of the invention has the following beneficial effects compared with the prior art:
(1) The quantitative model selected by criminals is optimized by fusing multi-source space-time data such as police condition data, capturing data, POI data, mobile phone signaling data and the like, increasing data such as crowd flowing environment, crime prevention and control environment and the like. Compared with the basic model, the fitting precision of the full model for criminal perpetration selection is improved by 8.63%, and the full model plays an important role in improving the fitting precision for criminal perpetration selection.
(2) Based on the discrete selection model, the expected benefit and probability of criminal selecting the criminal community are calculated by using the effect function and the probability function, so that the criminal selecting is accurately identified, the effectiveness and the accuracy are improved, and the method has an important reference effect on police control.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.