CN110097126B

CN110097126B - Method for checking important personnel and house missing registration based on DBSCAN clustering algorithm

Info

Publication number: CN110097126B
Application number: CN201910374115.XA
Authority: CN
Inventors: 许正; 朱哲辰; 黄泷; 闫子为; 高子康
Original assignee: Jiangsu Ugs Information Technology Co ltd
Current assignee: Jiangsu Ugs Information Technology Co ltd
Priority date: 2019-05-07
Filing date: 2019-05-07
Publication date: 2023-04-21
Anticipated expiration: 2039-05-07
Also published as: CN110097126A

Abstract

The invention relates to a method for checking important personnel and house miss-registration based on a DBSCAN clustering algorithm, which is used for preprocessing population and house data sets collected by polices, and comprises missing value filling, category type variable discretization and numerical value variable standardization; classifying samples of non-core points on a data set of 'key personnel and houses' by adopting a DBSCAN clustering algorithm, and analyzing a clustering result; and (3) fixing core points of all data of the tag 'key personnel and house', classifying samples of non-core points on population and house data sets through a DBSCAN clustering algorithm of the self-adaptive feature weights to obtain a clustering result, and finally generating a suspected missed registration 'key personnel and house' check list. Therefore, marked important attention personnel and houses are used as cores, personnel and houses similar to the important personnel and houses are screened out through a density clustering algorithm, and the checking range of suspected important personnel and houses is reduced, so that the accuracy and efficiency of police service checking can be effectively improved.

Description

Method for checking important personnel and house missing registration based on DBSCAN clustering algorithm

Technical Field

The invention relates to a method for checking important personnel and house omission, in particular to a method for checking important personnel and house omission based on a DBSCAN clustering algorithm.

Background

In the big data age, data mining technology plays a tremendous role in many areas. A series of problems of police working quality of the public security base layer are improved by optimizing a traditional community police checking working mode through big data and an algorithm. The first cut-in field is public security field, public security data is massive and rich in variety, and not only is traditional structured data, but also a large amount of unstructured data are available. The population and house management work is heavy, and the traditional police service checking work mode has difficulty in meeting public security population and house business under such large population and house base.

Aiming at the investigation needs of key personnel and houses in public security data, a dividing and layering method is a basic clustering method which is relatively effective and provided earlier, such as K-means, K-modes and the like, but the basic clustering method aims at finding spherical clusters, but is difficult to find clusters with any shape, and the algorithm needs to preset the K value. Density-based clustering such as DBSCAN can treat clusters as dense regions separated by sparse regions in the data space, with the advantages of finding arbitrary clusters, automatically eliminating noise points, and no need to specify the number of categories.

However, the core points obtained by the classical density clustering algorithm comprise all points meeting the neighborhood condition, in the population and house clustering problems, non-key personnel and houses are used as core points, the clustering effect is not satisfactory, in addition, the characteristic attributes which are strongly related to the key personnel and houses are weakened by giving equal weights to different characteristics in the process of calculating the similarity in distance, so that the misconvergence condition can occur.

In view of the above-mentioned drawbacks, the present designer is actively researched and innovated to create a method for checking key personnel and house miss registration based on a DBSCAN clustering algorithm, so that the method has more industrial utilization value.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide a method for checking key personnel and house missing registration based on a DBSCAN clustering algorithm.

The invention relates to a method for checking key personnel and house miss registration based on a DBSCAN clustering algorithm, which is characterized by comprising the following steps:

preprocessing population and house data sets collected by police, wherein the preprocessing comprises missing value filling, category type variable discretization and numerical value variable standardization;

dividing the data marked with the key personnel and the house into known key personnel and house and unknown key personnel and house, fixing the data samples of the known key personnel and house as core points of density clustering, and separating non-core points;

step three, setting a neighborhood parameter (epsilon, minPts), wherein epsilon describes a neighborhood distance threshold of a certain sample, and MinPts describes a threshold of the number of samples in a neighborhood with epsilon of the certain sample. Classifying samples of non-core points on a data set of 'key personnel and houses' by adopting a DBSCAN clustering algorithm, and analyzing a clustering result;

fourthly, fixing core points of all data of the tag 'key personnel and houses', classifying samples of non-core points on population and house data sets through a DBSCAN clustering algorithm of the self-adaptive feature weights, and obtaining clustering results;

and fifthly, counting and judging the clustering result in the step four, and finally generating a suspected missed registration key personnel and house check list.

In the first step, the data preprocessing step is to preprocess population of public security, population in a house database and house related data features, including performing independent heat coding on type features in the population and house related data features, performing dimensionless processing on numerical feature variables, wherein the missing value filling is to fill the type features with mode and fill the numerical feature variables with average.

Furthermore, the method for checking the key personnel and the house miss registration based on the DBSCAN clustering algorithm comprises the steps of determining the classification type characteristics including gender and marital status, and determining the numerical characteristic variables including age and address longitude and latitude.

Further, in the method for checking the important personnel and the house miss registration based on the DBSCAN clustering algorithm, the step one is that the discretizing of the category type variable is as follows: if N qualitative values are provided, the feature is expanded into N features, and when the original feature value is the ith qualitative value, the ith expanded feature is assignedOther extension features are assigned a value of 0 for 1. The numerical variable normalization process requires calculation of the mean value of each dimension feature

And standard deviation (S), the calculation formula is,

further, in the method for checking the key personnel and house miss registration based on the DBSCAN clustering algorithm, the step two is that the weight is given to the data sample of the key personnel and house in the process of calculating the distance, and the larger positive value represents that the sample is easy to become the core point, and the smaller negative value prevents the sample from becoming the core point.

In the third step, the similarity between samples is measured by using the euclidean distance through the DBSCAN algorithm, the smaller the distance is, the more similar the samples are, n samples are divided into K clusters, and the number of the samples in each cluster is respectively: n is n ₁ ，n ₂ ，…，n _k Then the sum d of the intra-class distances of all K clusters on the j-th dimensional feature _p In order to achieve this, the first and second,

x _ij the j-th dimension characteristic value, m, for the i-th sample _kj For the mean value of cluster k on the j-th dimensional feature, all

Sum d of inter-class distances of K clusters on the j-th dimensional feature _q In order to achieve this, the first and second,

m _j for data setThe mean value on the j-th dimension feature, and then calculating the contribution degree c of the feature j to the cluster _j ，

Finally, feature weights w of the j-th dimensional features _j In order to achieve this, the first and second,

m represents the dimension of the sample feature.

Thereby obtaining a weighted euclidean distance formula, thereby obtaining the similarity d (m, n) between the samples,

furthermore, in the method for checking key personnel and house omission based on the DBSCAN clustering algorithm, in the fourth step, the core point fixing processing process is that a Scikit-learn machine learning framework is adopted, and all the core points are found out according to given neighborhood parameters.

In the fourth step, the feature weight is optimized, the data of the marked important person and house are divided into the known important person and house and the unknown important person and house, the known important person and house data samples are fixed to be the core points of the density clustering according to the core point fixing step, the proper neighborhood parameters are set, then the samples of the non-core points are classified on the data set of the important person and house based on the DBSCAN clustering algorithm, the contribution degree of each attribute to the clustering is calculated according to the classification result, and the feature weight is updated.

In the fifth step, each class of clustering results contains the number N of labeled key people and houses, whether N is greater than or equal to a preset threshold value T is judged, if the judgment result is that N is greater than or equal to T, the unlabeled people and houses in the class have high possibility of suspected missed registration of key people and houses, and finally a suspected missed registration of key people and houses is generated; otherwise, the class has low possibility of suspected missed registration of 'important personnel and houses', and manual judgment is needed.

By means of the scheme, the invention has at least the following advantages:

1. the marked important attention personnel and houses are used as cores, the personnel and houses similar to the important personnel are screened out through a density clustering algorithm, and then the checking range of suspected important personnel and houses is narrowed, so that the accuracy and efficiency of police service checking can be effectively improved.

2. The characteristic weight self-adaptive mechanism gives different characteristic weights to the attributes, so that the similarity between samples can be reflected more accurately and the clustering performance can be improved.

3. The intelligent recommended police service checking mode has the advantages that the checking objects can be predicted and prejudged in advance, the checking work is more scientific and accurate, and the police service is more active and safer.

4. The police mode transformation upgrading is promoted, and the important realistic effect is achieved in the aspect of improving the public security management capability of the house.

The foregoing description is only an overview of the present invention, and is intended to provide a better understanding of the present invention, as it is embodied in the following description, with reference to the preferred embodiments of the present invention and the accompanying drawings.

Drawings

FIG. 1 is a flow chart of feature weight optimization.

FIG. 2 is a flow chart of a DBSCAN clustering algorithm that adapts feature weights.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

The method for checking important personnel and house miss registration based on the DBSCAN clustering algorithm as shown in the figures 1 and 2 is characterized by comprising the following steps:

firstly, preprocessing population and house data sets collected by police, including missing value filling, category type variable discretization and numerical value variable standardization. Specifically, preprocessing population and house related data features in public security population and house databases comprises performing single-hot encoding on category type features in the population and house related data features, and performing dimensionless processing on numerical type feature variables. In order to meet the requirements of convenient classification, category type characteristics comprise gender and marital conditions, and numerical type characteristic variables comprise age and address longitude and latitude.

The purpose of this is to remove the dimension differences, which results in excessive differences between the data, and is susceptible to larger numerical features during the calculation of the distance. The missing value filling is that the mode is filled for the category type characteristic and the average value is filled for the numerical type characteristic variable. The reason for this is that key features containing missing values may have a large impact on the clustering results if discarded directly.

The adopted category type variable discretization process is as follows: if N qualitative values are provided, the feature is expanded into an N-medium feature, and when the original feature value is the ith qualitative value, the ith expansion feature is assigned 1, and the other expansion features are assigned 0.

The process of numerical variable normalization adopted requires calculation of the mean value of each dimension characteristic

And standard deviation (S), the calculation formula is,

next, the data labeled "accent person, house" are divided into known "accent person, house" and unknown "accent person, house", and the known "accent person, house" data samples are fixed as core points of the density clusters, and non-core points are separated.

The process of fixing the core points as density clusters is that weight is given to the data samples of key personnel and houses in the distance calculation process, the larger positive value represents that the samples are easy to become the core points, and the smaller negative value can prevent the samples from becoming the core points.

Next, a neighborhood parameter (ε, minPts) is set, where ε describes a neighborhood distance threshold for a sample and MinPts describes a threshold for the number of samples in a neighborhood of distance ε for a sample.

And classifying samples of non-core points on a data set of 'key personnel and houses' by adopting a DBSCAN clustering algorithm, and analyzing a clustering result.

Specifically, by using a DBSCAN algorithm, the similarity between samples is measured by using euclidean distance, and the smaller the distance is, the more similar the samples are, and n samples are divided into K clusters, wherein the number of samples in each cluster is respectively: n is n ₁ ，n ₂ ，…，n _k Then the sum d of the intra-class distances of all K clusters on the j-th dimensional feature _p In order to achieve this, the first and second,

wherein:

x _ij the j-th dimension characteristic value, m, for the i-th sample _kj Is the mean of cluster k over the j-th dimensional feature.

At the same time, the sum d of the inter-class distances of all K clusters on the j-th dimensional feature _q In order to achieve this, the first and second,

wherein m is _j For the mean of the dataset over the j-th dimensional feature, then, the feature j-pair cluster is calculatedContribution degree c _j ，

where m represents the dimension of the sample feature.

and then, fixing core points of all data of the tag 'key personnel and houses', classifying samples of non-core points on population and house data sets through a DBSCAN clustering algorithm of the self-adaptive feature weights, and obtaining a clustering result.

Specifically, the core point fixing process is to use a Scikit-learn machine learning framework to find all core points according to a given neighborhood parameter (epsilon, minPts). And optimizing the feature weight, dividing the data labeled with the key personnel and the house into known key personnel and house and unknown key personnel and house, fixing the known key personnel and house data samples as core points of density clustering according to a core point fixing step, and setting proper neighborhood parameters.

And classifying samples of non-core points on a data set of 'key personnel and houses' based on a DBSCAN clustering algorithm, calculating the contribution degree of each attribute to the clustering on the classified result, and updating the feature weight.

Finally, the clustering result is counted and judged, and a suspected missed registration key person and house check list is finally generated. Specifically, the clustering result is set to include the number N of labeled "important persons and houses" in each class, and whether N is equal to or greater than a preset threshold T is determined.

Specifically, if the judgment result is that N is more than or equal to T, the personnel and the house which are not labeled in the class have high possibility to be suspected to be missed to register the key personnel and the house, and finally, a suspected to be missed to register the key personnel and the house is generated to check a table. Otherwise, the class has low possibility of suspected missed registration of 'important personnel and houses', manual judgment is needed, and an expert is needed to judge again by experience, so that manual misjudgment is reduced. Moreover, the checking task can be pushed to the police.

Therefore, the characteristics of complicated personnel and houses and insufficient police quantity in public security population management work are considered, and the screening and cluster analysis of data are used for selecting proper cluster characteristics on the basis of fully considering the characteristics of important personnel and houses, so that the important personnel and houses are accurately judged. Meanwhile, a suspected missing registration check list is generated, so that the check task is pushed in the aspect of public security conveniently. The density clustering algorithm reduces the range of checking important attention personnel and houses in missed registration, finally improves the accuracy of population and house checking, and also provides decision support for other public security management and control fields.

The working principle of the invention is as follows:

in Table 1, prior to population (house-like) data preprocessing:

name number	Age of	Sex (sex)	Cultural degree	Latitude of latitude	Longitude and latitude
						1	73	2	Null	31.323771	120.666739
2	53	1	Null	31.315803	120.665558
						3	46	2	Null	31.317036	120.747582
4	29	2	70	34.646452	116.912783
						5	21	1	40	32.066899	118.193343
6	46	1	70	27.221181	111.248449
						7	44	1	20	31.319655	120.731328
8	62	2	Null	31.320779	120.665973
						9	31	1	60	35.828924	116.013732
10	32	1	60	34.357221	115.363676

Wherein:

gender: 1-male, 2-female; cultural degree: 20-family, 40-Zhongjun, 60-junior, 70-junior, null-deletion.

After treatment by the method of the present invention, table 2 was obtained as follows:

taking name number 1 as an example to demonstrate the data preprocessing process:

category type variable discretization: the sex characteristic has 2 qualitative values, namely a male sex and a female sex, and the sex characteristic is expanded into 2 characteristics, and at the moment, the sex characteristic value of the name number 1 is the 2 nd qualitative value, so the 2 nd expansion characteristic is assigned to be 1, the other expansion characteristics are assigned to be 0, and the sex characteristic is represented by (0, 1) after discretization.

Numerical variable normalization: firstly, calculating the mean value and standard deviation of age characteristics, namely 43.7 and 15.23187447 respectively, obtaining a standardized age value according to a standardized calculation formula,

(73-43.7)/15.23187447＝1.923598。

and the longitude and latitude standardized values can be obtained by the same method.

Missing value filling treatment: the cultural degree contains a missing value Null, is filled by adopting the mode 60 of the existing category, and then adopts a discretization processing mode similar to the sex characteristic.

As can be seen from the above text expressions and the accompanying drawings, the invention has the following advantages:

1. the marked important attention personnel and house are taken as the core, the personnel and house similar to the important personnel and house are screened out through a density clustering algorithm, and then the checking range of suspected important personnel and house is reduced, so that the accuracy and efficiency of police service checking can be effectively improved.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and it should be noted that it is possible for those skilled in the art to make several improvements and modifications without departing from the technical principle of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention.

Claims

1. The method for checking key personnel and house missing registration based on the DBSCAN clustering algorithm is characterized by comprising the following steps:

step three, setting a neighborhood parameter (epsilon, minPts), wherein epsilon describes a neighborhood distance threshold of a certain sample, minPts describes a threshold of the number of samples in the neighborhood with epsilon of the distance of the certain sample, classifying the samples of non-core points on a data set of 'key personnel and houses' by adopting a DBSCAN clustering algorithm, analyzing a clustering result,

in the third step, the Euclidean distance is used for measuring the sample through the DBSCAN algorithmThe smaller the distance is, the more similar the samples are, n samples are divided into K clusters, and the number of the samples in each cluster is as follows: n is n ₁ ，n ₂ ，…，n _k Then the sum d of the intra-class distances of all K clusters on the j-th dimensional feature _p In order to achieve this, the first and second,

m _j for the mean value of the data set on the j-th dimension feature, calculating the contribution degree c of the feature j to the cluster _j ，

m represents the dimension of the sample feature,

step four, fixing core points of all data of tag 'key personnel and house', classifying samples of non-core points on population and house data sets through DBSCAN clustering algorithm of self-adaptive feature weights to obtain clustering results,

in the fourth step, for the core point fixing process, a Scikit-learn machine learning framework is adopted, and all core points are found out according to given neighborhood parameters;

in the fourth step, the feature weight is optimized, the data of the marked important person and house are divided into known important person and house and unknown important person and house, the known important person and house data samples are fixed to be the core points of density clustering according to the core point fixing step, proper neighborhood parameters are set, then the samples of non-core points are classified on the important person and house data set based on a DBSCAN clustering algorithm, the contribution degree of each attribute to the clustering is calculated for the classification result, and the feature weight is updated;

2. The method for checking key personnel and house omission registration based on DBSCAN clustering algorithm as recited in claim 1, wherein the method comprises the following steps: in the first step, the data preprocessing step is to preprocess population and house related data features in public security population and house databases, and comprises the steps of performing single-heat coding on category type features in the population and house related data features, and performing dimensionless processing on numerical value type feature variables, wherein the missing value filling is to fill the category type features with mode and fill the numerical value type feature variables with average numbers.

3. The method for checking key personnel and house omission registration based on DBSCAN clustering algorithm as recited in claim 1, wherein the method comprises the following steps: the category type characteristics comprise gender and marital status, and the numerical type characteristic variables comprise age and address longitude and latitude.

4. The method for checking key personnel and house omission registration based on DBSCAN clustering algorithm as recited in claim 1, wherein the method comprises the following steps: step one, discretizing the category type variable: if N qualitative values are provided, the feature is expanded into N features, when the original feature value is the ith qualitative value, the ith expansion feature is assigned 1, the other expansion features are assigned 0, and the numerical variable normalization process needs to calculate the mean value of each dimension feature

And standard deviation (S), the calculation formula is,

5. the method for checking key personnel and house omission registration based on DBSCAN clustering algorithm as recited in claim 1, wherein the method comprises the following steps: in the second step, the process of fixing the core points as density clusters is to assign weights to the data samples of 'important personnel and houses' in the process of calculating the distance, wherein a larger positive value represents that the samples are easy to become core points, and a smaller negative value prevents the samples from becoming core points.

6. The method for checking key personnel and house omission registration based on DBSCAN clustering algorithm as recited in claim 1, wherein the method comprises the following steps: in the fifth step, setting the number N of marked key people and houses in each class in the clustering result, judging whether N is larger than or equal to a preset threshold value T, if the judgment result is that N is larger than or equal to T, the marked people and houses in the class are suspected to be missed to register the key people and houses with high possibility, and finally generating a suspected missed registration key people and houses checking list; otherwise, the class has low possibility of suspected missed registration of 'important personnel and houses', and manual judgment is needed.