CN114398934A

CN114398934A - High-risk area identification method based on clustering algorithm

Info

Publication number: CN114398934A
Application number: CN202111229509.XA
Authority: CN
Inventors: 程涛; 廖毅; 李英; 罗龑
Original assignee: Chinaccs Information Industry Co ltd
Current assignee: Chinaccs Information Industry Co ltd
Priority date: 2021-10-21
Filing date: 2021-10-21
Publication date: 2022-04-26

Abstract

The invention provides a high-risk area identification method based on a clustering algorithm, and belongs to the technical field of high-risk area identification. The technical scheme is as follows: a high-risk area identification method based on a clustering algorithm comprises the steps of butting information systems of related departments, obtaining historical case data and generating a training data set; extracting address information and high-risk features and generating a high-risk feature vector set; calculating the feature vector set of the high-risk area by using a clustering algorithm, performing clustering model training, and generating a model library; and (4) extracting residence information according to the identity information of the target user, and judging whether the target user is from an area with high-risk characteristics. The invention has the beneficial effects that: the historical case data of related departments are processed, and clustering of regions and high-risk features is realized by adopting automatic feature extraction and unsupervised clustering machine learning algorithm, so that automatic high-risk region identification is realized.

Description

High-risk area identification method based on clustering algorithm

Technical Field

The invention relates to the technical field of high-risk area identification, in particular to a high-risk area identification method based on a clustering algorithm.

Background

The high-risk areas refer to: in the case of persons who frequently have certain high-risk characteristics in a certain address or area range (the high-risk characteristics should be defined according to the identification requirements and relevant regulations), the address area can be set as a high-risk area with certain characteristics. In the daily management process of related departments, when the source and the residence of a person in a region have certain high-risk region characteristics, corresponding coping measures of the high-risk region are adopted to perform key prevention and control on the person.

At present, the identification of high-risk areas is mainly realized by adopting the following two modes:

experience: based on business experience formed by long-term accumulation in work, the probability of occurrence of error and leakage is higher;

a rule engine: if the experience is electronized, the experience can be further converted into rules, and automatic matching is realized through a rule engine. The rule engine is convenient to calculate and high in efficiency, however, the maintenance of the rules still needs manpower, and if the rules are not updated timely, the change of objective conditions cannot be reflected.

Disclosure of Invention

In view of the above problems in the prior art, an object of the present invention is to provide a high-risk area identification method based on a clustering algorithm, which generates a training data set by processing historical information of related systems, and implements automatic identification of high-risk areas by clustering areas and high-risk features using automatic feature extraction and unsupervised clustering machine learning algorithm.

The invention is realized by the following technical scheme: a high-risk area identification method based on a clustering algorithm comprises the following steps:

the method comprises the steps of butting an information system of a relevant department, obtaining historical case data, generating a relevant data set of case event information, address information and high-risk characteristics according to the case data, and using the relevant data set as a training data set; the method includes the steps that the case text file is characterized through a Chinese word segmentation technology, meanwhile, case characterization is conducted on high-risk characteristic words, expression of the high-risk characteristic words meets convention in relevant laws and regulations, such as theft, robbery and the like, and address information corresponds to residential addresses and household registration addresses of case-related personnel;

extracting address information in the training data set, coding the address information, generating an address vector corresponding to each address, and finally forming an address vector set;

and merging the address vectors with the similarity exceeding a set threshold value in the address vector set.

Extracting high-risk features in the training data set, and coding the high-risk features to form a high-risk feature vector set; and extracting high-risk features in all samples, and indexing the texts to form final high-risk feature codes. Such as theft- >1 and robbery- > 2.

Associating the address vector set with the high-risk feature vector set to obtain a high-risk region feature vector set; if { xx province, xx city, xx county, theft } is converted into a high-risk area feature vector, the high-risk area feature vector can be {1, 2, 5, 6, 9 };

calculating the feature vector set of the high-risk area by using a clustering algorithm, performing clustering model training, and generating a model library;

the method comprises the steps of extracting residence information according to identity data of a target user, and coding the residence information to generate an address code to be identified;

and matching the address code to be recognized with the model base, and judging whether the target user comes from an area with high-risk characteristics after model prediction.

Further, setting an updating period, periodically acquiring newly added case data, generating an incremental data set with the same format as the training data set, extracting and associating an address vector set and a high-risk feature vector set corresponding to the incremental data set, updating to the current high-risk area feature vector set, performing clustering model training again, and updating the model base.

Further, the encoding the address specifically includes: firstly, the national standard geographic information base is adopted for word segmentation, and each word is subjected to digital indexing, so that address vectorization is realized. To improve generalization, addresses are accurate to either city or county level.

Furthermore, similarity is calculated for the address vectors through an Euclidean distance algorithm, and address combinations with similarity larger than a threshold value are combined through multiple rounds of iteration.

The similarity of the address vectors is calculated as follows: the distance ρ (a, B) between a ═ a [1], a [2], …, a [ n ]) and B ═ B [1], B [2], …, B [ n ]) is defined by the following formula:

where a smaller value of d indicates a higher degree of similarity for the two address vectors A, B.

Further, the training of the clustering model specifically comprises: the clustering algorithm is a K-means algorithm realized based on Spark; calculating a K value; inputting the calculated K value and the feature vector; the calculated results are stored in a model library.

Further, the clustering (K-Means) algorithm process is as follows:

1. given an initial data set

The K-Means divides the data into K clusters, each cluster representing a different category;

2. from the training set

In the method, K centroids are randomly selected and are respectively

And initializing clusters

3. Calculating x_iDistance mu to centroid vector_jDistance d of_ijSelecting d_ijTime of minimum C_mIs x_iClass of (1), update C_m＝C_m∪x_i，

4. Recalculating C_jThe center of mass of (c):

5. the above 3,4 process is repeated until the K centroid vectors are not changing or the number of iterations is reached.

When the K-Means clustering algorithm is adopted, the K value needs to be obtained manually or in a calculation mode, and the accuracy of the K value directly influences the final clustering effect. Generally, the K value is selected by adopting a manual + calculation mode. The K value is first estimated manually and then verified by the Elbow algorithm. The Elbow algorithm calculates the value of the loss function when different K values are obtained, and when the change rate of the loss function is changed greatly, the K value is a proper K value; after the K value is calculated, realizing a Kmeans algorithm based on Spark, inputting the calculated K value and the characteristic vector, and storing the obtained result in a result base;

during clustering calculation, if the clustering effect is not good, the encoding algorithm for adjusting the K value and modifying the characteristics is needed.

1. A high-risk area identification system based on a clustering algorithm comprises a first acquisition unit, a first database unit and a second acquisition unit, wherein the first acquisition unit is used for butting information systems of related departments, acquiring historical case data, and generating an associated data set comprising case information, address information and high-risk characteristics according to the case data to serve as a training data set;

the address vector generating unit is used for extracting the address information in the training data set, coding addresses, generating an address vector corresponding to each address, and finally forming an address vector set;

the address vector merging unit is used for merging the address vectors with similarity exceeding a set threshold in the address vector set;

the second acquisition unit is used for butting an information system of a relevant department and acquiring an incremental data set by combining a real-time stream processing technology, wherein the incremental data set is new data which is continuously generated by updating along with time;

the high-risk feature vector generating unit is used for extracting high-risk features in the training data set and the incremental data set, and encoding the high-risk features to form a high-risk feature vector set;

the vector merging unit is used for obtaining a high-risk area feature vector set after associating the address vector set with the high-risk feature vector set;

the model base generation unit is used for calculating the high-risk area feature vector set by using a clustering algorithm, carrying out clustering model training and generating a model base;

the identification unit is used for extracting the living information of the target user and coding the living information to generate an address code to be identified;

and the model prediction unit is used for matching the address code to be identified with the model library and predicting and judging whether the target user is from an area with high-risk characteristics.

Further, the system further comprises an updating unit, configured to update the model library, specifically: setting an updating period, periodically acquiring newly added case data, generating an incremental data set with the same format as the training data set, extracting and associating an address vector set and a high-risk feature vector set corresponding to the incremental data set, updating to the current high-risk area feature vector set, performing clustering model training again, and updating the model library.

The invention has the beneficial effects that: the method adopts unsupervised learning, does not need a large amount of labeled data, is low in training cost, simultaneously adopts Spark distributed computation as a training method, trains a speed block, has a larger usable data set, and can quickly verify the model. After the model is trained successfully, the daily administrative and social management work of relevant departments can be supported, the high-risk regional characteristics of the residence and the household location of the target personnel can be judged quickly without experience, quick response is realized, the work efficiency of the relevant departments is improved, and the study and judgment cost is reduced.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of a clustering process;

FIG. 3 is a table of associated data sets.

Detailed Description

In order to clearly illustrate the technical features of the present solution, the present solution is explained below by way of specific embodiments.

The first embodiment, referring to fig. 1-3, is realized by the following technical scheme: a high-risk area identification method based on a clustering algorithm comprises the following steps:

the method comprises the steps of butting an information system of a relevant department, obtaining historical case data, generating a relevant data set comprising case information, address information and high-risk characteristics according to the case data, and using the relevant data set as a training data set; the method includes the steps that the case text file is characterized through a Chinese word segmentation technology, meanwhile, case characterization is conducted on high-risk characteristic words, expression of the high-risk characteristic words is in accordance with convention in relevant laws and regulations, such as theft, robbery and the like, and address information is in accordance with the residential address and the household registration address of case-related personnel;

merging the address vectors with similarity exceeding a set threshold in the address vector set;

Setting an updating period, periodically acquiring newly added case data, generating an incremental data set with the same format as the training data set, extracting and associating an address vector set and a high-risk feature vector set corresponding to the incremental data set, updating to the current high-risk area feature vector set, performing clustering model training again, and updating the model library.

The encoding of the address specifically includes: firstly, the national standard geographic information base is adopted for word segmentation, and each word is subjected to digital indexing, so that address vectorization is realized. To improve generalization, addresses are accurate to either city or county level.

And calculating similarity of the address vectors by an Euclidean distance algorithm, and combining address combinations with the similarity larger than a threshold value by multiple rounds of iteration.

The clustering model training specifically comprises the following steps: the clustering algorithm is a K-means algorithm realized based on Spark; calculating a K value; inputting the calculated K value and the feature vector; the calculated results are stored in a model library.

The clustering (K-Means) algorithm procedure is as follows:

1. given an initial data set

2. from the training set

In the method, K centroids are randomly selected and are respectively

And initializing clusters

4. Recalculating C_jThe center of mass of (c):

2. In a second embodiment, a high-risk area identification system based on clustering algorithm includes

3. The system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for butting an information system of a relevant department, acquiring historical case data, and generating a related data set comprising case information, address information and high-risk characteristics as a training data set according to the case data;

The system further comprises an updating unit for updating the model base, specifically: setting an updating period, periodically acquiring newly added case data, generating an incremental data set with the same format as the training data set, extracting and associating an address vector set and a high-risk feature vector set corresponding to the incremental data set, updating to the current high-risk area feature vector set, performing clustering model training again, and updating the model library.

In the description of the present invention, the foregoing detailed description has set forth various embodiments of the devices and/or processes via the use of block diagrams, flowcharts, and/or examples. To the extent that such block diagrams, flowcharts, and/or examples contain one or more functions and/or operations, those skilled in the art will appreciate that each function and/or operation within such block diagrams, flowcharts, or examples can be implemented, individually and/or collectively, by a wide range of different hardware, software, firmware, or virtually any combination thereof.

There is little difference between hardware and software implementations of aspects of the system; the use of hardware or software is typically (but not always, since in some scenarios the choice between hardware and software may become important) a design choice representing a cost versus efficiency tradeoff. There are various means (e.g., hardware, software, and/or firmware) by which processes and/or systems and/or other techniques described herein can be implemented, and the preferred means will vary from one scenario in which processes and/or systems and/or other techniques are deployed to another. For example, if the implementer determines that speed and accuracy are paramount, the implementer may opt for a mainly hardware and/or firmware approach; if flexibility is paramount, the implementer may opt for a mainly software implementation; alternatively, but again alternatively, the implementer may opt for some combination of hardware, software, and/or firmware.

The technical features of the present invention which are not described in the above embodiments may be implemented by or using the prior art, and are not described herein again, of course, the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and variations, modifications, additions or substitutions which may be made by those skilled in the art within the spirit and scope of the present invention should also fall within the protection scope of the present invention.

Claims

1. A high-risk area identification method based on a clustering algorithm is characterized by comprising the following steps:

the method comprises the steps of butting an information system of a relevant department, obtaining historical case data, generating a relevant data set comprising case information, address information and high-risk characteristics according to the case data, and using the relevant data set as a training data set;

extracting high-risk features in the training data set, and coding the high-risk features to form a high-risk feature vector set;

associating the address vector set with the high-risk feature vector set to obtain a high-risk region feature vector set;

calculating the feature vector set of the high-risk region by using a clustering algorithm, performing clustering model training, and generating a model library;

2. The method for identifying high-risk regions based on clustering algorithm as claimed in claim 1, wherein an update cycle is set, newly added case data is periodically obtained, an incremental data set with the same format as the training data set is generated, an address vector set and a high-risk feature vector set corresponding to the incremental data set are extracted and associated, and updated to the current high-risk region feature vector set, clustering model training is performed again, and the model base is updated.

3. The high-risk region identification method based on clustering algorithm according to claim 1, wherein the encoding of the address specifically comprises: firstly, the national standard geographic information base is adopted for word segmentation, and each word is subjected to digital indexing, so that address vectorization is realized.

4. The method for identifying high-risk regions based on clustering algorithm as claimed in claim 3, wherein the similarity is calculated for the address vectors by Euclidean distance algorithm, and the address combinations with similarity greater than a threshold are combined by multiple iterations.

5. The high-risk region identification method based on the clustering algorithm as claimed in claim 4, wherein the similarity of the address vectors is calculated as follows: the distance ρ (a, B) between a ═ a [1], a [2], …, a [ n ]) and B ═ B [1], B [2], …, B [ n ]) is defined by the following formula:

6. The high-risk region identification method based on the clustering algorithm as claimed in claim 4, wherein the clustering model training is specifically: the clustering algorithm is a K-means algorithm realized based on Spark; calculating a K value; inputting the calculated K value and the feature vector; the calculated results are stored in a model library.

7. A high-risk area identification system based on a clustering algorithm is characterized by comprising a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for butting information systems of related departments, acquiring historical case data, and generating a related data set comprising case information, address information and high-risk characteristics according to the case data to serve as a training data set;

8. The high-risk region identification system based on clustering algorithm according to claim 7, further comprising an updating unit for updating the model base, specifically: setting an updating period, periodically acquiring newly added case data, generating an incremental data set with the same format as the training data set, extracting and associating an address vector set and a high-risk feature vector set corresponding to the incremental data set, updating to the current high-risk area feature vector set, performing clustering model training again, and updating the model library.