CN111460052B - Low-security fund supervision method and system based on supervised data correlation analysis - Google Patents

Low-security fund supervision method and system based on supervised data correlation analysis Download PDF

Info

Publication number
CN111460052B
CN111460052B CN202010275707.9A CN202010275707A CN111460052B CN 111460052 B CN111460052 B CN 111460052B CN 202010275707 A CN202010275707 A CN 202010275707A CN 111460052 B CN111460052 B CN 111460052B
Authority
CN
China
Prior art keywords
data
low
security
fund
clustering center
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010275707.9A
Other languages
Chinese (zh)
Other versions
CN111460052A (en
Inventor
云静
赵禹萌
王永生
刘利民
许志伟
张紫婷
翟娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Technology
Original Assignee
Inner Mongolia University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Technology filed Critical Inner Mongolia University of Technology
Priority to CN202010275707.9A priority Critical patent/CN111460052B/en
Publication of CN111460052A publication Critical patent/CN111460052A/en
Application granted granted Critical
Publication of CN111460052B publication Critical patent/CN111460052B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Tourism & Hospitality (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Primary Health Care (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A low-insurance fund supervision method based on supervision data association analysis comprises the following steps: step 1, acquiring information data of low security users and low security fund issuing data, performing data persistence processing, and counting the total amount and the total times of each subsidy received by each low security user in a period of time; step 2, obtaining problem data that a part of beneficiaries do not receive the low-security fund in the low-security information data table through data collision and decision analysis among the association tables, and extracting data that real low-security users receive the low-security fund; and 3, calculating the total amount and total times of picking up of each low-security user in one year in the data of the real low-security user for picking up the low-security fund, and respectively calculating a data clustering center of the total amount and the total times of picking up to obtain problem data which is separated from the clustering center, namely annual picking-up amount abnormal data or annual picking-up time abnormal data. The invention realizes the minimum life support fund supervision of the high-efficiency supervision data association analysis.

Description

Low-security fund supervision method and system based on supervised data correlation analysis
Technical Field
The invention belongs to the technical field of big data analysis and application, and particularly relates to a low-security fund supervision method and system based on supervision data association analysis.
Background
The lowest life guarantees the clothes and food cold and warm of the masses with difficult customs, and the customs society is harmonious, stable and fair, is a basic and fundamental system arrangement which can protect the livelihood and promote the fairness, and plays an active role in the aspects of guaranteeing the basic life of the masses with difficulty and maintaining the social stability. When the subsidy projects are recorded into the database by the basic department, some subsidy projects do not have proper filing categories, and when the subsidy projects are recorded into the database by the basic department, all the subsidy projects are filed as the minimum life guarantee. The corresponding supervision tasks are therefore also necessarily more laborious driven by these special phenomena. The existing supervision departments mainly rely on manual analysis for supervision of low-security capital. However, in the face of massive low-security fund data with various sources and complex contents, all cheap and administrative problems in the massive civil data are difficult to find only by using manual analysis, and the low-security issuing data is difficult to be comprehensively monitored and supervised.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a low-security fund supervision method and system based on supervised data association analysis, so that the effect of well discovering the low-security field and low-cost problems is achieved on the basis of solving the problems of irregular subsidy release, unobvious common characteristics of data and the like in massive low-security release data, and the efficiency of supervision work is improved.
In order to achieve the purpose, the invention adopts the technical scheme that:
a low-insurance fund supervision method based on supervision data association analysis comprises the following steps:
step 1, acquiring low-security user information data and one-card data of subsidy issuing, importing the low-security user information data and the one-card data of subsidy issuing into an Oracle database for data persistence processing, extracting low-security fund issuing data from the one-card data, removing interference information in low-security issuing from the low-security fund issuing data containing various subsidy items by using a preprocessing method facing the low-security user information data and the low-security fund issuing data, and extracting key fields in the low-security fund issuing data; firstly filtering data missing from the main user identity number and the member identity number in the low-security user information data, then obtaining a low-security user information data table and a low-security fund issuing data table, and counting the total amount and the total times of each subsidy received by each low-security user in a period of time;
the step is mainly used for extracting useful information related to low-security data in the mass civil fund data so as to ensure the calculation efficiency and the analysis precision of the subsequent analysis process.
Step 2, comparing the beneficiary identity number in the low-security fund distribution data table with the household main identity number and the household member identity number in the low-security fund information data table through data collision and decision analysis among association tables to obtain problem data that a part of beneficiaries receive the low-security fund without being in the low-security fund information data table, and extracting data of the real low-security fund received by the low-security household;
the step is mainly used for ensuring that the data of subsequent analysis are all true low-security users registered by related departments and finding out the possible phenomenon that non-low-security users receive low-security funds.
And 3, calculating the total amount and total times of earning of each low-security user in one year in the data of the actual low-security fund earning of the low-security user, and respectively calculating a data clustering center of the total amount and the total times of earning to obtain problem data which is separated from the clustering center, namely annual earning amount abnormal data or annual earning time abnormal data.
The step is mainly used for accurately clustering low-security issuing data with the same characteristics and accurately classifying behaviors of illegally getting low-security capital within a period of time.
In the step 1, for the acquired massive low-security-user information data and low-security-fund dispensing data, according to the low-security-fund dispensing rule, a preprocessing method facing the low-security-user information data and the low-security-fund dispensing data is used for screening out irrelevant interference information, and a method for extracting key fields comprises the following steps:
1) uniformly converting the low-security user information data and the one-card data for issuing subsidies into a CSV format, and performing data persistence processing by using an Oracle database;
2) extracting low-security fund issuing data from the one-card data, and extracting key field complete data including the amount of money received, the issuing date, the beneficiary identity card number, the city, the county, the district and the village group from the low-security fund issuing data; extracting complete data of the identity card numbers of the householder and the householder members from the low-security information data;
3) and screening the data processed in the two steps by using SQL language in an Oracle database according to the release year, the subsidy project code, the city and county fields to obtain a data table for receiving a certain subsidy project from a certain county of a certain city in a certain year, and counting the total sum and the total times of receiving the subsidy by each low insured user in a period of time.
3. The low-insurance fund supervision method based on supervised data association analysis as recited in claim 1, wherein in step 3, normal data common features are found through clustering, and problem data departing from a clustering center is found by using an unsupervised classification method based on clustering analysis.
The method for discovering the abnormal data of the annual pickup amount comprises the following steps:
the method comprises the following steps: there is a low-security fund release data set L, L ═ L1,l2,…,li,…,lnWhere {1,2, …, i, …, n } is the sequence number of the data, n is the number of data in the data set L,/iIndicates the ith data, lnRepresents the nth piece of data; randomly selecting K values from the annual total amount of money to be collected, { eta12,…,ηi,…ηKK initial clustering centers respectively representing annual total money pickup, {1, …, i, …, K } indicating the serial number of data,. eta.iDenotes the ith initial clustering center, ηKRepresenting the Kth initial clustering center, defining the initial clustering center { eta12,…,ηi,…ηKNeighborhood of { U }1,U2,…,Ui,…,UKValue of { lambda }12,…,λi,…λK}, initial clustering center ηiNeighborhood U ofiThe distance between the sample set L and the eta i is not more than lambdaiA subsample set of, i.e. Ui={li∈L|dis(lii)≤λi},UKRepresenting the initial clustering center ηKIs a neighborhood ofiDenotes the ith value, λKRepresents the Kth value; { eta [. eta. ]12,…,ηi,…ηKAnd the corresponding { lambda }12,…,λi,…λKForm a class cluster { Z }1,Z2,…,Zi,…,ZKWherein the ith class cluster Zi={ηi±λi}, the Kth class cluster ZK={ηK±λK}; for each value remaining in the annual total amount, calculate each value and { η [ ]12,…,ηi,…ηKA distance, assigning the value to the nearest cluster; then calculating the mean value of each cluster, namely the clustering center, and continuously and repeatedly calculating the clustering centers of the clusters until the clustering centers of the clusters do not change;
step two: calculating L ═ L1,l2,…,li,…,lnSet A of all data not in K clustersAmount of moneyWherein
Figure BDA0002444686660000042
Figure BDA0002444686660000043
Denotes an empty set, AAmount of moneyThere is a problem set for annual collection of amounts;
the method for discovering the abnormal data of the annual picking times comprises the following steps:
the method comprises the following steps: randomly selecting K values in the total annual picking times, { eta1',η2',…,ηi',…ηK' } K initial cluster centers for respectively taking total times of year, {1, …, i, …, K } denotes the number of data,. eta.i' denotes the ith initial clustering center, ηK' denotes the Kth initial clustering center, defining the initial clustering center { η1',η2',…,ηi',…ηK' } neighborhood U1',U2',…,Ui',…,UKA value of' } λ1',λ2',…,λi',…λK' } initial clustering center ηi' neighborhood Ui' includes the sum of the sample set L and ηi' is not more than λi' A subsample set, Ui'={li∈L|dis(lii')≤λi'},UK' representing initial clustering center etaK' neighborhood, λi' denotes the ith value, λK' represents a Kth numerical value; { eta [. eta. ]1',η2',…,ηi',…ηK' } and corresponding λ1',λ2',…,λi',…λK' } construction of class clusters Z1',Z2',…,Zi',…,ZK' }, wherein the ith class cluster Zi'={ηi'±λi' } K class cluster ZK'={ηK'±λK' }; for each value remaining in the total annual count, calculate each value and { η }1',η2',…,ηi',…ηK' } assigning the value to the nearest cluster; then calculating the mean value of each cluster, namely the clustering center, and continuously and repeatedly calculating the clustering centers of the clusters until the clustering centers of the clusters are not changed;
step two: calculating L ═ L1,l2,…,li,…,lnSet A of all data not in K clustersNumber of timesWherein
Figure BDA0002444686660000044
ANumber of timesThere is a problem set for annual pick-up times.
The mean calculation distance formula of the clusters is as follows:
Figure BDA0002444686660000041
where dis (l)i,lk) Is the ith data liAnd the k-th data lkDistance, xiAnd xkAre each liAnd lkThe numerical value of (c).
The invention also provides a low-security fund supervision system based on the supervised data association analysis, which comprises a data preprocessing module, an associated data collision and decision analysis module and an unsupervised classification module, wherein the step 1, the step 2 and the step 3 are respectively and correspondingly executed.
Compared with the prior art, the method has the advantages that aiming at the phenomenon that non-low-security users possibly receive low-security funds in the mass low-security fund issuing data and the characteristic that issuing standards of different cities and different years are inconsistent when the low-security fund issuing data is input in a basic department, a preprocessing method facing to the low-security user information data and the low-security fund issuing data is adopted, interference information in low-security issuing is eliminated, and key fields in the low-security fund issuing data are extracted; and obtaining a part of problem data which is not in the low-security-user information data table and receives the low-security-user funds by adopting data collision and decision analysis among the association tables, and extracting the data of the real low-security-user receiving the low-security-user funds; and extracting the suspected problem data of illegally drawing the low-security fund in one stage in the low-security issuing data by adopting a problem data analysis method in a period of time based on cluster analysis. Therefore, the problems that subsidy issuing is not unified in standard, issuing time is not fixed, temporary subsidies are completely filed as low insurance and the like in the existing low-insurance fund issuing data are well solved, and the efficiency of the supervision department for finding the problems in the field of civil fund issuing is improved.
Drawings
FIG. 1 is a schematic view of the main process of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 presents a schematic flow chart diagram in accordance with an embodiment of the present invention. In summary, the method comprises:
step 1: acquiring low-insurance-user information data and one-card data for issuing subsidies, uniformly converting the format into a CSV format, and performing data persistence processing by using an Oracle database; the method comprises the steps of extracting low-security fund issuing data from one-card data by using SQL language in an Oracle database, removing interference information in low-security issuing from the low-security fund issuing data containing multiple types (300 types) of subsidy items by using a preprocessing method facing the low-security user information data and the low-security fund issuing data, and extracting key fields in the low-security fund issuing data. Specifically, data of missing fields of a receiving amount, a release date, a beneficiary number, a city and a county are removed, complete data of key fields of a receiving amount, a release date, a beneficiary number (namely, a beneficiary identity number), a city, a county, a district, a village group and the like in low-security fund release data are extracted, data of missing identity numbers of a householder and a member of the householder are firstly filtered in low-security user information data, and complete data of identity numbers of the householder and the member of the householder are extracted, so that a low-security user information data table and a low-security fund release data table are obtained. Then, SQL language is used in Oracle database to screen the processed data according to the release year, subsidy project code, city, county and other fields, so as to obtain a data table for getting some subsidy project from some county in some city in some year. And (3) processing the data obtained through the steps by using an SQL language in an Oracle database, and counting the total sum and the total times of getting various subsidies by a certain low security holder in a period of time.
Generally, the information data of the low-security deposit can be provided by a civil department, the one-card data for issuing subsidies can be provided by a financial department, and the one-card data comprises low-security fund issuing data. The two data formats of the low-security fund issuing data and the low-security user information data are greatly different, direct correlation analysis is difficult to perform, and the data characteristics are targeted. Therefore, the invention uniformly converts the low-security fund data format from each department into the CSV format and uses the Oracle database for data persistence processing.
Step 2: minimum life support fund release rules, the beneficiary in the low-protection fund release data is a member of the low-protection family. For the possible phenomenon that non-low-security users receive low-security funds, the beneficiary identity number in the low-security fund issuing data table is compared with the household main identity number and the household member identity number in the low-security user information data table through data collision and decision analysis among the association tables to obtain problem data that a part of beneficiaries do not receive the low-security funds in the low-security user information data table, and data of really low-security users receiving the low-security funds are extracted.
And 3, when the low-security fund issuing data is recorded in the basic department, two special conditions exist. The first is that some subsidy items are issued without proper filing categories in the database, and all the subsidy items are filed as the minimum life guarantee when the subsidy items are recorded in the database by the basic department. The second is temporary subsidies such as price escalation subsidies and winter heating subsidies which are temporarily and irregularly issued by local governments and civil administration departments, and all the subsidies are filed as the lowest life guarantee when the subsidies are recorded in a database, so that the problem that whether data exist cannot be judged by using low-security fund issuance standards is caused. Aiming at the characteristic of issuing low-security funds, in the data of the actual low-security fund collection of each low-security household, the total collection amount and the total collection times of each low-security household in one year are calculated, and the data clustering centers of the total collection amount and the total collection times are respectively calculated to obtain problem data which is separated from the clustering centers, namely annual collection amount abnormal data or annual collection times abnormal data.
Specifically, common features of normal data are found through clustering, and problem data which are separated from a clustering center are found by using an unsupervised classification method based on clustering analysis.
The method for discovering the abnormal data of the annual pickup amount comprises the following steps:
the method comprises the following steps: there is a low-security fund release data set L, L ═ L1,l2,…,li,…,lnWhere {1,2, …, i, …, n } is the sequence number of the data, n is the number of data in the data set L,/iIndicates the ith data, lnRepresents the nth piece of data; randomly selecting K values from the annual total amount of money to be collected, { eta12,…,ηi,…ηKK initial clustering centers respectively representing annual total money pickup, {1, …, i, …, K } indicating the serial number of data,. eta.iDenotes the ith initial clustering center, ηKRepresenting the Kth initial clustering center, defining the initial clustering center { eta12,…,ηi,…ηKNeighborhood of { U }1,U2,…,Ui,…,UKValue of { lambda }12,…,λi,…λK}, initial clustering center ηiNeighborhood U ofiIncludes the sum of L and η of the sample setiIs not more than lambdaiA subsample set of, i.e. Ui={li∈L|dis(lii)≤λi},UKRepresenting the initial clustering center ηKIs a neighborhood ofiDenotes the ith value, λKRepresents the Kth value; { eta [. eta. ]12,…,ηi,…ηKAnd the corresponding { lambda }12,…,λi,…λKForm a class cluster { Z }1,Z2,…,Zi,…,ZKWherein the ith class cluster Zi={ηi±λi}, the Kth class cluster ZK={ηK±λK}; for each value remaining in the annual total amount, calculate each value and { η [ ]12,…,ηi,…ηKA distance, assigning the value to the nearest cluster; then calculating the mean value of each cluster, namely the clustering center, and continuously and repeatedly calculating the clustering centers of the clusters until the clustering centers of the clusters do not change;
step two: calculating L ═ L1,l2,…,li,…,lnSet A of all data not in K clustersAmount of moneyWherein
Figure BDA0002444686660000071
Figure BDA0002444686660000072
Denotes an empty set, AAmount of moneyThere is a problem set for annual collection of amounts;
the method for discovering the abnormal data of the annual picking times comprises the following steps:
the method comprises the following steps: randomly selecting K values in the total annual picking times, { eta1',η2',…,ηi',…ηK' } K initial cluster centers for respectively taking total times of year, {1, …, i, …, K } denotes the number of data,. eta.i' denotes the ith initial clustering center, ηK' denotes the Kth initial clustering center, defining the initial clustering center { η1',η2',…,ηi',…ηK' } neighborhood U1',U2',…,Ui',…,UKA value of' } λ1',λ2',…,λi',…λK' } initial clustering center ηi' neighborhood Ui' includes the sum of the sample set L and ηi' is not more than λi' A subsample set, Ui'={li∈L|dis(lii')≤λi'},UK' representing initial clustering center etaK' neighborhood, λi' denotes the ith value, λK' represents a Kth numerical value; { eta [. eta. ]1',η2',…,ηi',…ηK' } and corresponding λ1',λ2',…,λi',…λK' } construction of class clusters Z1',Z2',…,Zi',…,ZK' }, wherein the ith class cluster Zi'={ηi'±λi' } K class cluster ZK'={ηK'±λK' }; for each value remaining in the total annual count, calculate each value and { η }1',η2',…,ηi',…ηK' } assigning the value to the nearest cluster; then calculating the mean value of each cluster, namely the clustering center, and continuously and repeatedly calculating the clustering centers of the clusters until the clustering centers of the clusters are not changed;
step two: calculating L ═ L1,l2,…,li,…,lnSet A of all data not in K clustersNumber of timesWherein
Figure BDA0002444686660000083
ANumber of timesThere is a problem set for annual pick-up times.
In the invention, the cluster mean value calculation distance formula is as follows:
Figure BDA0002444686660000081
where dis (l)i,lk) For the ith data liAnd the k-th data lkDistance, xiAnd xkAre each liAnd lkThe numerical value of (c).
The invention also provides a low-security fund supervision system based on the supervised data association analysis, which comprises a data preprocessing module, an associated data collision and decision analysis module and an unsupervised classification module, wherein the step 1, the step 2 and the step 3 are respectively and correspondingly executed.
The overall implementation of the above method is illustrated by a specific example.
The embodiment is established on a cloud computing platform, the platform consists of 15 servers, comprises Vmware Esxi 5,20T disk arrays and 1000M network switches, and is provided with a Hadoop cluster. In this embodiment, 2018-year low-security fund issuance data table is obtained by using 2018-year town minimum life support fund issuance data as an analysis object. For convenience of presentation, the step of extracting the low-guarantee fund issuing data from the one-card fund issuing data is omitted, and the step of screening according to city and county in the steps is omitted. And selecting the small amount of low-security fund issuance data in 2018 as a display in the following table, and deleting the data of fields such as the shortage of issuance amount, issuance date, city, county, subsidy project code and the like. The low-security funds release data includes 12 fields, respectively: { city, county, town street, village, group, subsidy object code, subsidy object, beneficiary code, subsidy project name, issuance amount, issuance date }.
Figure BDA0002444686660000082
Figure BDA0002444686660000091
The method comprises the following steps: performing correlation analysis and data collision on the 2018 low-security fund issuing data table and the low-security user information table, and comparing the beneficiary codes in the 2018 low-security fund issuing data table with the user member identity card account numbers in the low-security user information table to obtain a part of problem data which is not in the low-security user information data table and is used for picking up low-security funds; and extracts data of the actual low-security fund collected by the low-security user in the 2018 low-security fund release data table.
Step two: calculating the total times and the total amount of all low-security households for receiving the low-security funds in 2018; clustering the annual total amount of money, setting the number K of clustering centers to be 3, and setting the initial clustering center eta to be 31=6000, η2=12000,η317000, calculate the distance of all data from the three initial cluster centers, assign data to Z according to distance1、Z2、Z3In the three initial clusters, then calculating the mean value in the cluster, namely the final cluster center, and respectively eta ″1=6300,η″2=12225,η″316751 sets η ″1、η″2、η″3Neighborhood U of1、U2、U3A value of (A)1、 λ2、λ32000, 1500, 1000 respectively. Three clusters were obtained to obtain three Z ″)1=6300±2000、 Z″2=12225±1500、Z″316751 +/-1000, and then calculating the data which is not in the three clusters, namely acquiring the suspected problem data of the low-security fund in 2018. Similar methods can calculate data with problems in annual access to the total number of low-security times, and this embodiment is not described herein again. In the embodiment, in 2018, 762708 pieces of data are preprocessed, and a possible non-low-security consumer receives 22 consumers with low-security; and the suspicious problem data 411 of the total amount of the low-security fund picked in 2018 of the county are found to be similar to the suspicion of the total annual picking times of the low-security fund2501 pieces of problem data, 384 pieces of problem data exist in the total amount of the low-security fund taken annually and the total times of annual taking.
Through the implementation steps of the invention, it can be seen that 762708 pieces of civil fund issuing data in the one-card data for issuing subsidies in the embodiment are subjected to screening out useless information, extracting key fields, performing association table collision analysis and clustering analysis methods, so that the data dimension is reduced, and the problems that the subsidies issuing in the existing low-security fund issuing data is not unified in standard, the issuing time is not fixed, all temporary subsidies are filed as the lowest life guarantee and the like are solved; and finds that there may be non-low-security households claiming 22 households with low-security impressions; and finding that 411 pieces of suspected problem data of the total amount of the low-security fund in 2018 of the county, 2501 pieces of suspected problem data of the total annual times of the low-security fund, 384 pieces of problem data of the total amount of the low-security fund taken annually and the total annual times of the low-security fund taken annually exist. The minimum life insurance fund supervision method based on the supervision data association analysis comprehensively considers the characteristics of low insurance fund issuing rules and low insurance fund data. In the practical example, the method and the system process mass low-security fund release data, and the efficiency of finding suspected problems is improved by nearly 1000 times compared with the efficiency of a traditional manual method.
In conclusion, the method analyzes the common characteristics of the data and finds the behavior of taking the low-security fund in a suspected violation in one stage on the basis of massive low-security data preprocessing and associated data collision and decision analysis, thereby realizing the minimum life support fund supervision of the high-efficiency supervision data association analysis.
Although the present invention has been described by way of preferred embodiments, the present invention is not limited to the embodiments described herein, and various changes and modifications may be made without departing from the scope of the present invention.

Claims (5)

1. A low-insurance fund supervision method based on supervision data association analysis is characterized by comprising the following steps:
step 1, acquiring low-security user information data and one-card data of subsidy release, importing the low-security user information data and the one-card data of subsidy release into an Oracle database for data persistence processing, extracting low-security fund release data from the one-card data, removing interference information in low-security release from the low-security fund release data containing various subsidy projects by using a preprocessing method facing the low-security user information data and the low-security fund release data, and extracting key fields in the low-security fund release data; firstly filtering data missing from the main user identity number and the member identity number in the low-security user information data, then obtaining a low-security user information data table and a low-security fund issuing data table, and counting the total amount and the total times of each subsidy received by each low-security user in a period of time;
step 2, for the phenomenon that non-low-security users receive low-security funds, comparing the beneficiary identity number in the low-security funds release data table with the household main identity number and the household member identity number in the low-security user information data table through data collision and decision analysis among the association tables to obtain problem data that the beneficiary does not receive the low-security funds in the low-security user information data table, and extracting data of the real low-security users receiving the low-security funds;
step 3, calculating the total amount and total times of picking up of each low-security user in one year in the data of the real low-security user for picking up the low-security fund, and respectively calculating a data clustering center of the total amount and the total times of picking up to obtain problem data which is separated from the clustering center, namely annual picking-up amount abnormal data or annual picking-up time abnormal data;
the method for discovering the abnormal data of the annual pickup amount comprises the following steps:
the method comprises the following steps: there is a low-security fund release data set L, L ═ L1,l2,…,li,…,lnWhere {1,2, …, i, …, n } is the sequence number of the data, n is the number of data in the data set L,/iIndicates the ith data, lnRepresenting the nth piece of data; randomly selecting K values from the annual total amount of money to be collected, { eta1,η2,…,ηi,…, ηKK initial clustering centers respectively representing annual total money pickup, {1, …, i, …, K } being the order of dataNumber ηiDenotes the ith initial clustering center, ηKRepresenting the Kth initial clustering center, defining the initial clustering center { eta1,η2,…,ηi,…, ηKNeighborhood of { U }1,U2,…,Ui,…,UKValue of { lambda }1,λ2,…,λi,…, λK}, initial clustering center ηiNeighborhood U ofiIncludes the sum of L and η of the sample setiIs not more than lambdaiA subsample set of, i.e. Ui={li∈L|dis(li,ηi)≤λi},UKRepresenting initial clustering center ηKIs a neighborhood ofiDenotes the ith value, λKRepresents the Kth value; { eta [. eta. ]1,η2,…,ηi,…, ηKAnd the corresponding { lambda }1,λ2,…,λi,…, λKForm a class cluster { Z }1,Z2,…,Zi,…,ZKWherein the ith class cluster Zi={ηi±λi}, the Kth class cluster ZK={ηK±λK}; for each value remaining in the annual total amount, calculate each value and { η [ ]1,η2,…,ηi,…, ηKA distance, assigning the value to the nearest cluster; then calculating the mean value of each cluster, namely the clustering center, and continuously and repeatedly calculating the clustering centers of the clusters until the clustering centers of the clusters are not changed;
step two: calculating L ═ L1,l2,…,li,…,lnSet A of all data not in K clustersAmount of moneyWherein
Figure FDA0003220050810000021
Figure FDA0003220050810000022
Denotes an empty set, AAmount of moneyThere is a problem set for annual collection of amounts;
the method for discovering the abnormal data of the annual picking times comprises the following steps:
the method comprises the following steps: randomly selecting K values in the total annual picking times, { eta1′,η2′,…,ηi′,…, ηK' } K initial cluster centers respectively representing the total number of annual picking times, {1, …, i, …, K } denotes the number of data, ηi' denotes the ith initial clustering center, ηK' denotes the Kth initial clustering center, defining the initial clustering center { η1′,η2′,…,ηi′,…, ηK' } neighborhood U1′,U2′,…,Ui′,…,UKA value of' } λ1′,λ2′,…,λi′,…, λK' } initial clustering center ηi' neighborhood Ui' includes the sum of the sample set L and ηi' is not more than λi' A subsample set, Ui′={li∈L|dis(li,ηi′)≤λi′},UK' representing initial clustering center etaK' neighborhood, λi' denotes the ith value, λK' represents a Kth numerical value; { eta [. eta. ]1′,η2′,…,ηi′,…, ηK' } and corresponding λ1′,λ2′,…,λi′,…, λK' } construction of class clusters Z1′,Z2′,…,Zi′,…,ZK' }, wherein the ith class cluster Zi′={ηi′±λi' } K class cluster ZK′={ηK′±λK' }; for each value remaining in the total annual count, calculate each value and { η }1′,η2′,…,ηi′,…, ηK' } assigning the value to the nearest cluster; then calculating the mean value of each cluster, namely the clustering center, and continuously and repeatedly calculating the clustering centers of the clusters until the clustering centers of the clusters are not changed;
step two: calculating L ═ L1,l2,…,li,…,lnSet A of all data not in K clustersNumber of timesWherein
Figure FDA0003220050810000023
ANumber of timesThere is a problem set for annual pick-up times.
2. The low-security fund supervision method based on supervised data correlation analysis as recited in claim 1, wherein in step 1, for the obtained low-security fund information data and low-security fund release data, according to the low-security fund release rule, a preprocessing method facing the low-security fund information data and the low-security fund release data is used to screen out irrelevant interference information, and a method for extracting key fields comprises the following steps:
step 1), uniformly converting the low-security information data and the one-card data for issuing subsidies into a CSV format, and performing data persistence processing by using an Oracle database;
step 2), extracting low-security fund issuing data from the one-card data, and extracting key field complete data including the collection amount, the issuing date, the beneficiary identity card number, the city, the county, the district and the village group from the low-security fund issuing data; extracting complete data of the identity card numbers of the householder and the householder members from the low-security information data;
and 3) screening the data processed in the step 1) and the step 2) in an Oracle database by using an SQL language according to the release year, the subsidy item code, the city field and the county field to obtain a data table for receiving a certain subsidy item in a certain city and a certain county in a certain year, and counting the total amount and the total times of receiving the subsidy by each low insured user in a period of time.
3. The low-insurance fund supervision method based on supervised data association analysis as recited in claim 1, wherein in step 3, common features of normal data are found through clustering, and problem data departing from a clustering center is further found by using an unsupervised classification method based on clustering analysis.
4. The low-insurance fund supervision method based on supervised data association analysis as recited in claim 1, wherein the mean calculated distance formula of the clusters is:
Figure FDA0003220050810000031
where dis (l)i,lk) For the ith data liAnd the k-th data lkDistance, xiAnd xkAre each liAnd lkThe numerical value of (c).
5. A low-security fund supervision system based on supervision data association analysis is characterized by comprising:
the data preprocessing module is used for importing the acquired low-security user information data and the one-card data for issuing subsidies into an Oracle database for data persistence processing, extracting low-security fund issuing data from the one-card data, eliminating interference information in low-security issuing from the low-security fund issuing data containing various subsidy items by using a preprocessing method facing the low-security user information data and the low-security fund issuing data, and extracting key fields in the low-security fund issuing data; firstly filtering data missing from the main user identity number and the member identity number in the low-security user information data, then obtaining a low-security user information data table and a low-security fund issuing data table, and counting the total amount and the total times of each subsidy received by each low-security user in a period of time;
the correlation data collision and decision analysis module is used for comparing the beneficiary identity number in the low-security fund issuing data table with the household main identity number and the household member identity number in the low-security fund information data table through data collision and decision analysis among correlation tables for the phenomenon that non-low-security users receive low-security funds, so that problem data that the beneficiary does not receive the low-security funds in the low-security user information data table is obtained, and data in which real low-security users receive the low-security funds are extracted;
the unsupervised classification module is used for calculating the total receiving amount and the total receiving times of each low-security user in one year in the data of the real low-security fund receiving of the low-security user, and respectively calculating a data clustering center of the total receiving amount and the total receiving times to obtain problem data which is separated from the clustering center, namely annual receiving amount abnormal data or annual receiving times abnormal data;
the method for discovering the abnormal data of the annual pickup amount comprises the following steps:
the method comprises the following steps: there is a low-security fund release data set L, L ═ L1,l2,…,li,…,lnWhere {1,2, …, i, …, n } is the sequence number of the data, n is the number of data in the data set L,/iIndicates the ith data, lnRepresenting the nth piece of data; randomly selecting K values from the annual total amount of money to be collected, { eta1,η2,…,ηi,…, ηKK initial clustering centers respectively representing annual total money pickup, {1, …, i, …, K } indicating the serial number of data,. eta.iDenotes the ith initial clustering center, ηKRepresenting the Kth initial clustering center, defining the initial clustering center { eta1,η2,…,ηi,…, ηKNeighborhood of { U }1,U2,…,Ui,…,UKValue of { lambda }1,λ2,…,λi,…, λK}, initial clustering center ηiNeighborhood U ofiIncludes the sum of L and η of the sample setiIs not more than lambdaiA subsample set of, i.e. Ui={li∈L|dis(li,ηi)≤λi},UKRepresenting initial clustering center ηKIs a neighborhood ofiDenotes the ith value, λKRepresents the Kth value; { eta [. eta. ]1,η2,…,ηi,…, ηKAnd the corresponding { lambda }1,λ2,…,λi,…, λKForm a class cluster { Z }1,Z2,…,Zi,…,ZKWherein the ith class cluster Zi={ηi±λi}, the Kth class cluster ZK={ηK±λK}; getting total money for each yearEach value remaining in the sum, each value is calculated with { η }1,η2,…,ηi,…, ηKA distance, assigning the value to the nearest cluster; then calculating the mean value of each cluster, namely the clustering center, and continuously and repeatedly calculating the clustering centers of the clusters until the clustering centers of the clusters are not changed;
step two: calculating L ═ L1,l2,…,li,…,lnSet A of all data not in K clustersAmount of moneyWherein
Figure FDA0003220050810000051
Figure FDA0003220050810000052
Denotes an empty set, AAmount of moneyThere is a problem set for annual collection of amounts;
the method for discovering the abnormal data of the annual picking times comprises the following steps:
the method comprises the following steps: randomly selecting K values in the total annual picking times, { eta1′,η2′,…,ηi′,…, ηK' } K initial cluster centers respectively representing the total number of annual picking times, {1, …, i, …, K } denotes the number of data, ηi' denotes the ith initial clustering center, ηK' denotes the Kth initial clustering center, defining the initial clustering center { η1′,η2′,…,η1′,…, ηK' } neighborhood U1′,U2′,…,Ui′,…,UKA value of' } λ1′,λ2′,…,λi′,…, λK' } initial clustering center ηi' neighborhood Ui' includes the sum of the sample set L and ηi' is not more than λi' A subsample set, Ui′={li∈L|dis(li,ηi′)≤λi′},UK' representing initial clustering center etaK' neighborhood, λi' denotes the ith value, λK' represents a Kth numerical value; { eta [. eta. ]1′,η2′,…,ηi′,…, ηK' } and corresponding λ1′,λ2′,…,λi′,…, λK' } construction of class clusters Z1′,Z2′,…,Zi′,…,ZK' }, wherein the ith class cluster Zi′={ηi′±λi' } K class cluster ZK′={ηK′±λK' }; for each value remaining in the total annual count, calculate each value and { η }1′,η2′,…,ηi′,…, ηK' } assigning the value to the nearest cluster; then calculating the mean value of each cluster, namely the clustering center, and continuously and repeatedly calculating the clustering centers of the clusters until the clustering centers of the clusters are not changed;
step two: calculating L ═ L1,l2,…,li,…,lnSet A of all data not in K clustersNumber of timesWherein
Figure FDA0003220050810000053
ANumber of timesThere is a problem set for annual pick-up times.
CN202010275707.9A 2020-04-09 2020-04-09 Low-security fund supervision method and system based on supervised data correlation analysis Active CN111460052B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010275707.9A CN111460052B (en) 2020-04-09 2020-04-09 Low-security fund supervision method and system based on supervised data correlation analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010275707.9A CN111460052B (en) 2020-04-09 2020-04-09 Low-security fund supervision method and system based on supervised data correlation analysis

Publications (2)

Publication Number Publication Date
CN111460052A CN111460052A (en) 2020-07-28
CN111460052B true CN111460052B (en) 2021-10-01

Family

ID=71682370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010275707.9A Active CN111460052B (en) 2020-04-09 2020-04-09 Low-security fund supervision method and system based on supervised data correlation analysis

Country Status (1)

Country Link
CN (1) CN111460052B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112785246A (en) * 2020-12-30 2021-05-11 杭州天阙科技有限公司 Low-income crowd auditing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599144A (en) * 2009-03-19 2009-12-09 杭州思锐网络有限公司 Network type civil affairs multi-department information integrated assistance platform
CN107346516A (en) * 2017-07-17 2017-11-14 山东浪潮云服务信息科技有限公司 A kind of insurance fraud identification data analysis system and method
CN109614496A (en) * 2018-09-27 2019-04-12 长威信息科技发展股份有限公司 A kind of minimum living discrimination method of knowledge based map
CN109711200A (en) * 2018-12-29 2019-05-03 百度在线网络技术(北京)有限公司 Accurate poverty alleviation method, apparatus, equipment and medium based on block chain
CN110019467A (en) * 2017-12-01 2019-07-16 广州明领基因科技有限公司 For the big data integration system of social security information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170032464A1 (en) * 2015-08-02 2017-02-02 Jeffrey A. Killian Automated social security disability insurance eligibility process for people with deafness and/or blindness

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599144A (en) * 2009-03-19 2009-12-09 杭州思锐网络有限公司 Network type civil affairs multi-department information integrated assistance platform
CN107346516A (en) * 2017-07-17 2017-11-14 山东浪潮云服务信息科技有限公司 A kind of insurance fraud identification data analysis system and method
CN110019467A (en) * 2017-12-01 2019-07-16 广州明领基因科技有限公司 For the big data integration system of social security information
CN109614496A (en) * 2018-09-27 2019-04-12 长威信息科技发展股份有限公司 A kind of minimum living discrimination method of knowledge based map
CN109711200A (en) * 2018-12-29 2019-05-03 百度在线网络技术(北京)有限公司 Accurate poverty alleviation method, apparatus, equipment and medium based on block chain

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
民政低保信息系统总体框架研究及数据交换技术实现;宋晓虹;《中国优秀硕士学位论文全文数据库信息科技辑》;20110315;I138-828 *

Also Published As

Publication number Publication date
CN111460052A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN107239891B (en) Bidding auditing method based on big data
CN108596750B (en) A kind of invoice voiding recognition methods based on parallel association rules
CN104321794B (en) A kind of system and method that the following commercial viability of an entity is determined using multidimensional grading
CN105931068A (en) Cardholder consumption figure generation method and device
US20100332482A1 (en) Real time data collection system and method
CN104424613A (en) Value added tax invoice monitoring method and system thereof
CN111882403A (en) Financial service platform intelligent recommendation method based on user data
CN108268886A (en) For identifying the method and system of plug-in operation
CN112632405A (en) Recommendation method, device, equipment and storage medium
Liu et al. Application of hierarchical clustering in tax inspection case-selecting
CN111460052B (en) Low-security fund supervision method and system based on supervised data correlation analysis
CN115794803A (en) Engineering audit problem monitoring method and system based on big data AI technology
CN111415067A (en) Enterprise and personal credit rating system
CN109919667A (en) A kind of method and apparatus of the IP of enterprise for identification
CN105447117A (en) User clustering method and apparatus
CN111415081A (en) Enterprise data processing method and device
WO2019112064A1 (en) Brand value calculation device
CN115965468A (en) Transaction data-based abnormal behavior detection method, device, equipment and medium
CN112634048B (en) Training method and device for money backwashing model
CN110032607A (en) A kind of auditing method based on big data
CN115952216A (en) Aging insurance data mining method and device, storage medium and electronic equipment
CN114943479A (en) Risk identification method, device and equipment of business event and computer readable medium
CN114265887A (en) Dimension data processing method and device, storage medium and electronic equipment
CN115392351A (en) Risk user identification method and device, electronic equipment and storage medium
CN114266594A (en) Big data analysis method based on southeast Asia cross-border e-commerce platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Yun Jing

Inventor after: Zhao Yumeng

Inventor after: Wang Yongsheng

Inventor after: Liu Limin

Inventor after: Xu Zhiwei

Inventor after: Zhang Ziting

Inventor after: Zhai Na

Inventor before: Yun Jing

Inventor before: Zhao Yumeng

Inventor before: Liu Limin

Inventor before: Xu Zhiwei

Inventor before: Zhang Ziting

Inventor before: Zhai Na

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant