CN114066606A

CN114066606A - System and method for falsely identifying data based on text escape as GPS distance

Info

Publication number: CN114066606A
Application number: CN202111359532.0A
Authority: CN
Inventors: 王萍; 张卓; 贾坤
Original assignee: Sichuan XW Bank Co Ltd
Current assignee: Sichuan XW Bank Co Ltd
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-02-18

Abstract

The invention discloses a system and a method for falsely identifying data as a GPS distance based on text escape, belonging to the technical field of big data. The problem of financial institution among the prior art be difficult to verify the authenticity of customer's submitted data to it is difficult to quantify the false risk possibility of data is solved. The invention carries out homogeneous clustering on all address information of a client, converts text address information into GPS longitude and latitude, calculates the distance between any two address longitudes and latitudes, constructs a GPS distance characteristic set, and constructs data false models such as a company name false model and a family address false model so as to quantify the risk possibility of data false and achieve the aim of identifying and intercepting before credit business loan.

Description

System and method for falsely identifying data based on text escape as GPS distance

Technical Field

The invention belongs to the technical field of big data, and particularly relates to a false data identification system and method based on text escape as a GPS distance.

Background

With the emergence of internet finance, the openness and the convenience of the internet finance are fully utilized, and a fund demander and a fund supplier can complete information screening, matching and transaction more quickly through a network platform, so that transparent and accurate butt joint is realized. And the fund is allowed to flow to the people in need, so that the economy of the entity is facilitated, and the fund is guided to be separated from the virtual reality. However, at the same time, the contactless trust model also provides a serious challenge to the fraud wind control work of the financial institution. Many black-parturients target the internet credit business and apply for loans in the modes of packaging data, providing false data and the like so as to achieve the purpose of defrauding the funds of financial institutions. How to verify the authenticity of the data submitted by the client by the financial institution is a subject of important research of each financial institution. These messages are all text type messages. In the current industry, text information is rarely applied, a general method is to perform operation of full equivalence or inclusion between texts, and how to quantify risks is always a difficult problem in the industry. After inspection, no similar patent literature is found.

Based on the above, in order to fully mine the information contained in the text, the text-to-GPS-distance-based false data identification method and system are provided, the address information of all customers is subjected to homogeneous clustering, the text address information is converted into GPS longitude and latitude, then the distance between any two address longitudes and latitudes is calculated, a GPS distance characteristic set is constructed, and false data models such as a company name false model and a home address false model are constructed, so that the risk possibility of false data is quantified, and the purposes of identifying false data customers and intercepting the false data before credit business credit are achieved.

Disclosure of Invention

Aiming at the problems that the financial institution in the prior art is difficult to verify the authenticity of the data submitted by the client and quantify the risk possibility of data false, the invention provides a system and a method for identifying the data false based on text escape as GPS distance, which aims to: information contained in the text is fully mined, the risk possibility of data false is quantified, and the purposes of identifying and intercepting before credit business is credited are achieved.

The technical scheme adopted by the invention is as follows:

a system for misidentifying material based on text escape to GPS distance, comprising:

class address data batching module: for aggregating customer-class address text information from various sources;

homogeneous information clustering module: the system is used for carrying out homogeneous clustering on the class address text information;

text escaping is a GPS longitude and latitude module: the system is used for converting all kinds of address text information into GPS longitude and latitude;

latitude and longitude distance calculation module: the device is used for summarizing all GPS longitudes and latitudes and calculating the distance between any two groups of GPS longitude and latitude pairs;

a data false model module: the device is used for constructing a data false model according to the distance characteristics calculated by the longitude and latitude distance calculation module;

a decision module: and the risk evaluation module is used for carrying out risk evaluation on the client according to the result output after the information of the client is analyzed by the data false model, and outputting a risk result.

The invention also discloses a false data identification method based on text escape as GPS distance, which comprises the following steps:

step 1: when a client initiates a credit application, a class address data batch module collects the text information of the client class address from each source;

step 2: the homogeneous information clustering module performs homogeneous clustering on all address-like text information, and divides the address-like text information into IP (Internet protocol) type information, mobile phone number type information, company name type information and company address type information;

and step 3: the text disambiguation module converts all kinds of address text information into GPS longitude and latitude;

and 4, step 4: the longitude and latitude distance calculation module collects the GPS longitude and latitude of all clusters and calculates the driving distance between any two groups of GPS longitude and latitude pairs as the distance between every two addresses;

and 5: the data false model module constructs a data false model based on the distance characteristics obtained by calculation in the step 4;

step 6: and the decision module carries out risk assessment on the client according to the result output by the material false model and outputs a risk result.

The invention carries out homogeneous clustering on all address information of a client, converts text address information into GPS longitude and latitude, calculates the distance between any two address longitudes and latitudes, constructs a GPS distance characteristic set, and constructs data false models such as a company name false model and a family address false model so as to quantify the risk possibility of data false and achieve the aim of identifying and intercepting before credit business loan.

Preferably, the class address text information in step 1 includes: the information of the data filled by the client, the information collected by the mobile terminal equipment and the information obtained by an external three-party data mechanism.

Preferably, the IP class information in step 2 includes: the IP address during registration, the IP address during application, the IP address during presentation, the IP address during login, the IP address during recovery of a login password and the IP address during binding; the mobile phone number information comprises a customer registration mobile phone number, a bank card reserved mobile phone number, a first contact person mobile phone number, an associated vehicle and merchant mobile phone number, an associated sales person mobile phone number, a legal person or real control person mobile phone number, a spouse telephone, a point location telephone, a residence and a mobile phone number list of pedestrian reports; the company name information comprises a company name filled by a client, a registration name of an industrial and commercial information company, a related dealer name, a public deposit payment unit, a social security unit name and a bank working unit list; the address detail type information comprises: the system comprises a client filled company address, a client identification card address, a client company and worker information registration address, an identification card issuing authority address, a client filled family address, a client credit request GPS address, a client debit request GPS address and a pedestrian report work unit address list.

Preferably, the step 3 of converting the similar address text information into the GPS longitude and latitude specifically includes the following steps:

converting IP information into GPS longitude and latitude: calling a common IP positioning service of a map service provider, acquiring an approximate position according to IP positioning, acquiring the approximate position by using the IP, calling an API (application programming interface), returning longitude and latitude of a specified internet IP in a request parameter, wherein the longitude and latitude are the longitude and the latitude of a current city central point and are used as the longitude and latitude of an IP address;

converting the mobile phone number information into GPS longitude and latitude: obtaining the attributive city of each mobile phone number address by calling a mobile phone number attribution inquiry interface, and inquiring the longitude and latitude of the central point of the attributive city to be used as the longitude and latitude of the mobile phone number address;

converting company name information into GPS longitude and latitude: converting company name information into longitude and latitude by calling address coding service provided by a map service provider;

converting the detail information of the address into GPS longitude and latitude: and converting the address detail information into longitude and latitude by calling address coding service provided by a map service provider.

Preferably, step 4 specifically comprises: and returning the bus distance, the riding distance and the driving distance between any two groups of GPS longitude and latitude pairs by calling a route planning service interface provided by a map service provider, summarizing the GPS longitude and latitude of all clusters, and calculating the driving distance between any two groups of GPS longitude and latitude pairs as the distance between every two addresses.

Preferably, step 5 specifically comprises the following steps:

step 5.1: determining a sample; defining a company name false target variable, a company address false target variable and a family address false target variable, and dividing the variables into Y₁，Y₂，Y₃(ii) a When the information of the ith dimension of the client is false, the client is a material false client, namely Y_i1 is ═ 1; otherwise Y_i0; wherein i is 1,2,3, based on Y_iCarrying out layered sampling to construct three sample sets;

step 5.2, training a model; constructing data false models based on different target variables based on different sample sets, and selecting the optimal data false model through a confusion matrix by adopting a supervised learning method;

step 5.3: outputting a model; the output model format is as follows:

ModeResult＝

{subModel₁:{subSocre₁:20，subRiskGrade₁that is "no risk" },

subModel₂:{subSocre₂:100，subRiskGrade₂that is, "high risk" },

subModel₃:{subSocre₁:85,subRiskGrade₃"high risk" };

wherein ModResult represents the data artifact model set result, subModel_iSubSocre, the model name representing the ith sub-model_iThe model score of the ith sub-model is represented, the value of the model score is 0-100, the larger the score is, the higher the probability of representing that the data is false is, and the subRiskGrade_iThe risk level of the ith sub-model is represented, where i is 1,2, 3.

Preferably, the supervised learning method adopted in step 5.2 comprises: logistic regression, GBDT, xgboost, and LightGBM algorithms.

Preferably, the risk rating in step 5.3 is determined by: and searching a Cut-off value by utilizing an ROC curve, wherein the point closest to the upper left corner in the ROC curve graph is the Cut-off value, and each model is divided into two sections by the Cut-off point: the risk level of the model is no risk if the model of the model is located in the no risk section, and the risk level of the model is high risk if the model of the model is located in the high risk section.

Preferably, step 6 specifically comprises: and the decision module evaluates the risk of the client according to the result of the data dummy model set, refuses the client if the risk level of any data dummy model in the result of the data dummy model set is high risk, and otherwise, the client passes the processing.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. the invention carries out homogeneous clustering on all address information of a client, converts text address information into GPS longitude and latitude, calculates the distance between any two address longitudes and latitudes, constructs a GPS distance characteristic set, and constructs data false models such as a company name false model and a family address false model so as to quantify the risk possibility of data false and achieve the aim of identifying and intercepting before credit business loan.

2. The invention provides a method for quantizing text information, which organically combines natural language processing, a map algorithm and a machine algorithm and provides a new method for identifying the authenticity of data for the credit business industry.

3. The invention creatively provides a method for constructing text information quantization characteristics, and provides a new idea for model derivative variables.

Drawings

The invention will now be described, by way of example, with reference to the accompanying drawings, in which:

fig. 1 is a schematic structural view of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

In the description of the embodiments of the present application, it should be noted that the terms "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings or orientations or positional relationships that the products of the present invention are usually placed in when used, and are only used for convenience of description and simplicity of description, but do not indicate or imply that the devices or elements that are referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present application. Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

The present invention will be described in detail with reference to fig. 1.

step 1: when a client initiates a credit application, the class address data batch module collects the text information of the client class address from each source. "run batch" is also called "batch processing". Batch processing, also called batch scripting, is the processing of a batch of objects. The address-like text information refers to text information which can be transferred into an address in a certain way, and mainly comprises the following steps: the information of the data filled by the client, the information collected by the mobile terminal equipment and the information obtained by an external three-party data mechanism.

Step 2: the homogeneous information clustering module performs homogeneous clustering on all address-like text information, and divides the address-like text information into a plurality of clustering clusters, including IP-like information, mobile phone number-like information, company name-like information and company address-like information. A cluster is a set of samples generated by clustering, and the samples in the same cluster are similar to each other and different from the samples in other clusters.

The IP class information refers to text information of all IP address classes associated with the client. The IP address refers to an internet protocol address, which is also translated into an internet protocol address. The IP address is a uniform address format provided by the IP protocol, and it allocates a logical address to each network and each host on the internet, so as to mask the difference of physical addresses. The IP class information refers to an IP address of a user's important operation behavior. The method comprises the following steps: an IP address at the time of registration, an IP address at the time of application, an IP address at the time of withdrawal, an IP address at the time of login, an IP address at the time of retrieval of a login password, and an IP address at the time of card binding. The IP address class information includes 6 pieces of text information.

The company name class information refers to text information of all company name classes associated with the customer. The system comprises a company name filled by a client, a registered name of an industrial and commercial information company, a related dealer name, a public deposit payment unit, a social security unit name and a bank work unit list (at most 5 work units are included and sorted according to ascending time, namely, a first work unit of a bank report, a second work unit of a bank report, a third work unit of a bank report, a fourth work unit of a bank report and a fifth work unit of a bank report). The company name class information includes 10 pieces of text information.

The address details class information refers to the text information of all address details classes associated with the client. The system comprises a company address filled by a client, a client identity card address, a client company and business information registration address, an identity card issuing authority address, a family address filled by the client, a GPS address when the client requests for credit, a GPS address when the client requests for loan, a work unit address list (at most 5 work unit addresses, sorted in ascending time, namely a first work unit address for pedestrian report, a second work unit address for pedestrian report, a third work unit address for pedestrian report, a fourth work unit address for pedestrian report, a fifth work unit address for pedestrian report), a pedestrian living address list (at most 5 living addresses, sorted in ascending time, namely a first living address for pedestrian report, a second living address for pedestrian report, a third living address for pedestrian report, a fourth living address for pedestrian report, a system address for the client, a family address filled by the client, a GPS address for client when the client requests for loan, a business information registration address for client, a business address list (at most 5 working unit addresses, sorted in ascending time, namely, a first living address for pedestrian report first living address for the client, Pedestrian reports the fifth residential address), a list of pedestrian household addresses (up to 5 household addresses, sorted in ascending order of time, i.e.: pedestrian reports a first household address, pedestrian reports a second household address, pedestrian reports a third household address, pedestrian reports a fourth household address, pedestrian reports a fifth household address). The address specification class information includes 22 pieces of text information.

And step 3: and the text disambiguation module converts all kinds of address text information into GPS longitude and latitude.

Converting IP information into GPS longitude and latitude: the method can directly call the common IP positioning service of the service providers such as Baidu or Gao De, and the service interface is an open source interface. Ordinary IP positioning is a set of lightweight positioning interfaces provided in the form of HTTP/HTTPs, through which a user can obtain an approximate location from IP positioning. Obtaining the rough position by using the IP, calling an API (application programming interface), returning rough position information (generally, city level) of the IP for internet surfing specified in the request parameters, wherein the position information comprises: latitude and longitude, province, city, etc. The longitude and latitude are the longitude and latitude of the current city central point, and are used as the longitude and latitude of the IP address. Here 6 sets of IP-like address latitude and longitude pairs can be obtained.

Converting the mobile phone number information into GPS longitude and latitude: the mobile phone numbers in China (China Mobile, China Unicom and China telecom) are distributed in a fixed area, the area is a home area, and each mobile phone number corresponds to a fixed home city. And obtaining the attributive city of each mobile phone number address by calling the mobile phone number attributive place inquiry interface. And inquiring the longitude and latitude of the central point of the home city, and taking the longitude and latitude as the longitude and latitude of the mobile phone number type address. Here, 15 sets of mobile phone number-like address longitude and latitude pairs can be acquired.

Converting company name information into GPS longitude and latitude: map service providers such as Baidu or Gagde provide open-source address coding services, and the services provide a function of converting structured address class data into corresponding longitude and latitude. By calling address coding service, the company name information is converted into longitude and latitude addresses, and 10 groups of company name address longitude and latitude pairs can be obtained.

Converting the detail information of the address into GPS longitude and latitude: the address detail information is converted into longitude and latitude addresses by calling address coding service provided by a map service provider, and 22 groups of address detail address longitude and latitude pairs can be obtained here.

And 4, step 4: the longitude and latitude distance calculation module collects the GPS longitude and latitude of all clusters and calculates the driving distance between any two groups of GPS longitude and latitude pairs as the distance between every two addresses.

Map service providers such as Baidu or Gagde provide some open source route planning services (also called Direction API), which are a set of REST style Web service API and provide route planning services in the form of HTTP/HTTPS. At present, the Direction API supports bus, riding and driving route planning. By calling the route planning service interface, the bus distance, the riding distance and the driving distance between any two groups of GPS longitude and latitude pairs can be returned. And summarizing the GPS longitude and latitude of all clusters, and calculating the driving distance between any two groups of GPS longitude and latitude pairs as the distance between every two addresses. The total of 53 pairs of all the GPS longitude and latitude pairs of the 4 clusters is obtained by calculating the distance between any two pairs of the GPS longitude and latitude pairs

A distance characteristic.

And 5: based on the distance characteristics calculated by the longitude and latitude distance calculation module, a data false model is constructed, different models are constructed according to different information, and finally, the data false model of each dimensionality is output, for example: a home address spoofing model, a company name spoofing model, a contact spoofing model, etc. The method specifically comprises the following steps:

step 5.1: and (4) sample determination. Defining a company name false target variable, a company address false target variable and a family address false target variable, and dividing the variables into Y₁，Y₂，Y₃(ii) a When the information of the ith dimension of the client is false, the client is a material false client, namely Y_i1 is ═ 1; otherwise Y_i0; wherein i is 1,2,3, based on Y_iCarrying out layered sampling to construct three sample sets;

step 5.2, training a model; and constructing data false models based on different target variables based on different sample sets, and selecting the optimal data false model by a confusion matrix by adopting a supervised learning method. The confusion matrix, whose general idea is to count the number of times a class A instance is predicted (classified) as class B. Recall (Recall) and precision (precision) are two metrics widely used in the field of statistical classification to assess the quality of classification results. Recall Rate (Recall Rate, also called Recall Rate) is the ratio of the number of retrieved relevant documents to the number of all relevant documents in the document library, and the Recall Rate of the retrieval system is measured; precision (Precision Rate) is the ratio of the number of relevant documents retrieved to the total number of documents retrieved, and measures the Precision of the retrieval system. Alternative supervised learning methods are: algorithms such as logistic regression (logistic regression), GBDT (Gradient Boosting Decision Tree), xgboost (eXtreme Gradient Boosting), LightGBM (light Gradient Boosting machine), and the like.

Step 5.3: outputting a model; one model result is output for each target variable. Each target variable model outputs a model score, the value of the model score is 0-100, and the larger the score is, the larger the probability of showing that the data is false is. And finding the optimal cut-off point (i.e. critical point) by the ROC curve. The criterion for finding the cut-off value using the ROC curve is simple, and the point closest to the top left corner in the ROC curve graph is the cut-off value. The Cut-off point divides each model into two sections: a no risk segment and a high risk segment. And if the model output by the target variable model is in the risk-free section, the risk level of the false data model is risk-free, and if the model output by the target variable model is in the high risk section, the risk level of the false data model is high risk.

The format of the result output by the data dummy model is as follows:

ModeResult＝

{subModel₁:{subSocre₁:20,subRiskGrade₁that is "no risk" },

subModel₂:{subSocre₂:100,subRiskGrade₂that is, "high risk" },

subModel₃:{subSocre₁:85,subRiskGrade₃"high risk" };

Step 6: and the decision module evaluates the risk of the client according to the result output by the material false model and outputs a risk result. The method specifically comprises the following steps: and the decision module evaluates the risk of the client according to the result of the data dummy model set, refuses the client if the risk level of any data dummy model in the result of the data dummy model set is high risk, and otherwise, the client passes the processing.

In conclusion, the invention forms a closed loop based on the similar address data batch module, the homogeneous information clustering module, the text escape to GPS longitude and latitude module, the longitude and latitude distance calculation module, the data false model module and the decision module, and provides a quantitative method for identifying the authenticity of data for the credit business industry.

The above-mentioned embodiments only express the specific embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, without departing from the technical idea of the present application, several changes and modifications can be made, which are all within the protection scope of the present application.

Claims

1. A system for falsely identifying data based on text escape to GPS distance, comprising:

2. A false data identification method based on text escape as GPS distance is characterized by comprising the following steps:

3. The method as claimed in claim 2, wherein the step 1 of the text information of the address-like message comprises: the information of the data filled by the client, the information collected by the mobile terminal equipment and the information obtained by an external three-party data mechanism.

4. The method as claimed in claim 2, wherein the step 2 of falsely identifying the material based on text escape as GPS distance comprises the following steps: the IP address during registration, the IP address during application, the IP address during presentation, the IP address during login, the IP address during recovery of a login password and the IP address during binding; the mobile phone number information comprises a customer registration mobile phone number, a bank card reserved mobile phone number, a first contact person mobile phone number, an associated vehicle and merchant mobile phone number, an associated sales person mobile phone number, a legal person or real control person mobile phone number, a spouse telephone, a point location telephone, a residence and a mobile phone number list of pedestrian reports; the company name information comprises a company name filled by a client, a registration name of an industrial and commercial information company, a related dealer name, a public deposit payment unit, a social security unit name and a bank working unit list; the address detail type information comprises: the system comprises a client filled company address, a client identification card address, a client company and worker information registration address, an identification card issuing authority address, a client filled family address, a client credit request GPS address, a client debit request GPS address and a pedestrian report work unit address list.

5. The method for falsely identifying data based on text escape as GPS distance according to claim 2, wherein the step 3 of converting the address-like text information into GPS longitude and latitude specifically comprises the following steps:

6. The method as claimed in claim 2, wherein the step 4 is specifically as follows: and returning the bus distance, the riding distance and the driving distance between any two groups of GPS longitude and latitude pairs by calling a route planning service interface provided by a map service provider, summarizing the GPS longitude and latitude of all clusters, and calculating the driving distance between any two groups of GPS longitude and latitude pairs as the distance between every two addresses.

7. The method as claimed in claim 2, wherein the step 5 comprises the following steps:

step 5.3: outputting a model; the output model format is as follows:

ModeResult＝

{subModel₁:{subSocre₁:20,subRiskGrade₁that is "no risk" },

subModel₂:{subSocre₂:100,subRiskGrade₂that is, "high risk" },

subModel₃:{subSocre₁:85,subRiskGrade₃"high risk" };

8. The method for falsely identifying data based on text escape as GPS distance as claimed in claim 7, wherein the supervised learning method adopted in step 5.2 comprises: logistic regression, GBDT, xgboost, and LightGBM algorithms.

9. The method for falsely identifying data based on text escape as GPS distance as claimed in claim 7, wherein the risk level in step 5.3 is determined by the following method: and searching a Cut-off value by utilizing an ROC curve, wherein the point closest to the upper left corner in the ROC curve graph is the Cut-off value, and each model is divided into two sections by the Cut-off point: the risk level of the model is no risk if the model of the model is located in the no risk section, and the risk level of the model is high risk if the model of the model is located in the high risk section.

10. The method as claimed in claim 7, wherein the step 6 is specifically as follows: and the decision module evaluates the risk of the client according to the result of the data dummy model set, refuses the client if the risk level of any data dummy model in the result of the data dummy model set is high risk, and otherwise, the client passes the processing.