CN111241077A

CN111241077A - Financial fraud behavior identification method based on internet data

Info

Publication number: CN111241077A
Application number: CN202010003646.0A
Authority: CN
Inventors: 翟恩荣
Original assignee: Sichuan XW Bank Co Ltd
Current assignee: Sichuan XW Bank Co Ltd
Priority date: 2020-01-03
Filing date: 2020-01-03
Publication date: 2020-06-05
Anticipated expiration: 2040-01-03
Also published as: CN111241077B

Abstract

The invention relates to a financial fraud behavior identification method based on internet data, which comprises the following steps: A. collecting data on the Internet in real time, wherein the data at least comprises data of a news portal website, a financial forum and a financial community; B. cleaning the acquired data, and carrying out normalization processing on heterogeneous multi-source dirty data to obtain structured data; C. identifying negative public sentiments in the structured data through an emotion analysis method based on deep learning; D. calculating a public sentiment index according to the configuration information; E. and identifying financial fraud behaviors on the Internet according to the public opinion index, and performing early warning. The invention can monitor the public websites, communities, forums and the like on the Internet in real time on the premise of not optimizing intelligent configuration, can early warn financial fraudulent behaviors in the public websites, communities, forums and the like at the first time, and can monitor different fields and different platforms and identify the financial fraudulent behaviors by configuring the monitored contents.

Description

Financial fraud behavior identification method based on internet data

Technical Field

The invention relates to a financial fraud identification method, in particular to a financial fraud identification method based on internet data.

Background

In the current period of compliance supervision, new opportunities and challenges are faced by consuming financial institutions, including how to achieve smooth transition from the wild expansion phase to the robust development phase, how to face normal customer consumption degradation and the credit customers are seriously more, and the like, which are all difficult problems that are not avoidable by consuming financial institutions. Based on the industry background, bank risks need to be comprehensively grasped, and therefore an effective early warning system is constructed, risks are reduced, and crisis is avoided.

The existing financial fraud identification method generally processes fraud, so that the fraud cannot be prevented, and the reaction is slow after the fraud occurs, which often causes significant loss. When financial fraud occurs, the fraud behavior can be identified effectively by identifying the identity and identifying information such as fingerprints of the user. However, both of the above two methods cannot identify the sudden black product of wool type, and when a sudden problem or other hole leakage occurs in the system, a lawless person attacks the system by means of a system leak, which often causes great loss.

Disclosure of Invention

The invention provides a financial fraud behavior recognition method based on internet data, which can be used for timely early warning agency attacks, black product attacks and the like in network communities or forums by monitoring fraud behaviors on the internet.

The invention relates to a financial fraud behavior recognition method based on internet data, which comprises the following steps:

A. collecting data on the Internet in real time, wherein the data at least comprises data of a news portal website, a financial forum and a financial community;

B. cleaning the acquired data, and carrying out normalization processing on heterogeneous multi-source dirty data to obtain structured data; the structured data is relational model data.

C. And identifying negative public sentiments in the structured data through an emotion analysis method based on deep learning. The traditional emotion analysis technology is used for analyzing text emotion according to manually labeled emotion characteristics by using traditional machine learning algorithms such as SVM (support vector machine), CRF (learning reference number), and the like, but supervised learning depends on a large amount of manually labeled data, so that a system based on the supervised learning needs to pay high labeling cost. The emotion analysis method based on deep learning adopts a recurrent neural network to find characteristics related to tasks, avoids artificial characteristic design depending on specific tasks, and introduces an emotion polarity transfer model to strengthen capture of text relevance according to the front and back relevance between sentence words. The method based on deep learning is equivalent to the current method for manually marking the emotional characteristics in performance, but saves a large amount of workload of manual marking. In the prior art, a great deal of publications are available for emotion analysis methods based on deep learning, for example, patent application numbers 201711417352.7, 201811617266.5, 201810290094.9 and other published patent application documents, and deep learning is also a mature emotion analysis method, which is not the innovation point of the present invention and is not described in detail herein.

D. Public sentiment indexes are calculated according to the configuration information, so that different types of public sentiment indexes are calculated, such as fraud attack indexes, platform thunderstorm indexes, black-yield dynamic indexes and the like;

E. and identifying financial fraud behaviors on the Internet according to the public opinion index, carrying out early warning, and informing relevant management personnel of the financial fraud behaviors, so that appropriate measures are taken to stop the fraud behaviors in time.

The identification method disclosed by the invention is used for monitoring the public internet data crawled from the outside in real time, generating the corresponding public opinion index, and carrying out real-time early warning notification when the public opinion index is abnormal, so that the functions of identifying the black birth attack, monitoring the thunderstorm public opinion, calculating the fraud attack index and the like are realized. The risk can be predicted in advance through real-time early warning, measures can be taken in advance, strategies are adjusted, and loss is reduced to the minimum.

Further, when data on the internet is collected in real time in the step A, distributed capture is firstly carried out on the dynamic webpage on the internet, and when the data are captured, a master node in a distributed architecture is responsible for scheduling, and a slave node is responsible for capturing; and then, carrying out structured data extraction on the captured HTML codes, and converting the semi-structured HTML codes into required structured data. The semi-structured data is data of a non-relational model and has a basic fixed structure mode. The capture speed can be increased through a distributed architecture, and horizontal expansion is supported, so that field information of different websites can be quickly positioned, different information can be extracted according to different sources, and the information is finally stored in a relational database.

Furthermore, the dynamic webpage is subjected to distributed grabbing through the JS engine, so that the HTML code of the finally displayed page is obtained.

Further, in the step B, when data cleaning is performed, the unstructured data is converted into structured data, and data deduplication and data cleaning are performed. Unstructured data refers to data without fixed patterns, such as WORD, PDF, PPT, EXL, pictures in various formats, video, etc.

Specifically, the data deduplication is realized by a Bloom Filter (a binary vector data structure) -based data structure, and is based on a URL; the data cleaning is based on configuration, and json format data is converted into formatted data. The json data is a data structure assembled according to a specific format, the data can be well minimized through the data structure, useful data can be compressed as much as possible, but the data needs to be formatted to view the json data format, and the data can be viewed after the data is formatted.

Further, step D includes:

D1. setting a keyword list related to financial fraud, and marking an article corresponding to the data when the collected data has the same keywords as the keywords in the keyword list;

D2. calculating the public sentiment index of the current article: c, the public sentiment index of the current article is an intercept + score, the intercept is the sentiment index obtained by the current article through the sentiment analysis method based on deep learning in the step C, and the score is a numerical value obtained by carrying out predetermined calculation on various attributes of the current article;

D3. computing platform public opinion index: calculating the public sentiment index A of n articles searched by the platform in a set time range_iI is article number, i is less than or equal to n, and public sentiment index A_iThe highest article is set with weight n and public sentiment meansNumber A_iThe lowest article is set to weight 1, and the platform public sentiment index is sum ([ n-top (a))_i)]×A_i) /(1+2+ … + n), where top (A)_i) Public opinion index A of the current ith article_iThe rank value of (c).

On the basis, in the step E, when the public sentiment index of the single article or the public sentiment index of the platform reaches a set threshold value, a related manager is notified in a short message and/or mail mode.

The method for identifying the financial fraudulent conduct based on the internet data can monitor open websites, communities, forums and the like on the internet in real time on the premise of not optimizing intelligent configuration, early warn the financial fraudulent conduct in the internet at the first time, and monitor different fields and different platforms and identify the financial fraudulent conduct by configuring the monitored content.

The present invention will be described in further detail with reference to the following examples. This should not be understood as limiting the scope of the above-described subject matter of the present invention to the following examples. Various substitutions and alterations according to the general knowledge and conventional practice in the art are intended to be included within the scope of the present invention without departing from the technical spirit of the present invention as described above.

Drawings

FIG. 1 is a flow chart of the method for identifying financial fraud based on Internet data according to the present invention.

Fig. 2 is a flow chart of real-time data acquisition over the internet.

Detailed Description

The method for identifying financial fraud based on internet data of the invention as shown in fig. 1 comprises:

A. as shown in fig. 2, data on the internet, including at least data of news portal sites, financial forums, and financial communities, are collected in real time by a web crawler algorithm. When data are collected, the dynamic webpage is rendered through the JS engine to perform distributed capture on the dynamic webpage on the Internet, and therefore the HTML code of the finally displayed page is obtained. During fetching, the master node in the distributed architecture is responsible for scheduling, and the slave nodes are responsible for fetching. And then, extracting structured data of the captured HTML codes, and converting the semi-structured HTML codes into the required structured data to obtain a Web page source file and a corresponding URL (uniform resource locator). The capture speed can be increased through a distributed architecture, and horizontal expansion is supported, so that field information of different websites can be quickly positioned, different information can be extracted according to different sources, and the information is finally stored in a relational database.

B. The collected data are cleaned, heterogeneous multi-source dirty data are normalized, unstructured data are converted into structured data, and data de-duplication and data cleaning are performed. Wherein the data deduplication is realized by a data structure based on BloomFilter (a binary vector data structure) and based on URL. Data cleansing is based on configuration to convert json formatted data to formatted data.

C. In the field of financial wind control, the most important application scene for natural language processing is public opinion analysis. The text is analyzed and mined through technologies such as text clustering and emotion analysis, and the discovery and tracking of negative public sentiment are achieved. The negative public opinion analysis identification needs to consider two aspects of scale and negative degree, and needs to find out the negative public opinion which rises faster in a period of time or participates in larger scale. The scale can be judged by the number of related web pages after the text clustering, and the negative degree is identified by the emotion analysis technology of the text.

And identifying negative public sentiments in the structured data through an emotion analysis method based on deep learning. The emotion analysis method based on deep learning adopts a recurrent neural network to find characteristics related to tasks, avoids artificial characteristic design depending on specific tasks, and introduces an emotion polarity transfer model to strengthen capture of text relevance according to the front and back relevance between sentence words. The method based on deep learning is equivalent to the current method for manually marking the emotional characteristics in performance, but saves a large amount of workload of manual marking. In the prior art, a great deal of published documents such as patent application numbers 201711417352.7, 201811617266.5, 201810290094.9 and the like exist in emotion analysis methods based on deep learning, and deep learning is also a mature emotion analysis method, which is not the innovation point of the present invention and is not described in detail herein.

D. And calculating the public sentiment indexes according to the configuration information, thereby realizing the calculation of different types of public sentiment indexes, such as fraud attack indexes, platform thunderstorm indexes, black yield dynamic indexes and the like. The method specifically comprises the following steps:

D1. and setting a keyword list related to financial fraud, and marking an article corresponding to the data when the collected data has the same keywords as the keywords in the keyword list. The keyword list is for example:

D2. Calculating the public sentiment index of the current article: and D, scoring the public sentiment index of the current article, wherein the intercept is the sentiment index obtained by the current article through the sentiment analysis method based on deep learning in the step C, and the score is a numerical value obtained by performing a preset calculation mode on each attribute of the current article. Wherein the attributes include: reading amount, comment amount, type (original creation or reprinting), days until the article is released, and the number of keywords contained in the article. Calculating the public sentiment index of the current article as follows:

a. for example, the number of keywords "openning" included in this document is n, and when n <1, the term "openning" is scored as 0; n is 1, and the score of the term "Kouzi" is 20; n > -2, and "kouzi" term score-100;

b. the number of whether the keyword roll is included is n, when n > is 1, the term roll is 100; n is less than 1, and the score of the term of stripping is 0.

c. The visit amount of the article is n, and when n is less than 10, the score of the article is 5 points; 100> n > -10, which score 20; 1000> n > -100, with a score of 50; n > 1000, the term score 100;

d. the evaluation amount of the article is n, and when n is less than 10, the score is 5; n > is 10, and the score of the item is 100;

e. the number of days of release of the article is n, and when n is less than 2, the score is 100; 7> n > -2, the term score 70; n > is 7, and the score of the term is 20.

In the calculation:

(1) a plurality of keywords can be defined, and each keyword is defined, the number of the keywords contained in the current article needs to be counted.

(2) The intercept is limited to (0, 30) minutes, and the maximum is 30 minutes;

(3) the score limit is (0, 70) points, with a maximum of 70 points;

D3. computing platform public opinion index: a platform public opinion index is generated every hour every day. Calculating the respective public sentiment indexes A of n articles searched by the platform in each hour_iI is article number, i is less than or equal to n, and public sentiment index A_iThe highest article is set with weight n and public sentiment index A_iThe lowest article is set to weight 1, and the platform public sentiment index is sum ([ n-top (a))_i)]× A_i) /(1+2+ … + n), where top (A)_i) Public opinion index A of the current ith article_iThe rank value of (c).

E. Identifying financial fraud behaviors on the Internet according to the public sentiment index, when one of the public sentiment index, the score and the sentiment index of a single article reaches a set threshold, or the public sentiment index of a platform reaches the threshold, or the public sentiment index of the platform is abnormal, such as: the current public sentiment index of the platform is larger than the maximum value of the public sentiment index of the platform at yesterday by 1.5; the public sentiment index of the current platform is larger than the average value of the public sentiment index of the platform at the previous 7 days multiplied by 1.5, and relevant managers are informed in a short message and/or mail mode, so that appropriate measures are taken to stop the occurrence of fraud behaviors in time.

For a single article, the alert content includes: time, alarm content and article links, for example: [ xx Bank ] 2019-01-1812: 39:40] [ financial fraud attack hint: the financial fraud attack index is 80 points https:// xx.

For the platform, the alarm content comprises: time, content, links ranking the top three of a single article, for example: [ xx Bank ] 2019-01-1812: 39:40] [ financial fraud attack hint: the financial fraud attack index is 80 points ] https:// xx.cc.com, https:// pp.mm.com, https:// gg.uu.com.

Claims

1. The identification method of financial fraud based on internet data is characterized by comprising the following steps:

B. cleaning the acquired data, and carrying out normalization processing on heterogeneous multi-source dirty data to obtain structured data;

C. identifying negative public sentiments in the structured data through an emotion analysis method based on deep learning;

D. calculating a public sentiment index according to the configuration information;

E. and identifying financial fraud behaviors on the Internet according to the public opinion index, and performing early warning.

2. A method for identifying internet data based financial fraud as claimed in claim 1, characterized by: in the step A, when data on the Internet are acquired in real time, distributed capture is firstly carried out on dynamic webpages on the Internet, and during capture, a master node in a distributed architecture is responsible for scheduling, and a slave node is responsible for capturing; and then, extracting the structural data of the captured HTML codes, and converting the semi-structured HTML codes into the required structural data.

3. A method for identifying internet data based financial fraud as claimed in claim 2, characterized by: the dynamic webpage distributed grabbing is that the dynamic webpage is rendered through the JS engine, so that HTML codes of the finally displayed page are obtained.

4. A method for identifying internet data based financial fraud as claimed in claim 1, characterized by: and B, converting the unstructured data into structured data during data cleaning, and performing data deduplication and data cleaning.

5. A method for identifying internet data based financial fraud as claimed in claim 4, characterized by: the data deduplication is realized by a data structure based on Bloom Filter based on URL; the data cleaning is based on configuration, and json format data is converted into formatted data.

6. A method for identifying internet data based financial fraud as claimed in claim 1, characterized by: the step D comprises the following steps:

D3. computing platform public opinion index: calculating the public sentiment index A of n articles searched by the platform in a set time range_iI is article number, i is less than or equal to n, and public sentiment index A_iThe highest article is set with weight n and public sentiment index A_iThe lowest article is set to weight 1, and the platform public sentiment index is sum ([ n [ ])－top(A_i)]×A_i) /(1+2+ … + n), where top (A)_i) Public opinion index A of the current ith article_iThe rank value of (c).

7. A method for identifying internet data based financial fraud as claimed in claim 6, characterized by: and E, when the public sentiment index of the single article or the public sentiment index of the platform reaches a set threshold value, informing relevant managers in a short message and/or mail mode.