CN109636352A

CN109636352A - A kind of distributed content duplicate checking early warning system based on financial big data

Info

Publication number: CN109636352A
Application number: CN201811562264.0A
Authority: CN
Inventors: 李景龙
Original assignee: Hunan Long Hui Group Ltd By Share Ltd
Current assignee: Hunan Long Hui Group Ltd By Share Ltd
Priority date: 2018-12-20
Filing date: 2018-12-20
Publication date: 2019-04-16

Abstract

The invention discloses a kind of distributed content duplicate checking early warning systems and method based on financial big data, including projects report system, content Early-warning Model center, content analysis engine, big data management platform, information push center, task schedule center.The invention has the advantages that being based on big data management system, that establishes unification declares project library, industrial and commercial library；Content analysis engine based on distributed computing technology, support that the quick duplicate checking for declaring content-data on a large scale based on project library and industrial and commercial library is analyzed, it can use multiserver calculation power quickly to be calculated, the similar value for declaring content quickly is calculated, system availability is strong, duplicate checking is high-efficient, result is safe and reliable.

Description

A kind of distributed content duplicate checking early warning system based on financial big data

Technical field

The distributed content duplicate checking early warning system based on financial big data that present invention relates particularly to a kind of.

Background technique

With the continuous development of information-based industry, the finance department has built a collection of special fund communication management application system, It realizes papery and handles official business and change to the great-leap-forward of online working, improve office efficiency, but with the support of government-to-businesses Dynamics continues to increase, and the finance department needs to handle a large amount of enterprise and special fund is helped to declare, and audit largely declares content, is Cope with the problem, system needs more intelligent, duplicate checking analysis can be carried out to content is declared, based on the analysis results to management Personnel's early warning；Since e-government construction lacks unified planning, the mode of independent dispersion construction, information resources are mostly used Utilization cannot effectively be shared, it is difficult to solve the problems, such as Data Integration by simply upgrading.

With the development of big data technology and distributed computing technology, it is flat to establish unified project application big data management Platform becomes and solves special fund and declare the contents of a project to have identical solution, an existing financial information early warning platform product more It can complete to decide whether that sending early warning leads to by setting content similarity early warning threshold values to contents of a project duplicate checking early warning is declared Know, be primarily present following problems: 1) in face of large-scale data content, stand-alone server calculating calculation power is limited, simplest length Degree only have 20 characters two datacycle 100w time calculate the two data similarity time-consuming >=4000ms, it is assumed that we As soon as day needing to compare 100w times, whether light is to compare 100w data to repeat to need 4s, even if mono- document of 4s, single thread 15 documents, a hour ability 900 are just handled within one minute, if one is declared content text document and is likely to be breached several hundred million greatly Small, efficiency also has decline；2) data store relative distribution, and data are not centrally stored in unified data platform, information money Source cannot effectively share utilization, cause to declare the duplicate content duplicate checking fortune of the special platforms progress of content-data needs finances at different levels It calculates；3) fail to establish unified industrial and commercial large database concept, due to declaring unit legal person or shareholder has a many enterprises under one's name, and more families Enterprise may be involved in and declare same project, it is thus possible to lead to the problem of bull and declare, cannot be fully effective evade interior bulk density Problem is declared again.But since analytical precision is low, system architecture is stored in single node, does not support distributed computing, magnanimity Declare content-data, result timely cannot be fed back to user, while be easy to cause and asking by the too low outstanding problem of computational efficiency Topic examination & approval.

It is therefore desirable to provide a kind of to solve based on the distributed content duplicate checking early warning system of financial big data and method The above problem.

Summary of the invention

Higher and high safety and reliability the distribution based on financial big data the purpose of the present invention is to provide a kind of efficiency Formula content duplicate checking early warning system, a kind of distribution realized based on the identification of document character image, Chinese Word Automatic Segmentation, financial big data Formula calculates content similarity and analyzes the efficient warning information platform of Similar content.

One of the object of the invention provides the distributed content duplicate checking early warning system based on financial big data, including the project application Module, content warning module, content analysis engine, financial large database concept, information push center, task schedule center, in which:

Project application module declares special fund project for user；

Content warning module, the warning line numerical value of setting content similitude early warning and corresponding warning level.

Content analysis engine, current reference are divided into two parts, Chinese Word Automatic Segmentation and content similarity algorithm, Chinese point The sentence that word algorithm is responsible for the entire document content that will be declared splits into word (i.e. lemma refers to the word of composition a word), phase The similar value of document is declared like the target that degree algorithm is responsible for two comparisons of calculating, similarity algorithm is Simhash algorithm.

Financial large database concept is connect with industrial and commercial database and project application database communication, and financial large database concept is to acquisition To project application main body industrial and commercial data and project application data cleaned, processed, formation industry and commerce theme library and item of classify Mesh class theme library.

Information pushes center, and the information of early warning is precisely pushed by the different requirements of management.

Task schedule center, is responsible for the corresponding Processing Algorithm of scheduling and function executes task.

Another object of the present invention is to provide a kind of distributed content based on financial big data using above system to look into Weight method for early warning, comprising the following steps:

S1 finance large database concept is established, and by the algorithm model of setting, industrial and commercial data and project application data to acquisition are carried out Cleaning, processing, classification form industrial and commercial theme library and item class theme library；

Content is declared by project application module writes special fund by S2 enterprise, submits the special fund project application to ask to server It asks, the project application request that received server-side client is sent starts to receive data；

S3 calls segmentation methods functional interface to carry out morphological analysis to project application content by content analysis engine, and by language Sentence content splits into lemma, calls storage layer interface to store the lemma of participle in financial large database concept, declares the project of main body Declaring content can be stored in HDFS and MangoDB with document form；

S4 calls the task interface at task schedule center, publication similarity calculation task and industrial and commercial library by task schedule center Business connection link calculation task calls distributed computing tool Spark interface, executes calculating task, is calculated using multiserver Power is quickly calculated, and is quickly calculated and is declared similarity duplicate checking of the content based on item class theme library Yu Business Administration theme library Analysis；

Calculated result is fed back to content Early-warning Model center by S5, and whether model judging result triggers early warning threshold values, is more than early warning Value then starts step S6, and nothing then terminates entire contents of a project early warning calculation process；

Early warning log is written into early warning table by Early-warning Model center by S6, and early warning results messages is called to push interface, and message pushes away Send mainly mail, stand in carry out by way of letter, short message, APP, can be dynamically to set in a manner of message push；

S7 pushes center by information and carries out the push of early warning results messages, opens message informing, checks duplicate checking as a result, in duplicate Appearance, which is marked out, to be shown.

Segmentation methods are based on positive matched segmentation methods in the S3, method particularly includes: the Word Intelligent Segmentation mould of use Formula smart mode, this system participle engine segmenter then can export one according to inherent method and think most reasonable word segmentation result, Constructive in this algorithm simultaneously to start lemma and lemma chain concept, lemma chain is a kind of result of participle according to tandem Form a chain structure, the ordered set that essence is made of the lemma intersected defines lemma whole in lemma object Position in a link is used for disambiguation.

The method that similarity duplicate checking is analyzed in the S4 preferably uses SimHash similarity algorithm, and algorithmic procedure is as follows:

1) Doc is subjected to keyword abstraction (including segmenting and calculating weight), it is right to extract n (keyword, weight), (feature, weight) i.e. in figure.It is denoted as feature_weight_pairs=[fw1, fw2 ... fwn], Middle fwn=(feature_n, weight_n`)；

2) hash_weight_pairs=[(hash (feature), weight) for feature, weight in Feature_weight_pairs] generate figure in (hash, weight), it is assumed that hash generate digit bits_ count = 6；

3) longitudinal direction for then carrying out position to hash_weight_pairs is cumulative, if the position is 1 ,+weight, if it is 0, then-weight, ultimately produces bits_count number, and the digital value of generation is related to algorithm used in hash function；

4) digital value -> 110001 generated, positive 1 minus 0.

This distributed content duplicate checking early warning system and method based on financial big data provided by the invention, user pass through Platform submits project application list, and all data declared are stored in HDFS and MangoDB with document form, passes through setting Algorithm model SimHash similarity algorithm, cleans the project data declared, is processed, is classified, and structured storage is got up, In order to efficiently search and read, the present invention can eliminate data resource islanding problem caused by the dispersion of resource, can be with In the complete period that the whole entire contents of a project of tracking are declared, the monitoring in complete period is provided, guarantees that project funds can accomplish science Reasonable to use, the repetition that avoids practising fraud to the greatest extent is declared, and the waste of financial fund is avoided, to promote the height of enterprise Speed development.

Detailed description of the invention

Fig. 1 is system construction drawing of the invention.

Fig. 2 is distributed computing flow chart of the invention.

Fig. 3 is Simhash schematic diagram calculation of the invention.

Specific embodiment

It is as shown in Figure 1 system construction drawing of the invention, this distribution based on financial big data provided by the invention Content duplicate checking early warning system, including financial large database concept, project application module, Early-warning Model center, content analysis engine, information Push center, task schedule center, in which:

Financial large database concept is connect with industrial and commercial database and project application database communication, and financial large database concept is to collected The industrial and commercial data and project application data of project application main body are cleaned, are processed, classifying forms industrial and commercial theme library and item class Theme library；

Project application module, user carry out the special fund project application from terminal；

Early-warning Model center, the warning line numerical value of setting content similitude early warning and corresponding warning level；

Content analysis engine, current reference are divided into two parts, Chinese Word Automatic Segmentation and content similarity algorithm, and Chinese word segmentation is calculated The sentence that method is responsible for the entire document content that will be declared splits into word (i.e. lemma refers to the word of composition a word), similarity Algorithm is responsible for calculating the similar value that the target that two compare declares document, and similarity algorithm is Simhash algorithm；

Information pushes center, and the information of early warning is precisely pushed by the different requirements of management；

A kind of distributed content duplicate checking method for early warning based on financial big data of the present embodiment, comprising the following steps:

S4 calls the task interface at task schedule center by task schedule center, and task interface includes publication similarity calculation Task and industrial and commercial library business connection link calculation task, call distributed computing tool Spark interface, pass through content analysis engine In similarity calculation engine execute calculating task, quickly calculated using multiserver node, referring to fig. 2, quickly meter It calculates and declares similarity duplicate checking analysis of the content based on item class theme library with Business Administration theme library；

Calculated result is fed back to content Early-warning Model center by S5, and whether model judging result triggers early warning threshold values, is more than early warning Value then starts step S6, otherwise terminates entire contents of a project early warning calculation process；

The method that similarity duplicate checking is analyzed in the S4 of the present embodiment uses SimHash similarity algorithm, referring to Fig. 3, algorithm Process is as follows:

1) Doc is subjected to keyword abstraction (including segmenting and calculating weight), it is right to extract n (keyword, weight), i.e., (feature, weight) in figure.It is denoted as feature_weight_pairs=[fw1, fw2 ... fwn], wherein fwn = (feature_n,weight_n`)；

3) longitudinal direction for then carrying out position to hash_weight_pairs is cumulative, if the position is 1 ,+weight, if it is 0, then-weight, ultimately produces bits_count number, is [13,108, -22, -5, -32,55] as shown in the figure, Here the value generated is related to algorithm used in hash function；Hash is carried out to these words, 64 binary systems is obtained, obtains 20 The binary system set that a length is 64, hash are 1, then are replaced with positive weights；Hash are 0, then are replaced with negative weight；? To 20 length be 64 list [weight ,-weight, weight ..., weight], 20 lists are arranged to tired Add, obtain a list, that is, for a document, obtains the list that a length is 64.

4) this list is judged, positive value takes 1, and negative value takes 0；As [13,108, -22, -5, -32,55] obtain 10001, here it is the simhash value of a document, two simhash carry out XOR operation (Hamming distances), and exclusive or is as a result, 1 Number be more than 3 dissmilarities, be less than or equal to 3 similar.

Claims

1. a kind of distributed content duplicate checking early warning system based on financial big data, which is characterized in that including financial large database concept, Project application module, content warning module, content analysis engine, information push center, task schedule center, in which:

Project application module declares special fund project for user；

Content warning module, the warning line numerical value of setting content similitude early warning and corresponding warning level；

Content analysis engine, current reference are divided into two parts, Chinese Word Automatic Segmentation and content similarity algorithm；

2. a kind of distributed content duplicate checking method for early warning based on financial big data, which comprises the following steps:

3. the distributed content duplicate checking method for early warning according to claim 2 based on financial big data, which is characterized in that institute Segmentation methods are based on positive matched segmentation methods in the S3 stated, method particularly includes: the Word Intelligent Segmentation mode smart mould of use Formula, this system participle engine segmenter then can export one according to inherent method and think most reasonable word segmentation result, while at this Constructive in algorithm to start lemma and lemma chain concept, lemma chain is that a kind of result of participle forms one according to tandem Chain structure, the ordered set that essence is made of the lemma intersected define lemma in entire link in lemma object Position, be used for disambiguation.

4. the distributed content duplicate checking method for early warning according to claim 2 based on financial big data, which is characterized in that institute The method that similarity duplicate checking is analyzed in the S4 stated uses SimHash similarity algorithm.