CN107798068A - A kind of processing method, system and the relevant apparatus of user data of breaking one's promise - Google Patents

A kind of processing method, system and the relevant apparatus of user data of breaking one's promise Download PDF

Info

Publication number
CN107798068A
CN107798068A CN201710881910.9A CN201710881910A CN107798068A CN 107798068 A CN107798068 A CN 107798068A CN 201710881910 A CN201710881910 A CN 201710881910A CN 107798068 A CN107798068 A CN 107798068A
Authority
CN
China
Prior art keywords
data
promise
breaking
processing
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710881910.9A
Other languages
Chinese (zh)
Inventor
肖宇涵
王黎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Win Win Information Technology Co Ltd
Original Assignee
Zhejiang Win Win Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Win Win Information Technology Co Ltd filed Critical Zhejiang Win Win Information Technology Co Ltd
Priority to CN201710881910.9A priority Critical patent/CN107798068A/en
Publication of CN107798068A publication Critical patent/CN107798068A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

This application discloses a kind of processing method for user data of breaking one's promise, including:Using web crawlers data of breaking one's promise are crawled from each public break one's promise database or black list database;The extraction of corresponding manner progress key word information, keyword after being handled are chosen according to the record form for data of breaking one's promise;The ownership that keyword after each processing is carried out to corresponding content data by owning user divides, and obtains target and breaks one's promise data, and is broken one's promise user information database using its foundation;Judgement break one's promise it is no in user information database there is content-data exception, if if occurring according to each processing after the priority orders of keyword anomalous content data is modified.Can be by crawling related data by all kinds of means, and data processing method is more rationally and convenient and swift, data is compared and handled so that user data quality of breaking one's promise is higher.The application further simultaneously discloses a kind of processing system, device and the computer-readable recording medium of user data of breaking one's promise, has above-mentioned beneficial effect.

Description

A kind of processing method, system and the relevant apparatus of user data of breaking one's promise
Technical field
The application is related to network data processing technique, more particularly to a kind of processing method for user data of breaking one's promise, is System, device and computer-readable recording medium.
Background technology
With greatly developing for financial industry now, and mention finance and just be unable to do without air control, air control can be described as financial machine The lifeline of structure, it is even more so for pursuit Pu Hui and efficiency internet finance.Internet finance epoch, efficiency are able to really Lifting, but risk of fraud controls among these, becomes another challenge that financial institution has to face again.Shen should be investigated The credit risk of user is borrowed, more wants the generation of Fraud Protection behavior, wherein air control blacklist (user data of breaking one's promise) just becomes Many financial institutions carry out an important ring for risk control.
The utilization of blacklist, allow financial institution directly can judge that a people whether there is fraud with simpler, and Instantly " the air control blacklist " that financial institution, particularly little Wei credit agencies use in air control is then one except centre Outside row credit system, cover black card transaction card number list, had IP, phone number, the petty load of malice transaction or order Overdue user, P2P owe to borrow also user, law court break one's promise is performed list, steal block, blackmail, criminal penalty, network social intercourse are put down The general designation of multiple dimensions such as record of bad behavior on platform.Our so-called black list users today, also generally referred to as exist in history The crowd of bad credit record, than credit card is overdue, user of overdue loan if any crossing.
Because currently these user data of breaking one's promise are more scattered on the internet, and respectively have respective Storage Format, key Word title, can not be by simply capturing the user data of breaking one's promise with regard to that can obtain high quality, and the user's inquiry that causes to break one's promise is in the presence of one Quantitative wrong content data, practical effect are bad.
So how to provide a kind of data processing method is reasonable, convenient and swift, the user data quality that obtains breaking one's promise is higher, And the treatment mechanism for the user data of breaking one's promise for supporting to constantly update is those skilled in the art's urgent problem to be solved.
The content of the invention
The purpose of the application is to provide a kind of processing method for user data of breaking one's promise, system, device and computer-readable Storage medium, it utilizes more rational data processing method by crawling related data by all kinds of means, with more convenient Automated process flow, and the data to obtaining by all kinds of means are compared and screened so that user data quality of breaking one's promise is higher.
In order to solve the above technical problems, the application provides a kind of processing method for user data of breaking one's promise, the processing method bag Include:
Using web crawlers data of breaking one's promise are crawled from each public break one's promise database or black list database;
Corresponding keyword extraction mode is chosen according to the record form of the data of breaking one's promise and carries out carrying for key word information Take, obtain keyword after the processing of unified form;Wherein, the record form comprises at least form, picture and word;
The ownership that keyword after each processing is carried out to corresponding content data by owning user divides, and obtains target and breaks one's promise Data, and establish and break one's promise user information database using all targets data of breaking one's promise;
Broken one's promise described in judgement and whether occur content-data exception in user information database under same user attaching, if occurring, Priority orders according to keyword after each processing under same user attaching are modified to anomalous content data.
Optionally, data of breaking one's promise are crawled from each public break one's promise database or black list database using web crawlers, wrapped Include:
Different destination network addresses are set for each web crawlers and crawl identification parameter;
Each web crawlers is in corresponding destination network addresses using the identification parameter that crawls to number in website According to target data identification is carried out, the target data is obtained.
Optionally, corresponding keyword extraction mode is chosen according to the record form of the data of breaking one's promise and carries out keyword letter The extraction of breath, keyword after the processing of unified form is obtained, including:
The physical record form for data of being broken one's promise described in identification;
When it is described break one's promise form that data are form when, the word of title bar in the form is crucial in default identification Matched in dictionary, obtain the first matched data, and first matched data is preserved to preset path;
When it is described break one's promise form that data are picture when, obtain word in picture using OCR character recognition technologies, and by institute State in picture word to be matched in the identification keywords database, obtain the second matched data, and by second coupling number According under preservation to the preset path;
When it is described break one's promise data be word form when, using NLP natural language processings algorithm carry out segment punctuate processing, Effective information is obtained, the effective information is matched in the identification keywords database, obtains the 3rd matched data, and will 3rd matched data is preserved to the preset path.
Optionally, the ownership that keyword after each processing is carried out to corresponding content data by owning user divides, and obtains Target is broken one's promise data, and establishes user information database of breaking one's promise using all targets data of breaking one's promise, including:
ID card information and name will be utilized to uniquely determine a targeted customer, and belong to the targeted customer by all Each processing after content-data corresponding to keyword collect under the information aggregate of the targeted customer;
The user information database of breaking one's promise is built using all targeted customers and corresponding information aggregate.
Optionally, if occurring, according to the priority orders of keyword after each processing under same user attaching to exception Content-data is modified, including;
If there is information overlap, retain priority orders highest according to the priority division to each overlay information source and come The content-data in source;
If occur content-data crawl it is imperfect, according to imperfect degree carry out content-data compensation.
Optionally, the processing method also includes:
When there is same name and corresponding ID card information lacks, judge whether to have more than other places of predetermined number Content-data is identical corresponding to keyword after reason;
If so, then it is determined as same user and carries out the deletion of duplicate data.
Optionally, the processing method also includes:
Record the extraction process and result of key word information, generation extraction daily record, so as to follow-up using the extraction daily record Recognizer is improved.
Present invention also provides a kind of processing system for user data of breaking one's promise, the processing system includes:
Data of breaking one's promise crawl unit, for being climbed using web crawlers from each public break one's promise database or black list database Take data of breaking one's promise;
Extraction and processing unit, the record form for data of being broken one's promise according to choose corresponding keyword extraction mode The extraction of key word information is carried out, obtains keyword after the processing of unified form;Wherein, the record form comprises at least table Lattice, picture and word;
Unit is established in ownership division and storehouse, for keyword after each processing to be carried out into corresponding content number by owning user According to ownership divide, obtain target and break one's promise data, and user information database of breaking one's promise is established using all targets data of breaking one's promise;
Abnormal judgement and amending unit, for judging whether occur under same user attaching in the user information database of breaking one's promise Content-data is abnormal, if occurring, according to the priority orders of keyword after each processing under same user attaching to anomalous content Data are modified.
Present invention also provides a kind of processing unit for user data of breaking one's promise, the processing unit includes:
Memory, for storing computer program;
Processor, the processing of the user data of breaking one's promise as described in above-mentioned content is realized during for performing the computer program The step of method.
Present invention also provides a kind of computer-readable recording medium, meter is stored with the computer-readable recording medium Calculation machine program, the processing side of the user data of breaking one's promise as described in above-mentioned content is realized when the computer program is executed by processor The step of method.
The processing method of a kind of user data of breaking one's promise provided herein, using web crawlers from each public data of breaking one's promise Data of breaking one's promise are crawled in storehouse or black list database;Corresponding keyword extraction is chosen according to the record form of the data of breaking one's promise Mode carries out the extraction of key word information, obtains keyword after the processing of unified form;Wherein, the record form comprises at least Form, picture and word;The ownership that keyword after each processing is carried out to corresponding content data by owning user divides, and obtains Broken one's promise data to target, and user information database of breaking one's promise is established using all targets data of breaking one's promise;Break one's promise user described in judgement Whether content-data exception is occurred in information bank under same user attaching, if occurring, according to respectively being handled under same user attaching The priority orders of keyword are modified to anomalous content data afterwards.
Obviously, technical scheme provided herein, by multiple web crawlers from crawling related data by all kinds of means so that Relevant data sources are more extensive, and utilize more rational data processing method, with more convenient automatic business processing Flow, and the data to obtaining by all kinds of means are compared and screened so that user data quality of breaking one's promise is higher.The application is gone back simultaneously A kind of processing system, device and the computer-readable recording medium of user data of breaking one's promise are provided, there is above-mentioned beneficial effect, It will not be repeated here.
Brief description of the drawings
, below will be to embodiment or existing in order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this The embodiment of application, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis The accompanying drawing of offer obtains other accompanying drawings.
A kind of flow chart of the processing method for user data of breaking one's promise that Fig. 1 is provided by the embodiment of the present application;
The another kind that Fig. 2 is provided by the embodiment of the present application break one's promise user data processing method flow chart;
A kind of structured flowchart of the processing system for user data of breaking one's promise that Fig. 3 is provided by the embodiment of the present application.
Embodiment
The core of the application is to provide a kind of processing method for user data of breaking one's promise, system, device and computer-readable Storage medium, it is by multiple web crawlers from crawling related data by all kinds of means so that relevant data sources are more extensive, and profit With more rational data processing method, with more convenient automated process flow, and the data to obtaining by all kinds of means It is compared and screens so that user data quality of breaking one's promise is higher.
To make the purpose, technical scheme and advantage of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In accompanying drawing, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described embodiment is Some embodiments of the present application, rather than whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art The all other embodiment obtained under the premise of creative work is not made, belong to the scope of the application protection.
Below in conjunction with Fig. 1, a kind of flow of the processing method for user data of breaking one's promise that Fig. 1 is provided by the embodiment of the present application Figure.
It specifically includes following steps:
S101:Using web crawlers data of breaking one's promise are crawled from each public break one's promise database or black list database;
This step is intended to be crawled from each public break one's promise database or black list database using multiple web crawlers and broken one's promise Data.And for crawling the selection of target, in addition to new blacklist webpage is manually found, some websites can irregular announcement phase Close blacklist.As many government websites can announce promise breaking, break one's promise or the list of other financial crimes;Today's tops also can be indefinite Phase announces " Lao Lai " list, i.e., it is each it is public break one's promise database or black list database voluntarily can be chosen and set, herein not It is specifically limited.
Wherein, web crawlers is a program for automatically extracting web page contents, and it is that search engine is downloaded from WWW Webpage, it is the important composition of search engine.URL (Uniform Resource of traditional reptile from one or several Initial pages Locator, URL) start, the URL on Initial page is obtained, during webpage is captured, constantly from current New URL is extracted on the page and is put into queue, certain stop condition until meeting system.The workflow of focused crawler is more multiple It is miscellaneous, it is necessary to be linked according to certain web page analysis algorithm filtering is unrelated with theme, the link remained with is simultaneously put it into URL queues to be captured.Then, it will select the webpage URL to be captured in next step according to certain search strategy from queue, And said process is repeated, stop when reaching a certain condition of system.
Specifically, the mode that data of breaking one's promise how are crawled using web crawlers is varied, it is generally the case that only need to be net Network reptile sets target network address, in the case of without other settings, it can be by all numbers in destination network addresses According to completely crawling.Can also be that web crawlers setting crawls identification parameter, only in view of rejecting hash as far as possible There is the web crawlers when being found in destination network addresses and come with the content that to crawl identification parameter consistent, just crawl dependency number According to, can largely reduce hash occupancy memory space.
S102:Corresponding keyword extraction mode is chosen according to the record form for data of breaking one's promise and carries out carrying for key word information Take, obtain keyword after the processing of unified form;
On the basis of S101, this step be intended to by each channel, each form data of breaking one's promise according to different record forms Corresponding processing mode is handled, and target obtains keyword after the processing of unified form.
In our the daily webpages that can be seen, it is seen that data in general, which preserves form, has form, picture, general The relatively conventional modes such as logical word, because some websites can be handled data to meet itself for various considerations It is required that since being preserved using different record forms, and the application needs crawl the data come to each channel and handled, Unified form could be used to integrate all data, rather than simple data stacking.
The extraction for selecting corresponding keyword extraction mode to carry out key word information why is needed herein, with China's word The diversity of form of presentation is closely related, and an ID card information can is used:ID card No., identity card ID, second generation body The various form of presentations such as part card, People's Republic of China's second generation identity card express the same meaning, different data sources Channel from which also entirely with the hobby of each website, still need therefrom to extract common keyword to determine whether retouching State same data content.
S103:The ownership that keyword after each processing is carried out to corresponding content data by owning user divides, and obtains target mistake Letter data, and establish and break one's promise user information database using all targets data of breaking one's promise;
On the basis of S102, this step is intended to keyword after each processing carrying out corresponding content data by owning user Ownership division, obtains target and breaks one's promise data, and establishes user information database of breaking one's promise using all targets data of breaking one's promise.
Why ownership division is carried out, be because finally in actual use, using the spy of targeted customer Reference breath is inquired about, therefore is needed corresponding to keyword after all processing for belonging to the user, keyword in data Appearance carries out ownership division, i.e., all affiliated Data inductions get up in units of user, after using any processing Keyword can be carried out the retrieval of corresponding other contents.
Broken one's promise obtaining target after data, exactly set up user profile of breaking one's promise using the target of all users data of breaking one's promise Storehouse.
Specific how to carry out ownership division, establish user information database of breaking one's promise, mode is varied, is used for example, not being used only Data are divided after the identity information at family, other crucial processing that can also arrange in pairs or groups;Intrinsic database can not only be used Program is established to build, can also according to oneself need the new information bank form of expression is modified and created to it, herein And be not specifically limited, only need can realize purpose, can be considered according to each influence factor under actual conditions and Selection.
S104:Whether judgement is broken one's promise occurs content-data exception in user information database under same user attaching;
On the basis of S103 has been successfully established user information database of breaking one's promise, this step is intended to judgement and broken one's promise in user information database Whether content-data abnormal phenomenon is occurred under each user attaching.Wherein, the abnormal form of expression of the content-data is varied, Be not necessarily identical because crawling the data come from each channel, each network for the analysis emphasis with a list it is different, Processing mode is different, and obtained data can also have difference, thus inevitably occur such as:Data acquisition repetition, There is error etc. in critical data missing, same name data, the processing for these anomalies is highly important, and The quality committed step how of this user information database of breaking one's promise set up determined.
For these exceptions, settling mode is varied, retains keyword, basis after those are handled for example, can set Data source channel chosen, chosen according to the significance level of missing information, according to after each processing keyword it is preferential Level judged, according to the quantity of missing information judge etc., different users it is different using scene under make Decision and selection be all not quite similar, herein and be not specifically limited, can according to the difference of actual conditions combine each influence because Element is considered and selects most suitable mode to carry out processing data anomaly, to improve user profile of breaking one's promise as far as possible The quality of data in storehouse.
S105:Priority orders according to keyword after each processing under same user attaching are repaiied to anomalous content data Just.
On the basis of S104, the application selects the priority orders according to keyword after each processing under same user attaching Anomalous content data is modified.And the priority orders of keyword after each processing specifically how are formulated, its formulating rules It can be restricted by each influence factor, the priority orders of keyword are not necessarily after each processing under each application scenarios under actual conditions It is identical, it is not necessary that especially to seek a set of general priority orders, adaptability should be made more depending on the difference of actual conditions Change, be modified in a manner of most suitable.Special, the restriction of some particular/special requirements from outside is also suffered from sometimes, In the case where needing to meet the particular/special requirement, also to be considered accordingly and change priority orders.
Further, after the completion of the user information database of breaking one's promise is established, do not represent it is perfect, without modification, can be with Above-mentioned content is recorded and generates corresponding journal file, it is follow-up in order to be carried out according to the journal file to each process Analysis, to update each step, or even the new web crawlers of increase, growth data source etc., constantly to lift each step The quality of data of rapid reasonability and user information database of breaking one's promise.
Further, the judgement task that system can not be completed can also be sent to administrative staff by specific path Place, to be handled using the cognition of administrative staff, and the decision made according to administrative staff performs corresponding subsequent treatment.
Based on above-mentioned technical proposal, the processing method for a kind of user data of breaking one's promise that the embodiment of the present application provides, by more Individual web crawlers from crawling related data by all kinds of means so that relevant data sources are more extensive, and utilize more rational data Processing mode, with more convenient automated process flow, and the data to obtaining by all kinds of means are compared and screening makes Gain and loss credit household's quality of data is higher.
Below in conjunction with Fig. 2, another kind that Fig. 2 is provided by the embodiment of the present application break one's promise user data processing method stream Cheng Tu.
It specifically includes following steps:
S201:Different destination network addresses are set for each web crawlers and crawl identification parameter;
S202:Each web crawlers is entered using crawling identification parameter in corresponding destination network addresses to data in website Row target data identifies, obtains target data;
S201 and S202 for each web crawlers by setting different destination network addresses and crawls identification parameter, and profit Target data is carried out to data in network using the identification parameter that crawls of setting in destination network addresses with these web crawlers Identification, to obtain target data.
Further, it is contemplated that the difference of destination network addresses, may be also required to exist carefully for the setting of each web crawlers Elementary errors it is different crawl identification parameter, preferably to crawl target data.
S203:Identify the physical record form for data of breaking one's promise;
The present embodiment is enumerated most commonly seen form, picture and common language these three record forms, Yi Jihou Corresponding processing mode in continuous step.
S204:The word of title bar in form is matched in default identification keywords database, obtains the first matching Data, and the first matched data is preserved to preset path;
The foundation of this step is on the basis of S203 recognition result is the record form of form, it is intended to by title bar in form Word matched in default identification keywords database, obtain the first matched data, and by the first matched data preserve to Under preset path.Because what is generally recorded on the title bar of form is the column or the keyword or perhaps content-data of the row Species, and title bar is typically in the first row or first row of form, more convenient to be identified.
Wherein, the identification keywords database of default identification keywords database actually administrative staff's preset in advance, in order to Data Matching is carried out, and after the match is successful, the data of the matching are preserved to preset path, to be total to reference to other data With the user information database of breaking one's promise formed.
S205:Word in picture is obtained using OCR character recognition technologies, and by word in picture in keywords database is identified Matched, obtain the second matched data, and the second matched data is preserved to preset path;
The foundation of this step is on the basis of S203 recognition result is the record form of picture, it is intended to utilizes OCR words to know Other technology obtains word in picture, and word in picture is matched in keywords database is identified, obtains the second matched data, And the second matched data is preserved to preset path.
Wherein, OCR (Optical Character Recognition, optical character identification) refer to electronic equipment (such as Scanner or digital camera) character printed on paper is checked, its shape is determined by detecting dark, bright pattern, then uses character The process of computword is translated into shape in recognition methods;That is, it is using optical mode that papery is literary for printed character Text conversion in shelves turns into the image file of black and white lattice, and by identification software by the text conversion in image into text lattice Formula, the technology further edited and processed for word processor, it is a kind of relatively broad technology used.
Word in picture is obtained after using OCR technique, processing mode afterwards is with the mode in S204, with default knowledge Other keywords database is matched, and is preserved.
S206:Carry out segmenting punctuate processing using NLP natural language processings algorithm, effective information is obtained, by effective information Matched in keywords database is identified, obtain the 3rd matched data, and the 3rd matched data is preserved to preset path;
The foundation of this step is on the basis of S203 recognition result is the record form of common language, it is intended to using NLP certainly Right Language Processing algorithm carries out segmenting punctuate processing, obtains effective information, effective information is carried out in keywords database is identified Match somebody with somebody, obtain the 3rd matched data, and the 3rd matched data is preserved to preset path.
Why this step is different from S204 and S205 processing mode, be since it is considered that some websites extract it is general Logical text information can obtain the punctuate of mistake and a company very long, in the absence of punctuate on punctuate or punctuate information crawler mistake String literal, face two ways untreated keyword can not be directly obtained directly up, and be carried out with default identification keywords database Matching, it is also necessary to made pauses in reading unpunctuated ancient writings, segmented by NLP (Natural Language Processing, natural language processing) etc. Reason, matched after the effective information after being handled, then with default identification keywords database.
S207:A targeted customer is uniquely determined using ID card information and name, and belongs to targeted customer by all Each processing after content-data corresponding to keyword collect under the information aggregate of targeted customer;
Obtained in S204, S205 and S206 after unified processing after keyword, this step is intended to utilize ID card information And name uniquely determines a targeted customer, and by content number corresponding to keyword after all each processing for belonging to targeted customer According to collecting under the information aggregate of targeted customer.Why it is because China is existing from ID card information and name these data System under, be come corresponding unique its people using unique ID card No., secondary school tested along with name is used as Parameter.
S208:User information database of breaking one's promise is built using all targeted customers and corresponding information aggregate;
S209:Whether judgement is broken one's promise occurs content-data exception in user information database under same user attaching;
S210:Judgement is information overlap or content-data crawl it is imperfect;
On the basis of this step establishes the judged result in S209 to there is content-data exception, it is intended to determine whether Specially information overlap or content-data crawl imperfect.
S211:Retain the content number in priority orders highest source according to the priority division to each overlay information source According to;
The foundation of this step is on the basis of S210 judged result is information overlap, it is intended to according to each overlay information source Priority division retain the content-data in priority orders highest source.Because source is different, some may be from some political affairs The website of mansion mechanism, some from the higher private site of degree of recognition, and some then come from some nameless small websites, it is intended to Occur carrying out choosing reservation and deletion according to the order of degree of recognition priority in the case of information overlap.
S212:Content-data is compensated by default corresponding compensation data mechanism according to imperfect degree.
This step establish S210 judged result for content-data crawl it is incomplete on the basis of, it is intended to according to imperfect Degree is compensated by default corresponding compensation data mechanism to content-data.Wherein, the decision procedure of the imperfect degree is more Kind various, weights that can be different according to certain policy setting to keyword after each processing are obtained accordingly using the weight computing Imperfect degree, such as in the case of one group of shortage of data identification card number and name, it is believed that this group of data it is imperfect Processing has been highest, because the identity of targeted customer can not be judged, in this case, the compensation data to be carried out Can be very difficult.Herein and it is not specifically limited, can be divided according to requirement, the setting of actual conditions with reference to the weights of administrative staff Calculated accordingly with progress and compensation data is carried out according to corresponding compensation mechanism, to improve the user profile of breaking one's promise as far as possible The quality of data in storehouse.
Based on above-mentioned technical proposal, the processing method for a kind of user data of breaking one's promise that the embodiment of the present application provides, by more Individual web crawlers from crawling related data by all kinds of means so that relevant data sources are more extensive, and utilize more rational data Processing mode, with more convenient automated process flow, and the data to obtaining by all kinds of means are compared and screening makes Gain and loss credit household's quality of data is higher.
Because situation is complicated, it can not enumerate and be illustrated, those skilled in the art should be able to recognize more the application The basic skills principle combination actual conditions of offer may have many examples, in the case where not paying enough creative works, Should be in the protection domain of the application.
Fig. 3, a kind of knot of the processing system for user data of breaking one's promise that Fig. 3 is provided by the embodiment of the present application are referred to below Structure block diagram.
The processing system can include:
Data of breaking one's promise crawl unit 100, for using web crawlers from each public database or black list database of breaking one's promise In crawl data 200 of breaking one's promise;
Extraction and processing unit 300, the record form for data of being broken one's promise according to choose corresponding keyword extraction Mode carries out the extraction of key word information, obtains keyword after the processing of unified form;Wherein, the record form comprises at least Form, picture and word;
Unit 400 is established in ownership division and storehouse, for keyword after each processing to be carried out in corresponding by owning user Hold the ownership division of data, obtain target and break one's promise data, and user profile of breaking one's promise is established using all targets data of breaking one's promise Storehouse;
It is abnormal to judge and amending unit 500, for judge in the user information database of breaking one's promise under same user attaching whether It is abnormal to there is content-data, if occurring, according to the priority orders of keyword after each processing under same user attaching to exception Content-data is modified.
Above each unit can apply in the specific concrete instance of following one:
Step 1:Selection for crawling webpage, in addition to manually finding new blacklist webpage, some websites can be indefinite Phase announces related blacklist.As many government websites can announce promise breaking, break one's promise or the list of other financial crimes;Today's tops Also " Lao Lai " list can irregularly be announced.For latter class website, periodically obtain related urls (as per N days) using reptile and return Response information;Collect the JSON information that Restful API are returned;Site title is mainly collected in first.It will receive Crucial words set in advance is (such as in the data such as the title of collection and database:" blacklist ", " breaking one's promise ") it is compared, if full Sufficient dependency rule, then reading is collected to the information of webpage correlation;
Step 2:The blacklist announced on webpage often has three kinds of forms:Form, picture and common language (such as surname Name:King two, identity card:xxx;Break one's promise reason:xxxx).
One:Preferably handled for the form of form, reptile is by contrasting the text in header line or title bar in form Word, matched with keyword set in advance in database.Field set in advance must be enriched enough, cover various expression Mode.Such as identity card, ID card No., identification card number, People's Republic of China's No.2 residence card etc..Most form at last In data generation meet the form of database structure, such as table 1 below;
Table 1:Meet the form of database structure
Secondly:Processing for picture, then using related OCR character recognition technologies, then processing mode is as above;
Thirdly:For the information of common language, made pauses in reading unpunctuated ancient writings by the algorithm of NLP natural language processings, obtain effectively letter Breath, it is final to import in database.
Step 3:There is the process of an outlier processing in said process, be not equal to 18 if there is such as identity card Data, cell-phone number with 9 beginning etc., province title is one not in abnormal datas such as the words of Chinese province name character string. These data appearance may you say the automatic identification of machine, as OCR Text regions error or NLP punctuate error caused by. This partial data nor completely it is nugatory, can be used from following two angles:
Mode one:Manually compared after recording information source, in order to avoid lose significant data;
Mode two:NLP and OCR algorithm are trained and improve by analyzing the reason for identification is wrong so that the standard of identification True rate is constantly lifted.
Step 4:The confidence level and information dimension hierarchy disclosed due to network information is uneven, and the blacklist crawled has letter The problem of incomplete and list repeats is ceased, it is necessary to there is a crucial comparing process.
Repeat to judge:It is to judge whether data repeat first, by setting weight to the data classified.As identity card is First weight, the identity card of two parts of blacklists of fruit are just the same it can be assumed that being same person;Other a few case galaxies are the second power Weight, such as name, cell-phone number, residence, work etc., it is assumed that in the case where ID card information lacks, the of two parts of blacklists Two weights are if N items (N>1) it is all equal, it can also assert in the two lists it is same person.
Non-duplicate item is directly stored in:Directly the data put in order are stored in database.
Duplicate keys processing:It is not simple deletion one of which information, but information is compared, to same person Information forms complementation:If there is any discrepancy in different blacklists for the information of same person, it is necessary to the credible to carrying out of data source Compare, choose most likely real data.Concrete mode can use two ways to set weight, including (1) artificially to set: Such as the weight highest of government website, when the data difference with other websites, the data of government website are chosen;(2) website is read Data carry out weight ratio compared with such as higher website of selection visit capacity, or the visit capacity of the website of this web site url are more valuable Website data.
Step 5:Rule and algorithm in said system must constantly optimize in operation, improve accuracy, as far as possible Crawl more valuable list information.
Each embodiment is described by the way of progressive in specification, and what each embodiment stressed is and other realities Apply the difference of example, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment Speech, because it is corresponded to the method disclosed in Example, so description is fairly simple, related part is referring to method part illustration .
Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description And algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clearly demonstrate hardware and The interchangeability of software, the composition and step of each example are generally described according to function in the above description.These Function is performed with hardware or software mode actually, application-specific and design constraint depending on technical scheme.Specialty Technical staff can realize described function using distinct methods to each specific application, but this realization should not Think to exceed scope of the present application.
Specific case used herein is set forth to the principle and embodiment of the application, and above example is said It is bright to be only intended to help and understand the present processes and its core concept.It should be pointed out that the ordinary skill for the art For personnel, on the premise of the application principle is not departed from, some improvement and modification, these improvement can also be carried out to the application Also fallen into modification in the application scope of the claims.
It should also be noted that, in this manual, such as first and second or the like relational terms be used merely to by One entity or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or operation Between any this actual relation or order be present.Moreover, term " comprising ", "comprising" or its any other variant meaning Covering including for nonexcludability, so that process, method, article or equipment including a series of elements not only include that A little key elements, but also other key elements including being not expressly set out, or also include for this process, method, article or The intrinsic key element of equipment.In the absence of more restrictions, the key element limited by sentence "including a ...", is not arranged Except other identical element in the process including key element, method, article or equipment being also present.

Claims (10)

  1. A kind of 1. processing method for user data of breaking one's promise, it is characterised in that including:
    Using web crawlers data of breaking one's promise are crawled from each public break one's promise database or black list database;
    The extraction of key word information is carried out according to the corresponding keyword extraction mode of the record form of the data of breaking one's promise selection, is obtained Keyword after to the processing of unified form;Wherein, the record form comprises at least form, picture and word;
    The ownership that keyword after each processing is carried out to corresponding content data by owning user divides, and obtains target and breaks one's promise number According to, and establish and break one's promise user information database using all targets data of breaking one's promise;
    Broken one's promise described in judgement and whether occur content-data exception in user information database under same user attaching, if occurring, according to The priority orders of keyword are modified to anomalous content data after each processing under same user attaching.
  2. 2. processing method according to claim 1, it is characterised in that using web crawlers from it is each it is public break one's promise database or Data of breaking one's promise are crawled in black list database, including:
    Different destination network addresses are set for each web crawlers and crawl identification parameter;
    Each web crawlers is entered in corresponding destination network addresses using the identification parameter that crawls to data in website Row target data identifies, obtains the target data.
  3. 3. processing method according to claim 2, it is characterised in that phase is chosen according to the record form of the data of breaking one's promise The keyword extraction mode answered carries out the extraction of key word information, obtains keyword after the processing of unified form, including:
    The physical record form for data of being broken one's promise described in identification;
    When it is described break one's promise data be form form when, by the word of title bar in the form in default identification keywords database In matched, obtain the first matched data, and first matched data is preserved to preset path;
    When it is described break one's promise form that data are picture when, obtain word in picture using OCR character recognition technologies, and by the figure Word is matched in the identification keywords database in piece, obtains the second matched data, and second matched data is protected Deposit to the preset path;
    When it is described break one's promise data be word form when, using NLP natural language processings algorithm carry out segment punctuate processing, obtain Effective information, the effective information is matched in the identification keywords database, obtains the 3rd matched data, and by described in 3rd matched data is preserved to the preset path.
  4. 4. processing method according to claim 3, it is characterised in that enter keyword after each processing by owning user The ownership division of row corresponding content data, obtains target and breaks one's promise data, and is established and broken one's promise using all targets data of breaking one's promise User information database, including:
    A targeted customer is uniquely determined using ID card information and name, and belongs to the targeted customer everywhere by all Content-data corresponding to keyword is collected under the information aggregate of the targeted customer after reason;
    The user information database of breaking one's promise is built using all targeted customers and corresponding information aggregate.
  5. 5. processing method according to claim 4, it is characterised in that if occur, according under same user attaching everywhere The priority orders of keyword are modified to the content-data of exception after reason, including;
    If there is information overlap, retain priority orders highest source according to the priority division to each overlay information source Content-data;
    If occur content-data crawl it is imperfect, according to imperfect degree press preset corresponding compensation data mechanism to content number According to compensating.
  6. 6. processing method according to claim 5, it is characterised in that also include:
    When there is same name and corresponding ID card information lacks, judge whether after having more than other processing of predetermined number Content-data is identical corresponding to keyword;
    If so, then it is determined as same user and carries out the deletion of duplicate data.
  7. 7. processing method according to claim 6, it is characterised in that also include:
    Record the extraction process and result of key word information, generation extraction daily record, to extract daily record subsequently to knowing using described Other algorithm is improved.
  8. A kind of 8. processing system for user data of breaking one's promise, it is characterised in that including:
    Data of breaking one's promise crawl unit, for crawling mistake from each public break one's promise database or black list database using web crawlers Letter data;
    Extraction and processing unit, the record form for data of being broken one's promise according to are chosen corresponding keyword extraction mode and carried out The extraction of key word information, obtain keyword after the processing of unified form;Wherein, the record form comprises at least form, figure Piece and word;
    Unit is established in ownership division and storehouse, for keyword after each processing to be carried out into corresponding content data by owning user Ownership division, obtains target and breaks one's promise data, and establishes user information database of breaking one's promise using all targets data of breaking one's promise;
    Abnormal judgement and amending unit, for judging whether content occur under same user attaching in the user information database of breaking one's promise Data exception, if occurring, according to the priority orders of keyword after each processing under same user attaching to anomalous content data It is modified.
  9. A kind of 9. processing unit for user data of breaking one's promise, it is characterised in that including:
    Memory, for storing computer program;
    Processor, realizing user data of being broken one's promise as described in any one of claim 1 to 7 during for performing the computer program The step of processing method.
  10. 10. a kind of computer-readable recording medium, it is characterised in that be stored with computer on the computer-readable recording medium Program, the place for user data of being broken one's promise as described in any one of claim 1 to 7 is realized when the computer program is executed by processor The step of reason method.
CN201710881910.9A 2017-09-26 2017-09-26 A kind of processing method, system and the relevant apparatus of user data of breaking one's promise Pending CN107798068A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710881910.9A CN107798068A (en) 2017-09-26 2017-09-26 A kind of processing method, system and the relevant apparatus of user data of breaking one's promise

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710881910.9A CN107798068A (en) 2017-09-26 2017-09-26 A kind of processing method, system and the relevant apparatus of user data of breaking one's promise

Publications (1)

Publication Number Publication Date
CN107798068A true CN107798068A (en) 2018-03-13

Family

ID=61532399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710881910.9A Pending CN107798068A (en) 2017-09-26 2017-09-26 A kind of processing method, system and the relevant apparatus of user data of breaking one's promise

Country Status (1)

Country Link
CN (1) CN107798068A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447041A (en) * 2018-12-11 2019-03-08 绍兴上虞复旦协创绿色照明研究院有限公司 A kind of intelligent image identifying system
CN109948358A (en) * 2019-01-17 2019-06-28 平安科技(深圳)有限公司 Blacklist sharing method and device, storage medium, computer equipment
CN110083750A (en) * 2019-03-15 2019-08-02 平安科技(深圳)有限公司 Blacklist screening method, device, computer equipment and storage medium
CN111046087A (en) * 2019-12-20 2020-04-21 北京锐安科技有限公司 Data processing method, device, equipment and storage medium
CN111310012A (en) * 2020-01-21 2020-06-19 国网安徽省电力有限公司滁州供电公司 Automatic monitoring and early warning method for enterprise information loss behavior
CN112749154A (en) * 2020-12-30 2021-05-04 上海微盟企业发展有限公司 Data warehousing method, device and equipment and computer readable storage medium
CN113157704A (en) * 2021-05-06 2021-07-23 成都卫士通信息产业股份有限公司 Hierarchical relation analysis method, device, equipment and computer readable storage medium
CN114328413A (en) * 2021-12-30 2022-04-12 中国民航信息网络股份有限公司 Data processing method and device, storage medium and electronic equipment
CN115357688A (en) * 2022-10-12 2022-11-18 北京金堤科技有限公司 Enterprise list information acquisition method and device, storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030208482A1 (en) * 2001-01-10 2003-11-06 Kim Brian S. Systems and methods of retrieving relevant information
CN104123395A (en) * 2014-08-13 2014-10-29 北京赛科世纪数码科技有限公司 Decision making method and system based on big data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030208482A1 (en) * 2001-01-10 2003-11-06 Kim Brian S. Systems and methods of retrieving relevant information
CN104123395A (en) * 2014-08-13 2014-10-29 北京赛科世纪数码科技有限公司 Decision making method and system based on big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
开心果汁: "【python爬虫】全国失信被执行人名单爬虫", 《HTTPS://BLOG.CSDN.NET/U013421629/ARTICLE/DETAILS/77471919》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447041A (en) * 2018-12-11 2019-03-08 绍兴上虞复旦协创绿色照明研究院有限公司 A kind of intelligent image identifying system
CN109948358A (en) * 2019-01-17 2019-06-28 平安科技(深圳)有限公司 Blacklist sharing method and device, storage medium, computer equipment
CN110083750A (en) * 2019-03-15 2019-08-02 平安科技(深圳)有限公司 Blacklist screening method, device, computer equipment and storage medium
CN111046087A (en) * 2019-12-20 2020-04-21 北京锐安科技有限公司 Data processing method, device, equipment and storage medium
CN111310012A (en) * 2020-01-21 2020-06-19 国网安徽省电力有限公司滁州供电公司 Automatic monitoring and early warning method for enterprise information loss behavior
CN112749154A (en) * 2020-12-30 2021-05-04 上海微盟企业发展有限公司 Data warehousing method, device and equipment and computer readable storage medium
CN113157704A (en) * 2021-05-06 2021-07-23 成都卫士通信息产业股份有限公司 Hierarchical relation analysis method, device, equipment and computer readable storage medium
CN114328413A (en) * 2021-12-30 2022-04-12 中国民航信息网络股份有限公司 Data processing method and device, storage medium and electronic equipment
CN115357688A (en) * 2022-10-12 2022-11-18 北京金堤科技有限公司 Enterprise list information acquisition method and device, storage medium and electronic equipment
CN115357688B (en) * 2022-10-12 2023-02-21 北京金堤科技有限公司 Enterprise list information acquisition method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN107798068A (en) A kind of processing method, system and the relevant apparatus of user data of breaking one's promise
CN104899508B (en) A kind of multistage detection method for phishing site and system
US8650189B2 (en) Systems and methods for determining visibility and reputation of a user on the internet
US11503065B2 (en) Determining digital vulnerability based on an online presence
Zhang et al. Extracting implicit features in online customer reviews for opinion mining
CN108509482A (en) Question classification method, device, computer equipment and storage medium
CN107153847A (en) Predict method and computing device of the user with the presence or absence of malicious act
US10140297B2 (en) Supplementing search results with information of interest
CN107633081A (en) A kind of querying method and system of user profile of breaking one's promise
CN107766399A (en) For the method and system and machine readable media for image is matched with content item
CN112765366A (en) APT (android Package) organization portrait construction method based on knowledge map
CN110909120B (en) Resume searching/delivering method, device and system and electronic equipment
JP2009271799A (en) Company correlative information extracting system
CN107679977A (en) A kind of tax administration platform and implementation method based on semantic analysis
CN108446295A (en) Information retrieval method, device, computer equipment and storage medium
Batini et al. Semantic data integration for investigations: lessons learned and open challenges
Kale et al. Classification of fraud calls by intent analysis of call transcripts
Lawton et al. eDiscovery in digital forensic investigations
Chen et al. A hidden astroturfing detection approach base on emotion analysis
Banerji CHOGM 2018: A Right Royal Affair
JP6307933B2 (en) Information processing apparatus and program
Cifaldi Government surveillance and facial recognition system in the context of modern technologies and security challenges
Chatzimarkaki et al. Harvesting Large Textual and Multimedia Data to Detect Illegal Activities on Dark Web Marketplaces
CN109408704A (en) Fund data correlating method, system, computer equipment and storage medium
CN112287229B (en) National defense construction dynamic information recommendation method based on combined semantic similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180313