CN107329969A - It is a kind of that system and method are updated based on the data message repeatedly verified - Google Patents

It is a kind of that system and method are updated based on the data message repeatedly verified Download PDF

Info

Publication number
CN107329969A
CN107329969A CN201710367303.0A CN201710367303A CN107329969A CN 107329969 A CN107329969 A CN 107329969A CN 201710367303 A CN201710367303 A CN 201710367303A CN 107329969 A CN107329969 A CN 107329969A
Authority
CN
China
Prior art keywords
data
data message
multiplicity
preliminary
data information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710367303.0A
Other languages
Chinese (zh)
Inventor
周钰徐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Intellectual Property Mdt Infotech Ltd
Original Assignee
Hefei Intellectual Property Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Intellectual Property Mdt Infotech Ltd filed Critical Hefei Intellectual Property Mdt Infotech Ltd
Priority to CN201710367303.0A priority Critical patent/CN107329969A/en
Publication of CN107329969A publication Critical patent/CN107329969A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating

Abstract

The invention discloses a kind of based on the data message repeatedly verified renewal system and method, it is characterised in that including:Data acquisition module, for obtaining data message using web crawlers;First correction verification module, the data message for being obtained in preset time to web crawlers carries out preliminary check, obtains preliminary data information aggregate;Second correction verification module, for data message in preliminary data information aggregate and preset data information bank to be carried out into multiplicity verification;Module is removed, the data message for deleting multiplicity verification failure in preliminary data information aggregate;Update module, is added in preset data information bank for multiplicity in preliminary data information aggregate to be verified into successful data message, updates preset data information bank.Similarity verification is carried out after mass data information in this way, being obtained to web crawlers, the high data message of the similarity obtained in real time is removed, it is to avoid web crawlers obtains repeated data information, reduces data message quantity, multiplicity verification efficiency is improved.

Description

It is a kind of that system and method are updated based on the data message repeatedly verified
Technical field
The present invention relates to technical field of data check, more particularly to it is a kind of based on the data message repeatedly verified more new system And method.
Background technology
With internet information explosive growth, geometric increase is all presented in the data message in every day internet, User is often submerged in a large amount of useless duplicate messages, obtained at present by search engine when obtaining the data message needed It has been the easy way that most user praises highly to be derived from oneself data message interested, is used as the basic component of search engine One of web crawlers, it is necessary to obtain data message from internet, provide the user the support of data message, but with mutual The reprinting wantonly of networked information and many websites are delivered, the data message that web crawlers is obtained whether abundant, similarity and registration Whether high, the efficiency with web crawlers is closely related.Then for creep speed is improved, network would generally take Parallel Crawling Working method, the problem of introducing new therewith:Repeatability, the reptile run parallel or thread of creeping when running while add Duplicate pages, quality problems, when running parallel, each reptile or creep thread can only the fetching portion page, cause page quality Decline.
Current existing database duplicate checking, because database data is huge, with increasing for crawl data message, causes work Work amount is larger, efficiency of algorithm step-down.
The content of the invention
The technical problem existed based on background technology, the present invention proposes a kind of based on the data message repeatedly verified renewal System and method;
It is proposed by the present invention a kind of based on the data message repeatedly verified more new system, including:
Data acquisition module, for obtaining data message using web crawlers;
First correction verification module, the data message for being obtained in preset time to web crawlers carries out preliminary check, obtains Preliminary data information aggregate;
Second correction verification module, for data message in preliminary data information aggregate and preset data information bank to be compared And judge whether the data message verifies success;
Module is removed, the data message for deleting multiplicity verification failure in preliminary data information aggregate;
Update module, present count is added to for multiplicity in preliminary data information aggregate to be verified into successful data message According in information bank, preset data information bank is updated.
Preferably, first correction verification module, specifically for:
The data message that web crawlers is obtained is compared to each other in preset time;
Delete the data message that similarity in two data messages is more than default Similarity value;
After all data messages mutually compare, preliminary data information aggregate is obtained.
Preferably, second correction verification module, specifically for:
Obtain preliminary data one data message of information aggregate;
The data message and preset data information in preset data information bank are carried out into multiplicity to be compared, multiplicity is obtained Value simultaneously judges whether multiplicity verification succeeds according to repetition angle value and predetermined threshold value;When it is described repeat angle value be more than predetermined threshold value, Judge the data message multiplicity verification failure;Otherwise, judge that the data message multiplicity is verified successfully;Work as preliminary data After all data messages carry out multiplicity relatively in information aggregate, multiplicity verification is completed.
Preferably, the data acquisition module includes multiple data acquisition submodules.
Preferably, in addition to data message pushing module, for the data message being added in preset data information bank It is pushed to user.
It is a kind of based on the data message update method repeatedly verified, including:
S1, utilize web crawlers obtain data message;
S2, the data message obtained in preset time to web crawlers carry out preliminary check, obtain preliminary data information collection Close;
S3, data message in preliminary data information aggregate and preset data information bank are compared and the data are judged Whether information verifies success;
S4, the data message for deleting multiplicity verification failure in preliminary data information aggregate;
S5, multiplicity in preliminary data information aggregate is verified into successful data message it is added to preset data information bank In, update preset data information bank.
Preferably, step S2, is specifically included:
The data message that web crawlers is obtained is compared to each other in preset time;
Delete the data message that similarity in two data messages is more than default Similarity value;
After all data messages mutually compare, preliminary data information aggregate is obtained.
Preferably, step S3, is specifically included:
S31, acquisition preliminary data one data message of information aggregate;
S32, by the data message and in preset data information bank preset data information carry out multiplicity compared, obtain weight Multiplicity value simultaneously judges whether multiplicity verification succeeds according to repetition angle value and predetermined threshold value;When the angle value that repeats is more than default threshold Value, judges the data message multiplicity verification failure;Otherwise, judge that the data message multiplicity is verified successfully;
S33, the operation of repeat step S31, S32, until all data messages are repeated in preliminary data information aggregate Degree compares.
Preferably, in step sl, data message is obtained using multiple web crawlers.
Preferably, in addition to step S6, the data message being added in preset data information bank is pushed to user.
The present invention carries out preliminary check by the data message obtained to multiple web crawlers, what screen reptile obtained Similarity is more than a data message of default Similarity value in any two data message in data message, obtains preliminary data Information aggregate, then preset data information in data message preset data information bank in preliminary data information aggregate is subjected to multiplicity Compare, judge whether multiplicity verification succeeds according to repetition angle value and predetermined threshold value, when data are believed in preliminary data information aggregate Multiplicity verification failure is ceased, the data message is deleted;When data message multiplicity is verified successfully in preliminary data information aggregate, Data message in preliminary data information aggregate is added in preset data information bank;In this way, being obtained first to multiple web crawlers Progress similarity verification after mass data information is taken, the high data message of the similarity obtained in real time is removed, it is to avoid web crawlers Repeated data information is obtained, data message quantity is reduced, thus improves multiplicity verification efficiency, in multiplicity verification, is deleted The higher data message of multiplicity, reduces data message registration;The low data message of multiplicity is added to present count it is believed that Cease in storehouse, improve the real-time of multiplicity verification.
Brief description of the drawings
Fig. 1 is a kind of module diagram based on the data message repeatedly verified more new system proposed by the present invention;
Fig. 2 is a kind of schematic flow sheet based on the data message update method repeatedly verified proposed by the present invention.
Embodiment
Show as shown in figure 1, Fig. 1 is a kind of module based on the data message repeatedly verified more new system proposed by the present invention It is intended to;
Reference picture 1, it is proposed by the present invention a kind of based on the data message repeatedly verified more new system, including:
Data acquisition module, for obtaining data message using web crawlers.
In concrete scheme, data acquisition module can be set including multiple data acquisition submodules, each data submodule Block can be using multiple web crawlers to obtain data.According to information gathering and analysis target, using web crawlers, collection is each Category information.
First correction verification module, the data message for being obtained in preset time to web crawlers carries out preliminary check, obtains Preliminary data information aggregate;
In concrete scheme, the first correction verification module is used for:The data message for obtaining web crawlers in preset time is carried out It is compared to each other;Delete the data message that similarity in two data messages is more than default Similarity value;In all data letter After breath mutually compares, preliminary data information aggregate is obtained.Carried out by the data message obtained to multiple web crawlers preliminary Similarity is more than the one of default Similarity value in any two data message in verification, the data message that screen reptile obtains Individual data message, obtains preliminary data information aggregate, it is to avoid web crawlers obtains repeated data information, reduces repeated data information Quantity.
Second correction verification module, for data message in preliminary data information aggregate and preset data information bank to be compared And judge whether the data message verifies success;
In concrete scheme, the second correction verification module is used for:Obtain preliminary data one data message of information aggregate;Will be described Data message carries out multiplicity with preset data information in preset data information bank and compared, and obtains repeating angle value and according to multiplicity Value judges whether multiplicity verification succeeds with predetermined threshold value;When the angle value that repeats is more than predetermined threshold value, the data letter is judged Cease multiplicity verification failure;Otherwise, judge that the data message multiplicity is verified successfully;When all in preliminary data information aggregate After data message carries out multiplicity relatively, multiplicity verification is completed.
Module is removed, the data message for deleting multiplicity verification failure in preliminary data information aggregate;
Update module, present count is added to for multiplicity in preliminary data information aggregate to be verified into successful data message According in information bank, preset data information bank is updated;
In concrete scheme, by preset data information in data message preset data information bank in preliminary data information aggregate Multiplicity comparison is carried out, judges whether multiplicity verification succeeds according to repetition angle value and predetermined threshold value, when preliminary data information collection Data message multiplicity verification failure, deletes the data message in conjunction, reduces data message registration;When preliminary data information Data message multiplicity is verified successfully in set, and data message in preliminary data information aggregate is added into preset data information bank In, improve the real-time of multiplicity verification.
Data message pushing module, the data message for being added in preset data information bank is pushed to user;
In concrete scheme, after verifying twice, use is pushed to by updating to the data message in preset data information bank Family.
Show as shown in Fig. 2 Fig. 2 is a kind of flow based on the data message update method repeatedly verified proposed by the present invention It is intended to;
Reference picture 2, it is proposed by the present invention a kind of based on the data message update method repeatedly verified, it is characterised in that bag Include:
S1, utilize web crawlers obtain data message;
In concrete scheme, multiple web crawlers can be set to obtain data, according to information gathering and analysis target, Using web crawlers, various information is gathered.
S2, the data message obtained in preset time to web crawlers carry out preliminary check, obtain preliminary data information collection Close;Specifically include:The data message that web crawlers is obtained is compared to each other in preset time;Delete in two data messages Similarity is more than a data message of default Similarity value;After all data messages mutually compare, preliminary data is obtained Information aggregate.
In concrete scheme, preliminary check is carried out by the data message obtained to multiple web crawlers, screen is climbed Similarity is more than a data message of default Similarity value in any two data message in the data message that worm obtains, and obtains Preliminary data information aggregate, it is to avoid web crawlers obtains repeated data information, reduces repeated data information content.
S3, data message in preliminary data information aggregate and preset data information bank are compared and the data are judged Whether information verifies success;Specifically include:S31, acquisition preliminary data one data message of information aggregate;S32, by the data Information with preset data information bank preset data information carry out multiplicity compared, obtain repeat angle value and according to repeat angle value with Predetermined threshold value judges whether multiplicity verification succeeds;When the angle value that repeats is more than predetermined threshold value, the data message weight is judged Multiplicity verification failure;Otherwise, judge that the data message multiplicity is verified successfully;S33, the operation of repeat step S31, S32, until All data messages carry out multiplicity comparison in preliminary data information aggregate;
S4, the data message for deleting multiplicity verification failure in preliminary data information aggregate;
S5, multiplicity in preliminary data information aggregate is verified into successful data message it is added to preset data information bank In, update preset data information bank.
In concrete scheme, by preset data information in data message preset data information bank in preliminary data information aggregate Multiplicity comparison is carried out, judges whether multiplicity verification succeeds according to repetition angle value and predetermined threshold value, when preliminary data information collection Data message multiplicity verification failure, deletes the data message in conjunction, reduces data message registration;When preliminary data information Data message multiplicity is verified successfully in set, and data message in preliminary data information aggregate is added into preset data information bank In, improve the real-time of multiplicity verification.
Also include step S6, the data message being added in preset data information bank is pushed to user.
In concrete scheme, after verifying twice, use is pushed to by updating to the data message in preset data information bank Family.
Present embodiment carries out preliminary check by the data message obtained to multiple web crawlers, and screen reptile is obtained Similarity is more than a data message of default Similarity value in any two data message in the data message taken, obtains preliminary Data message set, then preset data information in data message preset data information bank in preliminary data information aggregate is weighed Multiplicity compares, and judges whether multiplicity verification succeeds according to repetition angle value and predetermined threshold value, when number in preliminary data information aggregate It is believed that breath multiplicity verification failure, deletes the data message;When data message multiplicity is verified in preliminary data information aggregate Success, data message in preliminary data information aggregate is added in preset data information bank;In this way, being climbed first to multiple networks Worm, which obtains, carries out similarity verification after mass data information, remove the high data message of the similarity obtained in real time, it is to avoid network Reptile obtains repeated data information, reduces data message quantity, thus improves multiplicity verification efficiency, in multiplicity verification, The higher data message of multiplicity is deleted, data message registration is reduced;The low data message of multiplicity is added to present count According in information bank, the real-time of multiplicity verification is improved.
The foregoing is only a preferred embodiment of the present invention, but protection scope of the present invention be not limited thereto, Any one skilled in the art the invention discloses technical scope in, technique according to the invention scheme and its Inventive concept is subject to equivalent substitution or change, should all be included within the scope of the present invention.

Claims (10)

1. it is a kind of based on the data message repeatedly verified more new system, it is characterised in that including:
Data acquisition module, for obtaining data message using web crawlers;
First correction verification module, the data message for being obtained in preset time to web crawlers carries out preliminary check, obtains preliminary Data message set;
Second correction verification module, for data message in preliminary data information aggregate to be compared and sentence with preset data information bank Whether the data message that breaks verifies success;
Module is removed, the data message for deleting multiplicity verification failure in preliminary data information aggregate;
Update module, for by multiplicity in preliminary data information aggregate verify successful data message be added to present count it is believed that Cease in storehouse, update preset data information bank.
2. it is according to claim 1 based on the data message repeatedly verified more new system, it is characterised in that first school Module is tested, specifically for:
The data message that web crawlers is obtained is compared to each other in preset time;
Delete the data message that similarity in two data messages is more than default Similarity value;
After all data messages mutually compare, preliminary data information aggregate is obtained.
3. it is according to claim 1 based on the data message repeatedly verified more new system, it is characterised in that second school Module is tested, specifically for:
Obtain preliminary data one data message of information aggregate;
The data message and preset data information in preset data information bank are carried out into multiplicity to be compared, obtain repeating angle value simultaneously Judge whether multiplicity verification succeeds according to repetition angle value and predetermined threshold value;When the angle value that repeats is more than predetermined threshold value, judge The data message multiplicity verification failure;Otherwise, judge that the data message multiplicity is verified successfully;
After all data messages carry out multiplicity relatively in preliminary data information aggregate, multiplicity verification is completed.
4. it is according to claim 1 based on the data message repeatedly verified more new system, it is characterised in that the data are obtained Modulus block includes multiple data acquisition submodules.
5. it is according to claim 1 based on the data message repeatedly verified more new system, it is characterised in that also including data Info push module, the data message for being added in preset data information bank is pushed to user.
6. it is a kind of based on the data message update method repeatedly verified, it is characterised in that including:
S1, utilize web crawlers obtain data message;
S2, the data message obtained in preset time to web crawlers carry out preliminary check, obtain preliminary data information aggregate;
S3, data message in preliminary data information aggregate and preset data information bank are compared and the data message is judged Whether verification is successful;
S4, the data message for deleting multiplicity verification failure in preliminary data information aggregate;
S5, multiplicity in preliminary data information aggregate is verified into successful data message be added in preset data information bank, more New preset data information bank.
7. according to claim 6 based on the data message update method repeatedly verified, it is characterised in that step S2, tool Body includes:
The data message that web crawlers is obtained is compared to each other in preset time;
Delete the data message that similarity in two data messages is more than default Similarity value;
After all data messages mutually compare, preliminary data information aggregate is obtained.
8. according to claim 6 based on the data message update method repeatedly verified, it is characterised in that step S3, tool Body includes:
S31, acquisition preliminary data one data message of information aggregate;
S32, by the data message and in preset data information bank preset data information carry out multiplicity compared, obtain multiplicity Value simultaneously judges whether multiplicity verification succeeds according to repetition angle value and predetermined threshold value;When it is described repeat angle value be more than predetermined threshold value, Judge the data message multiplicity verification failure;Otherwise, judge that the data message multiplicity is verified successfully;
S33, the operation of repeat step S31, S32, until all data messages carry out multiplicity ratio in preliminary data information aggregate Compared with.
9. it is according to claim 6 based on the data message update method repeatedly verified, it is characterised in that in step S1 In, obtain data message using multiple web crawlers.
10. it is according to claim 6 based on the data message update method repeatedly verified, it is characterised in that also including step Rapid S6, the data message being added in preset data information bank is pushed to user.
CN201710367303.0A 2017-05-23 2017-05-23 It is a kind of that system and method are updated based on the data message repeatedly verified Pending CN107329969A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710367303.0A CN107329969A (en) 2017-05-23 2017-05-23 It is a kind of that system and method are updated based on the data message repeatedly verified

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710367303.0A CN107329969A (en) 2017-05-23 2017-05-23 It is a kind of that system and method are updated based on the data message repeatedly verified

Publications (1)

Publication Number Publication Date
CN107329969A true CN107329969A (en) 2017-11-07

Family

ID=60193860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710367303.0A Pending CN107329969A (en) 2017-05-23 2017-05-23 It is a kind of that system and method are updated based on the data message repeatedly verified

Country Status (1)

Country Link
CN (1) CN107329969A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228431A (en) * 2018-01-04 2018-06-29 北京中关村科金技术有限公司 A kind of method and system of configurationization reptile quality-monitoring
CN108932285A (en) * 2018-05-22 2018-12-04 北京工业大学 A kind of data grab method and system based on browser extension

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110035367A1 (en) * 2009-08-07 2011-02-10 Gupta Ankur K Methods And System For Efficient Crawling Of Advertiser Landing Page URLs
CN102932448A (en) * 2012-10-30 2013-02-13 工业和信息化部电信传输研究所 Distributed network crawler URL (uniform resource locator) duplicate removal system and method
CN105468683A (en) * 2015-11-16 2016-04-06 孙宝文 Method and device for carrying out duplicate checking to network address
CN105760514A (en) * 2016-02-24 2016-07-13 西安交通大学 Method for automatically obtaining short text of knowledge domain from community question-and-answer website
CN105897841A (en) * 2015-12-11 2016-08-24 乐视网信息技术(北京)股份有限公司 Scheduling method, device and system for network resource processing and sub scheduler
CN105956068A (en) * 2016-04-27 2016-09-21 湖南蚁坊软件有限公司 Webpage URL repetition elimination method based on distributed database
CN106407485A (en) * 2016-12-20 2017-02-15 福建六壬网安股份有限公司 URL de-repetition method and system based on similarity comparison
CN106598984A (en) * 2015-10-16 2017-04-26 北京国双科技有限公司 Data processing method and device of web crawler

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110035367A1 (en) * 2009-08-07 2011-02-10 Gupta Ankur K Methods And System For Efficient Crawling Of Advertiser Landing Page URLs
CN102932448A (en) * 2012-10-30 2013-02-13 工业和信息化部电信传输研究所 Distributed network crawler URL (uniform resource locator) duplicate removal system and method
CN106598984A (en) * 2015-10-16 2017-04-26 北京国双科技有限公司 Data processing method and device of web crawler
CN105468683A (en) * 2015-11-16 2016-04-06 孙宝文 Method and device for carrying out duplicate checking to network address
CN105897841A (en) * 2015-12-11 2016-08-24 乐视网信息技术(北京)股份有限公司 Scheduling method, device and system for network resource processing and sub scheduler
CN105760514A (en) * 2016-02-24 2016-07-13 西安交通大学 Method for automatically obtaining short text of knowledge domain from community question-and-answer website
CN105956068A (en) * 2016-04-27 2016-09-21 湖南蚁坊软件有限公司 Webpage URL repetition elimination method based on distributed database
CN106407485A (en) * 2016-12-20 2017-02-15 福建六壬网安股份有限公司 URL de-repetition method and system based on similarity comparison

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228431A (en) * 2018-01-04 2018-06-29 北京中关村科金技术有限公司 A kind of method and system of configurationization reptile quality-monitoring
CN108932285A (en) * 2018-05-22 2018-12-04 北京工业大学 A kind of data grab method and system based on browser extension

Similar Documents

Publication Publication Date Title
CN100578504C (en) Web page importance evaluation method and system
CN107329969A (en) It is a kind of that system and method are updated based on the data message repeatedly verified
KR101557294B1 (en) Search results ranking using editing distance and document information
CN102710646B (en) The collection method of a kind of fishing website and system
JP4832061B2 (en) Content collection apparatus and content collection system
US20080270549A1 (en) Extracting link spam using random walks and spam seeds
CN104077402B (en) Data processing method and data handling system
CN104080054B (en) A kind of acquisition methods and device of exception point of interest
US20090150371A1 (en) Methods and apparatus for computing graph similarity via signature similarity
CN102567407B (en) Method and system for collecting forum reply increment
US8073832B2 (en) Estimating rank on graph streams
CN102222187A (en) Domain name structural feature-based hang horse web page detection method
CN102801709A (en) Phishing website identification system and method
KR101130108B1 (en) Method, system and computer readable recording medium for detecting web page traps based on perpectual calendar and building the search database using the same
US10185771B2 (en) Method and system for scheduling web crawlers according to keyword search
CN105227352A (en) A kind of update method of user ID collection and device
CN103927400A (en) Web site product detailed information classification crawling and product information base establishing method
CN105260469B (en) A kind of method, apparatus and equipment for handling site maps
JP2004341624A (en) Device and method for evaluating information
CN108900905A (en) A kind of video clipping method and device
WO2016115944A1 (en) Method and device for establishing webpage quality model
CN105512199A (en) Search method, search device and search server
CN104636340A (en) Webpage URL filtering method, device and system
CN103428219A (en) Web vulnerability scanning method based on webpage template matching
CN101493818A (en) Network information searching method based on human relation network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination