CN109635182A - Parallelization data tracking method based on educational information theme - Google Patents

Parallelization data tracking method based on educational information theme Download PDF

Info

Publication number
CN109635182A
CN109635182A CN201811571552.2A CN201811571552A CN109635182A CN 109635182 A CN109635182 A CN 109635182A CN 201811571552 A CN201811571552 A CN 201811571552A CN 109635182 A CN109635182 A CN 109635182A
Authority
CN
China
Prior art keywords
url
page
address
educational
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811571552.2A
Other languages
Chinese (zh)
Inventor
陈炽昌
杨帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
All Pass Education Group (guangdong) Ltd By Share Ltd
Original Assignee
All Pass Education Group (guangdong) Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by All Pass Education Group (guangdong) Ltd By Share Ltd filed Critical All Pass Education Group (guangdong) Ltd By Share Ltd
Priority to CN201811571552.2A priority Critical patent/CN109635182A/en
Publication of CN109635182A publication Critical patent/CN109635182A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs

Abstract

The invention discloses one kind to be acquired simultaneously to multiple educational information themes, improves collecting efficiency, while the parallelization data tracking method based on educational information theme for realizing tracking can be updated to collected webpage information.The parallelization data tracking method based on educational information theme between Web page and Spider acquisition database comprising steps of construct multiple parallel acquisition threads;Parallelization acquires Webpage, analysis downloading is carried out to the page, web page information is extracted, the page unrelated with all educational topics and unrelated URL are removed, duplicate removal then is carried out to the page and URL, webpage after duplicate removal is saved in educational information library, and extract the URL of the page after duplicate removal, the URL is put into collected URL sequence, a collector is then supplied, webpage is resurveyed.Collecting efficiency can effectively be improved using the parallelization data tracking method based on educational information theme, can be improved the accuracy and validity of topic information acquisition.

Description

Parallelization data tracking method based on educational information theme
Technical field
The present invention relates to technical field of information processing, and in particular to it is a kind of based on the parallelization data of educational information theme with Track method.
Background technique
Well known: in recent years, the speed and scale generated with the development of internet and cloud computing technology, data is significantly super It crosses previous.Contain a large amount of value in mass data, how quickly and effectively to utilize data, this is our faces of big data era The a major challenge faced.
Parallel computation (Parallel Computing) is referred to while being solved the mistake of computational problem using a variety of computing resources Journey is a kind of effective means for improving computer system calculating speed and processing capacity.Its basic thought is with multiple processing Several parts are resolved into the problem of device carrys out Cooperative Solving same problem, i.e., will be solved, and each section is independent by one Reason machine carrys out parallel computation.Concurrent computational system, can also either supercomputers specially design, containing multiple processors To be cluster that several stand-alone computers interconnected in some way are constituted.The place of data is completed by parallel computing trunking Reason, then the result of processing is returned into user.
Existing educational information Focused crawler system can only generally be acquired single educational topics, realize multiple religions The acquisition for educating theme must individually acquire single educational topics, then merge the database of each theme, To form biggish educational information database.Since the information to educational topics is using being individually individually acquired, adopt It is lower to collect efficiency.The updated webpage of webpage information after acquisition can not be acquired simultaneously;Therefore it cannot achieve to acquisition net The tracking of page.
Summary of the invention
Technical problem to be solved by the invention is to provide one kind to be acquired simultaneously to multiple educational information themes, Collecting efficiency is improved, while the parallelization based on educational information theme for realizing tracking can be updated to collected webpage information Data tracking method.
The technical solution adopted by the present invention to solve the technical problems is: based on the parallelization data of educational information theme with Track method, comprising the following steps:
S1, multiple parallel acquisition threads are constructed between Web page and Spider acquisition database;
S2, theme class buffer pool positivePool, not a theme class buffer pool are constructed to each collecting thread Two class buffer pool of negtivePool is used to store UR class entity, the i.e. address URL in URL address set;Two buffer pools are initial Change value is null set;
S3, according to the corresponding acquisition educational information theme selected seed website of each collecting thread, constitute search program The initial set of Spider, that is, URL address set;
S4, Spider acquisition is carried out to Web page simultaneously by multiple collecting threads;
S5, parsing downloading is carried out to collected Webpage;Extract the address URL and the text information of the page;
S6, the correlation calculations with all educational topics are carried out to collected Webpage;To collected page URL Address carries out the correlation calculations with all educational information themes;
When carrying out the correlation calculations of educational topics to Webpage:
Carry out the calculating of correlation one by one with all educational information themes to collected webpage first;It then will be with education The relevant page of theme is stored in its corresponding educational information subject data base, until to all educational topics correlations It calculates, filters out the webpage unrelated with all educational topics;
When carrying out the correlation calculations of educational topics to the address Webpage URL:
Carry out the calculating of correlation one by one with all educational information themes to the address collected webpage URL first;Then The address page URL relevant to educational topics is stored in its corresponding educational information theme buffer pool, it is all until completing Educational topics correlation calculating, filter out the webpage URL address unrelated with all educational topics;
S7, duplicate removal processing is carried out to the webpage in all educational information subject data bases;I.e. to each educational information theme Webpage in database is into independent duplicate removal;Delete the identical page in educational information subject data base;And to all theme class The address URL in buffer pool carries out duplicate removal processing, i.e., carries out independent duplicate removal to the URL address sequence in each theme class buffer pool Processing;
S8, all addresses URL for taking the page after duplicate removal in subject data base are mentioned;The address URL extracted is added to master It inscribes in the URL address sequence in class buffer pool, and carries out duplicate removal processing;To the address the URL sequence in each theme class buffer pool Column carry out independent duplicate removal processing;By the web storage after duplicate removal processing to corresponding educational information subject data base.
Further, it further comprises the steps of: in step s3 in conjunction with point strategies, buffer pool strategy, with recording corresponding page Location;The page address of module record is finally all supplied to the acquisition that acquisition module carries out the page;The point strategies are only Search in Website is carried out to manually selected website;The buffer pool strategy is that the page address of acquisition is put into buffer pool.
Preferably, duplicate removal is carried out to URL address sequence using Hash table in step S7 and S;
All addresses URL are stored into hashmap container, the hash of URL is then calculated by strhash function Value;
It is searched according to the hash value for the URL being calculated in hashmap container, if the hash value has existed, is deleted Except the address URL.
Further, the threshold values of page relevance and URL correlation is respectively set in step S4 first, using based on semanteme Vector space model method the pass property of page phase and theme is calculated;By pagerank algorithm to URL correlation into Row calculates;Correlation is less than the page of threshold values and URL is deleted.
The beneficial effects of the present invention are: the parallelization data tracking method of the present invention based on educational information theme, Due to constructing multiple parallel acquisition threads between Web page and Spider acquisition database;Pass through multiple parallel gathering lines Journey is so as to realizing while the acquisition to multiple educational topics information;
Meanwhile the filtering by carrying out multiple educational information themes one by one to the collected page, to filter out and institute There is a webpage that educational topics are unrelated, and by the filtering to collecting the address page URL and carrying out multiple educational information themes, from And the webpage URL address unrelated with all educational topics is filtered out, so as to guarantee to collect the validity of webpage.
Secondly, the address URL of the page by correlation filtering is extracted, theme class is slow after being then stored in duplicate removal It rushes in the URL address sequence in pond so that the corresponding webpage in the address URL is higher with subject information correlation;And it can The update for corresponding to webpage information to the address URL resurveys, so as to realize the tracking to acquisition webpage.
In conclusion the parallelization data tracking method of the present invention based on educational information theme, can be improved master The efficiency for inscribing information collection, can be improved the accuracy and validity of topic information acquisition.
Detailed description of the invention
Fig. 1 is the flow chart of the parallelization data tracking method in the embodiment of the present invention based on educational information theme;
Specific embodiment
Present invention will be further explained below with reference to the attached drawings and examples.
As shown in Figure 1, a kind of parallelization data tracking method based on educational information theme of the present invention, including with Lower step:
S1, multiple parallel acquisition threads are constructed between Web page and Spider acquisition database;
S2, theme class buffer pool positivePool, not a theme class buffer pool are constructed to each collecting thread Two class buffer pool of negtivePool is used to store UR class entity, the i.e. address URL in URL address set;Two buffer pools are initial Change value is null set;
S3, according to the corresponding acquisition educational information theme selected seed website of each collecting thread, constitute search program The initial set of Spider, that is, URL address set;
S4, Spider acquisition is carried out to Web page simultaneously by multiple collecting threads;
S5, parsing downloading is carried out to collected Webpage;Extract the address URL and the text information of the page;
S6, the correlation calculations with all educational topics are carried out to collected Webpage;To collected page URL Address carries out the correlation calculations with all educational information themes;
When carrying out the correlation calculations of educational topics to Webpage:
Carry out the calculating of correlation one by one with all educational information themes to collected webpage first;It then will be with education The relevant page of theme is stored in its corresponding educational information subject data base, until to all educational topics correlations It calculates, filters out the webpage unrelated with all educational topics;
When carrying out the correlation calculations of educational topics to the address Webpage URL:
Carry out the calculating of correlation one by one with all educational information themes to the address collected webpage URL first;Then The address page URL relevant to educational topics is stored in its corresponding educational information theme buffer pool, it is all until completing Educational topics correlation calculating, filter out the webpage URL address unrelated with all educational topics;
S7, duplicate removal processing is carried out to the webpage in all educational information subject data bases;I.e. to each educational information theme Webpage in database is into independent duplicate removal;Delete the identical page in educational information subject data base;And to all theme class The address URL in buffer pool carries out duplicate removal processing, i.e., carries out independent duplicate removal to the URL address sequence in each theme class buffer pool Processing;
S8, all addresses URL for taking the page after duplicate removal in subject data base are mentioned;The address URL extracted is added to master It inscribes in the URL address sequence in class buffer pool, and carries out duplicate removal processing;To the address the URL sequence in each theme class buffer pool Column carry out independent duplicate removal processing;By the web storage after duplicate removal processing to corresponding educational information subject data base.
In step sl, by constructing multiple parallel acquisition threads between Web page and Spider acquisition database; To provide infrastructural support for educational topics information parallelization acquisition.
The acquisition of the education network message subject to Web page is realized into S4 in step S2;In step s 5 to webpage Parsing downloading is carried out, and extracts the address URL and the text information of webpage;In order to which the filtering of subsequent web pages correlation provides base Plinth.
In step s 6, carry out the filtering of multiple educational information themes one by one to the collected page, thus filter out with The unrelated webpage of all educational topics, and by the filtering to collecting the address page URL and carrying out multiple educational information themes, To filter out the webpage URL address unrelated with all educational topics, respectively by home page filter and url filtering, to realize Complete filtering to webpage;Guarantee collects the validity of webpage;Guarantee the correlation of acquisition webpage and theme.
Duplicate removal processing is carried out to webpage in the step s 7, while duplicate removal processing is carried out to URL, so that it is superfluous to reduce system data Remaining, the system that can be avoided runs too long efficiency and is lower;The storage of invalid data is avoided simultaneously.
The address URL of the page after duplicate removal in subject data base is extracted in step s 8;The address URL extracted is added to In URL address sequence after duplicate removal in theme class buffer pool, and carry out duplicate removal processing;By the address the URL sequence after duplicate removal processing Theme class buffer pool is arrived in column storage, so as to realize that the subject information to Webpage new after the renewal of the page of the address URL is adopted Collection;It can be improved the comprehensive and accuracy of acquisition.The educational information subject data base obtained in step s 8 can be used for Publication provides service for education.
In conclusion the parallelization data tracking method of the present invention based on educational information theme, due in Web net Multiple parallel acquisition threads are constructed between page and Spider acquisition database;By multiple parallel collecting threads so as to reality The now acquisition to multiple educational topics information simultaneously;
Meanwhile the filtering by carrying out multiple educational information themes one by one to the collected page, to filter out and institute There is a webpage that educational topics are unrelated, and by the filtering to collecting the address page URL and carrying out multiple educational information themes, from And the webpage URL address unrelated with all educational topics is filtered out, so as to guarantee to collect the validity of webpage.
Secondly, the address URL of the page by correlation filtering is extracted, theme class is slow after being then stored in duplicate removal It rushes in the URL address sequence in pond so that the corresponding webpage in the address URL is higher with subject information correlation;And it can The update for corresponding to webpage information to the address URL resurveys, so as to realize the tracking to acquisition webpage.
Therefore, the parallelization data tracking method of the present invention based on educational information theme can be improved theme letter The efficiency for ceasing acquisition, can be improved the accuracy and validity of topic information acquisition.
Further, it further comprises the steps of: in step s3 in conjunction with point strategies, buffer pool strategy, with recording corresponding page Location;The page address of module record is finally all supplied to the acquisition that acquisition module carries out the page;The point strategies are only Search in Website is carried out to manually selected website;The buffer pool strategy is that the page address of acquisition is put into buffer pool.It will note The page address of record is finally all supplied to the acquisition that acquisition module carries out the page;The point strategies are only to manually selected Website carries out search in Website;The buffer pool strategy is that the page address of acquisition is put into buffer pool, can accelerate to acquire in this way When duplicate checking speed;
For the ease of duplicate removal, deduplicated efficiency is improved, it is preferred that using Hash table to the address URL sequence in step S7 and S8 Column carry out duplicate removal;
All addresses URL are stored into hashmap container, the hash of URL is then calculated by strhash function Value;
It is searched according to the hash value for the URL being calculated in hashmap container, if the hash value has existed, is deleted Except the address URL.
In order to improve, the accuracy of correlation calculations, guarantee collects correlation of the page with theme, it is preferred that step The threshold values of page relevance and URL correlation is respectively set in S6 first, using semantic-based vector space model method into Row page relevance calculates;URL correlation is calculated by pagerank algorithm;By correlation be less than threshold values the page and URL is deleted.

Claims (4)

1. the parallelization data tracking method based on educational information theme, which comprises the following steps:
S1, multiple parallel acquisition threads are constructed between Web page and Spider acquisition database;
S2, theme class buffer pool positivePool, not a theme class buffer pool negtivePool are constructed to each collecting thread Two class buffer pools are used to store UR class entity, the i.e. address URL in URL address set;Two buffer pool initialization values are empty set It closes;
S3, according to the corresponding acquisition educational information theme selected seed website of each collecting thread, constitute search program Spider Initial set, that is, URL address set;
S4, Spider acquisition is carried out to Web page simultaneously by multiple collecting threads;
S5, parsing downloading is carried out to collected Webpage;Extract the address URL and the text information of the page;
S6, the correlation calculations with all educational topics are carried out to collected Webpage;To collected page URL Location carries out the correlation calculations with all educational information themes;
When carrying out the correlation calculations of educational topics to Webpage:
Carry out the calculating of correlation one by one with all educational information themes to collected webpage first;It then will be with educational topics The relevant page is stored in its corresponding educational information subject data base, until the meter to all educational topics correlations It calculates, filters out the webpage unrelated with all educational topics;
When carrying out the correlation calculations of educational topics to the address Webpage URL:
Carry out the calculating of correlation one by one with all educational information themes to the address collected webpage URL first;It then will be with The address educational topics relevant page URL is stored in its corresponding educational information theme buffer pool, until completing all religions The calculating for educating topic relativity filters out the webpage URL address unrelated with all educational topics;
S7, duplicate removal processing is carried out to the webpage in all educational information subject data bases;I.e. to each educational information subject data Webpage in library is into independent duplicate removal;Delete the identical page in educational information subject data base;And all theme class are buffered The address URL in pond carries out duplicate removal processing, i.e., carries out at independent duplicate removal to the URL address sequence in each theme class buffer pool Reason;
S8, all addresses URL for taking the page after duplicate removal in subject data base are mentioned;The address URL extracted is added to theme class In URL address sequence in buffer pool, and carry out duplicate removal processing;To the URL address sequence in each theme class buffer pool into The independent duplicate removal processing of row;By the web storage after duplicate removal processing to corresponding educational information subject data base.
2. as described in claim 1 based on the parallelization data tracking method of educational information theme, it is characterised in that: in step It is further comprised the steps of: in S3 in conjunction with point strategies, buffer pool strategy, records corresponding page address;The page address of module record The last acquisition for being all supplied to acquisition module and carrying out the page;The point strategies are in only being stood to manually selected website Search;The buffer pool strategy is that the page address of acquisition is put into buffer pool.
3. as claimed in claim 2 based on the parallelization data tracking method of educational information theme, it is characterised in that: in step Duplicate removal is carried out to URL address sequence using Hash table in S7 and S8;
All addresses URL are stored into hashmap container, the hash value of URL is then calculated by strhash function;
It is searched according to the hash value for the URL being calculated in hashmap container, if the hash value has existed, deleting should The address URL.
4. as claimed in claim 3 based on the parallelization data tracking method of educational information theme, it is characterised in that: step S4 In the threshold values of page relevance and URL correlation is respectively set first, using semantic-based vector space model method to page Face phase and the pass property of theme are calculated;URL correlation is calculated by pagerank algorithm;Correlation is less than threshold values The page and URL delete.
CN201811571552.2A 2018-12-21 2018-12-21 Parallelization data tracking method based on educational information theme Pending CN109635182A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811571552.2A CN109635182A (en) 2018-12-21 2018-12-21 Parallelization data tracking method based on educational information theme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811571552.2A CN109635182A (en) 2018-12-21 2018-12-21 Parallelization data tracking method based on educational information theme

Publications (1)

Publication Number Publication Date
CN109635182A true CN109635182A (en) 2019-04-16

Family

ID=66076350

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811571552.2A Pending CN109635182A (en) 2018-12-21 2018-12-21 Parallelization data tracking method based on educational information theme

Country Status (1)

Country Link
CN (1) CN109635182A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561814A (en) * 2009-05-08 2009-10-21 华中科技大学 Topic crawler system based on social labels
CN101605129A (en) * 2009-06-23 2009-12-16 北京理工大学 A kind of URL lookup method that is used for the url filtering system
CN102662954A (en) * 2012-03-02 2012-09-12 杭州电子科技大学 Method for implementing topical crawler system based on learning URL string information
CN103310013A (en) * 2013-07-02 2013-09-18 北京航空航天大学 Subject-oriented web page collection system
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN104809182A (en) * 2015-04-17 2015-07-29 东南大学 Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561814A (en) * 2009-05-08 2009-10-21 华中科技大学 Topic crawler system based on social labels
CN101605129A (en) * 2009-06-23 2009-12-16 北京理工大学 A kind of URL lookup method that is used for the url filtering system
CN102662954A (en) * 2012-03-02 2012-09-12 杭州电子科技大学 Method for implementing topical crawler system based on learning URL string information
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN103310013A (en) * 2013-07-02 2013-09-18 北京航空航天大学 Subject-oriented web page collection system
CN104809182A (en) * 2015-04-17 2015-07-29 东南大学 Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)

Similar Documents

Publication Publication Date Title
CN106776768B (en) A kind of URL grasping means of distributed reptile engine and system
CN103488681A (en) Slash label
CN107103032B (en) Mass data paging query method for avoiding global sequencing in distributed environment
Kyrola Drunkardmob: billions of random walks on just a pc
CN103970788A (en) Webpage-crawling-based crawler technology
CN106708993A (en) Spatial data storage processing middleware framework realization method based on big data technology
US20070282940A1 (en) Thread-ranking apparatus and method
CN104516982A (en) Method and system for extracting Web information based on Nutch
CN110275920A (en) Data query method, apparatus, electronic equipment and computer readable storage medium
CN100458784C (en) Researching system and method used in digital labrary
CN102890713A (en) Music recommending method based on current geographical position and physical environment of user
CN105930479A (en) Data skew processing method and apparatus
CN104182482B (en) A kind of news list page determination methods and the method for screening news list page
CN104951529A (en) Interactive analyzing method for website logs
CN107408114A (en) Based on transactions access pattern-recognition connection relation
CN107894986B (en) Enterprise relation division method based on vectorization, server and client
US20210303537A1 (en) Log record identification using aggregated log indexes
CN102222098A (en) Method and system for pre-fetching webpage
CN105550375A (en) Heterogeneous data integrating method and system
CN104598536B (en) A kind of distributed network information structuring processing method
CN104298669A (en) Person geographic information mining model based on social network
Tuan et al. On the io characteristics of the sqlite transactions
CN109947935A (en) The generation method and device of media event
Khodaei et al. Temporal-textual retrieval: Time and keyword search in web documents
CN107704620A (en) A kind of method, apparatus of file administration, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190416

RJ01 Rejection of invention patent application after publication