CN109635182A

CN109635182A - Parallelization data tracking method based on educational information theme

Info

Publication number: CN109635182A
Application number: CN201811571552.2A
Authority: CN
Inventors: 陈炽昌; 杨帆
Original assignee: All Pass Education Group (guangdong) Ltd By Share Ltd
Current assignee: All Pass Education Group (guangdong) Ltd By Share Ltd
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2019-04-16

Abstract

The invention discloses one kind to be acquired simultaneously to multiple educational information themes, improves collecting efficiency, while the parallelization data tracking method based on educational information theme for realizing tracking can be updated to collected webpage information.The parallelization data tracking method based on educational information theme between Web page and Spider acquisition database comprising steps of construct multiple parallel acquisition threads；Parallelization acquires Webpage, analysis downloading is carried out to the page, web page information is extracted, the page unrelated with all educational topics and unrelated URL are removed, duplicate removal then is carried out to the page and URL, webpage after duplicate removal is saved in educational information library, and extract the URL of the page after duplicate removal, the URL is put into collected URL sequence, a collector is then supplied, webpage is resurveyed.Collecting efficiency can effectively be improved using the parallelization data tracking method based on educational information theme, can be improved the accuracy and validity of topic information acquisition.

Description

Parallelization data tracking method based on educational information theme

Technical field

The present invention relates to technical field of information processing, and in particular to it is a kind of based on the parallelization data of educational information theme with Track method.

Background technique

Well known: in recent years, the speed and scale generated with the development of internet and cloud computing technology, data is significantly super It crosses previous.Contain a large amount of value in mass data, how quickly and effectively to utilize data, this is our faces of big data era The a major challenge faced.

Parallel computation (Parallel Computing) is referred to while being solved the mistake of computational problem using a variety of computing resources Journey is a kind of effective means for improving computer system calculating speed and processing capacity.Its basic thought is with multiple processing Several parts are resolved into the problem of device carrys out Cooperative Solving same problem, i.e., will be solved, and each section is independent by one Reason machine carrys out parallel computation.Concurrent computational system, can also either supercomputers specially design, containing multiple processors To be cluster that several stand-alone computers interconnected in some way are constituted.The place of data is completed by parallel computing trunking Reason, then the result of processing is returned into user.

Existing educational information Focused crawler system can only generally be acquired single educational topics, realize multiple religions The acquisition for educating theme must individually acquire single educational topics, then merge the database of each theme, To form biggish educational information database.Since the information to educational topics is using being individually individually acquired, adopt It is lower to collect efficiency.The updated webpage of webpage information after acquisition can not be acquired simultaneously；Therefore it cannot achieve to acquisition net The tracking of page.

Summary of the invention

Technical problem to be solved by the invention is to provide one kind to be acquired simultaneously to multiple educational information themes, Collecting efficiency is improved, while the parallelization based on educational information theme for realizing tracking can be updated to collected webpage information Data tracking method.

The technical solution adopted by the present invention to solve the technical problems is: based on the parallelization data of educational information theme with Track method, comprising the following steps:

S1, multiple parallel acquisition threads are constructed between Web page and Spider acquisition database；

S2, theme class buffer pool positivePool, not a theme class buffer pool are constructed to each collecting thread Two class buffer pool of negtivePool is used to store UR class entity, the i.e. address URL in URL address set；Two buffer pools are initial Change value is null set；

S3, according to the corresponding acquisition educational information theme selected seed website of each collecting thread, constitute search program The initial set of Spider, that is, URL address set；

S4, Spider acquisition is carried out to Web page simultaneously by multiple collecting threads；

S5, parsing downloading is carried out to collected Webpage；Extract the address URL and the text information of the page；

S6, the correlation calculations with all educational topics are carried out to collected Webpage；To collected page URL Address carries out the correlation calculations with all educational information themes；

When carrying out the correlation calculations of educational topics to Webpage:

Carry out the calculating of correlation one by one with all educational information themes to collected webpage first；It then will be with education The relevant page of theme is stored in its corresponding educational information subject data base, until to all educational topics correlations It calculates, filters out the webpage unrelated with all educational topics；

When carrying out the correlation calculations of educational topics to the address Webpage URL:

Carry out the calculating of correlation one by one with all educational information themes to the address collected webpage URL first；Then The address page URL relevant to educational topics is stored in its corresponding educational information theme buffer pool, it is all until completing Educational topics correlation calculating, filter out the webpage URL address unrelated with all educational topics；

S7, duplicate removal processing is carried out to the webpage in all educational information subject data bases；I.e. to each educational information theme Webpage in database is into independent duplicate removal；Delete the identical page in educational information subject data base；And to all theme class The address URL in buffer pool carries out duplicate removal processing, i.e., carries out independent duplicate removal to the URL address sequence in each theme class buffer pool Processing；

S8, all addresses URL for taking the page after duplicate removal in subject data base are mentioned；The address URL extracted is added to master It inscribes in the URL address sequence in class buffer pool, and carries out duplicate removal processing；To the address the URL sequence in each theme class buffer pool Column carry out independent duplicate removal processing；By the web storage after duplicate removal processing to corresponding educational information subject data base.

Further, it further comprises the steps of: in step s3 in conjunction with point strategies, buffer pool strategy, with recording corresponding page Location；The page address of module record is finally all supplied to the acquisition that acquisition module carries out the page；The point strategies are only Search in Website is carried out to manually selected website；The buffer pool strategy is that the page address of acquisition is put into buffer pool.

Preferably, duplicate removal is carried out to URL address sequence using Hash table in step S7 and S；

All addresses URL are stored into hashmap container, the hash of URL is then calculated by strhash function Value；

It is searched according to the hash value for the URL being calculated in hashmap container, if the hash value has existed, is deleted Except the address URL.

Further, the threshold values of page relevance and URL correlation is respectively set in step S4 first, using based on semanteme Vector space model method the pass property of page phase and theme is calculated；By pagerank algorithm to URL correlation into Row calculates；Correlation is less than the page of threshold values and URL is deleted.

The beneficial effects of the present invention are: the parallelization data tracking method of the present invention based on educational information theme, Due to constructing multiple parallel acquisition threads between Web page and Spider acquisition database；Pass through multiple parallel gathering lines Journey is so as to realizing while the acquisition to multiple educational topics information；

Meanwhile the filtering by carrying out multiple educational information themes one by one to the collected page, to filter out and institute There is a webpage that educational topics are unrelated, and by the filtering to collecting the address page URL and carrying out multiple educational information themes, from And the webpage URL address unrelated with all educational topics is filtered out, so as to guarantee to collect the validity of webpage.

Secondly, the address URL of the page by correlation filtering is extracted, theme class is slow after being then stored in duplicate removal It rushes in the URL address sequence in pond so that the corresponding webpage in the address URL is higher with subject information correlation；And it can The update for corresponding to webpage information to the address URL resurveys, so as to realize the tracking to acquisition webpage.

In conclusion the parallelization data tracking method of the present invention based on educational information theme, can be improved master The efficiency for inscribing information collection, can be improved the accuracy and validity of topic information acquisition.

Detailed description of the invention

Fig. 1 is the flow chart of the parallelization data tracking method in the embodiment of the present invention based on educational information theme；

Specific embodiment

Present invention will be further explained below with reference to the attached drawings and examples.

As shown in Figure 1, a kind of parallelization data tracking method based on educational information theme of the present invention, including with Lower step:

In step sl, by constructing multiple parallel acquisition threads between Web page and Spider acquisition database； To provide infrastructural support for educational topics information parallelization acquisition.

The acquisition of the education network message subject to Web page is realized into S4 in step S2；In step s 5 to webpage Parsing downloading is carried out, and extracts the address URL and the text information of webpage；In order to which the filtering of subsequent web pages correlation provides base Plinth.

In step s 6, carry out the filtering of multiple educational information themes one by one to the collected page, thus filter out with The unrelated webpage of all educational topics, and by the filtering to collecting the address page URL and carrying out multiple educational information themes, To filter out the webpage URL address unrelated with all educational topics, respectively by home page filter and url filtering, to realize Complete filtering to webpage；Guarantee collects the validity of webpage；Guarantee the correlation of acquisition webpage and theme.

Duplicate removal processing is carried out to webpage in the step s 7, while duplicate removal processing is carried out to URL, so that it is superfluous to reduce system data Remaining, the system that can be avoided runs too long efficiency and is lower；The storage of invalid data is avoided simultaneously.

The address URL of the page after duplicate removal in subject data base is extracted in step s 8；The address URL extracted is added to In URL address sequence after duplicate removal in theme class buffer pool, and carry out duplicate removal processing；By the address the URL sequence after duplicate removal processing Theme class buffer pool is arrived in column storage, so as to realize that the subject information to Webpage new after the renewal of the page of the address URL is adopted Collection；It can be improved the comprehensive and accuracy of acquisition.The educational information subject data base obtained in step s 8 can be used for Publication provides service for education.

In conclusion the parallelization data tracking method of the present invention based on educational information theme, due in Web net Multiple parallel acquisition threads are constructed between page and Spider acquisition database；By multiple parallel collecting threads so as to reality The now acquisition to multiple educational topics information simultaneously；

Therefore, the parallelization data tracking method of the present invention based on educational information theme can be improved theme letter The efficiency for ceasing acquisition, can be improved the accuracy and validity of topic information acquisition.

Further, it further comprises the steps of: in step s3 in conjunction with point strategies, buffer pool strategy, with recording corresponding page Location；The page address of module record is finally all supplied to the acquisition that acquisition module carries out the page；The point strategies are only Search in Website is carried out to manually selected website；The buffer pool strategy is that the page address of acquisition is put into buffer pool.It will note The page address of record is finally all supplied to the acquisition that acquisition module carries out the page；The point strategies are only to manually selected Website carries out search in Website；The buffer pool strategy is that the page address of acquisition is put into buffer pool, can accelerate to acquire in this way When duplicate checking speed；

For the ease of duplicate removal, deduplicated efficiency is improved, it is preferred that using Hash table to the address URL sequence in step S7 and S8 Column carry out duplicate removal；

In order to improve, the accuracy of correlation calculations, guarantee collects correlation of the page with theme, it is preferred that step The threshold values of page relevance and URL correlation is respectively set in S6 first, using semantic-based vector space model method into Row page relevance calculates；URL correlation is calculated by pagerank algorithm；By correlation be less than threshold values the page and URL is deleted.

Claims

1. the parallelization data tracking method based on educational information theme, which comprises the following steps:

S2, theme class buffer pool positivePool, not a theme class buffer pool negtivePool are constructed to each collecting thread Two class buffer pools are used to store UR class entity, the i.e. address URL in URL address set；Two buffer pool initialization values are empty set It closes；

S3, according to the corresponding acquisition educational information theme selected seed website of each collecting thread, constitute search program Spider Initial set, that is, URL address set；

S6, the correlation calculations with all educational topics are carried out to collected Webpage；To collected page URL Location carries out the correlation calculations with all educational information themes；

Carry out the calculating of correlation one by one with all educational information themes to collected webpage first；It then will be with educational topics The relevant page is stored in its corresponding educational information subject data base, until the meter to all educational topics correlations It calculates, filters out the webpage unrelated with all educational topics；

Carry out the calculating of correlation one by one with all educational information themes to the address collected webpage URL first；It then will be with The address educational topics relevant page URL is stored in its corresponding educational information theme buffer pool, until completing all religions The calculating for educating topic relativity filters out the webpage URL address unrelated with all educational topics；

S7, duplicate removal processing is carried out to the webpage in all educational information subject data bases；I.e. to each educational information subject data Webpage in library is into independent duplicate removal；Delete the identical page in educational information subject data base；And all theme class are buffered The address URL in pond carries out duplicate removal processing, i.e., carries out at independent duplicate removal to the URL address sequence in each theme class buffer pool Reason；

S8, all addresses URL for taking the page after duplicate removal in subject data base are mentioned；The address URL extracted is added to theme class In URL address sequence in buffer pool, and carry out duplicate removal processing；To the URL address sequence in each theme class buffer pool into The independent duplicate removal processing of row；By the web storage after duplicate removal processing to corresponding educational information subject data base.

2. as described in claim 1 based on the parallelization data tracking method of educational information theme, it is characterised in that: in step It is further comprised the steps of: in S3 in conjunction with point strategies, buffer pool strategy, records corresponding page address；The page address of module record The last acquisition for being all supplied to acquisition module and carrying out the page；The point strategies are in only being stood to manually selected website Search；The buffer pool strategy is that the page address of acquisition is put into buffer pool.

3. as claimed in claim 2 based on the parallelization data tracking method of educational information theme, it is characterised in that: in step Duplicate removal is carried out to URL address sequence using Hash table in S7 and S8；

All addresses URL are stored into hashmap container, the hash value of URL is then calculated by strhash function；

It is searched according to the hash value for the URL being calculated in hashmap container, if the hash value has existed, deleting should The address URL.

4. as claimed in claim 3 based on the parallelization data tracking method of educational information theme, it is characterised in that: step S4 In the threshold values of page relevance and URL correlation is respectively set first, using semantic-based vector space model method to page Face phase and the pass property of theme are calculated；URL correlation is calculated by pagerank algorithm；Correlation is less than threshold values The page and URL delete.