CN109635182A - Parallelization data tracking method based on educational information theme - Google Patents
Parallelization data tracking method based on educational information theme Download PDFInfo
- Publication number
- CN109635182A CN109635182A CN201811571552.2A CN201811571552A CN109635182A CN 109635182 A CN109635182 A CN 109635182A CN 201811571552 A CN201811571552 A CN 201811571552A CN 109635182 A CN109635182 A CN 109635182A
- Authority
- CN
- China
- Prior art keywords
- url
- page
- address
- educational
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 241000239290 Araneae Species 0.000 claims abstract description 13
- 239000000284 extract Substances 0.000 claims abstract description 5
- 238000004364 calculation method Methods 0.000 claims description 13
- 238000001914 filtration Methods 0.000 description 11
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
Abstract
The invention discloses one kind to be acquired simultaneously to multiple educational information themes, improves collecting efficiency, while the parallelization data tracking method based on educational information theme for realizing tracking can be updated to collected webpage information.The parallelization data tracking method based on educational information theme between Web page and Spider acquisition database comprising steps of construct multiple parallel acquisition threads;Parallelization acquires Webpage, analysis downloading is carried out to the page, web page information is extracted, the page unrelated with all educational topics and unrelated URL are removed, duplicate removal then is carried out to the page and URL, webpage after duplicate removal is saved in educational information library, and extract the URL of the page after duplicate removal, the URL is put into collected URL sequence, a collector is then supplied, webpage is resurveyed.Collecting efficiency can effectively be improved using the parallelization data tracking method based on educational information theme, can be improved the accuracy and validity of topic information acquisition.
Description
Technical field
The present invention relates to technical field of information processing, and in particular to it is a kind of based on the parallelization data of educational information theme with
Track method.
Background technique
Well known: in recent years, the speed and scale generated with the development of internet and cloud computing technology, data is significantly super
It crosses previous.Contain a large amount of value in mass data, how quickly and effectively to utilize data, this is our faces of big data era
The a major challenge faced.
Parallel computation (Parallel Computing) is referred to while being solved the mistake of computational problem using a variety of computing resources
Journey is a kind of effective means for improving computer system calculating speed and processing capacity.Its basic thought is with multiple processing
Several parts are resolved into the problem of device carrys out Cooperative Solving same problem, i.e., will be solved, and each section is independent by one
Reason machine carrys out parallel computation.Concurrent computational system, can also either supercomputers specially design, containing multiple processors
To be cluster that several stand-alone computers interconnected in some way are constituted.The place of data is completed by parallel computing trunking
Reason, then the result of processing is returned into user.
Existing educational information Focused crawler system can only generally be acquired single educational topics, realize multiple religions
The acquisition for educating theme must individually acquire single educational topics, then merge the database of each theme,
To form biggish educational information database.Since the information to educational topics is using being individually individually acquired, adopt
It is lower to collect efficiency.The updated webpage of webpage information after acquisition can not be acquired simultaneously;Therefore it cannot achieve to acquisition net
The tracking of page.
Summary of the invention
Technical problem to be solved by the invention is to provide one kind to be acquired simultaneously to multiple educational information themes,
Collecting efficiency is improved, while the parallelization based on educational information theme for realizing tracking can be updated to collected webpage information
Data tracking method.
The technical solution adopted by the present invention to solve the technical problems is: based on the parallelization data of educational information theme with
Track method, comprising the following steps:
S1, multiple parallel acquisition threads are constructed between Web page and Spider acquisition database;
S2, theme class buffer pool positivePool, not a theme class buffer pool are constructed to each collecting thread
Two class buffer pool of negtivePool is used to store UR class entity, the i.e. address URL in URL address set;Two buffer pools are initial
Change value is null set;
S3, according to the corresponding acquisition educational information theme selected seed website of each collecting thread, constitute search program
The initial set of Spider, that is, URL address set;
S4, Spider acquisition is carried out to Web page simultaneously by multiple collecting threads;
S5, parsing downloading is carried out to collected Webpage;Extract the address URL and the text information of the page;
S6, the correlation calculations with all educational topics are carried out to collected Webpage;To collected page URL
Address carries out the correlation calculations with all educational information themes;
When carrying out the correlation calculations of educational topics to Webpage:
Carry out the calculating of correlation one by one with all educational information themes to collected webpage first;It then will be with education
The relevant page of theme is stored in its corresponding educational information subject data base, until to all educational topics correlations
It calculates, filters out the webpage unrelated with all educational topics;
When carrying out the correlation calculations of educational topics to the address Webpage URL:
Carry out the calculating of correlation one by one with all educational information themes to the address collected webpage URL first;Then
The address page URL relevant to educational topics is stored in its corresponding educational information theme buffer pool, it is all until completing
Educational topics correlation calculating, filter out the webpage URL address unrelated with all educational topics;
S7, duplicate removal processing is carried out to the webpage in all educational information subject data bases;I.e. to each educational information theme
Webpage in database is into independent duplicate removal;Delete the identical page in educational information subject data base;And to all theme class
The address URL in buffer pool carries out duplicate removal processing, i.e., carries out independent duplicate removal to the URL address sequence in each theme class buffer pool
Processing;
S8, all addresses URL for taking the page after duplicate removal in subject data base are mentioned;The address URL extracted is added to master
It inscribes in the URL address sequence in class buffer pool, and carries out duplicate removal processing;To the address the URL sequence in each theme class buffer pool
Column carry out independent duplicate removal processing;By the web storage after duplicate removal processing to corresponding educational information subject data base.
Further, it further comprises the steps of: in step s3 in conjunction with point strategies, buffer pool strategy, with recording corresponding page
Location;The page address of module record is finally all supplied to the acquisition that acquisition module carries out the page;The point strategies are only
Search in Website is carried out to manually selected website;The buffer pool strategy is that the page address of acquisition is put into buffer pool.
Preferably, duplicate removal is carried out to URL address sequence using Hash table in step S7 and S;
All addresses URL are stored into hashmap container, the hash of URL is then calculated by strhash function
Value;
It is searched according to the hash value for the URL being calculated in hashmap container, if the hash value has existed, is deleted
Except the address URL.
Further, the threshold values of page relevance and URL correlation is respectively set in step S4 first, using based on semanteme
Vector space model method the pass property of page phase and theme is calculated;By pagerank algorithm to URL correlation into
Row calculates;Correlation is less than the page of threshold values and URL is deleted.
The beneficial effects of the present invention are: the parallelization data tracking method of the present invention based on educational information theme,
Due to constructing multiple parallel acquisition threads between Web page and Spider acquisition database;Pass through multiple parallel gathering lines
Journey is so as to realizing while the acquisition to multiple educational topics information;
Meanwhile the filtering by carrying out multiple educational information themes one by one to the collected page, to filter out and institute
There is a webpage that educational topics are unrelated, and by the filtering to collecting the address page URL and carrying out multiple educational information themes, from
And the webpage URL address unrelated with all educational topics is filtered out, so as to guarantee to collect the validity of webpage.
Secondly, the address URL of the page by correlation filtering is extracted, theme class is slow after being then stored in duplicate removal
It rushes in the URL address sequence in pond so that the corresponding webpage in the address URL is higher with subject information correlation;And it can
The update for corresponding to webpage information to the address URL resurveys, so as to realize the tracking to acquisition webpage.
In conclusion the parallelization data tracking method of the present invention based on educational information theme, can be improved master
The efficiency for inscribing information collection, can be improved the accuracy and validity of topic information acquisition.
Detailed description of the invention
Fig. 1 is the flow chart of the parallelization data tracking method in the embodiment of the present invention based on educational information theme;
Specific embodiment
Present invention will be further explained below with reference to the attached drawings and examples.
As shown in Figure 1, a kind of parallelization data tracking method based on educational information theme of the present invention, including with
Lower step:
S1, multiple parallel acquisition threads are constructed between Web page and Spider acquisition database;
S2, theme class buffer pool positivePool, not a theme class buffer pool are constructed to each collecting thread
Two class buffer pool of negtivePool is used to store UR class entity, the i.e. address URL in URL address set;Two buffer pools are initial
Change value is null set;
S3, according to the corresponding acquisition educational information theme selected seed website of each collecting thread, constitute search program
The initial set of Spider, that is, URL address set;
S4, Spider acquisition is carried out to Web page simultaneously by multiple collecting threads;
S5, parsing downloading is carried out to collected Webpage;Extract the address URL and the text information of the page;
S6, the correlation calculations with all educational topics are carried out to collected Webpage;To collected page URL
Address carries out the correlation calculations with all educational information themes;
When carrying out the correlation calculations of educational topics to Webpage:
Carry out the calculating of correlation one by one with all educational information themes to collected webpage first;It then will be with education
The relevant page of theme is stored in its corresponding educational information subject data base, until to all educational topics correlations
It calculates, filters out the webpage unrelated with all educational topics;
When carrying out the correlation calculations of educational topics to the address Webpage URL:
Carry out the calculating of correlation one by one with all educational information themes to the address collected webpage URL first;Then
The address page URL relevant to educational topics is stored in its corresponding educational information theme buffer pool, it is all until completing
Educational topics correlation calculating, filter out the webpage URL address unrelated with all educational topics;
S7, duplicate removal processing is carried out to the webpage in all educational information subject data bases;I.e. to each educational information theme
Webpage in database is into independent duplicate removal;Delete the identical page in educational information subject data base;And to all theme class
The address URL in buffer pool carries out duplicate removal processing, i.e., carries out independent duplicate removal to the URL address sequence in each theme class buffer pool
Processing;
S8, all addresses URL for taking the page after duplicate removal in subject data base are mentioned;The address URL extracted is added to master
It inscribes in the URL address sequence in class buffer pool, and carries out duplicate removal processing;To the address the URL sequence in each theme class buffer pool
Column carry out independent duplicate removal processing;By the web storage after duplicate removal processing to corresponding educational information subject data base.
In step sl, by constructing multiple parallel acquisition threads between Web page and Spider acquisition database;
To provide infrastructural support for educational topics information parallelization acquisition.
The acquisition of the education network message subject to Web page is realized into S4 in step S2;In step s 5 to webpage
Parsing downloading is carried out, and extracts the address URL and the text information of webpage;In order to which the filtering of subsequent web pages correlation provides base
Plinth.
In step s 6, carry out the filtering of multiple educational information themes one by one to the collected page, thus filter out with
The unrelated webpage of all educational topics, and by the filtering to collecting the address page URL and carrying out multiple educational information themes,
To filter out the webpage URL address unrelated with all educational topics, respectively by home page filter and url filtering, to realize
Complete filtering to webpage;Guarantee collects the validity of webpage;Guarantee the correlation of acquisition webpage and theme.
Duplicate removal processing is carried out to webpage in the step s 7, while duplicate removal processing is carried out to URL, so that it is superfluous to reduce system data
Remaining, the system that can be avoided runs too long efficiency and is lower;The storage of invalid data is avoided simultaneously.
The address URL of the page after duplicate removal in subject data base is extracted in step s 8;The address URL extracted is added to
In URL address sequence after duplicate removal in theme class buffer pool, and carry out duplicate removal processing;By the address the URL sequence after duplicate removal processing
Theme class buffer pool is arrived in column storage, so as to realize that the subject information to Webpage new after the renewal of the page of the address URL is adopted
Collection;It can be improved the comprehensive and accuracy of acquisition.The educational information subject data base obtained in step s 8 can be used for
Publication provides service for education.
In conclusion the parallelization data tracking method of the present invention based on educational information theme, due in Web net
Multiple parallel acquisition threads are constructed between page and Spider acquisition database;By multiple parallel collecting threads so as to reality
The now acquisition to multiple educational topics information simultaneously;
Meanwhile the filtering by carrying out multiple educational information themes one by one to the collected page, to filter out and institute
There is a webpage that educational topics are unrelated, and by the filtering to collecting the address page URL and carrying out multiple educational information themes, from
And the webpage URL address unrelated with all educational topics is filtered out, so as to guarantee to collect the validity of webpage.
Secondly, the address URL of the page by correlation filtering is extracted, theme class is slow after being then stored in duplicate removal
It rushes in the URL address sequence in pond so that the corresponding webpage in the address URL is higher with subject information correlation;And it can
The update for corresponding to webpage information to the address URL resurveys, so as to realize the tracking to acquisition webpage.
Therefore, the parallelization data tracking method of the present invention based on educational information theme can be improved theme letter
The efficiency for ceasing acquisition, can be improved the accuracy and validity of topic information acquisition.
Further, it further comprises the steps of: in step s3 in conjunction with point strategies, buffer pool strategy, with recording corresponding page
Location;The page address of module record is finally all supplied to the acquisition that acquisition module carries out the page;The point strategies are only
Search in Website is carried out to manually selected website;The buffer pool strategy is that the page address of acquisition is put into buffer pool.It will note
The page address of record is finally all supplied to the acquisition that acquisition module carries out the page;The point strategies are only to manually selected
Website carries out search in Website;The buffer pool strategy is that the page address of acquisition is put into buffer pool, can accelerate to acquire in this way
When duplicate checking speed;
For the ease of duplicate removal, deduplicated efficiency is improved, it is preferred that using Hash table to the address URL sequence in step S7 and S8
Column carry out duplicate removal;
All addresses URL are stored into hashmap container, the hash of URL is then calculated by strhash function
Value;
It is searched according to the hash value for the URL being calculated in hashmap container, if the hash value has existed, is deleted
Except the address URL.
In order to improve, the accuracy of correlation calculations, guarantee collects correlation of the page with theme, it is preferred that step
The threshold values of page relevance and URL correlation is respectively set in S6 first, using semantic-based vector space model method into
Row page relevance calculates;URL correlation is calculated by pagerank algorithm;By correlation be less than threshold values the page and
URL is deleted.
Claims (4)
1. the parallelization data tracking method based on educational information theme, which comprises the following steps:
S1, multiple parallel acquisition threads are constructed between Web page and Spider acquisition database;
S2, theme class buffer pool positivePool, not a theme class buffer pool negtivePool are constructed to each collecting thread
Two class buffer pools are used to store UR class entity, the i.e. address URL in URL address set;Two buffer pool initialization values are empty set
It closes;
S3, according to the corresponding acquisition educational information theme selected seed website of each collecting thread, constitute search program Spider
Initial set, that is, URL address set;
S4, Spider acquisition is carried out to Web page simultaneously by multiple collecting threads;
S5, parsing downloading is carried out to collected Webpage;Extract the address URL and the text information of the page;
S6, the correlation calculations with all educational topics are carried out to collected Webpage;To collected page URL
Location carries out the correlation calculations with all educational information themes;
When carrying out the correlation calculations of educational topics to Webpage:
Carry out the calculating of correlation one by one with all educational information themes to collected webpage first;It then will be with educational topics
The relevant page is stored in its corresponding educational information subject data base, until the meter to all educational topics correlations
It calculates, filters out the webpage unrelated with all educational topics;
When carrying out the correlation calculations of educational topics to the address Webpage URL:
Carry out the calculating of correlation one by one with all educational information themes to the address collected webpage URL first;It then will be with
The address educational topics relevant page URL is stored in its corresponding educational information theme buffer pool, until completing all religions
The calculating for educating topic relativity filters out the webpage URL address unrelated with all educational topics;
S7, duplicate removal processing is carried out to the webpage in all educational information subject data bases;I.e. to each educational information subject data
Webpage in library is into independent duplicate removal;Delete the identical page in educational information subject data base;And all theme class are buffered
The address URL in pond carries out duplicate removal processing, i.e., carries out at independent duplicate removal to the URL address sequence in each theme class buffer pool
Reason;
S8, all addresses URL for taking the page after duplicate removal in subject data base are mentioned;The address URL extracted is added to theme class
In URL address sequence in buffer pool, and carry out duplicate removal processing;To the URL address sequence in each theme class buffer pool into
The independent duplicate removal processing of row;By the web storage after duplicate removal processing to corresponding educational information subject data base.
2. as described in claim 1 based on the parallelization data tracking method of educational information theme, it is characterised in that: in step
It is further comprised the steps of: in S3 in conjunction with point strategies, buffer pool strategy, records corresponding page address;The page address of module record
The last acquisition for being all supplied to acquisition module and carrying out the page;The point strategies are in only being stood to manually selected website
Search;The buffer pool strategy is that the page address of acquisition is put into buffer pool.
3. as claimed in claim 2 based on the parallelization data tracking method of educational information theme, it is characterised in that: in step
Duplicate removal is carried out to URL address sequence using Hash table in S7 and S8;
All addresses URL are stored into hashmap container, the hash value of URL is then calculated by strhash function;
It is searched according to the hash value for the URL being calculated in hashmap container, if the hash value has existed, deleting should
The address URL.
4. as claimed in claim 3 based on the parallelization data tracking method of educational information theme, it is characterised in that: step S4
In the threshold values of page relevance and URL correlation is respectively set first, using semantic-based vector space model method to page
Face phase and the pass property of theme are calculated;URL correlation is calculated by pagerank algorithm;Correlation is less than threshold values
The page and URL delete.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811571552.2A CN109635182A (en) | 2018-12-21 | 2018-12-21 | Parallelization data tracking method based on educational information theme |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811571552.2A CN109635182A (en) | 2018-12-21 | 2018-12-21 | Parallelization data tracking method based on educational information theme |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109635182A true CN109635182A (en) | 2019-04-16 |
Family
ID=66076350
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811571552.2A Pending CN109635182A (en) | 2018-12-21 | 2018-12-21 | Parallelization data tracking method based on educational information theme |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109635182A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561814A (en) * | 2009-05-08 | 2009-10-21 | 华中科技大学 | Topic crawler system based on social labels |
CN101605129A (en) * | 2009-06-23 | 2009-12-16 | 北京理工大学 | A kind of URL lookup method that is used for the url filtering system |
CN102662954A (en) * | 2012-03-02 | 2012-09-12 | 杭州电子科技大学 | Method for implementing topical crawler system based on learning URL string information |
CN103310013A (en) * | 2013-07-02 | 2013-09-18 | 北京航空航天大学 | Subject-oriented web page collection system |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN104809182A (en) * | 2015-04-17 | 2015-07-29 | 东南大学 | Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter) |
-
2018
- 2018-12-21 CN CN201811571552.2A patent/CN109635182A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561814A (en) * | 2009-05-08 | 2009-10-21 | 华中科技大学 | Topic crawler system based on social labels |
CN101605129A (en) * | 2009-06-23 | 2009-12-16 | 北京理工大学 | A kind of URL lookup method that is used for the url filtering system |
CN102662954A (en) * | 2012-03-02 | 2012-09-12 | 杭州电子科技大学 | Method for implementing topical crawler system based on learning URL string information |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN103310013A (en) * | 2013-07-02 | 2013-09-18 | 北京航空航天大学 | Subject-oriented web page collection system |
CN104809182A (en) * | 2015-04-17 | 2015-07-29 | 东南大学 | Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter) |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106776768B (en) | A kind of URL grasping means of distributed reptile engine and system | |
CN103488681A (en) | Slash label | |
CN107103032B (en) | Mass data paging query method for avoiding global sequencing in distributed environment | |
Kyrola | Drunkardmob: billions of random walks on just a pc | |
CN103970788A (en) | Webpage-crawling-based crawler technology | |
CN106708993A (en) | Spatial data storage processing middleware framework realization method based on big data technology | |
US20070282940A1 (en) | Thread-ranking apparatus and method | |
CN104516982A (en) | Method and system for extracting Web information based on Nutch | |
CN110275920A (en) | Data query method, apparatus, electronic equipment and computer readable storage medium | |
CN100458784C (en) | Researching system and method used in digital labrary | |
CN102890713A (en) | Music recommending method based on current geographical position and physical environment of user | |
CN105930479A (en) | Data skew processing method and apparatus | |
CN104182482B (en) | A kind of news list page determination methods and the method for screening news list page | |
CN104951529A (en) | Interactive analyzing method for website logs | |
CN107408114A (en) | Based on transactions access pattern-recognition connection relation | |
CN107894986B (en) | Enterprise relation division method based on vectorization, server and client | |
US20210303537A1 (en) | Log record identification using aggregated log indexes | |
CN102222098A (en) | Method and system for pre-fetching webpage | |
CN105550375A (en) | Heterogeneous data integrating method and system | |
CN104598536B (en) | A kind of distributed network information structuring processing method | |
CN104298669A (en) | Person geographic information mining model based on social network | |
Tuan et al. | On the io characteristics of the sqlite transactions | |
CN109947935A (en) | The generation method and device of media event | |
Khodaei et al. | Temporal-textual retrieval: Time and keyword search in web documents | |
CN107704620A (en) | A kind of method, apparatus of file administration, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190416 |
|
RJ01 | Rejection of invention patent application after publication |