CN105320740B

CN105320740B - The acquisition methods and acquisition system of wechat article and public platform

Info

Publication number: CN105320740B
Application number: CN201510609672.7A
Authority: CN
Inventors: 薛波; 薛一波; 易成岐; 郭泽豪
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2015-09-22
Filing date: 2015-09-22
Publication date: 2018-10-16
Anticipated expiration: 2035-09-22
Also published as: CN105320740A

Abstract

The present invention provides a kind of wechat article and the acquisition methods and acquisition system of public platform, the present invention is on the basis of reptile normally crawls, third-party platform is accessed to identify identifying code, solves the problems, such as the identifying code occurred when search dog search, ensures that reptile steadily crawls；In addition reptile is cooked the grand filtering of cloth using article ID and public's WeChat ID, it ensure that wechat reptile will not be unable to operate normally due to the variation of search dog search platform URL, the newer state of last time reptile is had recorded by offset list simultaneously, it ensure that the increment type of reptile crawls, improve the efficiency of reptile, the present invention can efficiently, stablize, comprehensively crawl wechat public platform and article, have good availability.

Description

The acquisition methods and acquisition system of wechat article and public platform

Technical field

The invention belongs to data acquisition technology field, it is more particularly to a kind of wechat article and the acquisition methods of public platform And obtain system.

Background technology

The wechat performance reports in 2015 that Tencent announces show that monthly any active ues are more than 500,000,000 for wechat, user's covering More than 200 country, more than 20 kinds language.In addition, wechat public platform is one of main business of wechat, in November, 2013, wechat was public Many numbers quantity be more than in July, 2,000,000,2014 wechat public platform quantity have reached in December, 5,800,000,2014 wechat public platform Sum is more than 8,000,000, currently, the quantity of wechat public platform alreadys exceed 1,000 ten thousand.Wechat public platform is mainly by pushing article Increase bean vermicelli amount, to which advertiser can launch advertisement in the relatively high public platform of attention rate, through statistics, close to 80% wechat User has paid close attention to wechat public platform.Most users pay close attention to the wechat public platform of enterprise and media, and ratio is up to 73.4%. The purpose of 41.1% user's concern public platform is to obtain information, and 36.9% user is to live for convenience, 13.7% User is for learning knowledge.Wechat data how are extracted and effectively utilized, is both opportunity and challenges.

Wechat data acquisition is the basis of wechat data analysis, wherein wechat data mainly include wechat public platform information with And wechat article information.Wechat data acquisition is mainly crawled by the form of web crawlers.Web crawlers is also known as net machine People, Web Spider are a kind of according to certain strategies, the automatic script or program for capturing Internet resources.

The search of search dog wechat is the search engine for wechat public platform that search dog was released on June 9th, 2014, wechat The article pushed according to keyword search wechat public platform and wechat public platform is supported in search dog search.The formal access of search dog search Wechat public's number realizes " outer net " displaying of public platform for the first time.

To sum up, wechat expands social circle as social platform, and wechat public platform is one of main business of wechat, public Many number amounts are huge, and there are prodigious potential researching values.The access wechat data of search dog search simultaneously are also acquisition wechat Data provide possibility.However, in wechat data acquisition, one kind is efficiently currently not yet, stable, comprehensively acquisition wechat is literary The technical solution of chapter and public platform.

Invention content

(1) technical problems to be solved

The technical problem to be solved by the present invention is to how efficiently, stablizes, comprehensively obtains wechat article and public platform.

(2) technical solution

It is described in order to solve the above technical problem, the present invention provides a kind of wechat article and the acquisition methods of public platform Method includes the following steps：

S1, keyword needed for wechat retrieval is obtained, for each keyword, one or more search URL is built for it, And the described search URL of structure is put into request queue；

S2, start reptile component, do not crawled on each search URL and the search URL pages for a keyword URL is crawled：

S21, judgement currently crawl whether the page is the identifying code page, if it is the identifying code page currently to crawl the page, hold Row step S22, it is no to then follow the steps S23；

S22, the identifying code for obtaining current page, and it is uploaded to third-party platform, it is verified by the third-party platform Code identification, the identifying code of simplation verification code submission form submission later, executes the step S21 later；

S23, judge currently crawl the page URL whether be in the corresponding multiple described search URL of current keyword one It is a, it is no to then follow the steps S30 if executing step S24；

S24, the current URL and correspondence for crawling the wechat article not crawled in the page is filtered out using the grand filter method of cloth Wechat public platform URL, and be put into the request queue；URL for wechat article and in the URL of wechat public platform Each URL executes step S21；

S25, judge currently to crawl whether the page is the corresponding pages of first of current keyword search URL, if holding Row step S26, it is no to then follow the steps S27；

S26, the ID number for obtaining first article for currently crawling the page, and update into offset list, step is executed later S27；The wherein described offset list is used to store the ID number of first article of first page of each keyword；

S27, judge currently to crawl the page whether be current keyword the last one corresponding page of search URL, if so, Then the operation that crawls of current keyword is completed, and executes step S29；It is no to then follow the steps S28；

S28, judge whether the corresponding pages of next search URL for currently crawling the page have got over, if so, executing step Rapid S29；Otherwise, the next search URL for currently crawling the page is put into the request queue, and executes step S21；

S29, judge whether current keyword is the last one keyword, if so, reptile is terminated；Otherwise, step S21 is executed Carry out the URL not crawled on the search URL and the search URL pages of next keyword crawls operation.

S30, it crawls parsing in the page from current and obtains wechat public platform or wechat article, and the wechat that analysis is obtained is public Crowd number or wechat article are handled, and are stored later.

Preferably, the method further includes the steps that offset list load；And the method further includes Bloom filter The step of initialization.

Preferably, the search URL of first keyword is put into the request queue in the step S1, and in step In S27, the operation that crawls of current keyword will select the search URL of a keyword to be put into the request queue again after the completion.

Preferably, in the step S24, the current wechat text for crawling and not crawled in the page is filtered out in the grand filter method of cloth Before the URL of the chapter and URL of wechat public platform, the method is further comprising the steps of：

It will be in URL and current keyword storage to breakpoint daily record that the page currently crawled.

Preferably, in the step S26, before acquisition currently crawls the ID number of first article of the page, the side Method is further comprising the steps of：

In the locally downloading file of offset list, will be obtained from the offset list of download the keyword of last storage with And on corresponding first search URL pages first article ID number, and be stored in memory.

A kind of wechat article and public platform obtain system, the system comprises：

It searches for URL and builds module, be used to obtain keyword needed for wechat retrieval, be its structure for each keyword Multiple search URL are built, and the described search URL of structure is put into request queue；

Page parsing module, each search URL for a keyword and searches for the URL that does not crawl on the URL pages It is crawled, whether the URL that the page parsing module is used to judge currently to crawl the page is that current keyword is corresponding multiple One in described search URL, currently the wechat article not crawled in the page is crawled if then being filtered out using Bloom filter URL and wechat public platform URL, and be put into the request queue, crawled later；If it is not, then parsing currently crawls The page obtains wechat public platform or wechat article；The page parsing module is additionally operable to judge currently to crawl whether the page is to work as The corresponding pages of first search URL of preceding keyword, if the ID number for first article for currently crawling the page is obtained, and more Newly enter offset list, if it is not, the page parsing module be additionally operable to judge currently crawl the page whether be current keyword most The latter searches for the corresponding pages of URL, if so, the operation that crawls of current keyword is completed, carries out searching for next keyword The URL not crawled on rope URL and the search URL pages crawls operation, works as if it is not, the page parsing module is additionally operable to judgement Before crawl the corresponding pages of next search URL of the page and whether got over, if so, crawling for current keyword has operated At carry out the URL not crawled on the search URL and the search URL pages of next keyword crawls operation；Otherwise, will work as Before crawl next search URL of the page and be put into the request queue, carry out crawling operation；

Dissection process module, wechat public platform or wechat article for being analysed to the page parsing module carry out Processing；

Database module, for storing the data after the dissection process resume module；

Stamp module currently crawls whether the page is the identifying code page for judging, if it is identifying code currently to crawl the page The page, then obtain the identifying code of current page, and is uploaded to third-party platform, and identifying code knowledge is carried out by the third-party platform Not, simplation verification code submission form submits identifying code later.

Preferably, the system also includes initialization module, for initialize Bloom filter, search URL structures module, Page parsing module, dissection process module, database module and stamp module.

Preferably, described search URL builds module for first building the search URL of first keyword and being put into described ask Queue is asked, the search URL of next keyword is built again in crawling for keyword after the completion of operation and is put into the request Queue.

Preferably, the system also includes journal module, it is used to store the current URL and current key for crawling the page Word.

Preferably, the system also includes delta files, and first of first page for storing each keyword The ID number of a article.

(3) advantageous effect

The present invention provides a kind of wechat article and the acquisition methods and acquisition system of public platform, and the present invention is using increasing List is measured to ensure that the increment type of wechat article and public platform crawls；It solves by third-party platform and to occur when search dog search Identifying code problem ensures that reptile steadily crawls；In addition reptile is cooked the grand filtering of cloth using article ID and public's WeChat ID, is ensured Wechat crawler system will not be unable to operate normally due to the variation of URL.The method and system of the present invention can efficiently, surely Determine, comprehensively crawl wechat public platform and article, there is good availability；

Description of the drawings

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Obtain other attached drawings according to these attached drawings.

Figure 1A, 1B are the wechat article of the preferred embodiment of the present invention and the acquisition methods flow chart of public platform；

Fig. 2 is the wechat article of another preferred embodiment of the present invention and the acquisition methods flow chart of public platform.

Specific implementation mode

Present invention is further described in detail with reference to the accompanying drawings and examples.Following embodiment is for illustrating this hair It is bright, but cannot be used for limiting the scope of the invention.

A kind of wechat article and the acquisition methods of public platform the described method comprises the following steps as shown in Figure 1A, 1B：

S1, keyword needed for wechat retrieval is obtained, for each keyword, multiple search URL is built for it, and by structure The described search URL built is put into request queue；It refers to the URL by keyword search wechat article list wherein to state search URL.

S24, the grand filtering of cloth is done using the article ID number and WeChat ID of wechat, filters out current crawl in the page and does not crawl Wechat article URL and corresponding wechat public platform URL, and be put into the request queue；For the URL of wechat article And each URL in the URL of wechat public platform, execute step S21；

The above method ensures that the increment type of wechat article and public platform crawls using offset list；Pass through third-party platform It solves the problems, such as the identifying code occurred when search dog search, ensures that reptile steadily crawls；In addition reptile utilizes article ID and public affairs Many WeChat IDs do the grand filtering of cloth, ensure that wechat crawler system will not can not be normal due to the variation of search dog search platform URL Operation.

Further, the method further includes the steps that offset list load；And the method further includes the grand filtering of cloth The step of device initializes.

Further, the search URL of first keyword is put into the request queue in the step S1, and in step In rapid S27, the operation that crawls of current keyword will select the search URL of a keyword to be put into the request queue again after the completion. This improvement project has first crawled a keyword, then the URL of another keyword is put into request queue, is handled, can More accurately completely to crawl web data.Therefore the search URL of first keyword is properly termed as seed in step sl URL, seed URL are the URL for searching for wechat article.

Further, it in the step S24, is filtered out in the grand filter method of cloth and currently crawls the wechat not crawled in the page Before the URL of the article and URL of wechat public platform, the method is further comprising the steps of：By the URL for currently crawling the page with And in current keyword storage to breakpoint daily record.Breakpoint daily record refers to recording the file of reptile operating status, extensive for fault point Realize that the breakpoint of reptile is climbed again again.

Further, described before acquisition currently crawls the ID number of first article of the page in the step S26 Method is further comprising the steps of：It is deposited the last time in the locally downloading file of offset list, is obtained from the offset list of download The ID number of first article on the keyword of storage and corresponding first search URL pages, and be stored in memory.

Corresponding to the above method, there are a kind of wechat articles and public platform to obtain system, the system comprises：

Page parsing module refers to the component by selector analyzing web page content.It searches for a keyword each The URL that does not crawl is crawled on rope URL and the search URL pages, and the page parsing module is for judging currently to crawl page Whether the URL in face is one in the corresponding multiple described search URL of current keyword, if then Bloom filter is utilized to screen Go out and currently crawl the URL for the wechat article not crawled in the page and URL of wechat public platform, and is put into the request queue, It is crawled later；If it is not, currently crawling the page obtains wechat public platform or wechat article for parsing；The page parsing module It is additionally operable to judge currently to crawl whether the page is the corresponding pages of first of current keyword search URL, if obtaining current The ID number of first article of the page is crawled, and is updated into offset list, if it is not, the page parsing module is additionally operable to judge to work as Before crawl the page whether be current keyword the last one corresponding page of search URL, if so, current keyword crawls Operation is completed, and carry out the URL not crawled on the search URL and the search URL pages of next keyword crawls operation, if No, the page parsing module is additionally operable to judge whether the corresponding pages of next search URL for currently crawling the page have climbed It crosses, if so, the operation that crawls of current keyword is completed, carries out on the search URL and the search URL pages of next keyword The URL's not crawled crawls operation；Otherwise, the next search URL for currently crawling the page is put into the request queue, into Row crawls operation；

Database module, for storing the data after the dissection process resume module；It is to lasting data The component for changing operation is additionally operable to data query, data storage, batch storage.

Stamp module currently crawls whether the page is the identifying code page for judging, if it is identifying code currently to crawl the page The page, then obtain the identifying code of current page, and is uploaded to third-party platform, and identifying code knowledge is carried out by the third-party platform Not, simplation verification code submission form submits identifying code later, if verification is unsuccessful, program can be continued for verifying, until It is proved to be successful.

Above-mentioned page parsing module, which is mainly responsible for, executes different page parsing logics according to different URL, and URL is generally wrapped List URL (searching for URL) and detailed URL (i.e. wechat article URL and wechat public platform URL) are included, is parsed from list URL Go out detailed URL to be put into request queue, detailed content is parsed from detailed URL, which is most important module, including webpage Parsing, incremental update, breakpoint log recording, the grand filtering of cloth, while also needing to according to different normal URL and various exception URL The logical process taken.

Further, the system also includes initialization modules, for initializing the Bloom filter, search URL structures Model block, page parsing module, dissection process module, database module and stamp module.

Further, Bloom filter is the binary vector comprising regular length and a series of random mapping functions Component considers that the URL of search dog search platform can change, but wechat article ID and the WeChat ID of wechat public platform will not become Change, the WeChat ID of article ID and wechat public platform are filtered by cloth grand algorithm, realizes public to wechat article ID, wechat The duplicate removal of many numbers WeChat IDs, to find the wechat article URL not crawled and wechat public platform URL.

Further, described search URL builds module for first building the search URL of first keyword and being put into described Request queue builds the search URL of next keyword after the completion of operation and is put into described ask again in crawling for keyword Ask queue.

Further, the system also includes journal module, it is used to store the current URL for crawling the page and current pass Key word is initialized by above-mentioned initialization module.

Further, the system also includes delta files, and of first page for storing each keyword The ID number of one article.Reptile has often updated a keyword, then (memory chained list, program are initial with offset list for delta file Shi Baocun's is the update of last time reptile) it synchronizes, it is used for the incremental update of reptile.

Further, the system also includes download component, download component refers to that the group of web page contents is downloaded according to URL Part sets the response header to next time for according to request header configuring request and request to create, obtaining request response contents The request header information of request, including setting cookie information；And it is initialized by above-mentioned initialization module.

The above method and system can efficient, stable, comprehensively obtain the information of wechat article and wechat public platform.

The method of the present invention is illustrated with reference to specific embodiment：

A kind of method obtaining wechat article and public platform includes step, as shown in Figure 2：

S1. keyword needed for wechat search is obtained；

S2. the initialization of seed URL, download component, page parsing module, dissection process module, database module, Bu Long The initialization of filter etc. and the load of journal module, delta file；

S3. assembling reptile component (reptile initialization) starts reptile；

S4. judge to crawl whether the page is the identifying code page, if the identifying code page enters S5, otherwise enter S8；

S5. identifying code is uploaded into third-party platform and carries out identifying code identification；

S6. simplation verification code submission form submits identifying code；

S7. it accesses again and downloads the page, be transferred to step S4；

S8. judge to crawl whether page URL is search URL, if otherwise search URL enters S20 into S9；

S9. current reptile operating status is recorded, it will be in the number of pages currently crawled and keyword storage to breakpoint daily record；

S10. the grand filtering of cloth, the wechat article that this page is not crawled are carried out according to the WeChat ID of article ID, wechat public platform URL and wechat public platform URL are put into request queue.

S11. judge to crawl whether the page is first page, if into S12, otherwise enter S15；

S12. offset list is stored into local file；

First article ID of keyword first page when S13. obtaining last time update from offset list, is buffered in memory change In amount；

S14. first article ID update offset list of the keyword first page is obtained；

S15. judge to crawl whether the page is last page, if into S17, otherwise enter S16；

S16. judge whether lower one page has got over, if into S17, otherwise enter S18；

S17. judge whether the keyword is the last one keyword, if terminating, otherwise enter S19；

S18. lower one page URL of the keyword is put into request queue, and return to step S4；

S19. next keyword search URL is put into request queue, and return to step S4；

S20. wechat public platform or wechat article are parsed in the page from crawling.

S21. wechat public platform or wechat article are stored.

S22. local deduplicated file is synchronized.

Embodiment of above is merely to illustrate the present invention rather than limitation of the present invention.Although with reference to embodiment to this hair It is bright to be described in detail, it will be understood by those of ordinary skill in the art that, to technical scheme of the present invention carry out it is various combination, Modification or equivalent replacement, without departure from the spirit and scope of technical solution of the present invention, the right that should all cover in the present invention is wanted It asks in range.

Claims

1. the acquisition methods of a kind of wechat article and public platform, which is characterized in that the described method comprises the following steps：

S1, keyword needed for wechat retrieval is obtained, for each keyword, builds one or more search URL for it, and will The described search URL of structure is put into request queue；

S2, start reptile component, to each search URL and corresponding to the page of described search URL of each keyword On the URL that does not crawl crawled；Crawling step in the wherein described step S2 is specially：

S21, judgement currently crawl whether the page is the identifying code page, if it is the identifying code page currently to crawl the page, execute step Rapid S22, it is no to then follow the steps S23；

S22, the identifying code for obtaining current page, and it is uploaded to third-party platform, identifying code knowledge is carried out by the third-party platform Not, identifying code is submitted by simplation verification code submission form later, executes the step S21 later；

Whether S23, the URL for judging currently to crawl the page are one in the corresponding multiple described search URL of current keyword, if It is execution step S24, it is no to then follow the steps S30；

S24, do the grand filtering of cloth using the article ID number and WeChat ID of wechat, filter out it is current crawl do not crawled in the page it is micro- The URL of the URL of message chapter and corresponding wechat public platform, and it is put into the request queue；URL for wechat article and Each URL in the URL of wechat public platform executes step S21；

S25, judge currently to crawl whether the page is the corresponding pages of first of current keyword search URL, if executing step Rapid S26, it is no to then follow the steps S27；

S26, the ID number for obtaining first article for currently crawling the page, and update into offset list, step S27 is executed later； The wherein described offset list is used to store the ID number of first article of first page of each keyword；

S27, judge currently to crawl the page whether be current keyword the last one corresponding page of search URL, if so, working as The operation that crawls of preceding keyword is completed, and step S29 is executed；It is no to then follow the steps S28；

S28, judge whether the corresponding pages of next search URL for currently crawling the page have got over, if so, executing step S29；Otherwise, the next search URL for currently crawling the page is put into the request queue, and executes step S21；

S29, judge whether current keyword is the last one keyword, if so, reptile is terminated；Otherwise, step S21 is executed to carry out The URL's not crawled on the search URL and the search URL pages of next keyword crawls operation,

S30, it crawls parsing in the page from current and obtains wechat public platform or wechat article, and the wechat public that analysis is obtained Number or wechat article handled, stored later.

2. described according to the method described in claim 1, it is characterized in that, before updating offset list in the step S26 Method further includes the steps that offset list load.

3. according to the method described in claim 1, it is characterized in that, by the search URL of first keyword in the step S1 It is put into the request queue, and in step s 27, current keyword crawls operation after the completion again by next keyword Search URL is put into the request queue.

4. according to the method described in claim 1, it is characterized in that, in the step S24, filters out and work as in the grand filter method of cloth Before crawl the URL for the wechat article not crawled in the page and URL of wechat public platform before, the method further includes following step Suddenly：

5. according to the method described in claim 1, it is characterized in that, in the step S26, the of the page is currently crawled obtaining Before the ID number of one article, the method is further comprising the steps of：

In the locally downloading file of offset list, the keyword of last storage and right will be obtained from the offset list of download The ID number of first article on the first search URL page answered, and be stored in memory.

6. a kind of wechat article and public platform obtain system, which is characterized in that the system comprises：

It searches for URL and builds module, be used to obtain keyword needed for wechat retrieval, it is more for its structure for each keyword A search URL, and the described search URL of structure is put into request queue；

Page parsing module, each search URL for a keyword and searches for the URL that does not crawl on the URL pages and carries out It crawls, whether the URL that the page parsing module is used to judge currently to crawl the page is that current keyword is corresponding multiple described One in URL is searched for, if one in multiple described search URL, then it is filtered out using Bloom filter and currently crawls page The URL for the wechat article not crawled in the face and URL of wechat public platform, and it is put into the request queue, it is crawled later； If not one in multiple described search URL, then currently crawling the page obtains wechat public platform or wechat article for parsing；Institute Page parsing module is stated to be additionally operable to judge currently to crawl whether the page is the corresponding pages of first of current keyword search URL Face, if the corresponding pages of first search URL of current keyword, then obtain the ID for first article for currently crawling the page Number, and it updates into offset list, if not the corresponding pages of first search URL of current keyword, then the page parsing Module be additionally operable to judge currently to crawl the page whether be current keyword the last one corresponding page of search URL, if working as The corresponding pages of the last one search URL of preceding keyword, the then operation that crawls of current keyword are completed, and next key is carried out The URL's not crawled on the search URL and the search URL pages of word crawls operation, if not the last one of current keyword The corresponding pages of URL are searched for, then the page parsing module is additionally operable to the next search URL for judging currently to crawl page correspondences The page whether got over, if having got over, current keyword crawl operation complete, carry out searching for next keyword The URL not crawled on rope URL and the search URL pages crawls operation；If not getting over, the next of the page will be currently crawled A search URL is put into the request queue, carries out crawling operation；

Dissection process module, at the wechat public platform or wechat article for being analysed to the page parsing module Reason；

Stamp module currently crawls whether the page is the identifying code page for judging, if it is the identifying code page currently to crawl the page, The identifying code of current page is then obtained, and is uploaded to third-party platform, identifying code identification is carried out by the third-party platform, later Simplation verification code submission form submits identifying code.

7. system according to claim 6, which is characterized in that the system also includes initialization modules, for initializing The Bloom filter, search URL structures module, page parsing module, dissection process module, database module and stamp mould Block.

8. system according to claim 6, which is characterized in that described search URL structure modules are for first building first The search URL of keyword is simultaneously put into the request queue, and next pass is built again after the completion in the operation that crawls of a keyword The search URL of key word is simultaneously put into the request queue.

9. system according to claim 6, which is characterized in that the system also includes journal module, be used to store and work as Before crawl the URL and current keyword of the page.

10. system according to claim 6, which is characterized in that the system also includes delta files, each for storing The ID number of first article of first page of a keyword.