CN105320740B - The acquisition methods and acquisition system of wechat article and public platform - Google Patents

The acquisition methods and acquisition system of wechat article and public platform Download PDF

Info

Publication number
CN105320740B
CN105320740B CN201510609672.7A CN201510609672A CN105320740B CN 105320740 B CN105320740 B CN 105320740B CN 201510609672 A CN201510609672 A CN 201510609672A CN 105320740 B CN105320740 B CN 105320740B
Authority
CN
China
Prior art keywords
url
page
keyword
wechat
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510609672.7A
Other languages
Chinese (zh)
Other versions
CN105320740A (en
Inventor
薛波
薛一波
易成岐
郭泽豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201510609672.7A priority Critical patent/CN105320740B/en
Publication of CN105320740A publication Critical patent/CN105320740A/en
Application granted granted Critical
Publication of CN105320740B publication Critical patent/CN105320740B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of wechat article and the acquisition methods and acquisition system of public platform, the present invention is on the basis of reptile normally crawls, third-party platform is accessed to identify identifying code, solves the problems, such as the identifying code occurred when search dog search, ensures that reptile steadily crawls;In addition reptile is cooked the grand filtering of cloth using article ID and public's WeChat ID, it ensure that wechat reptile will not be unable to operate normally due to the variation of search dog search platform URL, the newer state of last time reptile is had recorded by offset list simultaneously, it ensure that the increment type of reptile crawls, improve the efficiency of reptile, the present invention can efficiently, stablize, comprehensively crawl wechat public platform and article, have good availability.

Description

The acquisition methods and acquisition system of wechat article and public platform
Technical field
The invention belongs to data acquisition technology field, it is more particularly to a kind of wechat article and the acquisition methods of public platform And obtain system.
Background technology
The wechat performance reports in 2015 that Tencent announces show that monthly any active ues are more than 500,000,000 for wechat, user's covering More than 200 country, more than 20 kinds language.In addition, wechat public platform is one of main business of wechat, in November, 2013, wechat was public Many numbers quantity be more than in July, 2,000,000,2014 wechat public platform quantity have reached in December, 5,800,000,2014 wechat public platform Sum is more than 8,000,000, currently, the quantity of wechat public platform alreadys exceed 1,000 ten thousand.Wechat public platform is mainly by pushing article Increase bean vermicelli amount, to which advertiser can launch advertisement in the relatively high public platform of attention rate, through statistics, close to 80% wechat User has paid close attention to wechat public platform.Most users pay close attention to the wechat public platform of enterprise and media, and ratio is up to 73.4%. The purpose of 41.1% user's concern public platform is to obtain information, and 36.9% user is to live for convenience, 13.7% User is for learning knowledge.Wechat data how are extracted and effectively utilized, is both opportunity and challenges.
Wechat data acquisition is the basis of wechat data analysis, wherein wechat data mainly include wechat public platform information with And wechat article information.Wechat data acquisition is mainly crawled by the form of web crawlers.Web crawlers is also known as net machine People, Web Spider are a kind of according to certain strategies, the automatic script or program for capturing Internet resources.
The search of search dog wechat is the search engine for wechat public platform that search dog was released on June 9th, 2014, wechat The article pushed according to keyword search wechat public platform and wechat public platform is supported in search dog search.The formal access of search dog search Wechat public's number realizes " outer net " displaying of public platform for the first time.
To sum up, wechat expands social circle as social platform, and wechat public platform is one of main business of wechat, public Many number amounts are huge, and there are prodigious potential researching values.The access wechat data of search dog search simultaneously are also acquisition wechat Data provide possibility.However, in wechat data acquisition, one kind is efficiently currently not yet, stable, comprehensively acquisition wechat is literary The technical solution of chapter and public platform.
Invention content
(1) technical problems to be solved
The technical problem to be solved by the present invention is to how efficiently, stablizes, comprehensively obtains wechat article and public platform.
(2) technical solution
It is described in order to solve the above technical problem, the present invention provides a kind of wechat article and the acquisition methods of public platform Method includes the following steps:
S1, keyword needed for wechat retrieval is obtained, for each keyword, one or more search URL is built for it, And the described search URL of structure is put into request queue;
S2, start reptile component, do not crawled on each search URL and the search URL pages for a keyword URL is crawled:
S21, judgement currently crawl whether the page is the identifying code page, if it is the identifying code page currently to crawl the page, hold Row step S22, it is no to then follow the steps S23;
S22, the identifying code for obtaining current page, and it is uploaded to third-party platform, it is verified by the third-party platform Code identification, the identifying code of simplation verification code submission form submission later, executes the step S21 later;
S23, judge currently crawl the page URL whether be in the corresponding multiple described search URL of current keyword one It is a, it is no to then follow the steps S30 if executing step S24;
S24, the current URL and correspondence for crawling the wechat article not crawled in the page is filtered out using the grand filter method of cloth Wechat public platform URL, and be put into the request queue;URL for wechat article and in the URL of wechat public platform Each URL executes step S21;
S25, judge currently to crawl whether the page is the corresponding pages of first of current keyword search URL, if holding Row step S26, it is no to then follow the steps S27;
S26, the ID number for obtaining first article for currently crawling the page, and update into offset list, step is executed later S27;The wherein described offset list is used to store the ID number of first article of first page of each keyword;
S27, judge currently to crawl the page whether be current keyword the last one corresponding page of search URL, if so, Then the operation that crawls of current keyword is completed, and executes step S29;It is no to then follow the steps S28;
S28, judge whether the corresponding pages of next search URL for currently crawling the page have got over, if so, executing step Rapid S29;Otherwise, the next search URL for currently crawling the page is put into the request queue, and executes step S21;
S29, judge whether current keyword is the last one keyword, if so, reptile is terminated;Otherwise, step S21 is executed Carry out the URL not crawled on the search URL and the search URL pages of next keyword crawls operation.
S30, it crawls parsing in the page from current and obtains wechat public platform or wechat article, and the wechat that analysis is obtained is public Crowd number or wechat article are handled, and are stored later.
Preferably, the method further includes the steps that offset list load;And the method further includes Bloom filter The step of initialization.
Preferably, the search URL of first keyword is put into the request queue in the step S1, and in step In S27, the operation that crawls of current keyword will select the search URL of a keyword to be put into the request queue again after the completion.
Preferably, in the step S24, the current wechat text for crawling and not crawled in the page is filtered out in the grand filter method of cloth Before the URL of the chapter and URL of wechat public platform, the method is further comprising the steps of:
It will be in URL and current keyword storage to breakpoint daily record that the page currently crawled.
Preferably, in the step S26, before acquisition currently crawls the ID number of first article of the page, the side Method is further comprising the steps of:
In the locally downloading file of offset list, will be obtained from the offset list of download the keyword of last storage with And on corresponding first search URL pages first article ID number, and be stored in memory.
A kind of wechat article and public platform obtain system, the system comprises:
It searches for URL and builds module, be used to obtain keyword needed for wechat retrieval, be its structure for each keyword Multiple search URL are built, and the described search URL of structure is put into request queue;
Page parsing module, each search URL for a keyword and searches for the URL that does not crawl on the URL pages It is crawled, whether the URL that the page parsing module is used to judge currently to crawl the page is that current keyword is corresponding multiple One in described search URL, currently the wechat article not crawled in the page is crawled if then being filtered out using Bloom filter URL and wechat public platform URL, and be put into the request queue, crawled later;If it is not, then parsing currently crawls The page obtains wechat public platform or wechat article;The page parsing module is additionally operable to judge currently to crawl whether the page is to work as The corresponding pages of first search URL of preceding keyword, if the ID number for first article for currently crawling the page is obtained, and more Newly enter offset list, if it is not, the page parsing module be additionally operable to judge currently crawl the page whether be current keyword most The latter searches for the corresponding pages of URL, if so, the operation that crawls of current keyword is completed, carries out searching for next keyword The URL not crawled on rope URL and the search URL pages crawls operation, works as if it is not, the page parsing module is additionally operable to judgement Before crawl the corresponding pages of next search URL of the page and whether got over, if so, crawling for current keyword has operated At carry out the URL not crawled on the search URL and the search URL pages of next keyword crawls operation;Otherwise, will work as Before crawl next search URL of the page and be put into the request queue, carry out crawling operation;
Dissection process module, wechat public platform or wechat article for being analysed to the page parsing module carry out Processing;
Database module, for storing the data after the dissection process resume module;
Stamp module currently crawls whether the page is the identifying code page for judging, if it is identifying code currently to crawl the page The page, then obtain the identifying code of current page, and is uploaded to third-party platform, and identifying code knowledge is carried out by the third-party platform Not, simplation verification code submission form submits identifying code later.
Preferably, the system also includes initialization module, for initialize Bloom filter, search URL structures module, Page parsing module, dissection process module, database module and stamp module.
Preferably, described search URL builds module for first building the search URL of first keyword and being put into described ask Queue is asked, the search URL of next keyword is built again in crawling for keyword after the completion of operation and is put into the request Queue.
Preferably, the system also includes journal module, it is used to store the current URL and current key for crawling the page Word.
Preferably, the system also includes delta files, and first of first page for storing each keyword The ID number of a article.
(3) advantageous effect
The present invention provides a kind of wechat article and the acquisition methods and acquisition system of public platform, and the present invention is using increasing List is measured to ensure that the increment type of wechat article and public platform crawls;It solves by third-party platform and to occur when search dog search Identifying code problem ensures that reptile steadily crawls;In addition reptile is cooked the grand filtering of cloth using article ID and public's WeChat ID, is ensured Wechat crawler system will not be unable to operate normally due to the variation of URL.The method and system of the present invention can efficiently, surely Determine, comprehensively crawl wechat public platform and article, there is good availability;
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Obtain other attached drawings according to these attached drawings.
Figure 1A, 1B are the wechat article of the preferred embodiment of the present invention and the acquisition methods flow chart of public platform;
Fig. 2 is the wechat article of another preferred embodiment of the present invention and the acquisition methods flow chart of public platform.
Specific implementation mode
Present invention is further described in detail with reference to the accompanying drawings and examples.Following embodiment is for illustrating this hair It is bright, but cannot be used for limiting the scope of the invention.
A kind of wechat article and the acquisition methods of public platform the described method comprises the following steps as shown in Figure 1A, 1B:
S1, keyword needed for wechat retrieval is obtained, for each keyword, multiple search URL is built for it, and by structure The described search URL built is put into request queue;It refers to the URL by keyword search wechat article list wherein to state search URL.
S2, start reptile component, do not crawled on each search URL and the search URL pages for a keyword URL is crawled:
S21, judgement currently crawl whether the page is the identifying code page, if it is the identifying code page currently to crawl the page, hold Row step S22, it is no to then follow the steps S23;
S22, the identifying code for obtaining current page, and it is uploaded to third-party platform, it is verified by the third-party platform Code identification, the identifying code of simplation verification code submission form submission later, executes the step S21 later;
S23, judge currently crawl the page URL whether be in the corresponding multiple described search URL of current keyword one It is a, it is no to then follow the steps S30 if executing step S24;
S24, the grand filtering of cloth is done using the article ID number and WeChat ID of wechat, filters out current crawl in the page and does not crawl Wechat article URL and corresponding wechat public platform URL, and be put into the request queue;For the URL of wechat article And each URL in the URL of wechat public platform, execute step S21;
S25, judge currently to crawl whether the page is the corresponding pages of first of current keyword search URL, if holding Row step S26, it is no to then follow the steps S27;
S26, the ID number for obtaining first article for currently crawling the page, and update into offset list, step is executed later S27;The wherein described offset list is used to store the ID number of first article of first page of each keyword;
S27, judge currently to crawl the page whether be current keyword the last one corresponding page of search URL, if so, Then the operation that crawls of current keyword is completed, and executes step S29;It is no to then follow the steps S28;
S28, judge whether the corresponding pages of next search URL for currently crawling the page have got over, if so, executing step Rapid S29;Otherwise, the next search URL for currently crawling the page is put into the request queue, and executes step S21;
S29, judge whether current keyword is the last one keyword, if so, reptile is terminated;Otherwise, step S21 is executed Carry out the URL not crawled on the search URL and the search URL pages of next keyword crawls operation.
S30, it crawls parsing in the page from current and obtains wechat public platform or wechat article, and the wechat that analysis is obtained is public Crowd number or wechat article are handled, and are stored later.
The above method ensures that the increment type of wechat article and public platform crawls using offset list;Pass through third-party platform It solves the problems, such as the identifying code occurred when search dog search, ensures that reptile steadily crawls;In addition reptile utilizes article ID and public affairs Many WeChat IDs do the grand filtering of cloth, ensure that wechat crawler system will not can not be normal due to the variation of search dog search platform URL Operation.
Further, the method further includes the steps that offset list load;And the method further includes the grand filtering of cloth The step of device initializes.
Further, the search URL of first keyword is put into the request queue in the step S1, and in step In rapid S27, the operation that crawls of current keyword will select the search URL of a keyword to be put into the request queue again after the completion. This improvement project has first crawled a keyword, then the URL of another keyword is put into request queue, is handled, can More accurately completely to crawl web data.Therefore the search URL of first keyword is properly termed as seed in step sl URL, seed URL are the URL for searching for wechat article.
Further, it in the step S24, is filtered out in the grand filter method of cloth and currently crawls the wechat not crawled in the page Before the URL of the article and URL of wechat public platform, the method is further comprising the steps of:By the URL for currently crawling the page with And in current keyword storage to breakpoint daily record.Breakpoint daily record refers to recording the file of reptile operating status, extensive for fault point Realize that the breakpoint of reptile is climbed again again.
Further, described before acquisition currently crawls the ID number of first article of the page in the step S26 Method is further comprising the steps of:It is deposited the last time in the locally downloading file of offset list, is obtained from the offset list of download The ID number of first article on the keyword of storage and corresponding first search URL pages, and be stored in memory.
Corresponding to the above method, there are a kind of wechat articles and public platform to obtain system, the system comprises:
It searches for URL and builds module, be used to obtain keyword needed for wechat retrieval, be its structure for each keyword Multiple search URL are built, and the described search URL of structure is put into request queue;
Page parsing module refers to the component by selector analyzing web page content.It searches for a keyword each The URL that does not crawl is crawled on rope URL and the search URL pages, and the page parsing module is for judging currently to crawl page Whether the URL in face is one in the corresponding multiple described search URL of current keyword, if then Bloom filter is utilized to screen Go out and currently crawl the URL for the wechat article not crawled in the page and URL of wechat public platform, and is put into the request queue, It is crawled later;If it is not, currently crawling the page obtains wechat public platform or wechat article for parsing;The page parsing module It is additionally operable to judge currently to crawl whether the page is the corresponding pages of first of current keyword search URL, if obtaining current The ID number of first article of the page is crawled, and is updated into offset list, if it is not, the page parsing module is additionally operable to judge to work as Before crawl the page whether be current keyword the last one corresponding page of search URL, if so, current keyword crawls Operation is completed, and carry out the URL not crawled on the search URL and the search URL pages of next keyword crawls operation, if No, the page parsing module is additionally operable to judge whether the corresponding pages of next search URL for currently crawling the page have climbed It crosses, if so, the operation that crawls of current keyword is completed, carries out on the search URL and the search URL pages of next keyword The URL's not crawled crawls operation;Otherwise, the next search URL for currently crawling the page is put into the request queue, into Row crawls operation;
Dissection process module, wechat public platform or wechat article for being analysed to the page parsing module carry out Processing;
Database module, for storing the data after the dissection process resume module;It is to lasting data The component for changing operation is additionally operable to data query, data storage, batch storage.
Stamp module currently crawls whether the page is the identifying code page for judging, if it is identifying code currently to crawl the page The page, then obtain the identifying code of current page, and is uploaded to third-party platform, and identifying code knowledge is carried out by the third-party platform Not, simplation verification code submission form submits identifying code later, if verification is unsuccessful, program can be continued for verifying, until It is proved to be successful.
Above-mentioned page parsing module, which is mainly responsible for, executes different page parsing logics according to different URL, and URL is generally wrapped List URL (searching for URL) and detailed URL (i.e. wechat article URL and wechat public platform URL) are included, is parsed from list URL Go out detailed URL to be put into request queue, detailed content is parsed from detailed URL, which is most important module, including webpage Parsing, incremental update, breakpoint log recording, the grand filtering of cloth, while also needing to according to different normal URL and various exception URL The logical process taken.
Further, the system also includes initialization modules, for initializing the Bloom filter, search URL structures Model block, page parsing module, dissection process module, database module and stamp module.
Further, Bloom filter is the binary vector comprising regular length and a series of random mapping functions Component considers that the URL of search dog search platform can change, but wechat article ID and the WeChat ID of wechat public platform will not become Change, the WeChat ID of article ID and wechat public platform are filtered by cloth grand algorithm, realizes public to wechat article ID, wechat The duplicate removal of many numbers WeChat IDs, to find the wechat article URL not crawled and wechat public platform URL.
Further, described search URL builds module for first building the search URL of first keyword and being put into described Request queue builds the search URL of next keyword after the completion of operation and is put into described ask again in crawling for keyword Ask queue.
Further, the system also includes journal module, it is used to store the current URL for crawling the page and current pass Key word is initialized by above-mentioned initialization module.
Further, the system also includes delta files, and of first page for storing each keyword The ID number of one article.Reptile has often updated a keyword, then (memory chained list, program are initial with offset list for delta file Shi Baocun's is the update of last time reptile) it synchronizes, it is used for the incremental update of reptile.
Further, the system also includes download component, download component refers to that the group of web page contents is downloaded according to URL Part sets the response header to next time for according to request header configuring request and request to create, obtaining request response contents The request header information of request, including setting cookie information;And it is initialized by above-mentioned initialization module.
The above method and system can efficient, stable, comprehensively obtain the information of wechat article and wechat public platform.
The method of the present invention is illustrated with reference to specific embodiment:
A kind of method obtaining wechat article and public platform includes step, as shown in Figure 2:
S1. keyword needed for wechat search is obtained;
S2. the initialization of seed URL, download component, page parsing module, dissection process module, database module, Bu Long The initialization of filter etc. and the load of journal module, delta file;
S3. assembling reptile component (reptile initialization) starts reptile;
S4. judge to crawl whether the page is the identifying code page, if the identifying code page enters S5, otherwise enter S8;
S5. identifying code is uploaded into third-party platform and carries out identifying code identification;
S6. simplation verification code submission form submits identifying code;
S7. it accesses again and downloads the page, be transferred to step S4;
S8. judge to crawl whether page URL is search URL, if otherwise search URL enters S20 into S9;
S9. current reptile operating status is recorded, it will be in the number of pages currently crawled and keyword storage to breakpoint daily record;
S10. the grand filtering of cloth, the wechat article that this page is not crawled are carried out according to the WeChat ID of article ID, wechat public platform URL and wechat public platform URL are put into request queue.
S11. judge to crawl whether the page is first page, if into S12, otherwise enter S15;
S12. offset list is stored into local file;
First article ID of keyword first page when S13. obtaining last time update from offset list, is buffered in memory change In amount;
S14. first article ID update offset list of the keyword first page is obtained;
S15. judge to crawl whether the page is last page, if into S17, otherwise enter S16;
S16. judge whether lower one page has got over, if into S17, otherwise enter S18;
S17. judge whether the keyword is the last one keyword, if terminating, otherwise enter S19;
S18. lower one page URL of the keyword is put into request queue, and return to step S4;
S19. next keyword search URL is put into request queue, and return to step S4;
S20. wechat public platform or wechat article are parsed in the page from crawling.
S21. wechat public platform or wechat article are stored.
S22. local deduplicated file is synchronized.
Embodiment of above is merely to illustrate the present invention rather than limitation of the present invention.Although with reference to embodiment to this hair It is bright to be described in detail, it will be understood by those of ordinary skill in the art that, to technical scheme of the present invention carry out it is various combination, Modification or equivalent replacement, without departure from the spirit and scope of technical solution of the present invention, the right that should all cover in the present invention is wanted It asks in range.

Claims (10)

1. the acquisition methods of a kind of wechat article and public platform, which is characterized in that the described method comprises the following steps:
S1, keyword needed for wechat retrieval is obtained, for each keyword, builds one or more search URL for it, and will The described search URL of structure is put into request queue;
S2, start reptile component, to each search URL and corresponding to the page of described search URL of each keyword On the URL that does not crawl crawled;Crawling step in the wherein described step S2 is specially:
S21, judgement currently crawl whether the page is the identifying code page, if it is the identifying code page currently to crawl the page, execute step Rapid S22, it is no to then follow the steps S23;
S22, the identifying code for obtaining current page, and it is uploaded to third-party platform, identifying code knowledge is carried out by the third-party platform Not, identifying code is submitted by simplation verification code submission form later, executes the step S21 later;
Whether S23, the URL for judging currently to crawl the page are one in the corresponding multiple described search URL of current keyword, if It is execution step S24, it is no to then follow the steps S30;
S24, do the grand filtering of cloth using the article ID number and WeChat ID of wechat, filter out it is current crawl do not crawled in the page it is micro- The URL of the URL of message chapter and corresponding wechat public platform, and it is put into the request queue;URL for wechat article and Each URL in the URL of wechat public platform executes step S21;
S25, judge currently to crawl whether the page is the corresponding pages of first of current keyword search URL, if executing step Rapid S26, it is no to then follow the steps S27;
S26, the ID number for obtaining first article for currently crawling the page, and update into offset list, step S27 is executed later; The wherein described offset list is used to store the ID number of first article of first page of each keyword;
S27, judge currently to crawl the page whether be current keyword the last one corresponding page of search URL, if so, working as The operation that crawls of preceding keyword is completed, and step S29 is executed;It is no to then follow the steps S28;
S28, judge whether the corresponding pages of next search URL for currently crawling the page have got over, if so, executing step S29;Otherwise, the next search URL for currently crawling the page is put into the request queue, and executes step S21;
S29, judge whether current keyword is the last one keyword, if so, reptile is terminated;Otherwise, step S21 is executed to carry out The URL's not crawled on the search URL and the search URL pages of next keyword crawls operation,
S30, it crawls parsing in the page from current and obtains wechat public platform or wechat article, and the wechat public that analysis is obtained Number or wechat article handled, stored later.
2. described according to the method described in claim 1, it is characterized in that, before updating offset list in the step S26 Method further includes the steps that offset list load.
3. according to the method described in claim 1, it is characterized in that, by the search URL of first keyword in the step S1 It is put into the request queue, and in step s 27, current keyword crawls operation after the completion again by next keyword Search URL is put into the request queue.
4. according to the method described in claim 1, it is characterized in that, in the step S24, filters out and work as in the grand filter method of cloth Before crawl the URL for the wechat article not crawled in the page and URL of wechat public platform before, the method further includes following step Suddenly:
It will be in URL and current keyword storage to breakpoint daily record that the page currently crawled.
5. according to the method described in claim 1, it is characterized in that, in the step S26, the of the page is currently crawled obtaining Before the ID number of one article, the method is further comprising the steps of:
In the locally downloading file of offset list, the keyword of last storage and right will be obtained from the offset list of download The ID number of first article on the first search URL page answered, and be stored in memory.
6. a kind of wechat article and public platform obtain system, which is characterized in that the system comprises:
It searches for URL and builds module, be used to obtain keyword needed for wechat retrieval, it is more for its structure for each keyword A search URL, and the described search URL of structure is put into request queue;
Page parsing module, each search URL for a keyword and searches for the URL that does not crawl on the URL pages and carries out It crawls, whether the URL that the page parsing module is used to judge currently to crawl the page is that current keyword is corresponding multiple described One in URL is searched for, if one in multiple described search URL, then it is filtered out using Bloom filter and currently crawls page The URL for the wechat article not crawled in the face and URL of wechat public platform, and it is put into the request queue, it is crawled later; If not one in multiple described search URL, then currently crawling the page obtains wechat public platform or wechat article for parsing;Institute Page parsing module is stated to be additionally operable to judge currently to crawl whether the page is the corresponding pages of first of current keyword search URL Face, if the corresponding pages of first search URL of current keyword, then obtain the ID for first article for currently crawling the page Number, and it updates into offset list, if not the corresponding pages of first search URL of current keyword, then the page parsing Module be additionally operable to judge currently to crawl the page whether be current keyword the last one corresponding page of search URL, if working as The corresponding pages of the last one search URL of preceding keyword, the then operation that crawls of current keyword are completed, and next key is carried out The URL's not crawled on the search URL and the search URL pages of word crawls operation, if not the last one of current keyword The corresponding pages of URL are searched for, then the page parsing module is additionally operable to the next search URL for judging currently to crawl page correspondences The page whether got over, if having got over, current keyword crawl operation complete, carry out searching for next keyword The URL not crawled on rope URL and the search URL pages crawls operation;If not getting over, the next of the page will be currently crawled A search URL is put into the request queue, carries out crawling operation;
Dissection process module, at the wechat public platform or wechat article for being analysed to the page parsing module Reason;
Database module, for storing the data after the dissection process resume module;
Stamp module currently crawls whether the page is the identifying code page for judging, if it is the identifying code page currently to crawl the page, The identifying code of current page is then obtained, and is uploaded to third-party platform, identifying code identification is carried out by the third-party platform, later Simplation verification code submission form submits identifying code.
7. system according to claim 6, which is characterized in that the system also includes initialization modules, for initializing The Bloom filter, search URL structures module, page parsing module, dissection process module, database module and stamp mould Block.
8. system according to claim 6, which is characterized in that described search URL structure modules are for first building first The search URL of keyword is simultaneously put into the request queue, and next pass is built again after the completion in the operation that crawls of a keyword The search URL of key word is simultaneously put into the request queue.
9. system according to claim 6, which is characterized in that the system also includes journal module, be used to store and work as Before crawl the URL and current keyword of the page.
10. system according to claim 6, which is characterized in that the system also includes delta files, each for storing The ID number of first article of first page of a keyword.
CN201510609672.7A 2015-09-22 2015-09-22 The acquisition methods and acquisition system of wechat article and public platform Active CN105320740B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510609672.7A CN105320740B (en) 2015-09-22 2015-09-22 The acquisition methods and acquisition system of wechat article and public platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510609672.7A CN105320740B (en) 2015-09-22 2015-09-22 The acquisition methods and acquisition system of wechat article and public platform

Publications (2)

Publication Number Publication Date
CN105320740A CN105320740A (en) 2016-02-10
CN105320740B true CN105320740B (en) 2018-10-16

Family

ID=55248127

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510609672.7A Active CN105320740B (en) 2015-09-22 2015-09-22 The acquisition methods and acquisition system of wechat article and public platform

Country Status (1)

Country Link
CN (1) CN105320740B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126747A (en) * 2016-07-14 2016-11-16 北京邮电大学 Data capture method based on reptile and device
CN108090072A (en) * 2016-11-22 2018-05-29 上海看榜信息科技有限公司 A kind of minute grade monitoring system for graph text information
CN106789559B (en) * 2016-12-02 2019-09-24 上海智臻智能网络科技股份有限公司 Information processing method, device and system for wechat public platform
CN106909637A (en) * 2017-02-14 2017-06-30 国家计算机网络与信息安全管理中心 The influence power analysis method and system of wechat public number
CN107180103A (en) * 2017-05-31 2017-09-19 成都明途科技有限公司 The more fast and convenient interactive system of search
CN107948052A (en) * 2017-11-14 2018-04-20 福建中金在线信息科技有限公司 Information crawler method, apparatus, electronic equipment and system
CN108038221B (en) * 2017-12-22 2021-10-15 新奥(中国)燃气投资有限公司 Information capturing method and device
CN108038233B (en) * 2017-12-26 2021-07-23 福建中金在线信息科技有限公司 Method and device for collecting articles, electronic equipment and storage medium
CN108491434A (en) * 2018-02-09 2018-09-04 深圳前海道己社文化有限公司 Article methods of exhibiting, device and terminal based on wechat public platform
CN110555146A (en) * 2018-03-29 2019-12-10 中国科学院信息工程研究所 method and system for generating network crawler camouflage data
CN109284431A (en) * 2018-08-09 2019-01-29 国家计算机网络与信息安全管理中心 A method of finding specific area wechat public platform from wechat
CN109388735A (en) * 2018-09-13 2019-02-26 广州丰石科技有限公司 A method of crawling wechat public platform information
CN110188257B (en) * 2019-04-16 2021-12-31 国家计算机网络与信息安全管理中心 Mobile application data acquisition method and device
CN112579850A (en) * 2019-09-29 2021-03-30 北京国双科技有限公司 Breakpoint recovery method and device
CN112256959B (en) * 2020-06-11 2022-11-08 国家计算机网络与信息安全管理中心 Method for analyzing information collected by WeChat public number small program
CN113987320B (en) * 2021-11-24 2024-06-04 宁波深擎信息科技有限公司 Real-time information crawler method, device and equipment based on intelligent page analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253991A (en) * 2011-05-25 2011-11-23 北京星网锐捷网络技术有限公司 Uniform resource locator (URL) storage method, web filtering method, device and system
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
US8577878B1 (en) * 2006-05-09 2013-11-05 Google Inc. Filtering search results using annotations
CN103455615A (en) * 2013-09-10 2013-12-18 中国地质大学(武汉) Method for sequencing filtering and retrieving WeChat accounts
CN104794193A (en) * 2015-04-17 2015-07-22 南京大学 Webpage increment capture method for valid link acquisition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8577878B1 (en) * 2006-05-09 2013-11-05 Google Inc. Filtering search results using annotations
CN102253991A (en) * 2011-05-25 2011-11-23 北京星网锐捷网络技术有限公司 Uniform resource locator (URL) storage method, web filtering method, device and system
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN103455615A (en) * 2013-09-10 2013-12-18 中国地质大学(武汉) Method for sequencing filtering and retrieving WeChat accounts
CN104794193A (en) * 2015-04-17 2015-07-22 南京大学 Webpage increment capture method for valid link acquisition

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Content routing and lookup schemes using global bloom filter for content-delivery-as-a-service;Y Jin;《IEEE Systems Journal》;20141231(第8期);全文 *
深度web采集系统的设计与实现;宋宇;《中国优秀硕士论文全文库信息科技辑》;20130515(第2013年05期);全文 *

Also Published As

Publication number Publication date
CN105320740A (en) 2016-02-10

Similar Documents

Publication Publication Date Title
CN105320740B (en) The acquisition methods and acquisition system of wechat article and public platform
Middleton et al. Heroku: up and running: effortless application deployment and scaling
CN104065565B (en) The method of PUSH message, server, client terminal device and system
US8555018B1 (en) Techniques for storing data
CN105243159A (en) Visual script editor-based distributed web crawler system
CN104598518B (en) Content pushing method and device
CN102662966B (en) Method and system for obtaining subject-oriented dynamic page content
US20120124028A1 (en) Unified Application Discovery across Application Stores
CN107145556B (en) Universal distributed acquisition system
CN101635655A (en) Method, device and system for page performance test
CN107809383A (en) A kind of map paths method and device based on MVC
CN106126648A (en) A kind of based on the distributed merchandise news reptile method redo log
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN108011931A (en) Web data acquisition method and web data acquisition system
CN110535974A (en) Method for pushing, driving means, equipment and the storage medium of resource to be put
CN107783850A (en) A kind of node tree chooses analytic method, device, server and the system of record
CN108549714A (en) A kind of data processing method and device
CN111026945B (en) Multi-platform crawler scheduling method, device and storage medium
CN105069004A (en) Patent information automatic collection method
CN112671878A (en) Block chain information subscription method, device, server and storage medium
CN116155597A (en) Access request processing method and device and computer equipment
CN112784114A (en) Relation map updating method based on Neo4j high-performance map database
CN106611027A (en) Website ranking data processing method and device
CN113934428A (en) Log analysis method and device
US8909611B2 (en) Content management system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant