CN109388735A

CN109388735A - A method of crawling wechat public platform information

Info

Publication number: CN109388735A
Application number: CN201811070426.9A
Authority: CN
Inventors: 陈曦; 蓝志坚; 潘健; 曾伟杰
Original assignee: Guangzhou Feng Shi Technology Co Ltd
Current assignee: Guangzhou Feng Shi Technology Co Ltd
Priority date: 2018-09-13
Filing date: 2018-09-13
Publication date: 2019-02-26

Abstract

The present invention proposes that a kind of method for crawling wechat public platform information, key step include: S1: determining public platform title to be crawled, the number of public platform to be crawled is put in storage；S2: according to preset crawler matching rule, public platform to be crawled is crawled according to public platform title, obtains the relevant information of public platform to be crawled；S3: carrying out storage processing for the public platform relevant information to be crawled that crawler obtains, the relevant information storage for public platform to be crawled that treated；S4: setting crawler timing of execution time crawls the public platform article that wait crawl public platform article to be crawled, crawler is obtained according to the timing of the relevant information of public platform to be crawled and is put in storage；S5: display public platform article.The present invention crawls the historical content of target wechat public platform using the crawler technology that scrapy is provided, and it realizes and daily crawls in real time, the historical content of synchronized update target public platform will be locally stored public platform content search using JavaWeb the relevant technologies and be shown in front-end interface to being locally stored.

Description

A method of crawling wechat public platform information

Technical field

The present invention relates to Internet communication technology fields, more particularly, to a kind of side for crawling wechat public platform information Method.

Background technique

Currently, content in wechat public platform is checked there are mainly two types of modes: one is the search for passing through search dog wechat Function checks public platform content, and one is check wechat public platform content by handset Wechat APP.Existing wechat public platform is climbed Most of worm program is crawled in such a way that search dog wechat searches for wechat public platform, low efficiency, and the search of search dog wechat has Anti- crawler rule, it is limited to get wechat public platform content, cannot obtain a large amount of public platform content in a short time.

Summary of the invention

The present invention in order to overcome at least one of the drawbacks of the prior art described above, provides one kind and crawls wechat public platform information Method.

In order to solve the above technical problems, technical scheme is as follows:

A method of crawling wechat public platform information, which is characterized in that it the following steps are included:

S1: determining public platform title to be crawled, and the number of public platform to be crawled is put in storage；

S2: according to preset crawler matching rule, public platform to be crawled is crawled according to public platform title, obtains public platform to be crawled Relevant information；

S3: the public platform relevant information to be crawled that crawler obtains is subjected to storage processing, the phase for public platform to be crawled that treated Close information storage；

S4: setting crawler timing of execution time is crawled according to the timing of the relevant information of public platform to be crawled wait crawl public affairs to be crawled Crowd's article, the public platform article that crawler is obtained are put in storage；

S5: display public platform article.

Further, step S1 specifically includes following two situation:

S11: the accurate name of public platform to be crawled is not known in fuzzy matching, i.e. user, primarily determines public platform to be crawled Afterwards, using its number storage as public platform to be crawled；

S12: the accurate name that accurate matching, i.e. user understand public platform to be crawled, by accurately key in public platform title or Person's number, is directly put in storage, and as public platform to be crawled.

Further, the setting method of preset crawler matching rule described in step S2 are as follows: climbed by scrapy frame The html source code of specified page is taken, and according to library structure and style rule table, writes matching rule；The public platform to be crawled Relevant information include public platform to be crawled number, title, the article name delivered, the article delivered.

Further, the processing of storage described in step S3 is exactly to be processed and pre-processed, and specifically includes duplicate removal, according to finger Determine formatting, record entry time.

Further, timing of execution time described in step S4 is set as one day.

Further, display public platform article described in step S5 is exactly that will be locally stored using JavaWeb the relevant technologies Public platform content search is shown in front end page.

Further, the mechanism executed inside data source wechat search dog containing restricted crawler can filter the machine of crawler request System, such as same ip are frequently accessed and can be denied access, so must satisfy the following conditions could normally execute crawler:

P1: needing not stop in crawler implementation procedure to switch browser header information, can be cut by the inside api that scrapy is provided Change the browser header information for including in http request head；

P2: crawler execution interval cannot be too short, and since the content for needing to obtain is relatively more, wechat search dog does not allow hair in the short time Multiple requests are sent, certain time interval is preferably formed between each request, specific time interval is voluntarily debugged；

P3: different crawlers is tried not asynchronous execution, although the execution of operating system multithreading can be used, due to Netowrk tape Wide resource is certain, and asynchronous execution crawler can rob with bandwidth resources, leads to unexpected generation.

P4: using ip agent pool, can since data source wechat search dog has the mechanism for forbidding same ip excessively to access resource Crawler is executed in a manner of acting on behalf of by using ip.

Compared with prior art, the beneficial effect of technical solution of the present invention is: the crawler skill provided using scrapy frame Art crawls the historical content of target wechat public platform, and realizes and daily crawl in real time, the content of synchronized update target public platform, and Public platform content search will be locally stored using JavaWeb the relevant technologies to be shown in front end page.

Detailed description of the invention

Fig. 1 is the flow chart of the embodiment of the present invention.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent the ruler of actual product It is very little；

To those skilled in the art, the omitting of some known structures and their instructions in the attached drawings are understandable.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

As shown in Figure 1, the present embodiment uses the environment of python3.5, downloading scrapy first relies on packet, operation Scrapy order creates crawler project；Secondly the edit tool for installing python, with the necessary functional dependence of pip Installing of Command Packet.A kind of method for crawling wechat public platform information of the present invention, specific implementation step are as follows:

S1: inputting according to user, determines public platform title to be crawled, and the number of public platform to be crawled is put in storage；

S5: display public platform article.

Specifically, step S1 specifically includes following two situation:

Specifically, the setting method of preset crawler matching rule described in step S2 are as follows: crawled by scrapy frame The html source code of specified page, and according to library structure and style rule table, write matching rule；The public platform to be crawled Relevant information includes the number, title, the article name delivered, the article delivered of public platform to be crawled.

Specifically, the processing of storage described in step S3 is exactly to be processed and pre-processed, and specifically includes duplicate removal, according to specified Formatting, record entry time.

Specifically, timing of execution time described in step S4 is set as one day.

Specifically, display public platform article described in step S5 is exactly that public affairs will be locally stored using JavaWeb the relevant technologies Crowd's content search is shown in front end page.

Specifically, the mechanism executed inside data source wechat search dog containing restricted crawler can filter the mechanism of crawler request, Such as same ip is frequently accessed and can be denied access, so must satisfy the following conditions could normally execute crawler:

Specifically, the present embodiment is stored using mysql database refers to data, and public platform search table has following field:

User: user's name, varchar type；

WxName: public platform title, varchar type；

WxNo: public platform number, varchar type；

ImgURL: public platform picture permalink, varchar type；

CreateDate: data acquisition time；

UpdateDate: data renewal time；

WxNameURL: wechat article chained address；

Public platform details table: detail

Introduction: article introduction, varchar type；

Title: article title, varchar type；

WxName: public platform title, varchar type；

WxNo: public platform number, varchar type；

CreateDate: data acquisition time, varchar type

UpdateDate: data renewal time, varchar type；

WxNameURL: wechat article chained address, varchar type

ContentLink: original text link, varchar type

Html: website article content html text media type

The same or similar label correspond to the same or similar components；

The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent；

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be to this hair The restriction of bright embodiment.For those of ordinary skill in the art, it can also do on the basis of the above description Other various forms of variations or variation out.There is no necessity and possibility to exhaust all the enbodiments.It is all in the present invention Spirit and principle within made any modifications, equivalent replacements, and improvements etc., should be included in the guarantor of the claims in the present invention Within the scope of shield.

Claims

1. a kind of method for crawling wechat public platform information, which is characterized in that it the following steps are included:

S5: display public platform article.

2. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that step S1 is specifically wrapped Include following two situation:

3. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that described in step S2 The setting method of preset crawler matching rule are as follows: the html source code of specified page is crawled by scrapy frame, and according to library Structure and style rule list, writes matching rule；The relevant information of the public platform to be crawled includes public platform to be crawled Number, title, the article name delivered, the article delivered.

4. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that described in step S3 Storage processing is exactly to be processed and pre-processed, and specifically includes duplicate removal, according to specified format layout, record entry time.

5. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that described in step S4 Timing of execution time is set as one day.

6. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that described in step S5 Display public platform article is exactly that public platform content search will be locally stored using JavaWeb the relevant technologies to be shown to front end page On.