CN109388735A - A method of crawling wechat public platform information - Google Patents

A method of crawling wechat public platform information Download PDF

Info

Publication number
CN109388735A
CN109388735A CN201811070426.9A CN201811070426A CN109388735A CN 109388735 A CN109388735 A CN 109388735A CN 201811070426 A CN201811070426 A CN 201811070426A CN 109388735 A CN109388735 A CN 109388735A
Authority
CN
China
Prior art keywords
public platform
crawled
crawler
article
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811070426.9A
Other languages
Chinese (zh)
Inventor
陈曦
蓝志坚
潘健
曾伟杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Feng Shi Technology Co Ltd
Original Assignee
Guangzhou Feng Shi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Feng Shi Technology Co Ltd filed Critical Guangzhou Feng Shi Technology Co Ltd
Priority to CN201811070426.9A priority Critical patent/CN109388735A/en
Publication of CN109388735A publication Critical patent/CN109388735A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention proposes that a kind of method for crawling wechat public platform information, key step include: S1: determining public platform title to be crawled, the number of public platform to be crawled is put in storage;S2: according to preset crawler matching rule, public platform to be crawled is crawled according to public platform title, obtains the relevant information of public platform to be crawled;S3: carrying out storage processing for the public platform relevant information to be crawled that crawler obtains, the relevant information storage for public platform to be crawled that treated;S4: setting crawler timing of execution time crawls the public platform article that wait crawl public platform article to be crawled, crawler is obtained according to the timing of the relevant information of public platform to be crawled and is put in storage;S5: display public platform article.The present invention crawls the historical content of target wechat public platform using the crawler technology that scrapy is provided, and it realizes and daily crawls in real time, the historical content of synchronized update target public platform will be locally stored public platform content search using JavaWeb the relevant technologies and be shown in front-end interface to being locally stored.

Description

A method of crawling wechat public platform information
Technical field
The present invention relates to Internet communication technology fields, more particularly, to a kind of side for crawling wechat public platform information Method.
Background technique
Currently, content in wechat public platform is checked there are mainly two types of modes: one is the search for passing through search dog wechat Function checks public platform content, and one is check wechat public platform content by handset Wechat APP.Existing wechat public platform is climbed Most of worm program is crawled in such a way that search dog wechat searches for wechat public platform, low efficiency, and the search of search dog wechat has Anti- crawler rule, it is limited to get wechat public platform content, cannot obtain a large amount of public platform content in a short time.
Summary of the invention
The present invention in order to overcome at least one of the drawbacks of the prior art described above, provides one kind and crawls wechat public platform information Method.
In order to solve the above technical problems, technical scheme is as follows:
A method of crawling wechat public platform information, which is characterized in that it the following steps are included:
S1: determining public platform title to be crawled, and the number of public platform to be crawled is put in storage;
S2: according to preset crawler matching rule, public platform to be crawled is crawled according to public platform title, obtains public platform to be crawled Relevant information;
S3: the public platform relevant information to be crawled that crawler obtains is subjected to storage processing, the phase for public platform to be crawled that treated Close information storage;
S4: setting crawler timing of execution time is crawled according to the timing of the relevant information of public platform to be crawled wait crawl public affairs to be crawled Crowd's article, the public platform article that crawler is obtained are put in storage;
S5: display public platform article.
Further, step S1 specifically includes following two situation:
S11: the accurate name of public platform to be crawled is not known in fuzzy matching, i.e. user, primarily determines public platform to be crawled Afterwards, using its number storage as public platform to be crawled;
S12: the accurate name that accurate matching, i.e. user understand public platform to be crawled, by accurately key in public platform title or Person's number, is directly put in storage, and as public platform to be crawled.
Further, the setting method of preset crawler matching rule described in step S2 are as follows: climbed by scrapy frame The html source code of specified page is taken, and according to library structure and style rule table, writes matching rule;The public platform to be crawled Relevant information include public platform to be crawled number, title, the article name delivered, the article delivered.
Further, the processing of storage described in step S3 is exactly to be processed and pre-processed, and specifically includes duplicate removal, according to finger Determine formatting, record entry time.
Further, timing of execution time described in step S4 is set as one day.
Further, display public platform article described in step S5 is exactly that will be locally stored using JavaWeb the relevant technologies Public platform content search is shown in front end page.
Further, the mechanism executed inside data source wechat search dog containing restricted crawler can filter the machine of crawler request System, such as same ip are frequently accessed and can be denied access, so must satisfy the following conditions could normally execute crawler:
P1: needing not stop in crawler implementation procedure to switch browser header information, can be cut by the inside api that scrapy is provided Change the browser header information for including in http request head;
P2: crawler execution interval cannot be too short, and since the content for needing to obtain is relatively more, wechat search dog does not allow hair in the short time Multiple requests are sent, certain time interval is preferably formed between each request, specific time interval is voluntarily debugged;
P3: different crawlers is tried not asynchronous execution, although the execution of operating system multithreading can be used, due to Netowrk tape Wide resource is certain, and asynchronous execution crawler can rob with bandwidth resources, leads to unexpected generation.
P4: using ip agent pool, can since data source wechat search dog has the mechanism for forbidding same ip excessively to access resource Crawler is executed in a manner of acting on behalf of by using ip.
Compared with prior art, the beneficial effect of technical solution of the present invention is: the crawler skill provided using scrapy frame Art crawls the historical content of target wechat public platform, and realizes and daily crawl in real time, the content of synchronized update target public platform, and Public platform content search will be locally stored using JavaWeb the relevant technologies to be shown in front end page.
Detailed description of the invention
Fig. 1 is the flow chart of the embodiment of the present invention.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent the ruler of actual product It is very little;
To those skilled in the art, the omitting of some known structures and their instructions in the attached drawings are understandable.
The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.
As shown in Figure 1, the present embodiment uses the environment of python3.5, downloading scrapy first relies on packet, operation Scrapy order creates crawler project;Secondly the edit tool for installing python, with the necessary functional dependence of pip Installing of Command Packet.A kind of method for crawling wechat public platform information of the present invention, specific implementation step are as follows:
S1: inputting according to user, determines public platform title to be crawled, and the number of public platform to be crawled is put in storage;
S2: according to preset crawler matching rule, public platform to be crawled is crawled according to public platform title, obtains public platform to be crawled Relevant information;
S3: the public platform relevant information to be crawled that crawler obtains is subjected to storage processing, the phase for public platform to be crawled that treated Close information storage;
S4: setting crawler timing of execution time is crawled according to the timing of the relevant information of public platform to be crawled wait crawl public affairs to be crawled Crowd's article, the public platform article that crawler is obtained are put in storage;
S5: display public platform article.
Specifically, step S1 specifically includes following two situation:
S11: the accurate name of public platform to be crawled is not known in fuzzy matching, i.e. user, primarily determines public platform to be crawled Afterwards, using its number storage as public platform to be crawled;
S12: the accurate name that accurate matching, i.e. user understand public platform to be crawled, by accurately key in public platform title or Person's number, is directly put in storage, and as public platform to be crawled.
Specifically, the setting method of preset crawler matching rule described in step S2 are as follows: crawled by scrapy frame The html source code of specified page, and according to library structure and style rule table, write matching rule;The public platform to be crawled Relevant information includes the number, title, the article name delivered, the article delivered of public platform to be crawled.
Specifically, the processing of storage described in step S3 is exactly to be processed and pre-processed, and specifically includes duplicate removal, according to specified Formatting, record entry time.
Specifically, timing of execution time described in step S4 is set as one day.
Specifically, display public platform article described in step S5 is exactly that public affairs will be locally stored using JavaWeb the relevant technologies Crowd's content search is shown in front end page.
Specifically, the mechanism executed inside data source wechat search dog containing restricted crawler can filter the mechanism of crawler request, Such as same ip is frequently accessed and can be denied access, so must satisfy the following conditions could normally execute crawler:
P1: needing not stop in crawler implementation procedure to switch browser header information, can be cut by the inside api that scrapy is provided Change the browser header information for including in http request head;
P2: crawler execution interval cannot be too short, and since the content for needing to obtain is relatively more, wechat search dog does not allow hair in the short time Multiple requests are sent, certain time interval is preferably formed between each request, specific time interval is voluntarily debugged;
P3: different crawlers is tried not asynchronous execution, although the execution of operating system multithreading can be used, due to Netowrk tape Wide resource is certain, and asynchronous execution crawler can rob with bandwidth resources, leads to unexpected generation.
P4: using ip agent pool, can since data source wechat search dog has the mechanism for forbidding same ip excessively to access resource Crawler is executed in a manner of acting on behalf of by using ip.
Specifically, the present embodiment is stored using mysql database refers to data, and public platform search table has following field:
User: user's name, varchar type;
WxName: public platform title, varchar type;
WxNo: public platform number, varchar type;
ImgURL: public platform picture permalink, varchar type;
CreateDate: data acquisition time;
UpdateDate: data renewal time;
WxNameURL: wechat article chained address;
Public platform details table: detail
Introduction: article introduction, varchar type;
Title: article title, varchar type;
WxName: public platform title, varchar type;
WxNo: public platform number, varchar type;
CreateDate: data acquisition time, varchar type
UpdateDate: data renewal time, varchar type;
WxNameURL: wechat article chained address, varchar type
ContentLink: original text link, varchar type
Html: website article content html text media type
The same or similar label correspond to the same or similar components;
The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent;
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be to this hair The restriction of bright embodiment.For those of ordinary skill in the art, it can also do on the basis of the above description Other various forms of variations or variation out.There is no necessity and possibility to exhaust all the enbodiments.It is all in the present invention Spirit and principle within made any modifications, equivalent replacements, and improvements etc., should be included in the guarantor of the claims in the present invention Within the scope of shield.

Claims (6)

1. a kind of method for crawling wechat public platform information, which is characterized in that it the following steps are included:
S1: determining public platform title to be crawled, and the number of public platform to be crawled is put in storage;
S2: according to preset crawler matching rule, public platform to be crawled is crawled according to public platform title, obtains public platform to be crawled Relevant information;
S3: the public platform relevant information to be crawled that crawler obtains is subjected to storage processing, the phase for public platform to be crawled that treated Close information storage;
S4: setting crawler timing of execution time is crawled according to the timing of the relevant information of public platform to be crawled wait crawl public affairs to be crawled Crowd's article, the public platform article that crawler is obtained are put in storage;
S5: display public platform article.
2. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that step S1 is specifically wrapped Include following two situation:
S11: the accurate name of public platform to be crawled is not known in fuzzy matching, i.e. user, primarily determines public platform to be crawled Afterwards, using its number storage as public platform to be crawled;
S12: the accurate name that accurate matching, i.e. user understand public platform to be crawled, by accurately key in public platform title or Person's number, is directly put in storage, and as public platform to be crawled.
3. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that described in step S2 The setting method of preset crawler matching rule are as follows: the html source code of specified page is crawled by scrapy frame, and according to library Structure and style rule list, writes matching rule;The relevant information of the public platform to be crawled includes public platform to be crawled Number, title, the article name delivered, the article delivered.
4. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that described in step S3 Storage processing is exactly to be processed and pre-processed, and specifically includes duplicate removal, according to specified format layout, record entry time.
5. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that described in step S4 Timing of execution time is set as one day.
6. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that described in step S5 Display public platform article is exactly that public platform content search will be locally stored using JavaWeb the relevant technologies to be shown to front end page On.
CN201811070426.9A 2018-09-13 2018-09-13 A method of crawling wechat public platform information Pending CN109388735A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811070426.9A CN109388735A (en) 2018-09-13 2018-09-13 A method of crawling wechat public platform information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811070426.9A CN109388735A (en) 2018-09-13 2018-09-13 A method of crawling wechat public platform information

Publications (1)

Publication Number Publication Date
CN109388735A true CN109388735A (en) 2019-02-26

Family

ID=65418683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811070426.9A Pending CN109388735A (en) 2018-09-13 2018-09-13 A method of crawling wechat public platform information

Country Status (1)

Country Link
CN (1) CN109388735A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188257A (en) * 2019-04-16 2019-08-30 国家计算机网络与信息安全管理中心 A kind of mobile application collecting method and device
CN110263266A (en) * 2019-05-20 2019-09-20 江苏大学 A kind of method for exhibiting data based on wechat small routine and crawler
CN111008319A (en) * 2019-10-29 2020-04-14 上海医望网络科技有限公司 Content management system based on artificial intelligence
CN112541107A (en) * 2020-12-25 2021-03-23 天津浪淘科技股份有限公司 Page data learning and automatic acquisition method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140229601A1 (en) * 2011-09-22 2014-08-14 Beijing Qihoo Technology Company Limited URL Navigation Page Generation Method, Device and Program
CN104317880A (en) * 2014-10-22 2015-01-28 浪潮软件集团有限公司 Method special for microblog data acquisition mode
CN105320740A (en) * 2015-09-22 2016-02-10 清华大学 WeChat article and official account acquisition method and acquisition system
CN107122466A (en) * 2017-04-28 2017-09-01 福建中金在线信息科技有限公司 A kind of web documents querying method and system
CN107315736A (en) * 2017-06-22 2017-11-03 云天弈(北京)信息技术有限公司 A kind of assisted writing system and method
CN107948052A (en) * 2017-11-14 2018-04-20 福建中金在线信息科技有限公司 Information crawler method, apparatus, electronic equipment and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140229601A1 (en) * 2011-09-22 2014-08-14 Beijing Qihoo Technology Company Limited URL Navigation Page Generation Method, Device and Program
CN104317880A (en) * 2014-10-22 2015-01-28 浪潮软件集团有限公司 Method special for microblog data acquisition mode
CN105320740A (en) * 2015-09-22 2016-02-10 清华大学 WeChat article and official account acquisition method and acquisition system
CN107122466A (en) * 2017-04-28 2017-09-01 福建中金在线信息科技有限公司 A kind of web documents querying method and system
CN107315736A (en) * 2017-06-22 2017-11-03 云天弈(北京)信息技术有限公司 A kind of assisted writing system and method
CN107948052A (en) * 2017-11-14 2018-04-20 福建中金在线信息科技有限公司 Information crawler method, apparatus, electronic equipment and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴霖: "分布式微信公众平台爬虫系统的研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
郑铁男 等: "《数字编辑运营实训教程》", 30 September 2017 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188257A (en) * 2019-04-16 2019-08-30 国家计算机网络与信息安全管理中心 A kind of mobile application collecting method and device
CN110188257B (en) * 2019-04-16 2021-12-31 国家计算机网络与信息安全管理中心 Mobile application data acquisition method and device
CN110263266A (en) * 2019-05-20 2019-09-20 江苏大学 A kind of method for exhibiting data based on wechat small routine and crawler
CN111008319A (en) * 2019-10-29 2020-04-14 上海医望网络科技有限公司 Content management system based on artificial intelligence
CN112541107A (en) * 2020-12-25 2021-03-23 天津浪淘科技股份有限公司 Page data learning and automatic acquisition method

Similar Documents

Publication Publication Date Title
CN109388735A (en) A method of crawling wechat public platform information
US7827166B2 (en) Handling dynamic URLs in crawl for better coverage of unique content
US8838527B2 (en) Virtual environment spanning desktop and cloud
US8745082B2 (en) Methods and apparatus for evaluating XPath filters on fragmented and distributed XML documents
US20070143306A1 (en) Integrated website management system and management method thereof
CN103942309B (en) A kind of implementation method of Network Data Capture equipment, method and acquisition process
US9454535B2 (en) Topical mapping
CN109033403B (en) Method, apparatus and storage medium for searching blockchain data
WO2015164108A1 (en) Decoupling front end page and back end using tags
US20150331948A1 (en) Search infrastructure and method for performing web search
WO2020207022A1 (en) Scrapy-based data crawling method and system, terminal device, and storage medium
CN109101607B (en) Method, apparatus and storage medium for searching blockchain data
US20200142674A1 (en) Extracting web api endpoint data from source code
CN109241391A (en) A kind of anti-crawler method climbed of solution font
CN111858255A (en) User behavior acquisition method based on screenshot and related equipment
US20080301541A1 (en) Online internet navigation system and method
EP2018757A1 (en) A method of rendering at least one element in a client browser
CN110532455A (en) A kind of Web page picture acquisition methods and system based on Chrome browser
Suguna et al. User interest level based preprocessing algorithms using web usage mining
CN106919600A (en) One kind failure network address access method and terminal
Kherwa et al. Data preprocessing: A milestone of web usage mining
JP4259858B2 (en) WWW site history search device, method and program
JP2005071319A (en) Keyword acquiring device for homepage
Yang et al. An architecture for integrating OODBs with WWW
Ge et al. Robots exclusion and guidance protocol

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190226