CN109388735A - A method of crawling wechat public platform information - Google Patents
A method of crawling wechat public platform information Download PDFInfo
- Publication number
- CN109388735A CN109388735A CN201811070426.9A CN201811070426A CN109388735A CN 109388735 A CN109388735 A CN 109388735A CN 201811070426 A CN201811070426 A CN 201811070426A CN 109388735 A CN109388735 A CN 109388735A
- Authority
- CN
- China
- Prior art keywords
- public platform
- crawled
- crawler
- article
- storage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention proposes that a kind of method for crawling wechat public platform information, key step include: S1: determining public platform title to be crawled, the number of public platform to be crawled is put in storage;S2: according to preset crawler matching rule, public platform to be crawled is crawled according to public platform title, obtains the relevant information of public platform to be crawled;S3: carrying out storage processing for the public platform relevant information to be crawled that crawler obtains, the relevant information storage for public platform to be crawled that treated;S4: setting crawler timing of execution time crawls the public platform article that wait crawl public platform article to be crawled, crawler is obtained according to the timing of the relevant information of public platform to be crawled and is put in storage;S5: display public platform article.The present invention crawls the historical content of target wechat public platform using the crawler technology that scrapy is provided, and it realizes and daily crawls in real time, the historical content of synchronized update target public platform will be locally stored public platform content search using JavaWeb the relevant technologies and be shown in front-end interface to being locally stored.
Description
Technical field
The present invention relates to Internet communication technology fields, more particularly, to a kind of side for crawling wechat public platform information
Method.
Background technique
Currently, content in wechat public platform is checked there are mainly two types of modes: one is the search for passing through search dog wechat
Function checks public platform content, and one is check wechat public platform content by handset Wechat APP.Existing wechat public platform is climbed
Most of worm program is crawled in such a way that search dog wechat searches for wechat public platform, low efficiency, and the search of search dog wechat has
Anti- crawler rule, it is limited to get wechat public platform content, cannot obtain a large amount of public platform content in a short time.
Summary of the invention
The present invention in order to overcome at least one of the drawbacks of the prior art described above, provides one kind and crawls wechat public platform information
Method.
In order to solve the above technical problems, technical scheme is as follows:
A method of crawling wechat public platform information, which is characterized in that it the following steps are included:
S1: determining public platform title to be crawled, and the number of public platform to be crawled is put in storage;
S2: according to preset crawler matching rule, public platform to be crawled is crawled according to public platform title, obtains public platform to be crawled
Relevant information;
S3: the public platform relevant information to be crawled that crawler obtains is subjected to storage processing, the phase for public platform to be crawled that treated
Close information storage;
S4: setting crawler timing of execution time is crawled according to the timing of the relevant information of public platform to be crawled wait crawl public affairs to be crawled
Crowd's article, the public platform article that crawler is obtained are put in storage;
S5: display public platform article.
Further, step S1 specifically includes following two situation:
S11: the accurate name of public platform to be crawled is not known in fuzzy matching, i.e. user, primarily determines public platform to be crawled
Afterwards, using its number storage as public platform to be crawled;
S12: the accurate name that accurate matching, i.e. user understand public platform to be crawled, by accurately key in public platform title or
Person's number, is directly put in storage, and as public platform to be crawled.
Further, the setting method of preset crawler matching rule described in step S2 are as follows: climbed by scrapy frame
The html source code of specified page is taken, and according to library structure and style rule table, writes matching rule;The public platform to be crawled
Relevant information include public platform to be crawled number, title, the article name delivered, the article delivered.
Further, the processing of storage described in step S3 is exactly to be processed and pre-processed, and specifically includes duplicate removal, according to finger
Determine formatting, record entry time.
Further, timing of execution time described in step S4 is set as one day.
Further, display public platform article described in step S5 is exactly that will be locally stored using JavaWeb the relevant technologies
Public platform content search is shown in front end page.
Further, the mechanism executed inside data source wechat search dog containing restricted crawler can filter the machine of crawler request
System, such as same ip are frequently accessed and can be denied access, so must satisfy the following conditions could normally execute crawler:
P1: needing not stop in crawler implementation procedure to switch browser header information, can be cut by the inside api that scrapy is provided
Change the browser header information for including in http request head;
P2: crawler execution interval cannot be too short, and since the content for needing to obtain is relatively more, wechat search dog does not allow hair in the short time
Multiple requests are sent, certain time interval is preferably formed between each request, specific time interval is voluntarily debugged;
P3: different crawlers is tried not asynchronous execution, although the execution of operating system multithreading can be used, due to Netowrk tape
Wide resource is certain, and asynchronous execution crawler can rob with bandwidth resources, leads to unexpected generation.
P4: using ip agent pool, can since data source wechat search dog has the mechanism for forbidding same ip excessively to access resource
Crawler is executed in a manner of acting on behalf of by using ip.
Compared with prior art, the beneficial effect of technical solution of the present invention is: the crawler skill provided using scrapy frame
Art crawls the historical content of target wechat public platform, and realizes and daily crawl in real time, the content of synchronized update target public platform, and
Public platform content search will be locally stored using JavaWeb the relevant technologies to be shown in front end page.
Detailed description of the invention
Fig. 1 is the flow chart of the embodiment of the present invention.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent the ruler of actual product
It is very little;
To those skilled in the art, the omitting of some known structures and their instructions in the attached drawings are understandable.
The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.
As shown in Figure 1, the present embodiment uses the environment of python3.5, downloading scrapy first relies on packet, operation
Scrapy order creates crawler project;Secondly the edit tool for installing python, with the necessary functional dependence of pip Installing of Command
Packet.A kind of method for crawling wechat public platform information of the present invention, specific implementation step are as follows:
S1: inputting according to user, determines public platform title to be crawled, and the number of public platform to be crawled is put in storage;
S2: according to preset crawler matching rule, public platform to be crawled is crawled according to public platform title, obtains public platform to be crawled
Relevant information;
S3: the public platform relevant information to be crawled that crawler obtains is subjected to storage processing, the phase for public platform to be crawled that treated
Close information storage;
S4: setting crawler timing of execution time is crawled according to the timing of the relevant information of public platform to be crawled wait crawl public affairs to be crawled
Crowd's article, the public platform article that crawler is obtained are put in storage;
S5: display public platform article.
Specifically, step S1 specifically includes following two situation:
S11: the accurate name of public platform to be crawled is not known in fuzzy matching, i.e. user, primarily determines public platform to be crawled
Afterwards, using its number storage as public platform to be crawled;
S12: the accurate name that accurate matching, i.e. user understand public platform to be crawled, by accurately key in public platform title or
Person's number, is directly put in storage, and as public platform to be crawled.
Specifically, the setting method of preset crawler matching rule described in step S2 are as follows: crawled by scrapy frame
The html source code of specified page, and according to library structure and style rule table, write matching rule;The public platform to be crawled
Relevant information includes the number, title, the article name delivered, the article delivered of public platform to be crawled.
Specifically, the processing of storage described in step S3 is exactly to be processed and pre-processed, and specifically includes duplicate removal, according to specified
Formatting, record entry time.
Specifically, timing of execution time described in step S4 is set as one day.
Specifically, display public platform article described in step S5 is exactly that public affairs will be locally stored using JavaWeb the relevant technologies
Crowd's content search is shown in front end page.
Specifically, the mechanism executed inside data source wechat search dog containing restricted crawler can filter the mechanism of crawler request,
Such as same ip is frequently accessed and can be denied access, so must satisfy the following conditions could normally execute crawler:
P1: needing not stop in crawler implementation procedure to switch browser header information, can be cut by the inside api that scrapy is provided
Change the browser header information for including in http request head;
P2: crawler execution interval cannot be too short, and since the content for needing to obtain is relatively more, wechat search dog does not allow hair in the short time
Multiple requests are sent, certain time interval is preferably formed between each request, specific time interval is voluntarily debugged;
P3: different crawlers is tried not asynchronous execution, although the execution of operating system multithreading can be used, due to Netowrk tape
Wide resource is certain, and asynchronous execution crawler can rob with bandwidth resources, leads to unexpected generation.
P4: using ip agent pool, can since data source wechat search dog has the mechanism for forbidding same ip excessively to access resource
Crawler is executed in a manner of acting on behalf of by using ip.
Specifically, the present embodiment is stored using mysql database refers to data, and public platform search table has following field:
User: user's name, varchar type;
WxName: public platform title, varchar type;
WxNo: public platform number, varchar type;
ImgURL: public platform picture permalink, varchar type;
CreateDate: data acquisition time;
UpdateDate: data renewal time;
WxNameURL: wechat article chained address;
Public platform details table: detail
Introduction: article introduction, varchar type;
Title: article title, varchar type;
WxName: public platform title, varchar type;
WxNo: public platform number, varchar type;
CreateDate: data acquisition time, varchar type
UpdateDate: data renewal time, varchar type;
WxNameURL: wechat article chained address, varchar type
ContentLink: original text link, varchar type
Html: website article content html text media type
The same or similar label correspond to the same or similar components;
The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent;
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be to this hair
The restriction of bright embodiment.For those of ordinary skill in the art, it can also do on the basis of the above description
Other various forms of variations or variation out.There is no necessity and possibility to exhaust all the enbodiments.It is all in the present invention
Spirit and principle within made any modifications, equivalent replacements, and improvements etc., should be included in the guarantor of the claims in the present invention
Within the scope of shield.
Claims (6)
1. a kind of method for crawling wechat public platform information, which is characterized in that it the following steps are included:
S1: determining public platform title to be crawled, and the number of public platform to be crawled is put in storage;
S2: according to preset crawler matching rule, public platform to be crawled is crawled according to public platform title, obtains public platform to be crawled
Relevant information;
S3: the public platform relevant information to be crawled that crawler obtains is subjected to storage processing, the phase for public platform to be crawled that treated
Close information storage;
S4: setting crawler timing of execution time is crawled according to the timing of the relevant information of public platform to be crawled wait crawl public affairs to be crawled
Crowd's article, the public platform article that crawler is obtained are put in storage;
S5: display public platform article.
2. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that step S1 is specifically wrapped
Include following two situation:
S11: the accurate name of public platform to be crawled is not known in fuzzy matching, i.e. user, primarily determines public platform to be crawled
Afterwards, using its number storage as public platform to be crawled;
S12: the accurate name that accurate matching, i.e. user understand public platform to be crawled, by accurately key in public platform title or
Person's number, is directly put in storage, and as public platform to be crawled.
3. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that described in step S2
The setting method of preset crawler matching rule are as follows: the html source code of specified page is crawled by scrapy frame, and according to library
Structure and style rule list, writes matching rule;The relevant information of the public platform to be crawled includes public platform to be crawled
Number, title, the article name delivered, the article delivered.
4. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that described in step S3
Storage processing is exactly to be processed and pre-processed, and specifically includes duplicate removal, according to specified format layout, record entry time.
5. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that described in step S4
Timing of execution time is set as one day.
6. a kind of method for crawling wechat public platform information according to claim 1, which is characterized in that described in step S5
Display public platform article is exactly that public platform content search will be locally stored using JavaWeb the relevant technologies to be shown to front end page
On.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811070426.9A CN109388735A (en) | 2018-09-13 | 2018-09-13 | A method of crawling wechat public platform information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811070426.9A CN109388735A (en) | 2018-09-13 | 2018-09-13 | A method of crawling wechat public platform information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109388735A true CN109388735A (en) | 2019-02-26 |
Family
ID=65418683
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811070426.9A Pending CN109388735A (en) | 2018-09-13 | 2018-09-13 | A method of crawling wechat public platform information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109388735A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188257A (en) * | 2019-04-16 | 2019-08-30 | 国家计算机网络与信息安全管理中心 | A kind of mobile application collecting method and device |
CN110263266A (en) * | 2019-05-20 | 2019-09-20 | 江苏大学 | A kind of method for exhibiting data based on wechat small routine and crawler |
CN111008319A (en) * | 2019-10-29 | 2020-04-14 | 上海医望网络科技有限公司 | Content management system based on artificial intelligence |
CN112541107A (en) * | 2020-12-25 | 2021-03-23 | 天津浪淘科技股份有限公司 | Page data learning and automatic acquisition method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140229601A1 (en) * | 2011-09-22 | 2014-08-14 | Beijing Qihoo Technology Company Limited | URL Navigation Page Generation Method, Device and Program |
CN104317880A (en) * | 2014-10-22 | 2015-01-28 | 浪潮软件集团有限公司 | Method special for microblog data acquisition mode |
CN105320740A (en) * | 2015-09-22 | 2016-02-10 | 清华大学 | WeChat article and official account acquisition method and acquisition system |
CN107122466A (en) * | 2017-04-28 | 2017-09-01 | 福建中金在线信息科技有限公司 | A kind of web documents querying method and system |
CN107315736A (en) * | 2017-06-22 | 2017-11-03 | 云天弈(北京)信息技术有限公司 | A kind of assisted writing system and method |
CN107948052A (en) * | 2017-11-14 | 2018-04-20 | 福建中金在线信息科技有限公司 | Information crawler method, apparatus, electronic equipment and system |
-
2018
- 2018-09-13 CN CN201811070426.9A patent/CN109388735A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140229601A1 (en) * | 2011-09-22 | 2014-08-14 | Beijing Qihoo Technology Company Limited | URL Navigation Page Generation Method, Device and Program |
CN104317880A (en) * | 2014-10-22 | 2015-01-28 | 浪潮软件集团有限公司 | Method special for microblog data acquisition mode |
CN105320740A (en) * | 2015-09-22 | 2016-02-10 | 清华大学 | WeChat article and official account acquisition method and acquisition system |
CN107122466A (en) * | 2017-04-28 | 2017-09-01 | 福建中金在线信息科技有限公司 | A kind of web documents querying method and system |
CN107315736A (en) * | 2017-06-22 | 2017-11-03 | 云天弈(北京)信息技术有限公司 | A kind of assisted writing system and method |
CN107948052A (en) * | 2017-11-14 | 2018-04-20 | 福建中金在线信息科技有限公司 | Information crawler method, apparatus, electronic equipment and system |
Non-Patent Citations (2)
Title |
---|
吴霖: "分布式微信公众平台爬虫系统的研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
郑铁男 等: "《数字编辑运营实训教程》", 30 September 2017 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188257A (en) * | 2019-04-16 | 2019-08-30 | 国家计算机网络与信息安全管理中心 | A kind of mobile application collecting method and device |
CN110188257B (en) * | 2019-04-16 | 2021-12-31 | 国家计算机网络与信息安全管理中心 | Mobile application data acquisition method and device |
CN110263266A (en) * | 2019-05-20 | 2019-09-20 | 江苏大学 | A kind of method for exhibiting data based on wechat small routine and crawler |
CN111008319A (en) * | 2019-10-29 | 2020-04-14 | 上海医望网络科技有限公司 | Content management system based on artificial intelligence |
CN112541107A (en) * | 2020-12-25 | 2021-03-23 | 天津浪淘科技股份有限公司 | Page data learning and automatic acquisition method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109388735A (en) | A method of crawling wechat public platform information | |
US7827166B2 (en) | Handling dynamic URLs in crawl for better coverage of unique content | |
US8838527B2 (en) | Virtual environment spanning desktop and cloud | |
US8745082B2 (en) | Methods and apparatus for evaluating XPath filters on fragmented and distributed XML documents | |
US20070143306A1 (en) | Integrated website management system and management method thereof | |
CN103942309B (en) | A kind of implementation method of Network Data Capture equipment, method and acquisition process | |
US9454535B2 (en) | Topical mapping | |
CN109033403B (en) | Method, apparatus and storage medium for searching blockchain data | |
WO2015164108A1 (en) | Decoupling front end page and back end using tags | |
US20150331948A1 (en) | Search infrastructure and method for performing web search | |
WO2020207022A1 (en) | Scrapy-based data crawling method and system, terminal device, and storage medium | |
CN109101607B (en) | Method, apparatus and storage medium for searching blockchain data | |
US20200142674A1 (en) | Extracting web api endpoint data from source code | |
CN109241391A (en) | A kind of anti-crawler method climbed of solution font | |
CN111858255A (en) | User behavior acquisition method based on screenshot and related equipment | |
US20080301541A1 (en) | Online internet navigation system and method | |
EP2018757A1 (en) | A method of rendering at least one element in a client browser | |
CN110532455A (en) | A kind of Web page picture acquisition methods and system based on Chrome browser | |
Suguna et al. | User interest level based preprocessing algorithms using web usage mining | |
CN106919600A (en) | One kind failure network address access method and terminal | |
Kherwa et al. | Data preprocessing: A milestone of web usage mining | |
JP4259858B2 (en) | WWW site history search device, method and program | |
JP2005071319A (en) | Keyword acquiring device for homepage | |
Yang et al. | An architecture for integrating OODBs with WWW | |
Ge et al. | Robots exclusion and guidance protocol |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190226 |