CN107644021A

CN107644021A - Information collecting method and information collecting device

Info

Publication number: CN107644021A
Application number: CN201610575716.3A
Authority: CN
Inventors: 张学颖; 张丹; 于晓明; 杨建武
Original assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Priority date: 2016-07-20
Filing date: 2016-07-20
Publication date: 2018-01-30

Abstract

The present invention proposes an information collection method and an information collection device, wherein the information collection method includes: after logging into the web version of the application, obtaining the cookie of the application and the data published on the application using the target account list page link; obtain the list page according to the cookie and the list page link, and obtain at least one content page link in the list page; download the corresponding content page link of each content page link in the at least one content page link content page. Through the technical scheme of the present invention, it is possible to simulate human behavior to collect valuable data from massive data in applications, thereby improving the efficiency of information collection.

Description

Information collection method and information collection device

技术领域technical field

本发明涉及信息处理技术领域，具体而言，涉及一种信息采集方法和一种信息采集装置。The present invention relates to the technical field of information processing, in particular to an information collection method and an information collection device.

背景技术Background technique

目前，微信有1千多万个公众账号，拥有上亿计的文章量，而且以每天上百万的速度在增长，且公众账号发布文章的数据价值较高，因此微信公众号文章的采集成为海量数据采集中必不可少的部分。At present, WeChat has more than 10 million public accounts, with hundreds of millions of articles, and it is growing at a rate of millions every day, and the data value of articles published by public accounts is relatively high, so the collection of WeChat official account articles has become a An essential part of massive data collection.

采集微信公众号的文章，是指实时的获取公众号所发文章。相对于其他采集来说，微信与手机、平板等终端设备相关联，其采集方式独特，需要模拟人的行为，且会受到很严格的封禁。Collecting articles from WeChat official accounts refers to obtaining articles posted by official accounts in real time. Compared with other collections, WeChat is associated with terminal devices such as mobile phones and tablets. Its collection method is unique, requiring simulated human behavior, and will be subject to strict bans.

因此，如何模拟人的行为在微信的海量数据中采集有价值的数据，从而提高信息采集的效率成为亟待解决的问题。Therefore, how to simulate human behavior to collect valuable data from the massive data of WeChat, so as to improve the efficiency of information collection has become an urgent problem to be solved.

发明内容Contents of the invention

本发明正是基于上述问题，提出了一种新的技术方案，可以模拟人的行为在应用的海量数据中采集有价值的数据，进而提高了信息采集的效率。Based on the above-mentioned problems, the present invention proposes a new technical solution, which can simulate human behavior to collect valuable data from massive data in applications, thereby improving the efficiency of information collection.

有鉴于此，本发明的第一方面提出了一种信息采集方法，包括：在登录网页版的应用之后，获取所述应用的Cookie、和使用目标账号在所述应用上发布的数据的列表页链接；根据所述Cookie和所述列表页链接获取列表页，并获取所述列表页中的至少一个内容页链接；下载所述至少一个内容页链接中的每个内容页链接对应的内容页。In view of this, the first aspect of the present invention proposes an information collection method, including: after logging into the web version of the application, obtaining the cookie of the application and the list page of the data published on the application using the target account linking; obtaining a list page according to the cookie and the list page link, and obtaining at least one content page link in the list page; downloading a content page corresponding to each content page link in the at least one content page link.

在该技术方案中，在登录网页版的应用之后，例如，使用selenium webdriver工具登录网页版的应用之后，通过获取该应用的Cookie和使用目标账号在该应用上发布的数据的列表页链接，以获取列表页，然后获取该列表页中的至少一个内容页链接，最后根据至少一个内容页链接就可以获取到目标账号在应用上发布的数据，从而实现了模拟人的行为在应用的海量数据中采集有价值的数据，进而提高了信息采集的效率。In this technical solution, after logging in to the web version of the application, for example, after using the selenium webdriver tool to log in to the web version of the application, by obtaining the cookie of the application and the list page link of the data published on the application using the target account, to Obtain the list page, then obtain at least one content page link in the list page, and finally obtain the data published by the target account on the application according to at least one content page link, thus realizing the simulation of human behavior in the massive data of the application Collect valuable data, thereby improving the efficiency of information collection.

例如，使用selenium webdriver登录网页版的微信之后，获取微信的Cookie和使用目标公众账号为“北京”发布的文章的列表页链接，根据Cookie和列表页链接获取列表页，在列表页中有文章标题为“北京5日游攻略”、“在北京必去的10大旅游胜地”、“北京美食集锦”。对列表页进行解析以获取到列表页中的内容页链接，即获取到访问“北京5日游攻略”、“在北京必去的10大旅游胜地”、“北京美食集锦”这几篇文章内容的链接，最后根据这几篇文章内容的链接就可以获取到这几篇文章的内容。For example, after using selenium webdriver to log in to the web version of WeChat, obtain the WeChat cookie and the list page link of the article published by the target public account for "Beijing", and obtain the list page according to the cookie and list page link, and there is an article title in the list page It is "Beijing 5-Day Travel Guide", "Top 10 Must-Visit Tourist Attractions in Beijing", and "Beijing Food Collection". Analyze the list page to obtain the content page links in the list page, that is, access to the content of articles such as "Beijing 5-day travel guide", "10 must-visit tourist attractions in Beijing", and "Beijing food collection" Finally, according to the links of the content of these articles, the content of these articles can be obtained.

在上述技术方案中，优选地，根据所述Cookie和所述列表页链接获取所述列表页的步骤，还包括：若根据所述Cookie和所述列表页链接未获取到所述列表页，则重新获取所述应用的其他Cookie，以根据所述其他Cookie和所述列表页链接获取所述列表页。In the above technical solution, preferably, the step of obtaining the list page according to the cookie and the link to the list page further includes: if the list page is not obtained according to the cookie and the link to the list page, then Obtaining other cookies of the application again, so as to obtain the list page according to the other cookies and the list page link.

在该技术方案中，由于应用的Cookie具有一定的时效性，若根据Cookie和列表页链接无法获取到列表页，说明Cookie是无效的，则重新获取其他Cookie，从而根据其他Cookie和列表页链接获取到列表页。In this technical solution, since the applied Cookie has a certain timeliness, if the list page cannot be obtained according to the Cookie and the link to the list page, it means that the Cookie is invalid, and other Cookies are obtained again, so as to obtain to the list page.

在上述任一技术方案中，优选地，还包括：对所述内容页进行解析以获取所述内容页中的内容，并将所述内容页中的内容转化为目标格式的数据。In any of the above technical solutions, preferably, further comprising: parsing the content page to obtain the content in the content page, and converting the content in the content page into data in a target format.

在该技术方案中，通过提取内容页中的各项内容，并将内容页中的各项内容转化为统一的目标格式的数据进行保存，例如，转化为TXT或WORD格式的纯文本数据，从而方便对下载的内容页的内容进行统一管理。In this technical solution, by extracting each content in the content page, and converting each content in the content page into data in a unified target format for storage, for example, converting it into plain text data in TXT or WORD format, thereby It is convenient to carry out unified management on the content of the downloaded content page.

在上述任一技术方案中，优选地，还包括：在登录网页版的所述应用之后，周期性地刷新所述应用的网页。In any of the above technical solutions, preferably, further comprising: after logging into the web version of the application, periodically refreshing the web page of the application.

在该技术方案中，由于长时间不操作应用，应用处于掉线状态或者退出登录的状态，则通过周期性地刷新应用的网络，以保证网页版的应用处于在线状态，避免了在应用掉线之后重新登录应用。In this technical solution, because the application is not operated for a long time, the application is offline or logged out, the network of the application is refreshed periodically to ensure that the web version of the application is in an online state, and to avoid the application being disconnected Then log back into the app.

在上述任一技术方案中，优选地，所述列表页和所述内容页为JSON格式的数据。In any of the above technical solutions, preferably, the list page and the content page are data in JSON format.

在该技术方案中，获取的列表页和内容页为JSON格式的数据，从而根据JSON库对列表页和内容页进行解析即可获取到列表页中的至少一个内容页链接、和内容页中的内容。In this technical solution, the obtained list page and content page are data in JSON format, so that at least one content page link in the list page and the content page link in the content page can be obtained by parsing the list page and content page according to the JSON library. content.

本发明的第二方面提出了一种信息采集装置，包括：第一获取单元，用于在登录网页版的应用之后，获取所述应用的Cookie、和使用目标账号在所述应用上发布的数据的列表页链接；第二获取单元，用于根据所述Cookie和所述列表页链接获取列表页，并获取所述列表页中的至少一个内容页链接；下载单元，用于下载所述至少一个内容页链接中的每个内容页链接对应的内容页。The second aspect of the present invention proposes an information collection device, including: a first acquisition unit, configured to acquire the cookie of the application and the data published on the application using the target account after logging in to the web version of the application the list page link; the second obtaining unit is used to obtain the list page according to the cookie and the list page link, and obtain at least one content page link in the list page; the download unit is used to download the at least one A content page corresponding to each content page link in the content page links.

在上述技术方案中，优选地，还包括：第三获取单元，若根据所述Cookie和所述列表页链接未获取到所述列表页，则重新获取所述应用的其他Cookie，以根据所述其他Cookie和所述列表页链接获取所述列表页。In the above technical solution, preferably, it also includes: a third obtaining unit, if the list page is not obtained according to the cookie and the link to the list page, re-acquire other cookies of the application, so as to obtain the other cookies of the application according to the Other cookies and said list page link fetch said list page.

在上述任一技术方案中，优选地，还包括：转换单元，用于对所述内容页进行解析以获取所述内容页中的内容，并将所述内容页中的内容转化为目标格式的数据。In any of the above technical solutions, preferably, further comprising: a conversion unit configured to parse the content page to obtain the content in the content page, and convert the content in the content page into a target format data.

在上述任一技术方案中，优选地，还包括：刷新单元，用于在登录网页版的所述应用之后，周期性地刷新所述应用的网页。In any of the above technical solutions, preferably, further comprising: a refreshing unit, configured to periodically refresh the web page of the application after logging in the web version of the application.

通过本发明的技术方案，可以模拟人的行为在应用的海量数据中采集有价值的数据，进而提高了信息采集的效率。Through the technical solution of the present invention, it is possible to simulate human behavior to collect valuable data from massive data in applications, thereby improving the efficiency of information collection.

附图说明Description of drawings

图1示出了根据本发明的一个实施例的信息采集方法的流程示意图；FIG. 1 shows a schematic flow diagram of an information collection method according to an embodiment of the present invention;

图2示出了根据本发明的另一个实施例的信息采集方法的流程示意图；FIG. 2 shows a schematic flow chart of an information collection method according to another embodiment of the present invention;

图3示出了根据本发明的一个实施例的信息采集装置的结构示意图；FIG. 3 shows a schematic structural diagram of an information collection device according to an embodiment of the present invention;

图4示出了根据本发明的另一个实施例的信息采集装置的结构示意图。Fig. 4 shows a schematic structural diagram of an information collection device according to another embodiment of the present invention.

具体实施方式detailed description

为了可以更清楚地理解本发明的上述目的、特征和优点，下面结合附图和具体实施方式对本发明进行进一步的详细描述。需要说明的是，在不冲突的情况下，本申请的实施例及实施例中的特征可以相互组合。In order to have a clearer understanding of the above objects, features and advantages of the present invention, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments can be combined with each other.

在下面的描述中阐述了很多具体细节以便于充分理解本发明，但是，本发明还可以采用其他不同于在此描述的其他方式来实施，因此，本发明的保护范围并不受下面公开的具体实施例的限制。In the following description, many specific details are set forth in order to fully understand the present invention. However, the present invention can also be implemented in other ways different from those described here. Therefore, the protection scope of the present invention is not limited by the specific details disclosed below. EXAMPLE LIMITATIONS.

图1示出了根据本发明的一个实施例的信息采集方法的流程示意图。Fig. 1 shows a schematic flowchart of an information collection method according to an embodiment of the present invention.

如图1所示，根据本发明的一个实施例的信息采集方法，包括：As shown in Figure 1, the information collection method according to one embodiment of the present invention includes:

步骤102，在登录网页版的应用之后，获取所述应用的Cookie、和使用目标账号在所述应用上发布的数据的列表页链接。Step 102, after logging into the web version of the application, obtain the cookie of the application and the list page link of the data published on the application using the target account.

步骤104，根据所述Cookie和所述列表页链接获取列表页，并获取所述列表页中的至少一个内容页链接。Step 104, obtain a list page according to the cookie and the list page link, and obtain at least one content page link in the list page.

在步骤104中，由于应用的Cookie具有一定的时效性，若根据Cookie和列表页链接无法获取到列表页，说明Cookie是无效的，因此，优选地，若根据所述Cookie和所述列表页链接未获取到所述列表页，则重新获取所述应用的其他Cookie，以根据所述其他Cookie和所述列表页链接获取所述列表页。In step 104, since the applied Cookie has a certain timeliness, if the list page cannot be obtained according to the Cookie and the link to the list page, it means that the Cookie is invalid. If the list page is not acquired, other cookies of the application are acquired again, so as to acquire the list page according to the other cookies and the list page link.

步骤106，下载所述至少一个内容页链接中的每个内容页链接对应的内容页。Step 106, download the content page corresponding to each content page link in the at least one content page link.

在步骤106之后，对所述内容页进行解析以获取所述内容页中的内容，并将所述内容页中的内容转化为目标格式的数据，从而将内容页中的各项内容转化为统一的格式进行保存，例如，转化为TXT或WORD格式的纯文本数据，进而方便对下载的内容页的内容进行统一管理。After step 106, the content page is analyzed to obtain the content in the content page, and the content in the content page is converted into data in the target format, thereby converting each content in the content page into a unified format, for example, converted to plain text data in TXT or WORD format, so as to facilitate the unified management of the content of the downloaded content page.

另外，步骤104和步骤106中的所述列表页和所述内容页为JSON格式的数据，从而根据JSON库对列表页和内容页进行解析即可获取到列表页中的至少一个内容页链接、和内容页中的内容。In addition, the list page and the content page in steps 104 and 106 are data in JSON format, so that at least one content page link in the list page can be obtained by parsing the list page and the content page according to the JSON library. and the content in the content page.

在该技术方案中，在登录网页版的应用(应用可以是微信、微博、QQ)之后，例如，使用selenium webdriver工具登录网页版的应用之后，通过获取该应用的Cookie和使用目标账号在该应用上发布的数据的列表页链接，以获取列表页，然后获取该列表页中的至少一个内容页链接，最后根据至少一个内容页链接就可以获取到目标账号在应用上发布的数据，从而实现了模拟人的行为在应用的海量数据中采集有价值的数据，进而提高了信息采集的效率。In this technical solution, after logging in to the web version of the application (the application can be WeChat, Weibo, QQ), for example, after using the selenium webdriver tool to log in to the web version of the application, by obtaining the cookie of the application and using the target account in the The list page link of the data published on the application to obtain the list page, then obtain at least one content page link in the list page, and finally obtain the data published by the target account on the application according to at least one content page link, so as to realize In order to simulate human behavior and collect valuable data in the massive data of the application, the efficiency of information collection is improved.

在上述技术方案中，优选地，还包括：在登录网页版的所述应用之后，周期性地刷新所述应用的网页。In the above technical solution, preferably, further comprising: after logging into the web version of the application, periodically refreshing the web page of the application.

图2示出了根据本发明的另一个实施例的信息采集方法的流程示意图。Fig. 2 shows a schematic flowchart of an information collection method according to another embodiment of the present invention.

如图2所示，根据本发明的另一个实施例的信息采集方法，包括：As shown in Figure 2, the information collection method according to another embodiment of the present invention includes:

步骤202，Wechat_server向Cookie_server发送Cookie请求。Step 202, Wechat_server sends Cookie request to Cookie_server.

步骤204，Wechat_server从Cookie_server中获取微信的Cookie，并获取微信中的目标公众号所发表的文章的列表页链表。In step 204, the Wechat_server obtains the WeChat cookie from the Cookie_server, and obtains a linked list page list of articles published by the target official account in WeChat.

步骤206，根据Cookie和列表页链表获取列表页。Step 206, obtain the list page according to the cookie and the list page link list.

步骤208，判断获取的Cookie是否有效，在判断结果为是时，进入步骤210，否则，进入步骤202，以重新获取微信的其他Cookie。其中，若根据Cookie和列表页链表获取到列表页，则判定Cookie有效，若根据Cookie和列表页链表未获取到列表页或者获取到其他的内容，则判定Cookie无效。Step 208, judge whether the acquired Cookie is valid, if the judgment result is yes, go to step 210, otherwise, go to step 202 to re-acquire other cookies of WeChat. Wherein, if the list page is obtained according to the cookie and the list page link list, it is determined that the cookie is valid; if no list page or other content is obtained according to the cookie and the list page link list, then the cookie is determined to be invalid.

步骤210，解析列表页获取该列表页中的多个内容页链接。Step 210, parsing the list page to obtain multiple content page links in the list page.

步骤212，下载多个内容页链接对应的内容页。Step 212, downloading content pages corresponding to multiple content page links.

步骤214，解析内容页，并将信息落地，具体地，解析内容页获取到内容页中的内容，将内容页中的内容转换为可用的、格式化的纯文本数据。Step 214, analyze the content page, and implement the information, specifically, analyze the content page to obtain the content in the content page, and convert the content in the content page into usable and formatted plain text data.

图3示出了根据本发明的一个实施例的信息采集装置的结构示意图。Fig. 3 shows a schematic structural diagram of an information collection device according to an embodiment of the present invention.

如图3所示，根据本发明的一个实施例的信息采集装置300，包括：第一获取单元302、第二获取单元304和下载单元306，其中，第一获取单元302用于在登录网页版的应用之后，获取所述应用的Cookie、和使用目标账号在所述应用上发布的数据的列表页链接；第二获取单元304，用于根据所述Cookie和所述列表页链接获取列表页，并获取所述列表页中的至少一个内容页链接；下载单元306，用于下载所述至少一个内容页链接中的每个内容页链接对应的内容页。As shown in Figure 3, an information collection device 300 according to an embodiment of the present invention includes: a first acquisition unit 302, a second acquisition unit 304 and a download unit 306, wherein the first acquisition unit 302 is used to After the application of the application, obtain the cookie of the application and the list page link of the data published on the application using the target account; the second obtaining unit 304 is used to obtain the list page according to the cookie and the list page link, And acquire at least one content page link in the list page; the downloading unit 306 is configured to download a content page corresponding to each content page link in the at least one content page link.

在上述技术方案中，优选地，还包括：第三获取单元308，若根据所述Cookie和所述列表页链接未获取到所述列表页，则重新获取所述应用的其他Cookie，以根据所述其他Cookie和所述列表页链接获取所述列表页。In the above technical solution, preferably, it further includes: a third obtaining unit 308, if the list page is not obtained according to the cookie and the link to the list page, re-acquire other cookies of the application to The other cookies mentioned above and the link to the list page are used to obtain the list page.

在上述任一技术方案中，优选地，还包括：转换单元310，用于对所述内容页进行解析以获取所述内容页中的内容，并将所述内容页中的内容转化为目标格式的数据。In any of the above technical solutions, preferably, further comprising: a conversion unit 310, configured to parse the content page to obtain the content in the content page, and convert the content in the content page into a target format The data.

在上述任一技术方案中，优选地，还包括：刷新单元312，用于在登录网页版的所述应用之后，周期性地刷新所述应用的网页。In any of the above technical solutions, preferably, further comprising: a refreshing unit 312, configured to periodically refresh the web page of the application after logging in the web version of the application.

如图4所示，根据本发明的另一个实施例的信息采集装置400，包括：Cookie_server402和Wechat_server404。Cookie_server402通过selenium webdriver工具访问地址https://wx.qq.com/，模拟微信网页版登录，获取微信的Cookie，从而为Wechat_server404提供Cookie服务。若Wechat_server404向Cookie_server402请求Cookie，则Wechat_server404可以从Cookie_server402中获取微信的Cookie。然后Wechat_server404获取目标账号在微信上发布的文章的列表页链接，根据获取的Cookie和列表页链接获取列表页，解析列表页中的内容页链接，再下载内容页链接对应的内容页，将内容页中的内容转化为可用的、格式化的纯文本数据。As shown in FIG. 4 , an information collection device 400 according to another embodiment of the present invention includes: Cookie_server402 and Wechat_server404 . Cookie_server402 accesses the address https://wx.qq.com/ through the selenium webdriver tool, simulates the login of the web version of WeChat, obtains the Cookie of WeChat, and provides Cookie service for Wechat_server404. If Wechat_server404 requests Cookie from Cookie_server402, then Wechat_server404 can obtain the Cookie of WeChat from Cookie_server402. Then Wechat_server404 obtains the list page link of the article published by the target account on WeChat, obtains the list page according to the obtained Cookie and the list page link, parses the content page link in the list page, and then downloads the content page corresponding to the content page link, and converts the content page The content in is converted into usable, formatted plain text data.

以上结合附图详细说明了本发明的技术方案，通过本发明的技术方案，可以模拟人的行为在应用的海量数据中采集有价值的数据，进而提高了信息采集的效率。The technical solution of the present invention has been described in detail above in conjunction with the accompanying drawings. Through the technical solution of the present invention, human behavior can be simulated to collect valuable data from massive data in applications, thereby improving the efficiency of information collection.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. An information collection method, characterized in that, comprising:

After logging into the web version of the application, obtain the cookie of the application and the list page link of the data published on the application using the target account;

Obtain a list page according to the Cookie and the list page link, and obtain at least one content page link in the list page;

A content page corresponding to each content page link in the at least one content page link is downloaded.

2. The information collection method according to claim 1, wherein the step of obtaining the list page according to the cookie and the list page link further comprises:

If the list page is not obtained according to the cookie and the link to the list page, reacquire other cookies of the application, so as to obtain the list page according to the other cookie and the link to the list page.

3. The information collection method according to claim 1, further comprising:

The content page is parsed to obtain the content in the content page, and the content in the content page is converted into data in a target format.

4. The information collection method according to any one of claims 1 to 3, further comprising:

After logging into the web version of the application, the web page of the application is periodically refreshed.

5. The information collection method according to any one of claims 1 to 3, wherein the list page and the content page are data in JSON format.

6. An information collection device, characterized in that it comprises:

The first obtaining unit is used to obtain the cookie of the application and the list page link of the data published on the application using the target account after logging in the application of the web version;

A second obtaining unit, configured to obtain a list page according to the cookie and the list page link, and obtain at least one content page link in the list page;

A downloading unit, configured to download a content page corresponding to each content page link in the at least one content page link.

7. The information collection device according to claim 6, further comprising:

The third obtaining unit, if the list page is not obtained according to the cookie and the link to the list page, re-acquire other cookies of the application, so as to obtain the list according to the other cookies and the link to the list page Page.

8. The information collection device according to claim 6, further comprising:

The conversion unit is configured to parse the content page to obtain the content in the content page, and convert the content in the content page into data in a target format.

9. The information collection device according to any one of claims 6 to 8, further comprising:

A refreshing unit, configured to periodically refresh the webpage of the application after logging into the webpage version of the application.

10. The information collection device according to any one of claims 6 to 8, wherein the list page and the content page are data in JSON format.