CN107644021A - Information collecting method and information collecting device - Google Patents
Information collecting method and information collecting device Download PDFInfo
- Publication number
- CN107644021A CN107644021A CN201610575716.3A CN201610575716A CN107644021A CN 107644021 A CN107644021 A CN 107644021A CN 201610575716 A CN201610575716 A CN 201610575716A CN 107644021 A CN107644021 A CN 107644021A
- Authority
- CN
- China
- Prior art keywords
- page
- content
- application
- list page
- link
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 18
- 235000014510 cooky Nutrition 0.000 claims abstract description 75
- 238000006243 chemical reaction Methods 0.000 claims description 3
- BUGBHKTXTAQXES-UHFFFAOYSA-N Selenium Chemical compound [Se] BUGBHKTXTAQXES-UHFFFAOYSA-N 0.000 description 9
- 229910052711 selenium Inorganic materials 0.000 description 9
- 239000011669 selenium Substances 0.000 description 9
- 235000013305 food Nutrition 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 238000004088 simulation Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 235000016936 Dendrocalamus strictus Nutrition 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
本发明提出了一种信息采集方法和信息采集装置,其中,所述信息采集方法包括:在登录网页版的应用之后,获取所述应用的Cookie、和使用目标账号在所述应用上发布的数据的列表页链接;根据所述Cookie和所述列表页链接获取列表页,并获取所述列表页中的至少一个内容页链接;下载所述至少一个内容页链接中的每个内容页链接对应的内容页。通过本发明的技术方案,可以模拟人的行为在应用的海量数据中采集有价值的数据,进而提高了信息采集的效率。
The present invention proposes an information collection method and an information collection device, wherein the information collection method includes: after logging into the web version of the application, obtaining the cookie of the application and the data published on the application using the target account list page link; obtain the list page according to the cookie and the list page link, and obtain at least one content page link in the list page; download the corresponding content page link of each content page link in the at least one content page link content page. Through the technical scheme of the present invention, it is possible to simulate human behavior to collect valuable data from massive data in applications, thereby improving the efficiency of information collection.
Description
技术领域technical field
本发明涉及信息处理技术领域,具体而言,涉及一种信息采集方法和一种信息采集装置。The present invention relates to the technical field of information processing, in particular to an information collection method and an information collection device.
背景技术Background technique
目前,微信有1千多万个公众账号,拥有上亿计的文章量,而且以每天上百万的速度在增长,且公众账号发布文章的数据价值较高,因此微信公众号文章的采集成为海量数据采集中必不可少的部分。At present, WeChat has more than 10 million public accounts, with hundreds of millions of articles, and it is growing at a rate of millions every day, and the data value of articles published by public accounts is relatively high, so the collection of WeChat official account articles has become a An essential part of massive data collection.
采集微信公众号的文章,是指实时的获取公众号所发文章。相对于其他采集来说,微信与手机、平板等终端设备相关联,其采集方式独特,需要模拟人的行为,且会受到很严格的封禁。Collecting articles from WeChat official accounts refers to obtaining articles posted by official accounts in real time. Compared with other collections, WeChat is associated with terminal devices such as mobile phones and tablets. Its collection method is unique, requiring simulated human behavior, and will be subject to strict bans.
因此,如何模拟人的行为在微信的海量数据中采集有价值的数据,从而提高信息采集的效率成为亟待解决的问题。Therefore, how to simulate human behavior to collect valuable data from the massive data of WeChat, so as to improve the efficiency of information collection has become an urgent problem to be solved.
发明内容Contents of the invention
本发明正是基于上述问题,提出了一种新的技术方案,可以模拟人的行为在应用的海量数据中采集有价值的数据,进而提高了信息采集的效率。Based on the above-mentioned problems, the present invention proposes a new technical solution, which can simulate human behavior to collect valuable data from massive data in applications, thereby improving the efficiency of information collection.
有鉴于此,本发明的第一方面提出了一种信息采集方法,包括:在登录网页版的应用之后,获取所述应用的Cookie、和使用目标账号在所述应用上发布的数据的列表页链接;根据所述Cookie和所述列表页链接获取列表页,并获取所述列表页中的至少一个内容页链接;下载所述至少一个内容页链接中的每个内容页链接对应的内容页。In view of this, the first aspect of the present invention proposes an information collection method, including: after logging into the web version of the application, obtaining the cookie of the application and the list page of the data published on the application using the target account linking; obtaining a list page according to the cookie and the list page link, and obtaining at least one content page link in the list page; downloading a content page corresponding to each content page link in the at least one content page link.
在该技术方案中,在登录网页版的应用之后,例如,使用selenium webdriver工具登录网页版的应用之后,通过获取该应用的Cookie和使用目标账号在该应用上发布的数据的列表页链接,以获取列表页,然后获取该列表页中的至少一个内容页链接,最后根据至少一个内容页链接就可以获取到目标账号在应用上发布的数据,从而实现了模拟人的行为在应用的海量数据中采集有价值的数据,进而提高了信息采集的效率。In this technical solution, after logging in to the web version of the application, for example, after using the selenium webdriver tool to log in to the web version of the application, by obtaining the cookie of the application and the list page link of the data published on the application using the target account, to Obtain the list page, then obtain at least one content page link in the list page, and finally obtain the data published by the target account on the application according to at least one content page link, thus realizing the simulation of human behavior in the massive data of the application Collect valuable data, thereby improving the efficiency of information collection.
例如,使用selenium webdriver登录网页版的微信之后,获取微信的Cookie和使用目标公众账号为“北京”发布的文章的列表页链接,根据Cookie和列表页链接获取列表页,在列表页中有文章标题为“北京5日游攻略”、“在北京必去的10大旅游胜地”、“北京美食集锦”。对列表页进行解析以获取到列表页中的内容页链接,即获取到访问“北京5日游攻略”、“在北京必去的10大旅游胜地”、“北京美食集锦”这几篇文章内容的链接,最后根据这几篇文章内容的链接就可以获取到这几篇文章的内容。For example, after using selenium webdriver to log in to the web version of WeChat, obtain the WeChat cookie and the list page link of the article published by the target public account for "Beijing", and obtain the list page according to the cookie and list page link, and there is an article title in the list page It is "Beijing 5-Day Travel Guide", "Top 10 Must-Visit Tourist Attractions in Beijing", and "Beijing Food Collection". Analyze the list page to obtain the content page links in the list page, that is, access to the content of articles such as "Beijing 5-day travel guide", "10 must-visit tourist attractions in Beijing", and "Beijing food collection" Finally, according to the links of the content of these articles, the content of these articles can be obtained.
在上述技术方案中,优选地,根据所述Cookie和所述列表页链接获取所述列表页的步骤,还包括:若根据所述Cookie和所述列表页链接未获取到所述列表页,则重新获取所述应用的其他Cookie,以根据所述其他Cookie和所述列表页链接获取所述列表页。In the above technical solution, preferably, the step of obtaining the list page according to the cookie and the link to the list page further includes: if the list page is not obtained according to the cookie and the link to the list page, then Obtaining other cookies of the application again, so as to obtain the list page according to the other cookies and the list page link.
在该技术方案中,由于应用的Cookie具有一定的时效性,若根据Cookie和列表页链接无法获取到列表页,说明Cookie是无效的,则重新获取其他Cookie,从而根据其他Cookie和列表页链接获取到列表页。In this technical solution, since the applied Cookie has a certain timeliness, if the list page cannot be obtained according to the Cookie and the link to the list page, it means that the Cookie is invalid, and other Cookies are obtained again, so as to obtain to the list page.
在上述任一技术方案中,优选地,还包括:对所述内容页进行解析以获取所述内容页中的内容,并将所述内容页中的内容转化为目标格式的数据。In any of the above technical solutions, preferably, further comprising: parsing the content page to obtain the content in the content page, and converting the content in the content page into data in a target format.
在该技术方案中,通过提取内容页中的各项内容,并将内容页中的各项内容转化为统一的目标格式的数据进行保存,例如,转化为TXT或WORD格式的纯文本数据,从而方便对下载的内容页的内容进行统一管理。In this technical solution, by extracting each content in the content page, and converting each content in the content page into data in a unified target format for storage, for example, converting it into plain text data in TXT or WORD format, thereby It is convenient to carry out unified management on the content of the downloaded content page.
在上述任一技术方案中,优选地,还包括:在登录网页版的所述应用之后,周期性地刷新所述应用的网页。In any of the above technical solutions, preferably, further comprising: after logging into the web version of the application, periodically refreshing the web page of the application.
在该技术方案中,由于长时间不操作应用,应用处于掉线状态或者退出登录的状态,则通过周期性地刷新应用的网络,以保证网页版的应用处于在线状态,避免了在应用掉线之后重新登录应用。In this technical solution, because the application is not operated for a long time, the application is offline or logged out, the network of the application is refreshed periodically to ensure that the web version of the application is in an online state, and to avoid the application being disconnected Then log back into the app.
在上述任一技术方案中,优选地,所述列表页和所述内容页为JSON格式的数据。In any of the above technical solutions, preferably, the list page and the content page are data in JSON format.
在该技术方案中,获取的列表页和内容页为JSON格式的数据,从而根据JSON库对列表页和内容页进行解析即可获取到列表页中的至少一个内容页链接、和内容页中的内容。In this technical solution, the obtained list page and content page are data in JSON format, so that at least one content page link in the list page and the content page link in the content page can be obtained by parsing the list page and content page according to the JSON library. content.
本发明的第二方面提出了一种信息采集装置,包括:第一获取单元,用于在登录网页版的应用之后,获取所述应用的Cookie、和使用目标账号在所述应用上发布的数据的列表页链接;第二获取单元,用于根据所述Cookie和所述列表页链接获取列表页,并获取所述列表页中的至少一个内容页链接;下载单元,用于下载所述至少一个内容页链接中的每个内容页链接对应的内容页。The second aspect of the present invention proposes an information collection device, including: a first acquisition unit, configured to acquire the cookie of the application and the data published on the application using the target account after logging in to the web version of the application the list page link; the second obtaining unit is used to obtain the list page according to the cookie and the list page link, and obtain at least one content page link in the list page; the download unit is used to download the at least one A content page corresponding to each content page link in the content page links.
在该技术方案中,在登录网页版的应用之后,例如,使用selenium webdriver工具登录网页版的应用之后,通过获取该应用的Cookie和使用目标账号在该应用上发布的数据的列表页链接,以获取列表页,然后获取该列表页中的至少一个内容页链接,最后根据至少一个内容页链接就可以获取到目标账号在应用上发布的数据,从而实现了模拟人的行为在应用的海量数据中采集有价值的数据,进而提高了信息采集的效率。In this technical solution, after logging in to the web version of the application, for example, after using the selenium webdriver tool to log in to the web version of the application, by obtaining the cookie of the application and the list page link of the data published on the application using the target account, to Obtain the list page, then obtain at least one content page link in the list page, and finally obtain the data published by the target account on the application according to at least one content page link, thus realizing the simulation of human behavior in the massive data of the application Collect valuable data, thereby improving the efficiency of information collection.
例如,使用selenium webdriver登录网页版的微信之后,获取微信的Cookie和使用目标公众账号为“北京”发布的文章的列表页链接,根据Cookie和列表页链接获取列表页,在列表页中有文章标题为“北京5日游攻略”、“在北京必去的10大旅游胜地”、“北京美食集锦”。对列表页进行解析以获取到列表页中的内容页链接,即获取到访问“北京5日游攻略”、“在北京必去的10大旅游胜地”、“北京美食集锦”这几篇文章内容的链接,最后根据这几篇文章内容的链接就可以获取到这几篇文章的内容。For example, after using selenium webdriver to log in to the web version of WeChat, obtain the WeChat cookie and the list page link of the article published by the target public account for "Beijing", and obtain the list page according to the cookie and list page link, and there is an article title in the list page It is "Beijing 5-Day Travel Guide", "Top 10 Must-Visit Tourist Attractions in Beijing", and "Beijing Food Collection". Analyze the list page to obtain the content page links in the list page, that is, access to the content of articles such as "Beijing 5-day travel guide", "10 must-visit tourist attractions in Beijing", and "Beijing food collection" Finally, according to the links of the content of these articles, the content of these articles can be obtained.
在上述技术方案中,优选地,还包括:第三获取单元,若根据所述Cookie和所述列表页链接未获取到所述列表页,则重新获取所述应用的其他Cookie,以根据所述其他Cookie和所述列表页链接获取所述列表页。In the above technical solution, preferably, it also includes: a third obtaining unit, if the list page is not obtained according to the cookie and the link to the list page, re-acquire other cookies of the application, so as to obtain the other cookies of the application according to the Other cookies and said list page link fetch said list page.
在该技术方案中,由于应用的Cookie具有一定的时效性,若根据Cookie和列表页链接无法获取到列表页,说明Cookie是无效的,则重新获取其他Cookie,从而根据其他Cookie和列表页链接获取到列表页。In this technical solution, since the applied Cookie has a certain timeliness, if the list page cannot be obtained according to the Cookie and the link to the list page, it means that the Cookie is invalid, and other Cookies are obtained again, so as to obtain to the list page.
在上述任一技术方案中,优选地,还包括:转换单元,用于对所述内容页进行解析以获取所述内容页中的内容,并将所述内容页中的内容转化为目标格式的数据。In any of the above technical solutions, preferably, further comprising: a conversion unit configured to parse the content page to obtain the content in the content page, and convert the content in the content page into a target format data.
在该技术方案中,通过提取内容页中的各项内容,并将内容页中的各项内容转化为统一的目标格式的数据进行保存,例如,转化为TXT或WORD格式的纯文本数据,从而方便对下载的内容页的内容进行统一管理。In this technical solution, by extracting each content in the content page, and converting each content in the content page into data in a unified target format for storage, for example, converting it into plain text data in TXT or WORD format, thereby It is convenient to carry out unified management on the content of the downloaded content page.
在上述任一技术方案中,优选地,还包括:刷新单元,用于在登录网页版的所述应用之后,周期性地刷新所述应用的网页。In any of the above technical solutions, preferably, further comprising: a refreshing unit, configured to periodically refresh the web page of the application after logging in the web version of the application.
在该技术方案中,由于长时间不操作应用,应用处于掉线状态或者退出登录的状态,则通过周期性地刷新应用的网络,以保证网页版的应用处于在线状态,避免了在应用掉线之后重新登录应用。In this technical solution, because the application is not operated for a long time, the application is offline or logged out, the network of the application is refreshed periodically to ensure that the web version of the application is in an online state, and to avoid the application being disconnected Then log back into the app.
在上述任一技术方案中,优选地,所述列表页和所述内容页为JSON格式的数据。In any of the above technical solutions, preferably, the list page and the content page are data in JSON format.
在该技术方案中,获取的列表页和内容页为JSON格式的数据,从而根据JSON库对列表页和内容页进行解析即可获取到列表页中的至少一个内容页链接、和内容页中的内容。In this technical solution, the obtained list page and content page are data in JSON format, so that at least one content page link in the list page and the content page link in the content page can be obtained by parsing the list page and content page according to the JSON library. content.
通过本发明的技术方案,可以模拟人的行为在应用的海量数据中采集有价值的数据,进而提高了信息采集的效率。Through the technical solution of the present invention, it is possible to simulate human behavior to collect valuable data from massive data in applications, thereby improving the efficiency of information collection.
附图说明Description of drawings
图1示出了根据本发明的一个实施例的信息采集方法的流程示意图;FIG. 1 shows a schematic flow diagram of an information collection method according to an embodiment of the present invention;
图2示出了根据本发明的另一个实施例的信息采集方法的流程示意图;FIG. 2 shows a schematic flow chart of an information collection method according to another embodiment of the present invention;
图3示出了根据本发明的一个实施例的信息采集装置的结构示意图;FIG. 3 shows a schematic structural diagram of an information collection device according to an embodiment of the present invention;
图4示出了根据本发明的另一个实施例的信息采集装置的结构示意图。Fig. 4 shows a schematic structural diagram of an information collection device according to another embodiment of the present invention.
具体实施方式detailed description
为了可以更清楚地理解本发明的上述目的、特征和优点,下面结合附图和具体实施方式对本发明进行进一步的详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。In order to have a clearer understanding of the above objects, features and advantages of the present invention, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments can be combined with each other.
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是,本发明还可以采用其他不同于在此描述的其他方式来实施,因此,本发明的保护范围并不受下面公开的具体实施例的限制。In the following description, many specific details are set forth in order to fully understand the present invention. However, the present invention can also be implemented in other ways different from those described here. Therefore, the protection scope of the present invention is not limited by the specific details disclosed below. EXAMPLE LIMITATIONS.
图1示出了根据本发明的一个实施例的信息采集方法的流程示意图。Fig. 1 shows a schematic flowchart of an information collection method according to an embodiment of the present invention.
如图1所示,根据本发明的一个实施例的信息采集方法,包括:As shown in Figure 1, the information collection method according to one embodiment of the present invention includes:
步骤102,在登录网页版的应用之后,获取所述应用的Cookie、和使用目标账号在所述应用上发布的数据的列表页链接。Step 102, after logging into the web version of the application, obtain the cookie of the application and the list page link of the data published on the application using the target account.
步骤104,根据所述Cookie和所述列表页链接获取列表页,并获取所述列表页中的至少一个内容页链接。Step 104, obtain a list page according to the cookie and the list page link, and obtain at least one content page link in the list page.
在步骤104中,由于应用的Cookie具有一定的时效性,若根据Cookie和列表页链接无法获取到列表页,说明Cookie是无效的,因此,优选地,若根据所述Cookie和所述列表页链接未获取到所述列表页,则重新获取所述应用的其他Cookie,以根据所述其他Cookie和所述列表页链接获取所述列表页。In step 104, since the applied Cookie has a certain timeliness, if the list page cannot be obtained according to the Cookie and the link to the list page, it means that the Cookie is invalid. If the list page is not acquired, other cookies of the application are acquired again, so as to acquire the list page according to the other cookies and the list page link.
步骤106,下载所述至少一个内容页链接中的每个内容页链接对应的内容页。Step 106, download the content page corresponding to each content page link in the at least one content page link.
在步骤106之后,对所述内容页进行解析以获取所述内容页中的内容,并将所述内容页中的内容转化为目标格式的数据,从而将内容页中的各项内容转化为统一的格式进行保存,例如,转化为TXT或WORD格式的纯文本数据,进而方便对下载的内容页的内容进行统一管理。After step 106, the content page is analyzed to obtain the content in the content page, and the content in the content page is converted into data in the target format, thereby converting each content in the content page into a unified format, for example, converted to plain text data in TXT or WORD format, so as to facilitate the unified management of the content of the downloaded content page.
另外,步骤104和步骤106中的所述列表页和所述内容页为JSON格式的数据,从而根据JSON库对列表页和内容页进行解析即可获取到列表页中的至少一个内容页链接、和内容页中的内容。In addition, the list page and the content page in steps 104 and 106 are data in JSON format, so that at least one content page link in the list page can be obtained by parsing the list page and the content page according to the JSON library. and the content in the content page.
在该技术方案中,在登录网页版的应用(应用可以是微信、微博、QQ)之后,例如,使用selenium webdriver工具登录网页版的应用之后,通过获取该应用的Cookie和使用目标账号在该应用上发布的数据的列表页链接,以获取列表页,然后获取该列表页中的至少一个内容页链接,最后根据至少一个内容页链接就可以获取到目标账号在应用上发布的数据,从而实现了模拟人的行为在应用的海量数据中采集有价值的数据,进而提高了信息采集的效率。In this technical solution, after logging in to the web version of the application (the application can be WeChat, Weibo, QQ), for example, after using the selenium webdriver tool to log in to the web version of the application, by obtaining the cookie of the application and using the target account in the The list page link of the data published on the application to obtain the list page, then obtain at least one content page link in the list page, and finally obtain the data published by the target account on the application according to at least one content page link, so as to realize In order to simulate human behavior and collect valuable data in the massive data of the application, the efficiency of information collection is improved.
例如,使用selenium webdriver登录网页版的微信之后,获取微信的Cookie和使用目标公众账号为“北京”发布的文章的列表页链接,根据Cookie和列表页链接获取列表页,在列表页中有文章标题为“北京5日游攻略”、“在北京必去的10大旅游胜地”、“北京美食集锦”。对列表页进行解析以获取到列表页中的内容页链接,即获取到访问“北京5日游攻略”、“在北京必去的10大旅游胜地”、“北京美食集锦”这几篇文章内容的链接,最后根据这几篇文章内容的链接就可以获取到这几篇文章的内容。For example, after using selenium webdriver to log in to the web version of WeChat, obtain the WeChat cookie and the list page link of the article published by the target public account for "Beijing", and obtain the list page according to the cookie and list page link, and there is an article title in the list page It is "Beijing 5-Day Travel Guide", "Top 10 Must-Visit Tourist Attractions in Beijing", and "Beijing Food Collection". Analyze the list page to obtain the content page links in the list page, that is, access to the content of articles such as "Beijing 5-day travel guide", "10 must-visit tourist attractions in Beijing", and "Beijing food collection" Finally, according to the links of the content of these articles, the content of these articles can be obtained.
在上述技术方案中,优选地,还包括:在登录网页版的所述应用之后,周期性地刷新所述应用的网页。In the above technical solution, preferably, further comprising: after logging into the web version of the application, periodically refreshing the web page of the application.
在该技术方案中,由于长时间不操作应用,应用处于掉线状态或者退出登录的状态,则通过周期性地刷新应用的网络,以保证网页版的应用处于在线状态,避免了在应用掉线之后重新登录应用。In this technical solution, because the application is not operated for a long time, the application is offline or logged out, the network of the application is refreshed periodically to ensure that the web version of the application is in an online state, and to avoid the application being disconnected Then log back into the app.
图2示出了根据本发明的另一个实施例的信息采集方法的流程示意图。Fig. 2 shows a schematic flowchart of an information collection method according to another embodiment of the present invention.
如图2所示,根据本发明的另一个实施例的信息采集方法,包括:As shown in Figure 2, the information collection method according to another embodiment of the present invention includes:
步骤202,Wechat_server向Cookie_server发送Cookie请求。Step 202, Wechat_server sends Cookie request to Cookie_server.
步骤204,Wechat_server从Cookie_server中获取微信的Cookie,并获取微信中的目标公众号所发表的文章的列表页链表。In step 204, the Wechat_server obtains the WeChat cookie from the Cookie_server, and obtains a linked list page list of articles published by the target official account in WeChat.
步骤206,根据Cookie和列表页链表获取列表页。Step 206, obtain the list page according to the cookie and the list page link list.
步骤208,判断获取的Cookie是否有效,在判断结果为是时,进入步骤210,否则,进入步骤202,以重新获取微信的其他Cookie。其中,若根据Cookie和列表页链表获取到列表页,则判定Cookie有效,若根据Cookie和列表页链表未获取到列表页或者获取到其他的内容,则判定Cookie无效。Step 208, judge whether the acquired Cookie is valid, if the judgment result is yes, go to step 210, otherwise, go to step 202 to re-acquire other cookies of WeChat. Wherein, if the list page is obtained according to the cookie and the list page link list, it is determined that the cookie is valid; if no list page or other content is obtained according to the cookie and the list page link list, then the cookie is determined to be invalid.
步骤210,解析列表页获取该列表页中的多个内容页链接。Step 210, parsing the list page to obtain multiple content page links in the list page.
步骤212,下载多个内容页链接对应的内容页。Step 212, downloading content pages corresponding to multiple content page links.
步骤214,解析内容页,并将信息落地,具体地,解析内容页获取到内容页中的内容,将内容页中的内容转换为可用的、格式化的纯文本数据。Step 214, analyze the content page, and implement the information, specifically, analyze the content page to obtain the content in the content page, and convert the content in the content page into usable and formatted plain text data.
图3示出了根据本发明的一个实施例的信息采集装置的结构示意图。Fig. 3 shows a schematic structural diagram of an information collection device according to an embodiment of the present invention.
如图3所示,根据本发明的一个实施例的信息采集装置300,包括:第一获取单元302、第二获取单元304和下载单元306,其中,第一获取单元302用于在登录网页版的应用之后,获取所述应用的Cookie、和使用目标账号在所述应用上发布的数据的列表页链接;第二获取单元304,用于根据所述Cookie和所述列表页链接获取列表页,并获取所述列表页中的至少一个内容页链接;下载单元306,用于下载所述至少一个内容页链接中的每个内容页链接对应的内容页。As shown in Figure 3, an information collection device 300 according to an embodiment of the present invention includes: a first acquisition unit 302, a second acquisition unit 304 and a download unit 306, wherein the first acquisition unit 302 is used to After the application of the application, obtain the cookie of the application and the list page link of the data published on the application using the target account; the second obtaining unit 304 is used to obtain the list page according to the cookie and the list page link, And acquire at least one content page link in the list page; the downloading unit 306 is configured to download a content page corresponding to each content page link in the at least one content page link.
在该技术方案中,在登录网页版的应用之后,例如,使用selenium webdriver工具登录网页版的应用之后,通过获取该应用的Cookie和使用目标账号在该应用上发布的数据的列表页链接,以获取列表页,然后获取该列表页中的至少一个内容页链接,最后根据至少一个内容页链接就可以获取到目标账号在应用上发布的数据,从而实现了模拟人的行为在应用的海量数据中采集有价值的数据,进而提高了信息采集的效率。In this technical solution, after logging in to the web version of the application, for example, after using the selenium webdriver tool to log in to the web version of the application, by obtaining the cookie of the application and the list page link of the data published on the application using the target account, to Obtain the list page, then obtain at least one content page link in the list page, and finally obtain the data published by the target account on the application according to at least one content page link, thus realizing the simulation of human behavior in the massive data of the application Collect valuable data, thereby improving the efficiency of information collection.
例如,使用selenium webdriver登录网页版的微信之后,获取微信的Cookie和使用目标公众账号为“北京”发布的文章的列表页链接,根据Cookie和列表页链接获取列表页,在列表页中有文章标题为“北京5日游攻略”、“在北京必去的10大旅游胜地”、“北京美食集锦”。对列表页进行解析以获取到列表页中的内容页链接,即获取到访问“北京5日游攻略”、“在北京必去的10大旅游胜地”、“北京美食集锦”这几篇文章内容的链接,最后根据这几篇文章内容的链接就可以获取到这几篇文章的内容。For example, after using selenium webdriver to log in to the web version of WeChat, obtain the WeChat cookie and the list page link of the article published by the target public account for "Beijing", and obtain the list page according to the cookie and list page link, and there is an article title in the list page It is "Beijing 5-Day Travel Guide", "Top 10 Must-Visit Tourist Attractions in Beijing", and "Beijing Food Collection". Analyze the list page to obtain the content page links in the list page, that is, access to the content of articles such as "Beijing 5-day travel guide", "10 must-visit tourist attractions in Beijing", and "Beijing food collection" Finally, according to the links of the content of these articles, the content of these articles can be obtained.
在上述技术方案中,优选地,还包括:第三获取单元308,若根据所述Cookie和所述列表页链接未获取到所述列表页,则重新获取所述应用的其他Cookie,以根据所述其他Cookie和所述列表页链接获取所述列表页。In the above technical solution, preferably, it further includes: a third obtaining unit 308, if the list page is not obtained according to the cookie and the link to the list page, re-acquire other cookies of the application to The other cookies mentioned above and the link to the list page are used to obtain the list page.
在该技术方案中,由于应用的Cookie具有一定的时效性,若根据Cookie和列表页链接无法获取到列表页,说明Cookie是无效的,则重新获取其他Cookie,从而根据其他Cookie和列表页链接获取到列表页。In this technical solution, since the applied Cookie has a certain timeliness, if the list page cannot be obtained according to the Cookie and the link to the list page, it means that the Cookie is invalid, and other Cookies are obtained again, so as to obtain to the list page.
在上述任一技术方案中,优选地,还包括:转换单元310,用于对所述内容页进行解析以获取所述内容页中的内容,并将所述内容页中的内容转化为目标格式的数据。In any of the above technical solutions, preferably, further comprising: a conversion unit 310, configured to parse the content page to obtain the content in the content page, and convert the content in the content page into a target format The data.
在该技术方案中,通过提取内容页中的各项内容,并将内容页中的各项内容转化为统一的目标格式的数据进行保存,例如,转化为TXT或WORD格式的纯文本数据,从而方便对下载的内容页的内容进行统一管理。In this technical solution, by extracting each content in the content page, and converting each content in the content page into data in a unified target format for storage, for example, converting it into plain text data in TXT or WORD format, thereby It is convenient to carry out unified management on the content of the downloaded content page.
在上述任一技术方案中,优选地,还包括:刷新单元312,用于在登录网页版的所述应用之后,周期性地刷新所述应用的网页。In any of the above technical solutions, preferably, further comprising: a refreshing unit 312, configured to periodically refresh the web page of the application after logging in the web version of the application.
在该技术方案中,由于长时间不操作应用,应用处于掉线状态或者退出登录的状态,则通过周期性地刷新应用的网络,以保证网页版的应用处于在线状态,避免了在应用掉线之后重新登录应用。In this technical solution, because the application is not operated for a long time, the application is offline or logged out, the network of the application is refreshed periodically to ensure that the web version of the application is in an online state, and to avoid the application being disconnected Then log back into the app.
在上述任一技术方案中,优选地,所述列表页和所述内容页为JSON格式的数据。In any of the above technical solutions, preferably, the list page and the content page are data in JSON format.
在该技术方案中,获取的列表页和内容页为JSON格式的数据,从而根据JSON库对列表页和内容页进行解析即可获取到列表页中的至少一个内容页链接、和内容页中的内容。In this technical solution, the obtained list page and content page are data in JSON format, so that at least one content page link in the list page and the content page link in the content page can be obtained by parsing the list page and content page according to the JSON library. content.
图4示出了根据本发明的另一个实施例的信息采集装置的结构示意图。Fig. 4 shows a schematic structural diagram of an information collection device according to another embodiment of the present invention.
如图4所示,根据本发明的另一个实施例的信息采集装置400,包括:Cookie_server402和Wechat_server404。Cookie_server402通过selenium webdriver工具访问地址https://wx.qq.com/,模拟微信网页版登录,获取微信的Cookie,从而为Wechat_server404提供Cookie服务。若Wechat_server404向Cookie_server402请求Cookie,则Wechat_server404可以从Cookie_server402中获取微信的Cookie。然后Wechat_server404获取目标账号在微信上发布的文章的列表页链接,根据获取的Cookie和列表页链接获取列表页,解析列表页中的内容页链接,再下载内容页链接对应的内容页,将内容页中的内容转化为可用的、格式化的纯文本数据。As shown in FIG. 4 , an information collection device 400 according to another embodiment of the present invention includes: Cookie_server402 and Wechat_server404 . Cookie_server402 accesses the address https://wx.qq.com/ through the selenium webdriver tool, simulates the login of the web version of WeChat, obtains the Cookie of WeChat, and provides Cookie service for Wechat_server404. If Wechat_server404 requests Cookie from Cookie_server402, then Wechat_server404 can obtain the Cookie of WeChat from Cookie_server402. Then Wechat_server404 obtains the list page link of the article published by the target account on WeChat, obtains the list page according to the obtained Cookie and the list page link, parses the content page link in the list page, and then downloads the content page corresponding to the content page link, and converts the content page The content in is converted into usable, formatted plain text data.
以上结合附图详细说明了本发明的技术方案,通过本发明的技术方案,可以模拟人的行为在应用的海量数据中采集有价值的数据,进而提高了信息采集的效率。The technical solution of the present invention has been described in detail above in conjunction with the accompanying drawings. Through the technical solution of the present invention, human behavior can be simulated to collect valuable data from massive data in applications, thereby improving the efficiency of information collection.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610575716.3A CN107644021A (en) | 2016-07-20 | 2016-07-20 | Information collecting method and information collecting device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610575716.3A CN107644021A (en) | 2016-07-20 | 2016-07-20 | Information collecting method and information collecting device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107644021A true CN107644021A (en) | 2018-01-30 |
Family
ID=61108987
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610575716.3A Pending CN107644021A (en) | 2016-07-20 | 2016-07-20 | Information collecting method and information collecting device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107644021A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104954234A (en) * | 2015-05-19 | 2015-09-30 | 中国地质大学(北京) | Microblog data acquisition method, microblog data acquisition device and public opinion analysis method |
CN105162676A (en) * | 2015-04-03 | 2015-12-16 | 中国科学院信息工程研究所 | Method and system for acquiring WeChat data |
CN105577528A (en) * | 2015-12-31 | 2016-05-11 | 深圳中泓在线股份有限公司 | Wechat official account data collection method and device based on virtual machine |
CN105631030A (en) * | 2015-12-30 | 2016-06-01 | 福建亿榕信息技术有限公司 | Universal web crawler login simulation method and system |
-
2016
- 2016-07-20 CN CN201610575716.3A patent/CN107644021A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105162676A (en) * | 2015-04-03 | 2015-12-16 | 中国科学院信息工程研究所 | Method and system for acquiring WeChat data |
CN104954234A (en) * | 2015-05-19 | 2015-09-30 | 中国地质大学(北京) | Microblog data acquisition method, microblog data acquisition device and public opinion analysis method |
CN105631030A (en) * | 2015-12-30 | 2016-06-01 | 福建亿榕信息技术有限公司 | Universal web crawler login simulation method and system |
CN105577528A (en) * | 2015-12-31 | 2016-05-11 | 深圳中泓在线股份有限公司 | Wechat official account data collection method and device based on virtual machine |
Non-Patent Citations (3)
Title |
---|
EASTMOUNT: ""[python爬虫] Selenium爬取新浪微博内容及用户信息"", 《HTTPS://BLOG.CSDN.NET/EASTMOUNT/ARTICLE/DETAILS/50720436》 * |
EASTMOUNT: ""[Python爬虫] Selenium爬取新浪微博客户端用户信息、热点话题及评论(上)"", 《HTTPS://BLOG.CSDN.NET/EASTMOUNT/ARTICLE/DETAILS/51231852》 * |
高可用架构: ""为何大量网站不能抓取?爬虫突破封禁的6种常见方法【岂安低调分享】"", 《HTTP://BIGSEC.COM/BIGSEC-NEWS/WECHAT-2016-WEB-CRAWLER》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104063460B (en) | A kind of method and apparatus loading webpage in a browser | |
CN103124263B (en) | A kind of advertisement push system and advertisement pushing equipment, Advertisement Server | |
US8825749B2 (en) | Method of tracking offline user interaction in a rendered document on a mobile device | |
CN100462964C (en) | A method for updating and displaying web page data | |
CN103428042B (en) | Server is carried out the method and system of stress test | |
US10025599B1 (en) | Connectivity as a service | |
CN107145556B (en) | Universal distributed acquisition system | |
CN108494860B (en) | WEB access system, WEB access method and device for client | |
CN101178717A (en) | Method for adaptation processing web page and web page adaptation device | |
CN103246963B (en) | Based on the staffs training system of Internet of Things | |
CN111010364B (en) | System for offline object-based storage and simulation of REST responses | |
CN103347092A (en) | Method and device for recognizing cacheable file | |
US9521034B2 (en) | Method and apparatus for generating resource address, and system thereof | |
WO2016173185A1 (en) | Information pushing method and apparatus | |
CN105721578A (en) | User behavior data collection method and system | |
CN110929183A (en) | Data processing method, device and machine readable medium | |
US20140006918A1 (en) | Method and system for web page rearrangement | |
CN104320488A (en) | Proxy server system and proxy service method | |
US9336316B2 (en) | Image URL-based junk detection | |
CN103513986B (en) | A kind of method utilizing CGI technology to realize dynamic web server in without operating system equipment | |
CN102999424B (en) | Parallel remote automated testing method | |
CN101895550B (en) | Cache accelerating method for compatibility of dynamic and static contents of internet website | |
CN103605770A (en) | Method and server for generating web page templates | |
CN103853845A (en) | Dynamic analytic method of complex form | |
CN106897313B (en) | Mass user service preference evaluation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180130 |