CN113779357A

CN113779357A - Information subscription method based on intelligent crawler

Info

Publication number: CN113779357A
Application number: CN202111074611.7A
Authority: CN
Inventors: 也鹏; 董佳霖; 刘佳浩; 郑羽辰; 陈楚翘; 王锐璇
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2021-12-10

Abstract

The invention belongs to the technical field of computer WEB and the technical field of information capture, and particularly relates to an information subscription method based on intelligent crawlers. The method specifically comprises the following steps: step 1, acquiring a website which a user desires to monitor and keywords appointed by the user, and displaying a target webpage; step 2, acquiring specific modes of user monitoring, including a monitored webpage area, a time interval for checking the webpage, and a notification mode after monitoring keywords required by the user; step 3, crawling a target webpage by using a crawler at regular time; and 4, checking whether the user-specified keywords appear in the area selected by the user after the target webpage is crawled each time, checking again and repeating the step 3 if the user-specified keywords do not appear and wait for a period of time interval specified by the user, and sending a notification to the user by using a notification mode selected by the user if the user-specified keywords appear.

Description

Information subscription method based on intelligent crawler

Technical Field

The invention belongs to the technical field of computer WEB and the technical field of information capture, and particularly relates to an information subscription method based on intelligent crawlers.

Background

As more and more information is brought into people's lives, it becomes more difficult to obtain accurate and effective information in time. In this era of "information explosion", it is becoming increasingly difficult to accurately obtain a variety of information such as education, shopping, news, etc. that are of interest to individuals.

Although the functions and the quantity of the search engine are more and more abundant along with the development of the times, compared with the rapid surge of thousands of information, the search engine cannot rapidly monitor whether certain information appears and timely inform a user when the information appears. In the social environment where the data volume is exponentially increased, if the user wants to acquire information in time, the user has to repeatedly open a certain website at intervals and check whether related information appears, which not only wastes time and labor and greatly affects the non-working learning efficiency of the user, but also causes irreversible loss due to omission.

At present, some applications for helping users to acquire information exist at home and abroad, but the applications can not meet the requirements of the users more or less completely. For example, the domestic web page monitoring software OpenWebMonitor can help a user monitor the change of a certain position of a web page, but cannot check whether specified information appears on the web page, and the software needs to be downloaded and installed, and requires a host computer to be in a power-on state for web page monitoring, which is inconvenient. Foreign web page monitoring software URLyWarng can also help users to know the content of web page changes, but the official website of the software is a foreign address, and the software cannot be accessed at home, so that legal version software is difficult to obtain. The web Monitor named Page Monitor is a plug-in of Chrome browser, which can set up independent check time interval and can customize voice prompt, but the nature of its plug-in limits its function to some extent.

Compared with the prior art mentioned above, the information subscription system based on the intelligent crawler is based on various web crawler means and is fully combined with user operation. The system can check whether a certain specific information appears in the webpage, meanwhile, the system is presented in a web application mode, downloading and installation are not needed, a host computer is not needed to be started to ensure that a monitoring task is carried out smoothly, and a user can receive a system notice when a specified keyword appears in the webpage as long as the task is correctly set on the webpage, so that the working efficiency of the user is improved, the accuracy is improved, and resources and cost consumed by information collection are saved.

Disclosure of Invention

Aiming at the problems, the invention provides an information subscription method based on intelligent crawlers, which can help users to acquire concerned information more conveniently and reliably.

In order to achieve the purpose, the invention adopts the following technical scheme:

an information subscription method based on intelligent crawlers specifically comprises the following steps:

step 1, acquiring a website which a user desires to monitor and keywords appointed by the user, and displaying a target webpage;

step 2, acquiring specific modes of user monitoring, including a monitored webpage area, a time interval for checking the webpage, and a notification mode after monitoring keywords required by the user;

step 3, crawling a target webpage regularly by using a crawler to obtain a webpage html;

and 4, checking whether the user-specified keywords appear in the area selected by the user after the target webpage is crawled each time, checking again and repeating the step 3 if the user-specified keywords do not appear and wait for a period of time interval specified by the user, and sending a notification to the user by using a notification mode selected by the user if the user-specified keywords appear.

In the further optimization of the technical scheme, in the step 1, the user input is obtained through a form, and a webpage corresponding to the user input website is embedded in the webpage by using the iframe.

In the technical scheme, the method for obtaining the monitored webpage area in the step 2 is further optimized by crawling the html code of the website input by the user through a crawler and adding a mouse event in the original webpage in a js injection mode.

According to the technical scheme, the mouse event is further optimized, the mouse event specifically comprises mouse moving-in, moving-out and clicking events, and for the mouse moving-in event, a green css style frame is added to the element; for the shift-out event, judging whether the element is in a selected state, if not, deleting the green frame of the element, and if not, doing nothing; for a mouse click event, firstly judging whether the element has an id attribute in the original html, if so, recording the id, otherwise, generating a random id for the element, then judging whether the element is in a selected state, setting the element in the unselected state and deleting a blue border of the element, if so, setting the element in the selected state and adding the blue border, and then, dynamically loading by using ajax, and displaying the id of the element in a selected element list of a webpage.

According to the technical scheme, the method for injecting the js in the step 2 is further optimized, and the specific method is to add the path of the injected js file in the last part of the crawled html webpage.

And (3) further optimizing the technical scheme, wherein in the step (2), a form is used for obtaining a notification mode after a user clicks a selected webpage position, a time interval for checking the webpage input by the user and a keyword required by the user is monitored.

In the step 2, after the user selects the monitored webpage area, a DOM tree is established according to webpage html by using an lxml library of python, the DOM tree is traversed, nodes with ids corresponding to the area selected by the user are found, xpaths of the nodes are obtained and recorded, and thus the area selected by the user is obtained.

In a further optimization of the technical scheme, the implementation method for crawling the html of the webpage in the step 3 is to use a timing task function provided by an apscheduler package of python, and to each webpage monitoring task set by the user, a crawler module is called regularly to crawl the html of the webpage.

In a further optimization of the technical scheme, the step 4 specifically includes:

step 4.1, establishing a DOM tree by using the webpage html obtained in the step 3;

step 4.2, obtaining nodes in a DOM tree corresponding to the xpath of the webpage block selected by the user;

4.3, checking whether keywords appear in the sub DOM tree taking the nodes as the root;

4.4, if normal return does not occur;

step 4.5, if the user notification module is called, sending a notification to the user according to the notification mode selected by the user, and then setting the current monitoring task to be in a finished state;

and 4.6, if the node corresponding to the xpath is not found in the established DOM tree, indicating that the webpage structure is changed, the task cannot be carried out, setting the task to be in a failure state and informing a user.

In the further optimization of the technical scheme, the notification mode is as follows: postbox, WeChat, SMS.

Different from the prior art, the beneficial results of the technical scheme are as follows:

the invention can meet the requirements of the user on monitoring and collecting the internet information such as news, notice and other contents based on the actual requirements of the user. The invention adopts intelligent human-computer interaction technology and crawler technology to realize the monitoring of the specific webpage in a user-specified mode, and can inform the user in various modes after the collected specified information.

Compared with the current partial information subscription method, the method has stronger user friendliness, namely, the user can use the intelligent crawler to conveniently acquire the webpage information in time without any professional knowledge; meanwhile, the operation of selecting the area by the user is very simple and convenient, and the selection can be realized only by simply clicking a specific area in the webpage.

Drawings

FIG. 1 is a flow chart of an information subscription method based on an intelligent crawler;

FIG. 2 shows the effect of injecting js into the original web page.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

Referring to fig. 1, a flowchart of an information subscription method based on an intelligent crawler is shown, where the method specifically includes the following steps:

step 1, acquiring a website which a user wants to monitor, and displaying a target webpage.

Firstly, a webpage input by a user is obtained by using a html form in a task setting page 1, form content is obtained through request.

Step 2, obtaining the concrete mode of user monitoring, including the monitored webpage area, the time interval of checking the webpage, detecting the keywords, and the informing mode after monitoring the keywords required by the user

Obtaining the time interval of checking the web page specified by the user, and obtaining the monitored web page area in the same way as in the step one are one of the key steps of the invention, and the specific steps are as follows:

firstly, crawling an html code of a target website by using a crawler technology, wherein the crawling method is the same as the method in the third step, then processing the target code, adding a script file path at the last of the html code to perform js injection, and the function of the injected js file is to traverse tag elements of a specific type, such as div, a and the like, when a page is initialized. Firstly, adding cs frame patterns to the elements for occupying, and preventing the typesetting problem during later display; then adding mouse events, namely mouse in, mouse out and mouse clicking events, for the mouse move-in event, adding a green css style frame to the element, for the move-out event, judging whether the element is in a selected state (namely, the clicked time is singular), if not, deleting the green frame of the element, if so, doing nothing, for the mouse click event, firstly judging whether the element has an id attribute in the original html, if the id is recorded, otherwise, a random id is generated for the element, then whether the element is in the selected state is judged, the element is set to be in the unselected state and the blue border of the element is deleted if the element is not in the selected state, the element is set to be in the selected state and the blue border is added in the selected state, then ajax dynamic loading is used, and the id of the element is displayed in a selected element list of the webpage. The effect of the user-specified web page injected with js is shown in FIG. 2.

After a user selects a target area, a DOM tree is established according to webpage html by using an lxml library of python, the DOM tree is traversed, nodes with id corresponding to the area selected by the user are found, xpaths of the nodes are obtained and recorded, and therefore the area selected by the user is obtained.

Step 3, crawling the target webpage regularly by using a crawler to obtain webpage html

The crawlers used in this method are mainly divided into two categories. The html of the webpage can be obtained by using a requests library of python for the simple webpage loaded synchronously, and the complete html of the webpage is obtained by using a selenium simulation click method for the webpage loaded dynamically.

Besides the function of obtaining html, the function of calling the crawler module regularly is also realized. The apscheduler is used for the function, the apscheduler timing scheduling library is based on quartz, tasks can be executed at regular time, and it is noted that the tasks are preferably stored in a database when the apscheduler is used, so that the persistence of the timing tasks is convenient to realize, the original tasks are not lost after the server is restarted, and the scheduling is automatically started again.

Step 4, checking whether the user-specified keywords appear in the area selected by the user after the target webpage is crawled each time, and if the user-specified keywords do not appear, waiting for a period of time interval specified by the user to check again, and repeating the step 3; and if the monitoring fails due to the fact that the keywords or the target webpage structure specified by the user are changed, sending a notification to the user in a notification mode selected by the user.

After a target webpage is crawled each time, a DOM tree is established according to the method in the step 2, the DOM tree established in the step 2 is used for acquiring an xpath specified by a user, the text content in the node corresponding to the xpath recorded in the step 2 is checked, if the corresponding node is found, the structure of the target webpage is changed, the monitoring is failed, and the user needs to be informed; if no keyword appears in the webpage area corresponding to the recorded xpath, continuing monitoring, and executing the step 3 after a period of time specified by the user; and if the keywords appear in the webpage area corresponding to the recorded xpath, finishing monitoring and informing the user.

4.4, if normal return does not occur;

Three notification modes are provided for the user to select: WeChat, email and short message are realized by using the API provided by the method.

Specific examples are as follows:

1. in the task setting page 1, the user inputs a target website url: https:// news. basic.com/, and clicking the loading page, the web page obtains user input and embeds the page corresponding to the url into the web page using iframe, and simultaneously displays the task name input field.

The user inputs in the task name input field: epidemic situation information is obtained, a next button is clicked, then the method obtains an html source code of a target website by using a crawler technology, and finally < script src > (of./js/djl.js >) < script > is added to the source code for js injection and is cached in a server.

2. In the task setting page 2, the html cached in the previous step is embedded into the web page by using the ifame, the user selects the monitoring area by clicking, the selected area is framed by the blue frame, and the id corresponding to the area is displayed on the left side, as shown in fig. three.

After selecting the area, the user enters a monitoring keyword: nucleic acid detection, refresh interval (in hours): and 3, a notification mode: short message and click confirmation.

3. Then, the method uses an lxml library of python to establish a DOM tree of a target webpage html, obtains and records an xpath of a region selected by a user by using the DOM tree, simultaneously uses an apscheduler to set a timing task, an execution interval designates a refresh interval for the user, an executed function is to regularly crawl a source code of a target website, reestablishes the DOM tree, checks text information in a sub-DOM tree taking a node corresponding to the recorded xpath as a root, sends a notice in a user designated mode when checking that a keyword designated by the user appears, and sends monitoring failure information to the user if finding out that the DOM tree does not have the recorded xpath corresponding node in a certain check.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising … …" or "comprising … …" does not exclude the presence of additional elements in a process, method, article, or terminal that comprises the element. Further, herein, "greater than," "less than," "more than," and the like are understood to exclude the present numbers; the terms "above", "below", "within" and the like are to be understood as including the number.

Although the embodiments have been described, once the basic inventive concept is obtained, other variations and modifications of these embodiments can be made by those skilled in the art, so that the above embodiments are only examples of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes using the contents of the present specification and drawings, or any other related technical fields, which are directly or indirectly applied thereto, are included in the scope of the present invention.

Claims

1. An information subscription method based on intelligent crawlers is characterized by specifically comprising the following steps:

2. The information subscription method based on the intelligent crawler according to claim 1, wherein step 1 obtains the user input through a form tag, and embeds a web page corresponding to the user input website in the web page by using an iframe tag.

3. The information subscription method based on intelligent crawler according to claim 1, wherein the method for obtaining the monitored webpage area in step 2 is to crawl html codes of a website input by a user through the crawler and add a mouse event in an original webpage by means of js injection.

4. The intelligent crawler-based information subscription method according to claim 3, wherein the mouse events specifically include mouse move-in, move-out, click events, and for a mouse move-in event, a green css style border is added to the element; for the shift-out event, judging whether the element is in a selected state, if not, deleting the green frame of the element, and if not, doing nothing; for a mouse click event, firstly judging whether the element has an id attribute in the original html, if so, recording the id, otherwise, generating a random id for the element, then judging whether the element is in a selected state, setting the element in the unselected state and deleting a blue border of the element, if so, setting the element in the selected state and adding the blue border, and then, dynamically loading by using ajax, and displaying the id of the element in a selected element list of a webpage.

5. The intelligent crawler-based information subscription method as claimed in claim 3, wherein the specific method of js injection in the step 2 is to add the path of the injected js file to the last part of the crawled html webpage.

6. The method for subscribing to information based on an intelligent crawler according to claim 1, wherein in step 2, a form is used to obtain a notification manner after a user clicks a selected web page position, a time interval for checking web pages input by the user, and a keyword required by the user is monitored.

7. The intelligent crawler-based information subscription method according to any one of claims 1-6, wherein after the user selects the monitored webpage region in step 2, a DOM tree is established according to webpage html by using an lxml library of python, the DOM tree is traversed, nodes having id corresponding to the region selected by the user are found, xpath of the nodes is obtained and recorded, and thus the region selected by the user is obtained.

8. The information subscription method based on intelligent crawler according to claim 1, wherein the implementation method of regularly crawling html of the web page in step 3 is a timed task function provided by using an apscheduler package of python, and regularly calls a crawler module to crawl html of the web page for each web page monitoring task set by a user.

9. The information subscription method based on intelligent crawlers according to claim 1, wherein the step 4 specifically comprises:

4.4, if normal return does not occur;

10. The intelligent crawler-based information subscription method according to claim 1 or 9, wherein the notification manner is: postbox, WeChat, SMS.