CN109246069B

CN109246069B - Webpage login method and device and readable storage medium

Info

Publication number: CN109246069B
Application number: CN201810618080.5A
Authority: CN
Inventors: 陈少鹏
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2018-06-15
Filing date: 2018-06-15
Publication date: 2020-10-16
Anticipated expiration: 2038-06-15
Also published as: CN109246069A

Abstract

The application discloses a webpage login method, a webpage login device and a readable storage medium. The method comprises the following steps: searching a first tag in the obtained source code of the page to be logged in, wherein the first tag is an input tag with a password input attribute; because the account and the password are generally in the same row or the same column by combining the conventional design of the page, after a first tag with a password input attribute is searched in the source code of the page to be logged in, a second tag related to login is searched according to the source code of the page to be logged in, wherein the second tag is a father tag of the first tag, so that a target login tag used for logging in the page to be logged in is searched in the second tag and the son tags of the second tag, thereby realizing the login of the webpage, improving the login speed of the webpage and improving the success rate of the login of the webpage.

Description

Webpage login method and device and readable storage medium

Technical Field

The present disclosure relates to the field of communications technologies, and in particular, to a method and an apparatus for web page login, and a readable storage medium.

Background

With the rapid development of networks, the world wide web becomes a carrier of a large amount of information, and how to effectively extract and utilize the information in a complex network environment becomes a great challenge. Therefore, the web crawler takes place at the same time. The web crawler is a program for automatically browsing the internet, and is also called an automatic indexer, a web robot and the like. The web crawler program starts from an initial Uniform Resource Locator (URL), finds new URLs by continuously analyzing the content of the web pages and applying the algorithm, finds more web pages according to the new URLs, and continuously loops to capture the web pages until certain system setting conditions are met. In order to collect URLs as comprehensive as possible, different web pages need to be logged in for deep collection, so how to log in the web pages becomes a key influencing the performance of the web crawler.

In the related art, when a web page is logged in, a request is sent according to a start URL input by a user, and response content is saved locally, such as a Hyper Text Markup Language (HTML) source code. And aiming at the acquired source code, matching the source code in the full text through a regular expression, for example, performing regular matching on labels such as Login, submit, button and the like by using specific keywords, binding the labels with the account and the password of the user, and attempting Login.

Because the specific keywords cannot completely cover all the keywords, all the tags contained in the login page cannot be covered by the related technology, and especially some tags customized by a user cannot be matched, the webpage login mode provided by the related technology is easy to fail in login; in addition, the efficiency and accuracy of web page login are low due to the full-text matching mode through the regular expression.

Disclosure of Invention

The present disclosure provides a web page login method, apparatus and readable storage medium to overcome the problems in the related art. The technical scheme is as follows:

in a first aspect, the present disclosure provides a web page login method, including: searching a first tag in the obtained source code of the page to be logged in, wherein the first tag is an input tag with a password input attribute; searching a second label related to login according to a source code of the page to be logged in, wherein the second label is a father label of the first label; and then, searching a target login tag in the second tag and the sub-tags of the second tag, wherein the target login tag is used for logging in the page to be logged in, so that the login operation of the page to be logged in can be realized.

Due to the fact that the conventional design of the page is combined, the general naming rule of the tags with the password input attribute is fixed, the account number and the password are generally in the same row or the same column, and the positions of the tags corresponding to the account number and the password in the source code are close to each other, after the first tag with the password input attribute is searched in the source code of the page to be logged in, the second tag related to login is obtained by searching the father tag of the first tag, and the target login tag used for logging in the page to be logged in is searched in the second tag and the son tags of the second tag, the login of the webpage is achieved, the login speed of the webpage can be improved, and the login success rate of the webpage can be improved.

In the first implementation manner of the first aspect, since the button tag, the a tag, and the input tag are used for web page login in many ways, the login related tag is one of the button tag, the a tag, and the input tag.

With reference to the first aspect or the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the target login tag is searched for in the second tag and the sub-tag of the second tag, which includes but is not limited to: searching one or more candidate login tags in the second tag and the sub-tags of the second tag; and performing login verification according to each candidate login tag until login succeeds according to a target login tag in one or more candidate login tags.

With reference to the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the searching for one or more candidate login tags in the second tag and the sub-tags of the second tag may be implemented by a candidate dictionary, which includes but is not limited to: if the login related tag is the button tag, searching a class tag, a name tag, an id tag and an onclick tag in the sub-tags of the button tag, and taking the class tag, the name tag, the id tag and the onclick tag in the button tag and the sub-tags of the button tag as candidate login tags; if the login related label is an a label, searching a class label, a name label, an id label, a text label, an onclick label and a title label in the sub-labels of the a label, and taking the class label, the name label, the id label, the text label, the onclick label and the title label in the sub-labels of the a label and the a label as candidate login labels; if the login related label is an input label, searching a class label, a name label, an id label, a value label, a placeholder label, an onclick label and a title label in the sub-labels of the input label, and taking the class label, the name label, the id label, the value label, the placeholder label, the onclick label and the title label in the sub-labels of the input label and the input label as candidate login labels. The candidate dictionary is used for searching the candidate login tag in the mode, so that the searching speed can be further improved.

With reference to any one of the first aspect to the third implementation manner of the first aspect, in a fourth implementation manner of the first aspect, the second tag is searched according to a source code of the page to be logged in, and a full-text search may be performed in the source code. In addition, in order to increase the search speed, when the second tag is searched according to the source code of the page to be logged in, a Document Object Model (DOM) tree of the page to be logged in can be obtained according to the source code of the page to be logged in, and the second tag is searched according to the DOM tree of the page to be logged in. Each node of the DOM tree is each label in the source code of the page to be logged in, and the DOM tree represents the nesting relation of the labels in the source code of the page to be logged in.

With reference to any one of the first aspect to the fourth embodiment of the first aspect, in a fifth embodiment of the first aspect, the method further comprises: if the target login tag does not exist in the second tag and the child tag of the second tag, that is, the target login tag does not exist in the parent tag of the first tag and the child tag of the parent tag of the first tag, the ancestor tag of the first tag can be continuously traversed, that is, the third tag is searched according to the source code of the page to be logged in, the third tag is the parent tag of the second tag, and the third tag is a login related tag; and searching the third label and a target sub-label of the third label for a target login label, wherein the target sub-label is a sub-label except the second label in the sub-label of the third label.

With reference to the fifth implementation manner of the first aspect, in a sixth implementation manner of the first aspect, the searching for the third tag according to the source code of the page to be logged in includes: and searching the third tag according to the DOM tree of the page to be logged in.

With reference to the fifth implementation manner of the first aspect, in a seventh implementation manner of the first aspect, the searching for the target login tag in the third tag and the target sub-tag of the third tag includes: searching one or more candidate login tags in the third tag and the target sub-tag of the third tag; and performing login verification according to each candidate login tag until login is successful according to the target login tag in the one or more candidate login tags. When one or more candidate login tags are searched in the third tag and the target sub-tag of the third tag, full-text search can be performed in the source code, and search can also be performed based on the DOM tree, so that the search speed is further increased.

In a second aspect, a web page login apparatus is further provided, and the apparatus includes: the acquisition module is used for acquiring a page to be logged in; the system comprises a searching module, a searching module and a searching module, wherein the searching module is used for searching a first tag in a source code of a page to be logged in, and the first tag is an input tag with a password input attribute; searching a second label according to the source code of the page to be logged in, wherein the second label is a father label of the first label, and the second label is a login related label; and searching a target login tag in the second tag and the sub-tags of the second tag, wherein the target login tag is used for logging in a page to be logged in.

In the first embodiment of the second aspect, the login related tag is one of a button tag, an a tag, and an input tag.

With reference to the second aspect or the first implementation manner of the second aspect, in a second implementation manner of the second aspect, the searching module is configured to search for a target login tag in the second tag and the sub-tags of the second tag, and specifically includes: the search module is used for searching one or more candidate login tags in the second tag and the sub-tags of the second tag; and performing login verification according to each candidate login tag until login succeeds according to a target login tag in one or more candidate login tags.

With reference to the second implementation manner of the second aspect, in a third implementation manner of the second aspect, the searching module is configured to search for one or more candidate login tags in the second tag and the sub-tags of the second tag, and specifically includes the searching module, if the login related tag is a button tag, searching for a class tag, a name tag, an id tag, and an onclick tag in the sub-tags of the button tag, and taking the class tag, the name tag, the id tag, and the onclick tag in the sub-tags of the button tag and the button tag as the candidate login tag; if the login related label is an a label, searching a class label, a name label, an id label, a text label, an onclick label and a title label in the sub-labels of the a label, and taking the class label, the name label, the id label, the text label, the onclick label and the title label in the sub-labels of the a label and the a label as candidate login labels; if the login related label is an input label, searching a class label, a name label, an id label, a value label, a placeholder label, an onclick label and a title label in the sub-labels of the input label, and taking the class label, the name label, the id label, the value label, the placeholder label, the onclick label and the title label in the sub-labels of the input label and the input label as candidate login labels.

With reference to any one of the second aspect to the third implementation manner of the second aspect, in a fourth implementation manner of the second aspect, the searching module is configured to search for the second tag according to the source code of the page to be logged in, and specifically includes: the searching module is used for obtaining a DOM tree of the page to be logged in according to the source code of the page to be logged in, wherein each node of the DOM tree is each label in the source code of the page to be logged in, and the DOM tree represents the nesting relation of the labels in the source code of the page to be logged in; and searching the second tag according to the DOM tree of the page to be logged in.

With reference to any one of the second to fourth embodiments of the second aspect, in a fifth embodiment of the second aspect, the search module is further configured to: if the target login tag does not exist in the second tag and the sub-tags of the second tag, searching a third tag according to a source code of a page to be logged in, wherein the third tag is a parent tag of the second tag and is a login related tag; and searching the third label and a target sub-label of the third label for a target login label, wherein the target sub-label is a sub-label except the second label in the sub-label of the third label.

With reference to the fifth implementation manner of the second aspect, in a sixth implementation manner of the second aspect, the searching module is configured to search the third tag according to the source code of the page to be logged in, and specifically includes: and the searching module is used for searching the third tag according to the DOM tree of the page to be logged in.

With reference to the fifth implementation manner of the second aspect, in a seventh implementation manner of the second aspect, the searching module is configured to search for the target login tag in the third tag and the target sub-tag of the third tag, and specifically includes: a searching module, configured to search for one or more candidate login tags in the third tag and the target sub-tag of the third tag; and performing login verification according to each candidate login tag until login is successful according to the target login tag in the one or more candidate login tags. When one or more candidate login tags are searched in the third tag and the target sub-tag of the third tag, full-text search can be performed in the source code, and search can also be performed based on the DOM tree, so that the search speed is further increased.

In a third aspect, there is provided a web page login apparatus, the apparatus includes a processor and a memory, the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement any of the above web page login methods.

In a fourth aspect, a computer-readable storage medium is provided, having stored therein at least one instruction, which is loaded and executed by a processor to implement any of the above web page login methods.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the above web page login methods.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

FIG. 1 illustrates a crawler architecture diagram of one embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating an implementation environment provided by an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a web page login method provided by an embodiment of the present disclosure;

FIG. 4 illustrates a DOM tree structure diagram provided by one embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating a web page login method provided by an embodiment of the present disclosure;

FIG. 6 is a flowchart illustrating a web page login method provided by an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a web page login device according to an embodiment of the present disclosure;

fig. 8 shows a schematic structural diagram of a web page login device according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present disclosure more apparent, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

With the rapid development of networks, the world wide web becomes a carrier of a large amount of information, and how to effectively extract and utilize the information in a complex network environment becomes a great challenge. The web crawler is used as an information acquisition tool, and application scenes are more and more. For example, a web crawler, an important component of a search engine, may crawl information from the world wide web for the search engine. For another example, in order to detect vulnerabilities on externally exposed websites or established networks, a web crawler may also be applied to vulnerability scanner for URL collection. The vulnerability scanner acquires a directory structure and dynamic interaction points of the whole target site according to the URL collected by the web crawler, and then constructs an attack test request (http or https) according to the acquired dynamic interaction points and by taking a vulnerability feature library as reference. And then, the information is sent to the target site through the client, and is matched with the feature codes in the vulnerability feature library according to the response information body of the target site, so that whether the specific vulnerability exists in the target site is judged. And the vulnerability scanner may be used for security assessment by security auditors, malicious attacks or unauthorized access to assets by hackers, and pre-application testing. Therefore, the application range of the web crawler is wider and wider.

However, no matter what kind of scenario the web crawler is applied to, crawling of information is not required, and logging in the web page is a way for crawling information, so that how to log in the web page successfully and quickly is very important.

For convenience of understanding, the embodiment of the present invention first explains the principle of web crawlers before describing the specific process of web page login. As shown in fig. 1, which is a script crawler framework, the principle of the whole crawler is described. Wherein, Scapy is a fast and high-level screen grabbing and web grabbing framework developed by Python, and is used for grabbing web sites and extracting structured data from pages. The Scapy has wide application range and can be used for data mining, monitoring and automatic testing.

In the framework shown in fig. 1, the Engine (script Engine) is responsible for controlling the flow of data among all components in the system and triggering events when corresponding actions occur. A Scheduler (Scheduler) receives requests (requests) from the engines and enqueues the requests for later provision to the engines upon request by the engines. The Downloader (Downloader) is responsible for taking the page data and providing it to the engine and then to the crawler (Spider). A crawler is a class written in Scapy that parses the response (response) and extracts the item (item) or additional follow-up URLs. Each crawler is responsible for processing a particular (or some) web site. The Item pipe (Item Pipeline) is responsible for handling items that are fetched by the crawler. Typical processes are cleaning, validation, and persistence (e.g., access into a database). Downloader middleware (downloaders) is a special hook (specific hook) between the engine and the Downloader that handles the responses passed by the Downloader to the engine. The downloader middleware provides a simple mechanism to extend script functionality by inserting custom code. Crawler middleware (Spidermmidleware) is a special hook (specific hook) between the engine and the crawler, handling the crawler's inputs (responses) and outputs (items and requests). The crawler middleware provides an easy mechanism for expanding the script function by inserting custom codes.

When crawling information, the data flow in script is controlled by the execution engine, which opens a website (open adomain), finds the Spider that handles the website and requests the first url(s) to crawl from the Spider. The engine gets the first URL to crawl from the Spider and schedules it with a Request at the Scheduler (Scheduler). The engine requests the next URL to crawl from the dispatcher. The dispatcher returns the next URL to be crawled to the engine, which forwards the URL to the Downloader (Downloader) via download middleware (request direction). Once the page is downloaded, the downloader generates a Response for the page and sends it to the engine via the download middleware (return direction). The engine receives Response from the downloader and sends it to the Spider process via the Spider middleware (input direction). Spider processes the Response and returns the crawled Item and the (follow-up) new Request to the engine. The engine crawls the crawled Item (returned by Spider) to Item Pipeline and requests (returned by Spider) to the scheduler. This loops until there are no more requests in the scheduler and the engine shuts down the web site.

Based on the above description of the working principle of the web crawler, the embodiment of the present invention provides a web page login method, which can be applied to the implementation environment shown in fig. 2. As shown in fig. 2, the implementation environment includes: a terminal 100, a server 102, and a server 104.

The terminal 100 is used to obtain network data from the server 102, for example, the terminal 100 obtains web page data to be accessed from the server 102. The terminal 100 may be an electronic device such as a mobile phone, a tablet computer, a personal computer, and the like.

The server 102 is used for providing network data, and may be one server, a server cluster composed of a plurality of servers, or a cloud computing service center. The server 104 may be provided with a web page login device provided in the embodiment of the present invention, and the web page provided by the server 102 is logged in through the web page login device, so as to perform operations such as vulnerability detection on the website established on the server 102.

The server 102 and the server 104 may be one server or two separate servers. When the server 102 and the server 104 are two independent servers, the server 102 and the server 104 establish a communication connection through a network a106, and the network a106 may be a wired network or a wireless network. The terminal 100 and the server 102 establish a communication connection through a network B108, and the network B108 may be a wired network or a wireless network.

Based on the above implementation environment, the embodiment of the present invention provides a web page login method, which can be applied to the server 104 in the implementation environment shown in fig. 2. Referring to fig. 3, the web page login method provided by the embodiment of the present invention includes the following steps:

step 301, acquiring a page to be logged in.

For the step, the application scenarios of web page login are different, and the ways of triggering to acquire the page to be logged in are different, for example, web page login is performed in the application scenario of vulnerability detection, and the page to be logged in may be triggered to acquire when vulnerability detection is detected. For example, the server sends a HyperText Transfer Protocol (HTTP) request, accesses a start page that needs vulnerability detection, and obtains an HTTP response, which includes an HTML source code. The server stores the acquired HTML source code locally, collects all links (URLs) in the HTML source code and stores the URLs locally. Thereafter, the HTTP request may continue to be sent, and so on. And the URL acquired each time corresponds to a page to be logged in.

Besides the above manner of triggering acquisition, the web page to be logged may also be automatically acquired according to a certain period, for example, the web page to be logged is automatically acquired every other day, and the web page logging method provided by the embodiment of the present invention is triggered to be executed. Of course, other ways of triggering acquisition of the page to be logged may also be adopted, and the embodiment of the present invention does not limit the triggering way of acquiring the page to be logged.

It should be noted that the page to be logged in may be a root page, or may be any page acquired after logging in through the root page.

Step 302, searching a first tag in a source code of a page to be logged in, wherein the first tag is an input tag with a password input attribute.

For a page to be logged in, because various tags exist in the HTML source code of the page, and an input tag with a password input attribute is generally used for inputting a password, after the page to be logged in and the source code of the page to be logged in are acquired, a first tag with the password input attribute can be searched in the source code. For example, an input [ type ═ password ] tag.

Step 303, searching a second tag according to the source code of the page to be logged in, where the second tag is a parent tag of the first tag, and the second tag is a login related tag.

Due to the fact that a plurality of labels are arranged in the source code of the page, the labels with the password input attribute are generally fixed in naming rules by combining with conventional design, the account number and the password are generally in the same row or column, and the positions of the labels corresponding to the account number and the password in the source code are close to each other. Therefore, in the method provided by the embodiment of the present invention, after the first tag having the password input attribute is searched in step 302, the second tag related to login can be searched in the source code of the page, starting from the first tag. Further considering that there is an association relationship between some tags in the source code of the page, for example, there is a nesting relationship between some tags, such tags having a nesting relationship may be referred to as a parent tag and a child tag. Therefore, after the first tag is searched, the parent tag of the first tag can be searched according to the source code of the page to be logged in, and the login related tag in the parent tag of the first tag is used as the second tag.

For ease of understanding, a segment of source code is shown as follows:

wherein </body > represents the end of the body tag, </p > represents the end of the p tag, and </a > represents the end of the a tag. Because the P-tag and the a-tag are both nested in the body tag, the parent tag of the P-tag is the body, the parent tag of the a is the body, and the a is the brother tag of the P.

When the second tag is searched, full-text search can be performed based on the source code of the page to be logged in, although the speed of the full-text search is not high, the success rate of the searched tag for logging in is high due to the fact that the parent tag of the first tag is taken as the search starting point, and the success rate of the web page logging in can still be guaranteed by adopting the search mode.

In order to further improve the search speed, in an optional implementation manner, when the method provided by the embodiment of the present invention searches for the second tag according to the source code of the page to be logged in, a manner of obtaining the DOM tree of the page to be logged in according to the source code of the page to be logged in, and searching for the second tag according to the DOM tree of the page to be logged in is adopted. Each node of the DOM tree is each label in the source code of the page to be logged in, and the DOM tree represents the nesting relation of the labels in the source code of the page to be logged in.

Regarding the process of constructing the DOM tree, the embodiments of the present invention are not described herein again. Because the DOM tree is applied to the web page, the objects of the page are organized in a tree structure to represent the standard model of the objects in the page, and each node in the DOM tree is each tag in the source code of the page to be logged in, the speed of searching for the second tag through the DOM tree is greatly increased compared with full-text search in the source code of the page to be logged in.

Taking the DOM tree shown in fig. 4 as an example, there are multiple sub-tags, such as sub-tag 0, sub-tag 2, and sub-tag 3, below the root node of the DOM tree. And the sub-label 3 comprises two sub-labels of 1 account login label and 2 password label. If the cryptographic label has been searched in step 302, the parent label of the cryptographic label, i.e. the child label 3, is found by the DOM tree search in this step 303.

In addition, no matter which way is adopted to search the second tag, in an implementation manner, since the button tag, the a tag, and the input tag are mostly used for web page login, in the method provided by the embodiment of the present invention, the login related tag is one of the button tag, the a tag, and the input tag.

And 304, searching a target login tag in the second tag and the sub-tags of the second tag, wherein the target login tag is used for logging in a page to be logged in.

In one embodiment, searching for the target login tag in the second tag and the sub-tags of the second tag includes: searching one or more candidate login tags in the second tag and the sub-tags of the second tag; and performing login verification according to each candidate login tag until login succeeds according to a target login tag in one or more candidate login tags.

Optionally, when searching for one or more candidate login tags in the second tag and the sub-tags of the second tag, the search may be based on a candidate dictionary, which may be as shown in table 1 below:

TABLE 1

Father label	Candidate sub-tags
		button label	class，name，id，onclick
intput label	class，name，id，value，placeholder，onclick，title
		a label	class，name，id，text，onclick，title

Based on the candidate dictionary shown in table 1, the searching for one or more candidate entry tags in the second tag and the sub-tags of the second tag includes:

if the login related tag is the button tag, searching a class tag, a name tag, an id tag and an onclick tag in the sub-tags of the button tag, and taking the class tag, the name tag, the id tag and the onclick tag in the button tag and the sub-tags of the button tag as candidate login tags;

if the login related label is an a label, searching a class label, a name label, an id label, a text label, an onclick label and a title label in the sub-labels of the a label, and taking the class label, the name label, the id label, the text label, the onclick label and the title label in the sub-labels of the a label and the a label as candidate login labels;

if the login related label is an input label, searching a class label, a name label, an id label, a value label, a placeholder label, an onclick label and a title label in the sub-labels of the input label, and taking the class label, the name label, the id label, the value label, the placeholder label, the onclick label and the title label in the sub-labels of the input label and the input label as candidate login labels.

Further, after the candidate login tags are obtained, when login verification is performed according to each candidate login tag, the obtained password and the value of the first tag with the password input attribute in the candidate login tags can be directly bound, and the obtained account and the value of the tag with the account input attribute in the candidate login tags are bound; and logging in based on the label with the login attribute, the label with the password input attribute and the label with the account input attribute. And if the login is successful, taking the candidate login tag used when the login is successful as a target login tag to finish the webpage login.

In addition, login verification can be performed in a form mode. When login verification is performed in a form mode, because an account number, a password, submission and the like are all contained in the form, namely, a process of filling the account number and the password into the form associated with the submitted tag is a process of writing the account number and the password into an input tag in the form, and the submission also needs to submit the associated information in the tag. Therefore, the account label, the password label and the submission label are all unavailable when login verification is performed. For ease of understanding, the following tables are used as examples:

and filling the acquired password and the acquired account number into the form to finish login verification, and when the login is successful, taking the used candidate login tag as a target login tag to finish the login of the webpage.

According to the method provided by the embodiment of the invention, because the account password is generally in the same row or the same column by combining with the conventional design of the page, after the first tag with the password input attribute is searched in the source code of the page to be logged in, the second tag related to login is obtained by searching the father tag of the first tag, and the target login tag used for logging in the page to be logged in is searched in the second tag and the son tag of the second tag, so that the login of the webpage is realized, the login speed of the webpage can be increased, and the success rate of the webpage login can be increased.

Based on the embodiment shown in fig. 3, in a case that the target login tag is not searched in the second tag and the sub-tag of the second tag, the method provided in the embodiment of the present invention further includes the following optional steps, as shown in fig. 5, the method for logging in a web page provided in the embodiment of the present invention further includes, on the basis of steps 301 to 304:

step 305, if the target login tag does not exist in the second tag and the sub-tags of the second tag, searching a third tag according to the source code of the page to be logged in, wherein the third tag is a parent tag of the second tag, and the third tag is a login related tag.

For this step, if there is no target login tag in the second tag and the child tags of the second tag, the parent tag of the second tag may be searched according to the source code of the page to be logged in, and the login related tag in the parent tag of the second tag is used as the third tag.

A full text search may be performed in the source code while searching for the third tag, as well as searching for the second tag. And in order to improve the searching speed, a third tag can be searched in the DOM tree of the page to be logged in. Since the DOM tree of the page to be logged in is already acquired when the second tag is searched, the search for the third tag can be continued directly in the DOM tree.

Step 306, searching for a target login tag in the third tag and the target sub-tag of the third tag, where the target sub-tag is a sub-tag of the third tag except for the second tag.

After the third tag is searched, in an embodiment, searching for the target login tag in the third tag and the sub-tags of the third tag includes: searching one or more candidate login tags in the third tag and the sub-tags of the third tag; and performing login verification according to each candidate login tag until login succeeds according to a target login tag in one or more candidate login tags.

Alternatively, when one or more candidate login tags are searched for in the third tag and the sub-tags of the third tag, the search may also be performed based on the candidate dictionary, for example, the candidate dictionary shown in table 1 above is used to search for the candidate login tags. The process of performing login verification according to each candidate login tag may refer to the content in step 304, which is not described herein again.

For the case of logging in multiple web pages, after the logging in process of one web page is completed based on the embodiment shown in fig. 5, the method provided in the embodiment of the present invention can also be represented by the flow shown in fig. 6. In fig. 6, after the verification is successful, the crawling of the new URL may be continued, and the web page login is completed by using the method shown in fig. 5, and the process is repeated until no new URL exists, and the crawling is stopped, so that as many URLs as possible can be more comprehensively collected.

The embodiment of the invention provides a webpage login device which can be used for executing the webpage login method. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the above-described embodiments of the method. Referring to fig. 7, the apparatus includes:

an obtaining module 701, configured to obtain a page to be logged in;

a searching module 702, configured to search a first tag in a source code of a page to be logged in, where the first tag is an input tag with a password input attribute; searching a second label according to the source code of the page to be logged in, wherein the second label is a father label of the first label, and the second label is a login related label; and searching a target login tag in the second tag and the sub-tags of the second tag, wherein the target login tag is used for logging in a page to be logged in.

In an alternative embodiment, the login related tag is one of a button tag, an a tag and an input tag.

In an optional implementation manner, the searching module 702 is configured to search for the target login tag in the second tag and the sub-tags of the second tag, and specifically includes: a searching module 702, configured to search for one or more candidate login tags from the second tag and the sub-tags of the second tag; and performing login verification according to each candidate login tag until login succeeds according to a target login tag in one or more candidate login tags.

In an optional implementation manner, the searching module 702 is configured to search for one or more candidate login tags in the second tag and the sub-tags of the second tag, and specifically includes the searching module 702, configured to search for a class tag, a name tag, an id tag, and an onclick tag in the sub-tags of the button tag if the login related tag is the button tag, and use the class tag, the name tag, the id tag, and the onclick tag in the sub-tags of the button tag and the button tag as the candidate login tag; if the login related label is an a label, searching a class label, a name label, an id label, a text label, an onclick label and a title label in the sub-labels of the a label, and taking the class label, the name label, the id label, the text label, the onclick label and the title label in the sub-labels of the a label and the a label as candidate login labels; if the login related label is an input label, searching a class label, a name label, an id label, a value label, a placeholder label, an onclick label and a title label in the sub-labels of the input label, and taking the class label, the name label, the id label, the value label, the placeholder label, the onclick label and the title label in the sub-labels of the input label and the input label as candidate login labels.

In an optional implementation manner, the searching module 702 is configured to search the second tag according to the source code of the page to be logged in, and specifically includes: the searching module 702 is configured to obtain a DOM tree of the page to be logged according to the source code of the page to be logged, where each node of the DOM tree is each tag in the source code of the page to be logged, and the DOM tree represents a nesting relationship of the tags in the source code of the page to be logged; and searching the second tag according to the DOM tree of the page to be logged in.

In an alternative embodiment, the search module 702 is further configured to: if the target login tag does not exist in the second tag and the sub-tags of the second tag, searching a third tag according to the source code of the page to be logged in, wherein the third tag is a father tag of the second tag, and the third tag is a login related tag; and searching the third label and a target sub-label of the third label for a target login label, wherein the target sub-label is a sub-label except the second label in the sub-label of the third label.

In an optional implementation manner, the searching module 702 is configured to search for the third tag according to the source code of the page to be logged in, and specifically includes: and the searching module 702 is configured to search for the third tag according to the DOM tree of the page to be logged in.

In an optional implementation manner, the searching module 702 is configured to search for a target login tag in a third tag and a target sub-tag of the third tag, and specifically includes: a searching module 702, configured to search for one or more candidate login tags in the third tag and the target sub-tag of the third tag; and performing login verification according to each candidate login tag until login is successful according to the target login tag in the one or more candidate login tags. When one or more candidate login tags are searched in the third tag and the target sub-tag of the third tag, full-text search can be performed in the source code, and search can also be performed based on the DOM tree, so that the search speed is further increased.

Because account passwords are generally in the same row or the same column in combination with the conventional design of a page, the device provided by the embodiment of the invention searches a first tag with a password input attribute in a source code of a page to be logged in, obtains a second tag related to login by searching a parent tag of the first tag, and searches a target login tag for logging in the page to be logged in from the second tag and a child tag of the second tag, so that the login of a webpage is realized, the login speed of the webpage can be increased, and the success rate of the login of the webpage can be increased.

It should be noted that: in the device provided in the above embodiment, when performing web page login, only the division of the above function modules is exemplified, and in practical applications, the function distribution may be completed by different function modules as needed, that is, the internal structure of the device is divided into different function modules to complete all or part of the functions described above. In addition, the device provided by the above embodiment and the embodiment of the web page login method belong to the same concept, and the specific implementation process thereof is described in the embodiment of the method for details, which is not described herein again.

Referring to fig. 8, a schematic structural diagram of a web page login apparatus provided by an embodiment of the present disclosure is shown. The device may be a server or a terminal, in particular:

computing system 800 includes a Central Processing Unit (CPU)801, a system memory 804 including a Random Access Memory (RAM)802 and a Read Only Memory (ROM)803, and a system bus 805 connecting system memory 804 and central processing unit 801. The computing system 800 also includes a basic input/output system (I/O system) 806, which facilitates transfer of information between devices within the computer, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.

The basic input/output system 806 includes a display 808 for displaying information and an input device 809 such as a mouse, keyboard, etc. for user input of information. Wherein a display 808 and an input device 809 are connected to the central processing unit 801 through an input output controller 810 connected to the system bus 805. The basic input/output system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the computing system 800. That is, the mass storage device 807 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 804 and mass storage 807 described above may be collectively referred to as memory.

According to various embodiments of the present disclosure, computing system 800 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the computing system 800 may be connected to the network 812 through the network interface unit 811 coupled to the system bus 805, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 811.

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU. The one or more programs include instructions for performing the web page login method provided by any of fig. 3, 5 and 6.

The disclosed embodiments also provide a non-transitory computer-readable storage medium, whose instructions, when executed by a processor of a computing system, enable the computing system to perform the web page login method provided by any one of fig. 3, 5, and 6.

A computer program product containing instructions which, when run on a computer, cause the computer to perform the instructions to carry out the web page login method provided in any one of figures 3, 5 and 6.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware or any combination thereof, and when the implementation is realized by software, all or part of the implementation may be realized in the form of a computer program product. The computer program product comprises one or more computer program instructions which, when loaded and executed on a device, cause a process or function according to an embodiment of the invention to be performed, in whole or in part. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optics, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium can be any available medium that can be accessed by the apparatus or a data storage device, such as a server, a data center, etc., that is integrated into one or more available media. The usable medium may be a magnetic medium (such as a floppy Disk, a hard Disk, a magnetic tape, etc.), an optical medium (such as a Digital Video Disk (DVD), etc.), or a semiconductor medium (such as a solid state Disk, etc.).

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A web page login method, the method comprising:

acquiring a page to be logged in;

searching a first tag in a source code of the page to be logged in, wherein the first tag is an input tag with a password input attribute;

searching a second tag according to the source code of the page to be logged in, wherein the second tag is a father tag of the first tag, and the second tag is a login related tag;

searching a target login tag in the second tag and the sub-tags of the second tag, wherein the target login tag is used for logging in the page to be logged in.

2. The method according to claim 1, wherein the login related tag is one of a button tag, an a tag, and an input tag.

3. The method of claim 1, wherein searching for a target login tag in the second tag and the sub-tags of the second tag comprises:

searching one or more candidate login tags in the second tag and the sub-tags of the second tag;

and performing login verification according to each candidate login tag until login is successful according to the target login tag in the one or more candidate login tags.

4. The method of claim 3, wherein searching for one or more candidate login tags in the second tag and the sub-tags of the second tag comprises:

if the login related tag is a button tag, searching a class tag, a name tag, an id tag and an onclick tag in the sub-tags of the button tag, and taking the class tag, the name tag, the id tag and the onclick tag in the button tag and the sub-tags of the button tag as candidate login tags;

5. The method according to any one of claims 1-4, wherein the searching for the second tag according to the source code of the page to be logged comprises:

obtaining a Document Object Model (DOM) tree of the page to be logged in according to the source code of the page to be logged in, wherein each node of the DOM tree is each label in the source code of the page to be logged in, and the DOM tree represents the nesting relation of the labels in the source code of the page to be logged in;

and searching the second tag according to the DOM tree of the page to be logged in.

6. The method according to any one of claims 1-4, further comprising:

if the target login tag does not exist in the second tag and the sub-tags of the second tag, searching a third tag according to a source code of the page to be logged in, wherein the third tag is a parent tag of the second tag, and the third tag is the login related tag;

searching the target login tag in the third tag and a target sub-tag of the third tag, wherein the target sub-tag is a sub-tag of the third tag except the second tag.

7. A web page registration apparatus, comprising:

the acquisition module is used for acquiring a page to be logged in;

the search module is used for searching a first tag in the source code of the page to be logged in, wherein the first tag is an input tag with a password input attribute; searching a second tag according to the source code of the page to be logged in, wherein the second tag is a father tag of the first tag, and the second tag is a login related tag; searching a target login tag in the second tag and the sub-tags of the second tag, wherein the target login tag is used for logging in the page to be logged in.

8. The apparatus according to claim 7, wherein the login related tag is one of a button tag, an a tag, and an input tag.

9. The apparatus according to claim 7, wherein the search module is configured to search for the target login tag in the second tag and the sub-tags of the second tag, and specifically includes:

the search module is used for searching one or more candidate login tags in the second tag and the sub-tags of the second tag; and performing login verification according to each candidate login tag until login is successful according to the target login tag in the one or more candidate login tags.

10. The apparatus according to claim 9, wherein the searching module is configured to search for one or more candidate login tags in the second tag and the sub-tags of the second tag, and specifically includes the searching module, if the login related tag is a button tag, searching for a class tag, a name tag, an id tag, and an onclick tag in the sub-tags of the button tag, and taking the class tag, the name tag, the id tag, and the onclick tag in the sub-tags of the button tag and the button tag as the candidate login tag; if the login related label is an a label, searching a class label, a name label, an id label, a text label, an onclick label and a title label in the sub-labels of the a label, and taking the class label, the name label, the id label, the text label, the onclick label and the title label in the sub-labels of the a label and the a label as candidate login labels; if the login related label is an input label, searching a class label, a name label, an id label, a value label, a placeholder label, an onclick label and a title label in the sub-labels of the input label, and taking the class label, the name label, the id label, the value label, the placeholder label, the onclick label and the title label in the sub-labels of the input label and the input label as candidate login labels.

11. The apparatus according to any one of claims 7 to 10, wherein the searching module is configured to search for the second tag according to the source code of the page to be logged in, and specifically includes: the searching module is used for obtaining a Document Object Model (DOM) tree of the page to be logged in according to the source code of the page to be logged in, wherein each node of the DOM tree is each label in the source code of the page to be logged in, and the DOM tree represents the nesting relation of the labels in the source code of the page to be logged in; and searching the second tag according to the DOM tree of the page to be logged in.

12. The apparatus of any of claims 7-10, wherein the search module is further configured to:

if the target login tag does not exist in the second tag and the sub-tags of the second tag, searching a third tag according to a source code of the page to be logged in, wherein the third tag is a parent tag of the second tag, and the third tag is the login related tag; searching the target login tag in the third tag and a target sub-tag of the third tag, wherein the target sub-tag is a sub-tag of the third tag except the second tag.

13. A web page registration apparatus, comprising a processor and a memory, wherein the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the web page registration method according to any one of claims 1 to 6.

14. A computer-readable storage medium having stored therein at least one instruction, which is loaded and executed by a processor, to implement the web page registration method of any one of claims 1 to 6.