CN108268635B - Method and apparatus for acquiring data - Google Patents

Method and apparatus for acquiring data Download PDF

Info

Publication number
CN108268635B
CN108268635B CN201810044597.8A CN201810044597A CN108268635B CN 108268635 B CN108268635 B CN 108268635B CN 201810044597 A CN201810044597 A CN 201810044597A CN 108268635 B CN108268635 B CN 108268635B
Authority
CN
China
Prior art keywords
login
page
target
target website
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810044597.8A
Other languages
Chinese (zh)
Other versions
CN108268635A (en
Inventor
郑志彬
陈坤斌
骆金昌
方军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Original Assignee
Baidu Online Network Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baidu Online Network Technology Beijing Co Ltd filed Critical Baidu Online Network Technology Beijing Co Ltd
Priority to CN201810044597.8A priority Critical patent/CN108268635B/en
Publication of CN108268635A publication Critical patent/CN108268635A/en
Application granted granted Critical
Publication of CN108268635B publication Critical patent/CN108268635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the application discloses a method and a device for acquiring data. One embodiment of the method comprises: determining whether a target website is not logged in; in response to determining that the target website is not logged in, identifying a login form in a login page of the target website, and determining the category of each field in the login form; for each field in the login form, inputting a value corresponding to the category of the field to login the target website; and acquiring page data of a page presented after the target website is logged in. This embodiment improves the flexibility of information acquisition.

Description

Method and apparatus for acquiring data
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to the technical field of internet, and particularly relates to a method and a device for acquiring data.
Background
With the development of computer technology, in order to better perform data analysis, data needs to be crawled from a webpage by using a crawler technology. However, many websites require logging in to obtain content, such as social networking websites and the like. Some non-social web sites, even without user login requirements, typically require the user to log in to verify identity in order to prevent crawling.
The existing method usually adopts a manual login mode to acquire temporary login data (session cookie), and then transmits the temporary login data to the crawler, so that the crawler acquires the data.
Disclosure of Invention
The embodiment of the application provides a method and a device for acquiring data.
In a first aspect, an embodiment of the present application provides a method for acquiring data, where the method includes: determining whether a target website is not logged in; in response to determining that the target website is not logged in, identifying a login form in a login page of the target website, and determining the category of each field in the login form; for each field in the login form, inputting a value corresponding to the category of the field to login the target website; and acquiring page data of a page presented after the target website is logged in.
In some embodiments, in response to determining that the target website is not logged in, identifying a login form in a login page of the target website and determining a category of various fields in the login form, includes: extracting form information from page data of a login page of the target website in response to determining that the target website is not logged in; inputting the form information into a pre-trained logistic regression model, and identifying a login form in a login page, wherein the logistic regression model is used for identifying the type of the form; extracting the characteristic information of each field in the login form, and inputting the extracted characteristic information into a pre-trained conditional random field model to obtain the category of each field, wherein the conditional random field model is used for identifying the category of each field in the form.
In some embodiments, determining whether the target website is not logged in includes: determining whether temporary login data of a target website is stored locally; and in response to the fact that the temporary login data are stored locally, accessing a target page of the target website, and if the target page is redirected to a login page of the target website, determining that the target website is not logged in, wherein the target page is a page of the data to be acquired.
In some embodiments, determining whether the target website is not logged in includes: and responding to the fact that the temporary login data are not stored locally, and determining that the target website is not logged in.
In some embodiments, determining whether the target website is not logged in comprises: accessing a target page of a target website, wherein the target page is a page of data to be acquired; determining whether a target page contains a target character string and a login link; and responding to the target character string and/or the login link contained in the target page, and determining that the target website is not logged in.
In some embodiments, for each field in the login form, entering a value corresponding to the category of the field to login to the target website includes: for each field in the login form, inputting a value corresponding to the category of the field; and logging in the target website by using a simulated click submission mode.
In a second aspect, an embodiment of the present application provides an apparatus for acquiring data, where the apparatus includes: the determining unit is configured to determine that the target website is not logged in; the identification unit is configured to respond to the fact that whether the target website is not logged in or not, identify a login form in a login page of the target website and determine the category of each field in the login form; the input unit is configured to input a value corresponding to the category of each field in the login form so as to login the target website; and the acquisition unit is configured to acquire page data of a page presented after the target website is logged in.
In some embodiments, the identification unit comprises: the extraction module is configured to extract form information from page data of a login page of the target website in response to determining that the target website is not logged in; the first input module is configured to input form information to a pre-trained logistic regression model and identify a login form in a login page, wherein the logistic regression model is used for identifying the type of the form; and the second input module is configured to extract the feature information of each field in the login form, and input the extracted feature information into a pre-trained conditional random field model to obtain the category of each field, wherein the conditional random field model is used for identifying the category of the field in the form.
In some embodiments, the determining unit comprises: the first determining module is configured to determine whether temporary login data of the target website is stored locally; and the second determining module is configured to respond to the fact that the temporary login data are stored locally, access a target page of the target website, and determine that the target website is not logged in if the target page is redirected to the login page of the target website, wherein the target page is a page of the data to be acquired.
In some embodiments, the determining unit further comprises: and the third determining module is configured to respond to the fact that the temporary login data are not stored locally, and determine that the target website is not logged in.
In some embodiments, the determining unit comprises: the access module is used for configuring a target page for accessing a target website, wherein the target page is a page of data to be acquired; the fourth determination module is configured to determine whether the target page contains the target character string and the login link; and the fifth determining module is used for responding to the fact that the target page contains the target character string and/or the login link, and determining that the target website is not logged in.
In some embodiments, the input unit includes: the third input module is configured to input a value corresponding to the category of each field in the login form; and the login module is configured for logging in the target website by using a simulated click submission mode.
In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device to store one or more programs that, when executed by one or more processors, cause the one or more processors to implement a method as in any embodiment for obtaining data.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the method of any of the embodiments as used for acquiring data.
According to the method and the device for acquiring data, whether a target website is not logged in is determined, then a login form in a login page of the target website is identified after the target website is determined to be not logged in, the category of each field in the login form is determined, then a value corresponding to the category of each field is input for each field in the login form to log in the target website, and finally page data of a page presented after the target website is logged in is acquired.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a method for acquiring data according to the present application;
FIG. 3 is a flow diagram of yet another embodiment of a method for acquiring data according to the present application;
FIG. 4 is a schematic block diagram illustrating one embodiment of an apparatus for acquiring data according to the present application;
FIG. 5 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 to which the method for acquiring data or the apparatus for acquiring data of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include a terminal device 101, a network 102, and servers 103, 104, 105. Network 102 is the medium used to provide communication links between terminal equipment 101 and servers 103, 104, 105. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal device 101 to interact with the servers 103, 104, 105 over the network 102 to receive or send messages or the like. Various communication client applications, such as a crawler application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the terminal device 101. The terminal device 101 may be various electronic devices having a display screen and supporting web browsing, including but not limited to a tablet computer, a laptop portable computer, a desktop computer, and the like.
The servers 103, 104, 105 may be servers that provide various services, such as background web servers that provide support for web pages in different websites displayed on the terminal device 101. The background web server may analyze and perform other processing on the received data such as the login request, and feed back a processing result (e.g., a page indicated by the login request) to the terminal device.
It should be noted that the method for acquiring data provided in the embodiment of the present application is generally executed by the terminal device 101, and accordingly, the apparatus for acquiring data is generally disposed in the terminal device 101.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for acquiring data in accordance with the present application is shown. The method for acquiring data comprises the following steps:
step 201, determining whether the target website is not logged in.
In the present embodiment, the electronic device (for example, the terminal device 101 shown in fig. 1) on which the method for acquiring data is executed may determine whether the target website is not logged in various ways. The target website may be a website that is specified in advance by a technician and from which data needs to be acquired and from which data acquisition can be performed only after logging in, for example, a social networking website, a patent retrieval website, and the like.
As an example, the electronic device may store a website address of a target page in the target website in advance, where the target page may be a page to be obtained data, which is specified by a technician in advance. For example, if the target site is a patent search site, a result page searched for by a certain search term (e.g., "face recognition method") may be used as a target page of the target site. At this time, the electronic device may first access the target page, and determine whether the target website is not logged in based on whether the target page is successfully accessed. That is, if the target page is successfully accessed, it may be determined that the target website is logged in; if the target page fails to be accessed (e.g., a web page jump occurs), it may be determined that the target website is not logged in. In practice, the web address is generally represented by a Uniform Resource Locator (URL).
In practice, if the target page is not logged in, the target page is automatically jumped to the login page of the target page when being accessed.
In some optional implementation manners of this embodiment, the electronic device may further determine whether the target website is not logged in by: first, the electronic device may determine whether temporary login data (session cookie) of the target website is stored locally. Then, in response to determining that the temporary login data is locally stored, the electronic device may access a target page of the target website, where the target page may be a page of data to be acquired. In practice, the technician may specify the target page in advance. In the process of accessing the target page, if the target page is redirected to a login page of the target website, it may be determined that the target website is not logged in. If the target page is not redirected to other pages, it may be determined that the target website is logged in.
In some optional implementations of the embodiment, the electronic device may determine whether temporary login data of the target website is locally stored, and in response to determining that the temporary login data is not locally stored, the electronic device may determine that the target website is not logged in.
In some optional implementation manners of this embodiment, the electronic device may further determine whether the target website is not logged in by the following steps: first, the electronic device may access a target page of the target website, where the target page is a page of data to be acquired. In practice, the technician may specify the target page in advance. Then, the electronic device may determine whether the target page includes a target character string and a login link. The target character string may be a character string used for characterizing the login meaning, such as "login", "login _ url _ pattern", and the like; the login link may be a link to access a login page, such as "https:// www.zhihu.com/login/{ userName }". In response to determining that the target page includes the target character string and/or the login link, the electronic device may determine that the target website is not logged in.
Step 202, in response to determining that the target website is not logged in, identifying a login form in a login page of the target website, and determining the category of each field in the login form.
In this embodiment, in response to determining that the target website is not logged in, the electronic device may identify a login form in a login page of the target website, and determine a category of each field in the login form. In practice, the login form may be a form to be filled in for logging in the target website, and the category of the fields in the login form may include, but is not limited to, one or more of the following: a username class (e.g., field "username"), a password class (e.g., field "password"), a confirmation password class (e.g., field "password confirmation"), a mailbox class (e.g., field "email"), a confirmation mailbox class (e.g., field "email confirmation"), a verification code class (e.g., field "captcha"), a remember me class (e.g., field "members mailbox"), a submit button class (e.g., field "submit button"), a delete button class (e.g., field "cancel button"), a search button class (e.g., field "search query"), etc.
Here, the electronic device may recognize the login form and the category of each field in the login form using various methods. As an example, the login page of each of the plurality of websites may be analyzed in advance, the login form in the login page of each website and the field type of the login form of each website may be counted, and the information of the login form of each website may be stored. The information of the login form of each website may include a login request submission link (post _ login _ url) of the website, login request related data (JSON (JavaScript object notification) format), and the like. The login request related data may include, but is not limited to, a login user name form field (username _ key), a login user name (username _ value), a login password form field (password _ key), a login password (password _ value), and the like, and may further include a CSRF (Cross-site request forgery) flag (token), an authentication code (captcha), a link to an authentication code service, and the like. After determining that the target website is not logged in, the electronic device may extract pre-stored information of a login form of the target website, match the pre-stored information with page data of the target website, and determine the login form of the target website and a category of each field in the login form.
Step 203, for each field in the login form, inputting a value corresponding to the category of the field to login the target website.
In this embodiment, the electronic device may store values corresponding to fields of each category in advance. As an example, the value corresponding to the field whose category is the user name may be a string of the user name. After determining the category of each field in the login form, the electronic device may extract and input a value corresponding to the category of the field. After inputting the values corresponding to the fields in the login form, the target website may be logged in various ways. For example, the target website may be registered by constructing a POST.
In some optional implementations of this embodiment, for each field in the login form, the electronic device may input a value corresponding to a category of the field; and then, logging in the target website by using a simulated click submission mode. In practice, the target website can be logged in by using a Javascript simulation key click submission mode. Here, because some website login pages not only need to perform CSRF token verification, but also may encrypt the form content, if the POST mode needs to execute the encryption logic, and the analog click submission mode does not need to execute the encryption logic, the complexity may be reduced.
It should be noted that the POST mode and the Javascript simulated key click are both data submission modes, and the POST mode and the Javascript simulated key click are well-known technologies widely researched and applied at present, and are not described herein again.
And step 204, acquiring page data of a page presented after the target website is logged in.
In this embodiment, after logging in the target website, the electronic device may obtain page data of a page presented after logging in the target website. In practice, after logging in the target website, the electronic device may automatically jump from the current page to the target page that is originally requested to be accessed, and at this time, the electronic device may crawl page data (e.g., a web page code in an HTML (HyperText Markup Language) format) of the target page by using various existing crawler tools or a crawler application stored in the electronic device.
According to the method provided by the embodiment of the application, whether the target website is not logged in is determined, then the login form in the login page of the target website is identified after the target website is determined to be not logged in, the category of each field in the login form is determined, then a value corresponding to the category of each field in the login form is input for logging in the target website, and finally page data of the page presented after logging in the target website is obtained, so that the user does not need to log in and set temporary login data (session cookie) manually, the user does not need to re-operate after the session (session) expires, and the flexibility of data acquisition is improved.
With further reference to FIG. 3, a flow 300 of yet another embodiment of a method for acquiring data is shown. The flow 300 of the method for acquiring data comprises the steps of:
step 301, determine whether the target website is not logged in.
In the present embodiment, an electronic device (e.g., the terminal device 101 shown in fig. 1) on which the method for acquiring data operates may determine whether a target website is not logged in various ways. For example, it may be determined whether temporary login data of the target website is locally stored. And then, responding to the fact that the temporary login data are stored locally, and accessing a target page of the target website, wherein the target page can be a page of data to be acquired. In the process of accessing the target page, if the target page is redirected to a login page of the target website, it may be determined that the target website is not logged in. If the target page is not redirected to other pages, it may be determined that the target website is logged in.
Step 302, in response to determining that the target website is not logged in, extracting form information from page data of a login page of the target website.
In this embodiment, in response to determining that the target website is not logged in, the electronic device may extract form information from page data of a login page of the target website. Specifically, the electronic device may first search for form tags (e.g., "< form >" and "</form >") from page data of a login page of the target website; then, the data contained in the form label can be extracted; thereafter, one or more of the following may be determined from the extracted data as form information: the number of form fields, whether the form submission mode is GET or POST, text information on the submit button, the form name, CSS (Cascading Style Sheets) Style class (class) for setting or returning controls, CSS ID (Identification number), input tag, whether the link contains certain character strings (e.g., "login", "search", etc.), and the determined items are taken as form information. The form information is not limited to the above list, and may include other information related to the form.
Step 303, inputting the form information into a pre-trained logistic regression model, and identifying the login form in the login page.
In this embodiment, the electronic device may input the form information to a pre-trained Logistic regression model (Logistic regression) to identify the login form in the login page, where the pre-trained Logistic regression model may be used to identify the type of the form. Note that the form types may include login forms, search forms, registration forms, recovery password forms, contact forms, and the like.
Here, the above-mentioned pre-trained logistic regression model may be trained by: first, a plurality of form information samples (e.g., 1000) may be obtained, where each form information sample has a label, and the label may be used to indicate a type of a form corresponding to the form information sample. Then, by using a machine learning mode, the form information sample is used as the input of a pre-established logistic regression model (various existing logistic regression models can be used), the label carried by the form information sample is used as the output of the pre-established logistic regression model, and the pre-established logistic regression model is subjected to supervised training to obtain the trained logistic regression model. At this time, after the form information of the type to be determined is input to the trained regression model, the trained regression model may output the corresponding form type. In practice, the content output by the regression model may be probabilities belonging to each form type, and the type corresponding to the maximum probability value is the type of the form corresponding to the input form information.
In practice, the form types corresponding to the multiple form information samples used for training the logistic regression model may include login forms, search forms, registration forms, recovery password forms, contact forms, and the like. Different types of fields typically exist for each type of form, e.g., login forms typically contain username fields, category fields, and registration forms typically contain email fields or telephone fields, etc.
And step 304, extracting the characteristic information of each field in the login form, and inputting the extracted characteristic information into a pre-trained conditional random field model to obtain the category of each field.
In this embodiment, the electronic device may first extract feature information of each field in the login form, where the feature information may include, but is not limited to, at least one of the following: the name of this field, the words before and after this field, the CSS class and CSS ID of this field, the title (title) and placeholder (placeholder) attributes of this field, etc. Then, the electronic device may input the extracted feature information of each Field into a pre-trained Conditional Random Field (CRF) model to obtain a category of the Field, where the Conditional Random Field model may be used to identify the category of the Field in the form. It should be noted that the categories of the fields in the login form may include, but are not limited to: username class, password class, confirmation password class, mailbox class, confirmation mailbox class, authentication code class, remember me class, submit button class, delete button class, search button class, and the like. In practice, conditional random field models are discriminant probability models, which are random fields, and are commonly used to label or analyze sequence data, such as natural language text. Therefore, the trained conditional random field model can be used for identifying the field type.
Here, the pre-trained conditional random field model may be trained as follows: first, feature information of a plurality of form samples (e.g., 1000) may be obtained. Then, by using a machine learning mode, the feature information corresponding to each field in each form sample is used as the input of a pre-established conditional random field model (the existing conditional random field model can be used), the category of each field in the form is used as the output of the pre-established conditional random field model, and the pre-established conditional random field model is subjected to supervised training to obtain the trained conditional random field model.
It should be noted that the method for training the logistic regression model and the conditional random field model by using the machine learning method is a well-known technique widely studied and applied at present, and is not described herein again.
Step 305, for each field in the login form, inputting a value corresponding to the category of the field to login the target website.
In this embodiment, the electronic device may store values corresponding to fields of each category in advance. As an example, the value corresponding to the field whose category is the user name may be a string of the user name. After determining the category of each field in the login form, the electronic device may extract and input a value corresponding to the category of the field. After inputting the values corresponding to the fields in the login form, the target website can be logged in by using a simulated click submission mode.
Step 306, acquiring page data of the page presented after the target website is logged in.
In this embodiment, after logging in the target website, the electronic device may obtain page data of a page presented after logging in the target website. In practice, after logging in the target website, the user can automatically jump from the current page to the target page originally requesting access, and at this time, the electronic device can crawl page data of the target page by using various existing crawler tools or a crawler application stored in the electronic device.
It should be noted that the operations in step 301, step 305, and step 306 are substantially the same as the operations in step 201, step 203, and step 204, and are not described again here.
As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the flow 300 of the method for acquiring data in the present embodiment highlights the step of identifying the login form using the logistic regression model and the step of identifying the category of each field in the login form using the conditional random field model. Therefore, the scheme described in the embodiment does not need to analyze the login form of each website, can automatically identify the types of the fields in the login form and the login form, and further improves the flexibility of data acquisition.
With further reference to fig. 4, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for acquiring data, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 4, the apparatus 400 for acquiring data according to the present embodiment includes: a determination unit 401 configured to determine whether the target website is not logged in; an identifying unit 402, configured to identify a login form in a login page of the target website in response to determining that the target website is not logged in, and determine a category of each field in the login form; an input unit 403 configured to input, for each field in the login form, a value corresponding to a category of the field to log in the target website; the obtaining unit 404 is configured to obtain page data of a page presented after logging in the target website.
In some optional implementations of the present embodiment, the identification unit 402 includes an extraction module, a first input module, and a second input module (not shown in the figure). The extraction module may be configured to extract form information from page data of a login page of the target website in response to determining that the target website is not logged in. The first input module may be configured to input the form information to a pre-trained logistic regression model, and identify a login form in the login page, where the logistic regression model is used to identify a type of the form. The second input module may be configured to extract feature information of each field in the login form, and input the extracted feature information to a pre-trained conditional random field model to obtain a category of each field, where the conditional random field model is used to identify the category of the field in the form.
In some optional implementations of the present embodiment, the determining unit 401 may include a first determining module and a second determining module (not shown in the figure). The first determining module may be configured to determine whether temporary login data of the target website is locally stored. The second determining module may be configured to, in response to determining that the temporary login data is locally stored, access a target page of the target website, and if the target page is redirected to a login page of the target website, determine that the target website is not logged in, where the target page is a page to which data is to be acquired.
In some optional implementations of this embodiment, the determining unit 401 may further include a third determining module (not shown in the figure). The third determining module may be configured to determine that the target website is not logged in response to determining that the temporary login data is not locally stored.
In some optional implementations of the present embodiment, the determining unit 401 may include an accessing module, a fourth determining module, and a fifth determining module (not shown in the figure). The access module may be configured to access a target page of the target website, where the target page is a page of data to be acquired. The fourth determining module may be configured to determine whether the target page includes a target character string and a login link. The fifth determining module may be configured to determine that the target website is not logged in response to determining that the target page includes the target character string and/or the login link.
In some optional implementations of the present embodiment, the input unit 403 may include a third input module and a login module (not shown in the figure). The third input module may be configured to input, for each field in the login form, a value corresponding to a category of the field. The login module may be configured to log in the target website by using a simulated click submission method.
In the apparatus provided by the above embodiment of the present application, the determining unit 401 determines whether the target website is not logged in, the identifying unit 402 identifies the login form in the login page of the target website after determining that the target website is not logged in, and determines the category of each field in the login form, then the inputting unit 403 inputs a value corresponding to the category of each field in the login form to log in the target website, and finally the obtaining unit 404 obtains the page data of the page presented after logging in the target website, so that the user does not need to manually log in and set the temporary login data (session cookie), and does not need to manually re-operate after the session (session) expires, thereby improving the flexibility of data obtaining.
Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, which may be described as: a processor includes a determination unit, a recognition unit, an input unit, and an acquisition unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, the determination unit may also be described as a "unit that determines whether the target website is not registered".
As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: determining whether a target website is not logged in; in response to the fact that the target website is determined not to be logged in, identifying a login form in a login page of the target website, and determining the category of each field in the login form; for each field in the login form, inputting a value corresponding to the category of the field to login the target website; and acquiring page data of a page presented after the target website is logged in.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (12)

1. A method for acquiring data, comprising:
determining whether the target website is not logged in;
in response to determining that the target website is not logged in, identifying a login form in a login page of the target website and determining categories of fields in the login form, including: searching a form label from page data of a login page of the target website; extracting data contained in the form label; determining, from the extracted data, one or more of the following as form information: the form submission mode and the cascading style sheet style class used for setting or returning the control; inputting the form information into a pre-trained logistic regression model, and identifying a login form in the login page, wherein the logistic regression model is used for identifying the type of the form; extracting feature information of each field in the login form, and inputting the extracted feature information into a pre-trained conditional random field model to obtain the category of each field, wherein the conditional random field model is used for identifying the category of the field in the form;
for each field in the login form, inputting a value corresponding to the category of the field to login the target website;
and acquiring page data of a page presented after the target website is logged in.
2. The method for obtaining data of claim 1, wherein the determining whether the target website is not logged in comprises:
determining whether temporary login data of a target website is stored locally;
and in response to determining that the temporary login data is locally stored, accessing a target page of the target website, and if the target page is redirected to a login page of the target website, determining that the target website is not logged in, wherein the target page is a page of data to be acquired.
3. The method for obtaining data of claim 2, wherein the determining whether the target website is not logged in comprises:
and responding to the fact that the temporary login data are not stored locally, and determining that the target website is not logged in.
4. The method for obtaining data of claim 1, wherein said determining whether the target website is not logged in comprises:
accessing a target page of the target website, wherein the target page is a page of data to be acquired;
determining whether the target page contains a target character string and a login link;
and in response to determining that the target character string and/or the login link are contained in the target page, determining that the target website is not logged in.
5. The method for obtaining data according to claim 1, wherein the inputting, for each field in the login form, a value corresponding to a category of the field to log in the target website comprises:
for each field in the login form, inputting a value corresponding to the category of the field;
and logging in the target website by using a simulated click submission mode.
6. An apparatus for acquiring data, comprising:
a determination unit configured to determine whether the target website is not logged in;
the identification unit is configured to respond to the fact that the target website is not logged in, identify a login form in a login page of the target website, and determine the category of each field in the login form, and comprises the following steps: the extraction module is configured to search a form tag from page data of a login page of the target website; extracting data contained in the form label; determining, from the extracted data, one or more of the following as form information: the form submission mode and the cascading style sheet style class used for setting or returning the control; the first input module is configured to input the form information to a pre-trained logistic regression model and identify a login form in the login page, wherein the logistic regression model is used for identifying the type of the form; the second input module is configured to extract feature information of each field in the login form, and input the extracted feature information into a pre-trained conditional random field model to obtain the category of each field, wherein the conditional random field model is used for identifying the category of the field in the form;
the input unit is configured to input a value corresponding to the category of each field in the login form so as to login the target website;
and the acquisition unit is configured to acquire page data of a page presented after the target website is logged in.
7. The apparatus for acquiring data of claim 6, wherein the determining unit comprises:
the first determining module is configured to determine whether temporary login data of the target website is stored locally;
and the second determining module is configured to access a target page of the target website in response to determining that the temporary login data is locally stored, and determine that the target website is not logged in if the target page is redirected to a login page of the target website, wherein the target page is a page of data to be acquired.
8. The apparatus for acquiring data of claim 7, wherein the determining unit further comprises:
and the third determining module is configured to respond to the fact that the temporary login data is not stored locally, and determine that the target website is not logged in.
9. The apparatus for acquiring data of claim 6, wherein the determining unit comprises:
the access module is used for configuring a target page for accessing the target website, wherein the target page is a page of data to be acquired;
the fourth determination module is configured to determine whether the target page contains a target character string and a login link;
a fifth determining module, configured to determine that the target website is not logged in response to determining that the target page includes the target character string and/or the login link.
10. The apparatus for acquiring data of claim 6, wherein the input unit comprises:
a third input module configured to input, for each field in the login form, a value corresponding to a category of the field;
and the login module is configured for logging in the target website by utilizing a simulated click submission mode.
11. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.
12. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1-5.
CN201810044597.8A 2018-01-17 2018-01-17 Method and apparatus for acquiring data Active CN108268635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810044597.8A CN108268635B (en) 2018-01-17 2018-01-17 Method and apparatus for acquiring data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810044597.8A CN108268635B (en) 2018-01-17 2018-01-17 Method and apparatus for acquiring data

Publications (2)

Publication Number Publication Date
CN108268635A CN108268635A (en) 2018-07-10
CN108268635B true CN108268635B (en) 2022-06-24

Family

ID=62775817

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810044597.8A Active CN108268635B (en) 2018-01-17 2018-01-17 Method and apparatus for acquiring data

Country Status (1)

Country Link
CN (1) CN108268635B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909355A (en) * 2018-09-17 2020-03-24 北京京东金融科技控股有限公司 Unauthorized vulnerability detection method, system, electronic device and medium
CN109460522A (en) * 2018-10-30 2019-03-12 北京网众共创科技有限公司 The acquisition methods and device of site information
US11556699B2 (en) * 2019-02-04 2023-01-17 Citrix Systems, Inc. Data migration across SaaS applications
CN110119489A (en) * 2019-04-16 2019-08-13 深圳壹账通智能科技有限公司 The automatic register method of information, device, computer equipment and storage medium
US20220091707A1 (en) 2020-09-21 2022-03-24 MBTE Holdings Sweden AB Providing enhanced functionality in an interactive electronic technical manual
US20220261530A1 (en) 2021-02-18 2022-08-18 MBTE Holdings Sweden AB Providing enhanced functionality in an interactive electronic technical manual
US11947906B2 (en) 2021-05-19 2024-04-02 MBTE Holdings Sweden AB Providing enhanced functionality in an interactive electronic technical manual
CN113434234B (en) * 2021-06-29 2023-06-09 青岛海尔科技有限公司 Page jump method, device, computer readable storage medium and processor

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101782919A (en) * 2009-12-29 2010-07-21 北京搜狗科技发展有限公司 Web form data output method, device and form processing system
CN101872365A (en) * 2010-07-02 2010-10-27 苏州阔地网络科技有限公司 Method for realizing one-key login to other website on webpage
CN102495855A (en) * 2011-11-21 2012-06-13 奇智软件(北京)有限公司 Automatic login method and device
US8347088B2 (en) * 2005-02-01 2013-01-01 Newsilike Media Group, Inc Security systems and methods for use with structured and unstructured data
CN103268331A (en) * 2011-11-21 2013-08-28 北京奇虎科技有限公司 Automatic login method and automatic login device
US9400884B2 (en) * 2004-09-16 2016-07-26 International Business Machines Corporation Mapping a user's specific password and username pair to a temporary user's favorite password and username pair
CN106897357A (en) * 2017-01-04 2017-06-27 北京京拍档科技股份有限公司 A kind of method for crawling the network information for band checking distributed intelligence

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9400884B2 (en) * 2004-09-16 2016-07-26 International Business Machines Corporation Mapping a user's specific password and username pair to a temporary user's favorite password and username pair
US8347088B2 (en) * 2005-02-01 2013-01-01 Newsilike Media Group, Inc Security systems and methods for use with structured and unstructured data
CN101782919A (en) * 2009-12-29 2010-07-21 北京搜狗科技发展有限公司 Web form data output method, device and form processing system
CN101872365A (en) * 2010-07-02 2010-10-27 苏州阔地网络科技有限公司 Method for realizing one-key login to other website on webpage
CN102495855A (en) * 2011-11-21 2012-06-13 奇智软件(北京)有限公司 Automatic login method and device
CN103268331A (en) * 2011-11-21 2013-08-28 北京奇虎科技有限公司 Automatic login method and automatic login device
CN106897357A (en) * 2017-01-04 2017-06-27 北京京拍档科技股份有限公司 A kind of method for crawling the network information for band checking distributed intelligence

Also Published As

Publication number Publication date
CN108268635A (en) 2018-07-10

Similar Documents

Publication Publication Date Title
CN108268635B (en) Method and apparatus for acquiring data
CN109145280B (en) Information pushing method and device
CN108804450B (en) Information pushing method and device
CN106911693B (en) Method and device for detecting hijacking of webpage content and terminal equipment
US20150033331A1 (en) System and method for webpage analysis
US8543675B1 (en) Consistent link sharing
US10776444B1 (en) Methods and systems for universal deep linking across web and mobile applications
US11704373B2 (en) Methods and systems for generating custom content using universal deep linking across web and mobile applications
US11106754B1 (en) Methods and systems for hyperlinking user-specific content on a website or mobile applications
US20210064453A1 (en) Automated application programming interface (api) specification construction
US11074337B2 (en) Increasing security of a password-protected resource based on publicly available data
CN109672658B (en) JSON hijacking vulnerability detection method, device, equipment and storage medium
AU2018298640B2 (en) Determination device, determination method, and determination program
CN110796462B (en) Flow guiding method and device
CN113141360A (en) Method and device for detecting network malicious attack
RU2701040C1 (en) Method and a computer for informing on malicious web resources
US11347931B2 (en) Process for creating a fixed length representation of a variable length input
JP6763433B2 (en) Information gathering system, information gathering method, and program
US10853470B2 (en) Configuration of applications to desired application states
WO2017053602A1 (en) Crowd-source as a backup to asynchronous identification of a type of form and relevant fields in a credential-seeking web page
CN115130041A (en) Webpage quality evaluation method, neural network training method, device and equipment
CN110392064B (en) Risk identification method and device, computing equipment and computer readable storage medium
TWI680666B (en) Method and system for identifying users on internet
CN113886216A (en) Interface test and tool configuration method, device, electronic equipment and storage medium
US11461588B1 (en) Advanced data collection block identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant