CN113918789A - Web page element searching method and device and computing equipment - Google Patents

Web page element searching method and device and computing equipment Download PDF

Info

Publication number
CN113918789A
CN113918789A CN202111206055.4A CN202111206055A CN113918789A CN 113918789 A CN113918789 A CN 113918789A CN 202111206055 A CN202111206055 A CN 202111206055A CN 113918789 A CN113918789 A CN 113918789A
Authority
CN
China
Prior art keywords
target field
template
user
target
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111206055.4A
Other languages
Chinese (zh)
Inventor
刘毅
邢万祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hainan Chezhiyi Communication Information Technology Co ltd
Original Assignee
Hainan Chezhiyi Communication Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hainan Chezhiyi Communication Information Technology Co ltd filed Critical Hainan Chezhiyi Communication Information Technology Co ltd
Priority to CN202111206055.4A priority Critical patent/CN113918789A/en
Publication of CN113918789A publication Critical patent/CN113918789A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for searching web page elements and computing equipment. The method for searching the web page element comprises the following steps: responding to a request of a user for searching web page elements on a Chrome browser, sending a domain name and a URL (uniform resource locator) of a current site to a server, and then acquiring a template returned by the server, wherein the template records a target field and an extraction rule of the target field; when the extraction rule of the target field is absolute positioning, extracting the target field in the current page by utilizing one of an Xpath selector, a CSS selector and an ID selector; when the extraction rule of the target field is relative positioning, extracting a target container from the current page, and extracting the target field from the target container; and using the extracted target field as a search result of the web page element. The invention also discloses corresponding computing equipment and a corresponding device.

Description

Web page element searching method and device and computing equipment
Technical Field
The invention relates to the field of crawlers, in particular to a method and a device for searching web page elements and computing equipment.
Background
Capturing data of internet websites is a common requirement, and it is a common practice to write a program or script to obtain a response of a target website, and then parse required element fields from the response according to its own requirements. When the number of fields is large, the number of target websites is large, and the target elements are sometimes changed according to data, the difficulty and workload for analyzing the element fields are increased accordingly.
One way to find web page elements is by manual parsing, which requires a programmer to custom write a set of fetcher for each target web site, request a response from the target web site over a network, and then parse the required fields one by one from the HTML of the response. Before analyzing the fields, a developer is required to analyze the web page structure of the target website, find out a selector of each field, and then locate the required field through the selector. The disadvantage of this solution is that developers are required to analyze the web page structure, write out suitable selectors for multiple pages, which is difficult, and it is also not easy to test whether selectors are suitable for other pages of the same type, because with the development of front-end technology, the front-end separation technology is more common, there are many page elements without fixed IDs, without fixed CSS, and some data are dynamic data, which makes it difficult to write out a good selector, especially when acquiring a certain item in table data, it is very easy to get wrong data, for example, a table shows three fields, as shown in table 1:
table 1 table with three fields
NAME Zhang San
AGE 20
SEX For male
Assuming that the target element we want to obtain is an element with an AGE value of 20, it is easy to obtain data through the index of Table row, in this case, if a certain page field AGE is not shown, and the Table data is as shown in Table 2:
table 2 Table not showing AGE fields
NAME Zhang San
SEX For male
In this case, a shift is easily caused by using the table index, and the final result is analyzed as "male" and is erroneous data.
If the correct selector is written, many attempts are needed, which is time-consuming, and after writing a selector, it is also necessary to test whether other pages of the same type are applicable. Taking the above table 1 as an example, after completing a selector for acquiring age, it is necessary to test whether the age data can be acquired on another page, and the test is also inconvenient in this case.
The second way of finding web page elements is the Portia project visualization collection. The Portia project embeds the target website into the Iframe by using a visualization scheme, obtains elements of the target Web page through visualization operation, achieves the effect of rapid recording, then captures the elements through Scapy, and finally analyzes the captured response results one by using the collected Web elements. The mode realizes that the selector is automatically generated by a program through visual fool operation, reduces the workload of developers, and has the defects of poor support degree on front-end and back-end separation pages and inflexible special pages. Portia integrates a target website into an operation interface of the Portia by using Iframe, and due to the safety limitation of the Iframe, the Portia adopts a back-end agent to assist in rendering a page, so that the Portia of the page separated from the front end and the back end cannot be analyzed and operated normally. In modern websites, with the popularity of vue and the act technology framework, more and more websites adopt the latest front-end and back-end separation technology and the act framework, but Portia project visual acquisition does not support act, so that the application range of the Portia project visual acquisition is not wide enough.
The third way of searching the Web page element is to adopt an XPath Helper Chrome plug-in. The XPath Helper has realized Chrome browser plug-in, through putting the mouse to the assigned position, can show the XPath and the highlight show target of target object. However, the function of the XPath Helper Chrome plug-in is single, only the XPath of the mouse pointing to the target can be shown, the obtained XPath can only be used as the reference of the spider, and the dynamic layout page cannot be accurately obtained in many times. The main reasons are as follows: some websites themselves are dynamically changing, have different data, and have different displayed layouts. In a modern website, with the advent of vue and a act technology framework, an asynchronous loading technology is widely available, so that JavaScript can easily add and delete a lot of contents, which results in that an adopted XPath sample is easily invalidated, and the XPath sample can only be resampled after being invalidated, and then the spider code is modified. For example, in some cases, the XPath collected by the XPath Helper contains randomly generated styles, and these styles change after the website releases a new version, which results in the failure of the XPath. In addition, the XPath Helper Chrome plug-in also has the following problems: the method has the advantages that the method cannot be grounded for persistence, has no memory function, loses the selected content after the page is refreshed, cannot select a plurality of fields simultaneously, needs to be copied manually, cannot select 'crowded elements', and the element pointed by a mouse is the uppermost element.
In conclusion, the existing crawler technology has the problems that the web page element plug-in development difficulty is high, the adaptability to the dynamic page is not good enough, and the like.
Disclosure of Invention
To this end, the present invention provides a method, an apparatus and a computing device for finding web page elements, in an attempt to solve or at least alleviate at least one of the problems presented above.
According to one aspect of the invention, a method for searching web page elements is provided, which comprises the following steps: responding to a request of a user for searching web page elements on a Chrome browser, sending a domain name and a URL (uniform resource locator) of a current site to a server, and acquiring a template which is returned by the server and matched with the domain name and the URL of the current site, wherein the template records a target field and an extraction rule of the target field; when the extraction rule of the target field is absolute positioning, extracting the target field in the current page by utilizing one of an Xpath selector, a CSS selector and an ID selector; when the extraction rule of the target field is relative positioning, extracting a target container from the current page, and extracting the target field from the target container, wherein the target container represents an element of which the relative position with a preset node meets a preset condition; and using the extracted target field as a search result of the web page element.
Optionally, in the method for finding a web page element according to the present invention, the step of extracting the target field in the current page by using one of an Xpath selector, a CSS selector, and an ID selector includes: extracting a target field in the current page by using an Xpath selector; when the target field is failed to be extracted from the current page by using the Xpath selector, extracting the target field from the current page by using the CSS selector; and when the target field in the current page fails to be extracted by the CSS selector, extracting the target field in the current page by the ID selector.
Optionally, in the web page element searching method according to the present invention, the step of extracting the target container in the current page includes: and extracting the target container from the current page according to the position of the target container recorded in the template relative to the preset node.
Optionally, in the method for finding web page elements according to the present invention, the method for finding web page elements further includes the steps of: responding to a request of a recording template of a user on a Chrome browser, and storing a field name and a field type which are newly added by the user; responding to a positioning mode selection request of a user on a Chrome browser page, and determining a positioning mode selected by the user, wherein the positioning mode comprises absolute positioning and relative positioning; responding to a sample acquisition request of a user on a Chrome browser page, and saving elements selected by the user on the current page as samples; positioning the sample selected by the user according to the positioning mode selected by the user, and taking the process of positioning the sample selected by the user as the extraction rule of the target field; responding to a request of a user for storing the template on the Chrome browser, storing the recorded template, and sending the stored template and the URL of the template to the server side regularly.
Optionally, in the method for finding a web page element according to the present invention, the method for finding a web page element further includes: and taking the rule set by the user for cleaning the web page element as a search expression of the target field.
Optionally, in the method for searching for a web page element according to the present invention, the template further records a search expression of a target field, and the step of using the extracted target field as a search result of the web page element further includes: cleaning the searched target field according to the search expression of the target field in the template; and taking the cleaned field as a final search result of the web page element.
According to another aspect of the present invention, there is provided a web page element searching apparatus, including: the template acquisition unit is suitable for responding to a request of a user for searching web page elements on a Chrome browser, sending the domain name and the URL of the current site to the server, and acquiring a template which is returned by the server and matched with the domain name and the URL of the current site, wherein the template records a target field and an extraction rule of the target field; an absolute positioning extraction unit, adapted to extract a target field in a current page using one of an Xpath selector, a CSS selector, and an ID selector when an extraction rule of the target field is absolute positioning; the relative positioning extraction unit is suitable for extracting a target container from the current page and extracting a target field from the target container when the extraction rule of the target field is relative positioning, wherein the target container represents an element of which the relative position with a preset node meets a preset condition; and the presentation unit is suitable for taking the extracted target field as a search result of the web page element.
Optionally, in the apparatus for finding web page element according to the present invention, the absolute positioning extracting unit further includes: the Xpath extracting subunit is suitable for extracting a target field in the current page by using an Xpath selector; the CSS extraction subunit is suitable for extracting the target field in the current page by using the CSS selector when the target field is failed to be extracted in the current page by using the Xpath selector; and an ID extraction subunit adapted to extract the target field in the current page using the ID selector when the extraction of the target field in the current page using the CSS selector fails.
Optionally, in the apparatus for finding web page element according to the present invention, the apparatus for finding web page element further includes: the newly added field unit is suitable for responding to the request of a recording template of a user on the Chrome browser and storing the name and the type of the field newly added by the user; the positioning selection unit is suitable for responding to a positioning mode selection request of a user on a Chrome browser page, and determining a positioning mode selected by the user, wherein the positioning mode comprises absolute positioning and relative positioning; the system comprises a sample acquisition unit, a sample acquisition unit and a control unit, wherein the sample acquisition unit is suitable for responding to a sample acquisition request of a user on a Chrome browser page and saving an element selected by the user on a current page as a sample; the sample positioning unit is suitable for positioning the sample selected by the user according to the positioning mode selected by the user and taking the process of positioning the sample selected by the user as an extraction rule of the target field; and the template storage unit is suitable for responding to a request of a user for storing the template on the Chrome browser, storing the recorded template and regularly sending the stored template and the URL of the template to the server.
Optionally, in the apparatus for finding web page element according to the present invention, the apparatus for finding web page element further includes: and the cleaning setting unit is suitable for taking the rule set by the user for cleaning the web page element as the search expression of the target field.
Optionally, in the apparatus for finding web page elements according to the present invention, the template further records a finding expression of the target field, and the presentation unit further includes: the cleaning subunit is suitable for cleaning the searched target field according to the search expression of the target field in the template; and the display subunit is suitable for taking the cleaned field as a final search result of the web page element.
According to another aspect of the present invention, there is also provided a web page element search system, including: a client adapted to perform the web page element lookup method of the present invention; and the server is connected with the client, is suitable for receiving and storing the template sent by the client, is also suitable for receiving the domain name and the URL of the current site sent by the client, performs template matching according to the template list, the domain name and the URL of the current site, and sends the matched template to the client.
According to another aspect of the present invention, there is also provided a computing device comprising: at least one processor and a memory storing program instructions; the program instructions, when read and executed by a processor, cause a computing device to perform the web page element lookup method as above.
According to still another aspect of the present invention, there is also provided a readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform the web page element lookup method as above.
According to the web page element searching method and device and the computing equipment, at least one of the following beneficial effects can be realized:
the universality of the template is improved by a field template mode (namely, the target elements are extracted according to the target fields recorded in the template and the extraction rules of the target fields), and the universality is improved by a URL regular matching template, so that the crawler can extract the required elements through the maintained site and the field template corresponding to the URL;
the template recording process is visual, software codes do not need to be edited, and the recording speed is high;
by collecting the web page elements based on a Chrome browser plug-in mode, the usability and the testability of the web page element searching method are improved, and all web pages are compatible;
the search path is automatically generated by considering two positioning modes of absolute positioning and relative positioning, various extraction schemes are extracted regularly, and the extraction success probability is improved;
and (4) visual operation, clicking the required elements on the page through a mouse, automatically generating a selector, and reducing the development workload.
Through the Chrome plug-in mode, the plug-in JS code runs in the sandbox, the target website is not influenced, and all websites can be compatible;
the selected elements are marked prominently, so that the method is obtained when the selected elements are seen, and the usability is high;
the selector of the selected element of the page can be persisted, the refreshed page marks the redisplayed page element according to the generated selector, so that whether the selector is applicable to other same pages or not can be conveniently tested, sample data can still be redisplayed after the page is refreshed, and the sample data can still be redisplayed after the page with the same type of different data is replaced;
the fault-tolerant processing mechanism adopts selectors comprising an ID selector, a CSS selector and an XPath selector, and the success of extraction is indicated as long as one selector can extract a target field, so that the success rate of element extraction is improved;
the relative positioning search function can automatically generate an Xpath for processing dynamic table data through a container with an unchangeable relative position and a keyword with a unchangeable relative position aiming at the dynamic data;
the target element type is specified, text, numbers, URLs and the like are supported, the extraction rule of the regular expression is customized, the target rule is refined, and the extraction result is more accurate;
URL matching is supported, a plurality of page field templates can be recorded on a plurality of pages of one site, and the corresponding field templates are found for analysis according to the URL when data are extracted, so that template reusability and universality are improved.
Drawings
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.
FIG. 1 shows a schematic diagram of a web page element lookup system 100 according to one embodiment of the invention;
FIG. 2 shows a schematic diagram of a computing device 200, according to one embodiment of the invention;
FIG. 3 illustrates a flow diagram of a web page element lookup method 300 according to one embodiment of the invention;
FIG. 4 illustrates a screenshot of a browser page looking up a target field using absolute positioning, according to an embodiment of the invention;
FIG. 5 illustrates a flow diagram of a recording template according to one embodiment of the invention;
FIG. 6 shows a schematic diagram of a web page element finding apparatus 400 according to one embodiment of the invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The invention provides a web page element searching method, which can reduce the plug-in development difficulty and expand the application range of the plug-in, aiming at the problems of high plug-in development difficulty, poor adaptability to dynamic pages and the like in the existing web page element searching mode.
FIG. 1 shows a schematic diagram of a web page element lookup system 100 according to one embodiment of the invention.
As shown in fig. 1, the web page element search system 100 includes a server 102 and a plurality of clients 101, where the server 102 is communicatively connected to the plurality of clients 101, for example, through a wired or wireless network.
In an embodiment of the invention, the client 101 is adapted to perform a web page element lookup method. The server 102 may be an application on a server, the server 102 stores a template list of each site, one template list records all templates of the corresponding site, the server 102 is adapted to receive and store the template sent from the client 101, and is further adapted to receive a domain name and a URL of the current site sent from the client 101, then search for a template matching the domain name and the URL sent from the client 101 in the template list of the current site stored in the server, and send the matched template to the corresponding client 101.
The web page element lookup method 300 of the present invention will be described in detail below.
In one embodiment, the client 101 of the present invention may be implemented as a computing device such that the web page element lookup method of the present invention may be performed in the computing device. The computing device may be any device with storage and computing capabilities, and may be implemented as, for example, a server, a workstation, or the like, or may be implemented as a personal computer such as a desktop computer or a notebook computer, or may be implemented as a terminal device such as a mobile phone, a tablet computer, a smart wearable device, or an internet of things device, but is not limited thereto.
FIG. 2 shows a schematic diagram of a computing device 200, according to one embodiment of the invention. It should be noted that the computing device 200 shown in fig. 2 is only an example, and in practice, the computing device for implementing the web page element searching method of the present invention may be any type of device, and the hardware configuration thereof may be the same as the computing device 200 shown in fig. 2 or different from the computing device 200 shown in fig. 2. In practice, the computing device for implementing the automatic verification method of the present invention may add or delete hardware components of the computing device 200 shown in fig. 2, and the present invention does not limit the specific hardware configuration of the computing device.
As shown in FIG. 2, in a basic configuration 202, a computing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communication between the processor 204 and the system memory 206.
Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (UP), a microcontroller (UC), a digital information processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and registers 216. Example processor cores 214 may include Arithmetic Logic Units (ALUs), Floating Point Units (FPUs), digital signal processing cores (DSP cores), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations the memory controller 218 may be an internal part of the processor 204.
Depending on the desired configuration, system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 220, one or more applications 222, and program data 224. The application 222 is actually a plurality of program instructions that direct the processor 204 to perform corresponding operations. In some embodiments, application 222 may be arranged to cause processor 204 to operate with program data 224 on an operating system.
Computing device 200 may also include a storage interface bus 234. The storage interface bus 234 enables communication from the storage devices 232 (e.g., removable storage 236 and non-removable storage 238) to the basic configuration 202 via the bus/interface controller 230. At least a portion of the operating system 220, applications 222, and data 224 may be stored on removable storage 236 and/or non-removable storage 238, and loaded into system memory 206 via storage interface bus 234 and executed by the one or more processors 204 when the computing device 200 is powered on or the applications 222 are to be executed.
Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 252. Example peripheral interfaces 244 can include a serial interface controller 254 and a parallel interface controller 256, which can be configured to facilitate communications with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 258. An example communication device 246 may include a network controller 260, which may be arranged to facilitate communications with one or more other computing devices 262 over a network communication link via one or more communication ports 264.
A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in a manner that encodes information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
In a computing device 200 according to the invention, the application 222 includes a plurality of program instructions to perform the web page element finding method 300, which may instruct the processor 204 to perform the web page element finding method 300 of the invention in order to cause the computing device 200 to perform the web page element finding method 300 of the invention.
FIG. 3 illustrates a flow diagram of a web page element lookup method 300 according to one embodiment of the invention. The method 300 is performed in a computing device, such as the computing device 200 described above. As shown in fig. 3, the method 300 begins at step S310.
According to the embodiment of the invention, the web page element searching method 300 provides a function of searching web page elements for a user in the form of a browser plug-in, and all web pages can be compatible in a Chrome plug-in based mode. It should be noted that the browser here should be a Chrome browser.
S310, responding to a request of a user for searching web page elements on a Chrome browser, sending a domain name and a URL (uniform resource locator) of a current site to a server, and then obtaining a template which is matched and returned by the server according to a template list, the domain name and the URL of the current site, wherein a target field and an extraction rule of the target field are recorded in the template.
In step S310, the user clicks a web page element search plug-in on the Chrome browser of the client 101, and in a pop-up dialog box, when the user clicks a search web page element button, step S310 starts and sends the domain name and URL (which may be used to search for a corresponding template) of the current site to the server 102. A site may have multiple templates, with different templates being applicable to different web pages of the same site (web site). The server 102 stores a site table, a template list, and templates, where the site table records sites to be accessed, the template list records templates and numbers thereof, and the templates record target fields and extraction rules of the target fields, such as extraction rules of different types of fields, such as price, URL, mailbox, and phone number. A site can have a plurality of templates, and one template can have a plurality of target fields and corresponding extraction rules. The server 102 performs template matching in the template list of the current site according to the domain name and the URL of the current site sent by the client 101, and feeds back the matched template to the client 101, and in the next step, the method 300 performs web page element search and extraction according to the target field recorded in the template and the extraction rule of the target field. If the extraction rule of the target field in the returned template is absolute positioning, then step S320 is executed, and the absolute positioning is applicable to the static page; if the extraction rule of the target field in the returned template is relative positioning, step S330 is executed next, and the relative positioning is applied to the dynamic page.
The mode of calling the template from the server 102 provides high template universality, and in most cases, a new crawler code does not need to be developed, and a data capture task can be completed only by adopting a common public crawler.
In the absolute positioning mode, the position of the target field refers to the number of DOM's of the target field relative to the browser. For example, the target field is a title, which is in the second section.
In step S320, the Xpath selector, the CSS selector, and the ID selector are used to extract the target field in the current page, where the three selectors cooperate to reasonably set the priority of each selector, so as to ensure accurate extraction of the required target element.
In one embodiment, step S320 may include steps S321 to S323.
In step S321, the target field is extracted in the current page using the Xpath selector, that is, the target field is matched in the current page according to the Xpath of the target field recorded in the template.
In step S322, when the target field in the current page fails to be extracted by the Xpath selector, the target field in the current page is extracted by the CSS selector, that is, when the target field in the current page is not matched by the Xpath selector, the target field is matched in the current page by the CSS selector.
In step S323, when the extraction of the target field in the current page by the CSS selector fails, the target field is extracted in the current page by the ID selector, that is, when the target field is not matched in the current page by the CSS selector, the target field is matched in the current page by the ID selector.
In the relative positioning mode, the position of the target field is the position of the target field relative to an element on the DOM.
In step S330, when the extraction rule of the target field is relative positioning, extracting a target container in the current page, and extracting the target field in the target container, where the target container represents an element whose relative position with respect to the preset node satisfies a preset condition.
Taking price as an example, the web page element to be searched by the user is a price value, for example, "price 100 yuan" appears in the page, and "100" is a specific value of the price to be searched. For dynamic web pages, the content of the web page is changed, and if an absolute positioning is still used, for example to the second section, the resulting extracted elements may not be the ones required by the user. Assuming that the price in the dynamic web page is immediately behind the title, in this case, the target element needs to be searched by using a relative positioning mode, the title is positioned first, and then the price is positioned according to the position of the price relative to the title. The method specifically comprises the following steps: the title is set as the reference node and the relative position between the price and the title, which is the target container, is set. When searching for the price, firstly, the target container is searched in an absolute position searching mode, and then the target element is searched in the target container according to the position of the price relative to the title.
In step S340, the target fields extracted in step S320 and step S330 are used as search results of web page elements, and the search results are presented to the user.
In one embodiment, the target fields extracted in steps S320 and S330 are marked with colored wire frames on the current page to enable the user to visually see the extraction results.
FIG. 4 illustrates a screenshot of a browser page looking up a target field using absolute positioning, according to an embodiment of the invention. As shown in fig. 4, the target field is price, i.e., "price" in the figure, the target field extraction rule is absolute positioning, and in the current page, the matched element representing the price is marked by a colored wire frame. If the relative positioning mode is adopted for searching, the target container can be marked by a wire frame with one color after the target container is searched, and then the target elements in the target container are marked by a wire frame with another color, wherein the wire frame can be a solid wire frame or a dashed wire frame.
It should be noted that the template in the method 300 should be already stored in the server 102 before the web page element is searched, and if the user finds that the server 102 does not return an available template during the web page element search, the user needs to record one template and send the template to the server 102 for storage.
Fig. 5 shows a flow diagram of a recording template according to one embodiment of the invention. As shown in fig. 5, in one embodiment, the method 300 further includes a method of recording the template, and the method of recording the template includes the following steps S350 to S390.
Step S350, in response to the request of the user for recording the template on the Chrome browser, stores the field name and the field type newly added by the user.
In step S350, when the user clicks the record template button in the web page element search plug-in, the template recording process starts, the user needs to set a new field name and a new field type, clicks the save button, and then proceeds to the next step.
Step S360 responds to a positioning mode selection request of a user on a Chrome browser page, and determines a positioning mode selected by the user, wherein the positioning mode comprises absolute positioning and relative positioning. And after the user selects the field positioning mode, the next step is carried out.
Step S370, in response to a sample collection request of the user on the Chrome browser page, saving an element selected by the user on the current page as a sample, and then proceeding to the next step.
In step S370, the user needs to collect a sample, and uses a mouse to follow the sample during sample collection, the user places a mouse cursor in the current web page, and during movement of the mouse, an area where the cursor is located is highlighted, for example, if an element in the current web page is a text, an entire row of text where the cursor is located is highlighted.
In one embodiment, the current web page is a static page, the user can click a mouse to select a field, then click a confirm button, the selected field is saved as a sample, and an Xpath of the sample is automatically generated, and the Xpath serves as the position of the sample.
In one embodiment, when a target element to be collected cannot be selected by using a mouse, an edge distance adding function can be used, the edge distance can be added to change the size of the edge distance of the source website, an interval is generated between the 'crowded elements', so that the 'crowded' elements can be selected by using the mouse, and the 'recovered edge distance recovery' function is used for recovering the source website style after the selection is completed.
In one embodiment, the current web page is a dynamic page, for example, a page includes dynamic data or a dynamic table type, a field that does not change with the change of the dynamic data on the page may be searched first, the field is used as a reference node, and a node whose relative position to the reference node meets a preset condition is used as a container. The reference node is located by using an absolute location mode, then keywords are input into a container, a target element is automatically searched through the keywords, a corresponding XPath is generated, and the target element is highlighted.
In the sample collection process, a user can collect the sample only by using a mouse, the whole process is visual, various selectors can be utilized without analyzing a website code structure, the development amount is greatly reduced, and the development efficiency is improved. Samples are collected in multiple modes, so that a standby scheme can be adopted after the selector fails, and the target element grabbing accuracy is improved.
And step S380, positioning the sample selected by the user according to the positioning mode selected by the user, and taking the process of positioning the sample selected by the user as an extraction rule of the target field.
In an embodiment, when the template is recorded by using a relative positioning method, after a user selects a sample, the sample needs to be positioned, and the flow of generating the sample value relative to the Xpath of the keyword node is shown in steps S381 to S386.
In step S381, the position of the keyword is determined from the container and the keyword: the descendant selector of the parent container firstly uses text () to search a keyword position (for example, which API is used and how to generate a 'price' field), if the keyword position is not searched, a string mode is used for trying to search the keyword, after the keyword is found, the nearest keyword position is found as a keyword node according to the depth of the keyword, for example, a user needs to search a price value in the current page, and the keyword is 'price', and then the next step is carried out.
In step S382, search for a query sample value of the keyword itself node according to the keyword node, for example, in "price 100 yuan", 100 is a sample value, and if the query sample value is obtained, generate an XPath based on the current query mode, for example, if the keyword node finds a sample value, the XPath of the sample with respect to the keyword is:
/self::*[contains(text(),'sampleValue')][1]
and then proceeds to the next step.
In step S383, if no sample value is queried in step S382, the sibling nodes are queried backwards from the keyword node, if so, the current query mode is taken as the standard, Xpath of the sample value relative to the sibling nodes is generated as Xpath of the sample value relative to the keyword node, and then the next step is proceeded.
In step S384, if no sample value is queried in step S383, the sibling node is queried from the keyword node forward, if so, the current query mode is used as the standard, Xpath of the sample value relative to the sibling node is generated as Xpath of the sample value relative to the keyword node, and then the next step is proceeded.
In step S385, if no sample value is queried in step S384, a descendant node is queried from the keyword node, if so, the current query mode is used as the standard, Xpath of the sample value relative to the descendant node is generated as Xpath of the sample value relative to the keyword node, and then the next step is proceeded.
In step S386, the container position, the keyword, and the Xpath of the sample value with respect to the keyword node are combined to obtain the final position of the sample value, and according to this position, the user can preview the extracted data directly through the interface.
The relative positioning mode combines absolute positioning and keyword searching, and solves the problem that part of dynamic data is difficult to capture.
In step S390, in response to the request of the user to save the template on the Chrome browser, the recorded template is saved, and the saved template and the URL regular of the template are sent to the server 102 together.
In one embodiment, the method 300 further includes step S3100.
In step S3100, a rule set by the user for cleansing the web page element is taken as a search expression for the target field, and the search expression is recorded in the recorded template.
In the template recording process, after the location is located to the sample position, if the number of characters of the sample data previewed is too many, for example, the element that the user wants to extract is "100", but the sample data previewed is "100", in this case, the user may set the price rule so as to clean the located data, for example, extract the price value from the extracted character string.
Correspondingly, step S340 may also clean the searched target field according to the search expression of the target field in the template, and use the cleaned field as the final search result of the web page element.
In one embodiment, when recording the template, for a form type without keywords, the whole form element can be selected as a sample, and then the required field value is extracted by custom regularization.
In one embodiment, after the template is recorded, the page is refreshed to verify the template, if the elements which can be marked are consistent with the target elements to be searched by the user, the selector is indicated to be effective, and then test verification can be performed on other pages of the same type.
The web page element searching method can be used for a general crawler in the aspect of background application, corresponding sites are accessed according to the sites in the database, and required fields are automatically collected according to the template.
Embodiments of the present invention also provide a web page element search apparatus 400, which is capable of performing the steps and processes of the web page element search method 300 as described above. The web page element finding apparatus 400 described above is described below in conjunction with fig. 6.
As shown in fig. 6, the web page element search apparatus 400 includes a template obtaining unit 410, an absolute positioning extracting unit 420, a relative positioning extracting unit 430, and a presentation unit 440.
The template obtaining unit 410 responds to a request of a user for searching web page elements on a Chrome browser, sends a domain name and a URL of a current site to a server, and obtains a template which is returned by the server and matched with the domain name and the URL of the current site, wherein the template records a target field and an extraction rule of the target field.
The absolute positioning extracting unit 420 is adapted to extract the target field in the current page using one of an Xpath selector, a CSS selector, and an ID selector when the extraction rule of the target field is absolute positioning.
The relative positioning extracting unit 430 is adapted to extract a target container in the current page and extract a target field in the target container when the extraction rule of the target field is relative positioning, where the target container represents an element whose relative position with respect to a preset node satisfies a preset condition.
The presentation unit 440 is adapted to use the extracted target field as a result of a search of elements of the web page.
In one embodiment, the absolute positioning extraction unit 420 further includes:
the Xpath extracting subunit is suitable for extracting a target field in the current page by using an Xpath selector;
the CSS extraction subunit is suitable for extracting the target field in the current page by using the CSS selector when the target field is failed to be extracted in the current page by using the Xpath selector; and
an ID extraction subunit adapted to extract the target field in the current page using the ID selector when the extraction of the target field in the current page using the CSS selector fails.
In one embodiment, the web page element lookup apparatus 400 further comprises:
the newly added field unit is suitable for responding to the request of a recording template of a user on the Chrome browser and storing the name and the type of the field newly added by the user;
the positioning selection unit is suitable for responding to a positioning mode selection request of a user on a Chrome browser page, and determining a positioning mode selected by the user, wherein the positioning mode comprises absolute positioning and relative positioning;
the system comprises a sample acquisition unit, a sample acquisition unit and a control unit, wherein the sample acquisition unit is suitable for responding to a sample acquisition request of a user on a Chrome browser page and saving an element selected by the user on a current page as a sample;
the sample positioning unit is suitable for positioning the sample selected by the user according to the positioning mode selected by the user and taking the process of positioning the sample selected by the user as an extraction rule of the target field; and
and the template storage unit is suitable for responding to a request of a user for storing the template on the Chrome browser, storing the recorded template and sending the stored template and the URL regular form of the template to the server side.
In one embodiment, the web page element lookup apparatus 400 further comprises:
and the cleaning setting unit is suitable for taking the rule set by the user for cleaning the web page element as the search expression of the target field.
In one embodiment, the template further records a search expression of the target field, and the presentation unit further includes:
the cleaning subunit is suitable for cleaning the searched target field according to the search expression of the target field in the template; and
and the display subunit is suitable for taking the cleaned field as a final search result of the web page element.
The principle of the web page element searching device 400 according to the embodiment of the present invention is the same as that of the web page element searching method 300 described above, and the technical effect of the web page element searching method 300 can be achieved, which is not described herein again.
Embodiments of the present invention also provide a readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform the web page element lookup method 300 described above.
The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the automatic verification method of the present invention according to instructions in the program code stored in the memory.
By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.
In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose preferred embodiments of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense with respect to the scope of the invention, as defined in the appended claims.

Claims (10)

1. A method of web page element lookup, the method comprising the steps of:
responding to a request of a user for searching web page elements on a Chrome browser, sending a domain name and a URL (uniform resource locator) of a current site to a server, and acquiring a template which is returned by the server and matched with the domain name and the URL of the current site, wherein a target field and an extraction rule of the target field are recorded in the template;
when the extraction rule of the target field is absolute positioning, extracting the target field in the current page by utilizing one of an Xpath selector, a CSS selector and an ID selector;
when the extraction rule of the target field is relative positioning, extracting a target container in the current page, and extracting the target field in the target container, wherein the target container represents an element of which the relative position with a preset node meets a preset condition; and
and using the extracted target field as a search result of the web page element.
2. The method of claim 1, wherein the extracting the target field in the current page using one of an Xpath selector, a CSS selector, and an ID selector comprises:
extracting a target field in the current page by using an Xpath selector;
when the target field is failed to be extracted from the current page by using the Xpath selector, extracting the target field from the current page by using the CSS selector; and
when the extraction of the target field in the current page using the CSS selector fails, the target field is extracted in the current page using the ID selector.
3. The method of claim 1, wherein the extracting the target container in the current page comprises:
and extracting the target container from the current page according to the position of the target container recorded in the template relative to the preset node.
4. The method of any one of claims 1 to 3, wherein the method further comprises the step of:
responding to a request of a recording template of a user on a Chrome browser, and storing a field name and a field type which are newly added by the user;
responding to a positioning mode selection request of a user on a Chrome browser page, and determining a positioning mode selected by the user, wherein the positioning mode comprises absolute positioning and relative positioning;
responding to a sample acquisition request of a user on a Chrome browser page, and saving elements selected by the user on the current page as samples;
positioning the sample selected by the user according to the positioning mode selected by the user, and taking the process of positioning the sample selected by the user as the extraction rule of the target field; and
responding to a request of a user for storing the template on the Chrome browser, storing the recorded template, and sending the stored template and the URL of the template to the server side regularly.
5. The method of claim 4, wherein the method further comprises the steps of:
and taking the rule set by the user for cleaning the web page element as a search expression of the target field.
6. The method of claim 5, wherein the template further records a search expression for a target field, and the step of using the extracted target field as a search result for the web page element further comprises:
cleaning the searched target field according to the search expression of the target field in the template; and
and taking the cleaned field as a final search result of the web page element.
7. A web page element lookup apparatus comprising:
the template acquisition unit is suitable for responding to a request of a user for searching web page elements on a Chrome browser, sending the domain name and the URL of the current site to the server, and acquiring a template which is returned by the server and matched with the domain name and the URL of the current site, wherein the template records a target field and an extraction rule of the target field;
an absolute positioning extraction unit, adapted to extract a target field in a current page by using one of an Xpath selector, a CSS selector, and an ID selector when an extraction rule of the target field is absolute positioning;
the relative positioning extraction unit is suitable for extracting a target container from the current page and extracting a target field from the target container when the extraction rule of the target field is relative positioning, wherein the target container represents an element of which the relative position with a preset node meets a preset condition; and
and the display unit is suitable for taking the extracted target field as a search result of the web page element.
8. The apparatus of claim 7, further comprising:
the newly added field unit is suitable for responding to the request of a recording template of a user on the Chrome browser and storing the name and the type of the field newly added by the user;
the positioning selection unit is suitable for responding to a positioning mode selection request of a user on a Chrome browser page, and determining a positioning mode selected by the user, wherein the positioning mode comprises absolute positioning and relative positioning;
the system comprises a sample acquisition unit, a sample acquisition unit and a control unit, wherein the sample acquisition unit is suitable for responding to a sample acquisition request of a user on a Chrome browser page and saving an element selected by the user on a current page as a sample;
the sample positioning unit is suitable for positioning the sample selected by the user according to the positioning mode selected by the user and taking the process of positioning the sample selected by the user as the extraction rule of the target field; and
and the template storage unit is suitable for responding to a request of a user for storing the template on the Chrome browser, storing the recorded template and regularly sending the stored template and the URL of the template to the server.
9. A computing device, comprising:
at least one processor and a memory storing program instructions;
the program instructions, when read and executed by the processor, cause the computing device to perform the web page element lookup method of any one of claims 1-6.
10. A readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform the web page element lookup method of any one of claims 1-6.
CN202111206055.4A 2021-10-14 2021-10-14 Web page element searching method and device and computing equipment Pending CN113918789A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111206055.4A CN113918789A (en) 2021-10-14 2021-10-14 Web page element searching method and device and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111206055.4A CN113918789A (en) 2021-10-14 2021-10-14 Web page element searching method and device and computing equipment

Publications (1)

Publication Number Publication Date
CN113918789A true CN113918789A (en) 2022-01-11

Family

ID=79240746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111206055.4A Pending CN113918789A (en) 2021-10-14 2021-10-14 Web page element searching method and device and computing equipment

Country Status (1)

Country Link
CN (1) CN113918789A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969610A (en) * 2022-06-21 2022-08-30 中银金融科技有限公司 Page processing method and device
CN115061927A (en) * 2022-06-27 2022-09-16 壹沓科技(上海)有限公司 Webpage element positioning method and device based on RPA and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969610A (en) * 2022-06-21 2022-08-30 中银金融科技有限公司 Page processing method and device
CN115061927A (en) * 2022-06-27 2022-09-16 壹沓科技(上海)有限公司 Webpage element positioning method and device based on RPA and storage medium

Similar Documents

Publication Publication Date Title
US9075873B2 (en) Generation of context-informative co-citation graphs
WO2018133452A1 (en) Webpage rendering method and related device
US9330179B2 (en) Configuring web crawler to extract web page information
US9904936B2 (en) Method and apparatus for identifying elements of a webpage in different viewports of sizes
JP4945813B2 (en) Print structured documents
CN105631393A (en) Information recognition method and device
CN113918789A (en) Web page element searching method and device and computing equipment
JP2014032665A (en) Selective display of ocr'ed text and corresponding images from publications on client device
CN105580384A (en) Actionable content displayed on a touch screen
JP6514244B2 (en) Difference detection device and program
US20180218076A1 (en) Information obtaining method and apparatus
AU2009238294A1 (en) Data transformation based on a technical design document
WO2016095502A1 (en) Mathematical formula processing method, device, apparatus and computer storage medium
Wu Language independent web news extraction system based on text detection framework
US20150106701A1 (en) Input support method and information processing system
WO2017162031A1 (en) Method and device for collecting information, and intelligent terminal
JP6723976B2 (en) Test execution device and program
US20160103799A1 (en) Methods and systems for automated detection of pagination
CN113419711A (en) Page guiding method and device, electronic equipment and storage medium
US20210312141A1 (en) Content management systems for providing automated translation of content items
TW201705021A (en) An information retrieving method utilizing webpage visual features and webpage language features and a system using thereof
CN115061688B (en) Page effect display method, computing device and storage medium
JP2002169637A (en) Document display mode conversion device, document display mode conversion method, recording medium
CN114579461A (en) Browser compatibility detection method and related equipment
CN110147477B (en) Data resource modeling extraction method, device and equipment of Web system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination