WO2022179128A1

WO2022179128A1 - Crawler-based data crawling method and apparatus, computer device, and storage medium

Info

Publication number: WO2022179128A1
Application number: PCT/CN2021/124394
Authority: WO
Inventors: 郑如刚
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2021-02-25
Filing date: 2021-10-18
Publication date: 2022-09-01
Also published as: CN112925968A

Abstract

A crawler-based data crawling method, comprising: obtaining a target webpage, and parsing the target webpage to obtain all labels of the target webpage (S201); obtaining label pairs according to the labels, traversing all the label pairs searching for characters in each label pair, taking a character satisfying a first preset condition as a first character, and taking a character satisfying a second preset condition as a second character (S202); by taking the first character as a starting point and the second character as an end point, reading a target character between the starting point and the end point (S203); and determining whether the target character satisfies an extraction condition, and when it is determined that the extraction condition is satisfied, extracting the target character as page content (S204). The method enhances the adaptability of a crawler, reduces the workload, and improves the data crawling efficiency.

Description

Crawler-based data capture method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application filed on February 25, 2021 with the application number 202110213211.3 and the invention title is "Crawler-based data capture method, device, computer equipment and storage medium", all of which The contents are incorporated herein by reference.

technical field

The present application relates to the technical field of data collection of big data, and in particular, to a crawler-based data capture method, device, computer equipment and storage medium.

Background technique

In recent years, the Internet has gradually shifted to the direction of big data. In the big data environment, the acquisition of data is very important. In the method of data acquisition, crawler technology is usually used for data capture. Crawler technology is a program that searches web pages through their link addresses and automatically obtains web page content according to certain rules.

The inventor found that, at present, traditional crawler crawling web page data needs to perform data analysis by locating page elements, that is to say, in the process of crawling data using crawler, scripts must be written, and different websites need to write different scripts. , At the same time, if the webpage is revised, the crawling script needs to be rewritten, resulting in a huge workload of data crawling, time-consuming and labor-intensive, and low efficiency.

SUMMARY OF THE INVENTION

The purpose of the embodiments of the present application is to propose a crawler-based data grabbing method, device, computer equipment and storage medium, so as to solve the problems in the related art that the crawler has a huge workload for data grabbing, is time-consuming and labor-intensive, and has low grabbing efficiency. .

In order to solve the above technical problems, the embodiments of the present application provide a crawler-based data capture method, which adopts the following technical solutions:

Obtaining a target web page, and parsing the target web page to obtain all tags of the target web page;

Obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each of the tag pairs, take the character that satisfies the first preset condition as the first character, and set the character that satisfies the second preset condition as the first character. the character of the condition as the second character;

Taking the first character as the starting point and the second character as the ending point, read out the target character between the starting point and the ending point;

It is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.

In order to solve the above technical problems, the embodiments of the present application also provide a crawler-based data grabbing device, which adopts the following technical solutions:

an acquisition module, used to acquire a target webpage, parse the target webpage, and obtain all tags of the target webpage;

The traversal module is used to obtain the label pair according to the label, traverse all the label pairs, and search for the characters in each of the label pairs, and use the character that satisfies the first preset condition as the first character. A character that satisfies the second preset condition is used as the second character;

a reading module, configured to take the first character as the starting point and the second character as the ending point, and read out the target character between the starting point and the ending point;

The extraction module is used for judging whether the target character satisfies the extraction condition, and extracting the target character as the page content when it is determined that the extraction condition is satisfied.

In order to solve the above-mentioned technical problems, the embodiment of the present application also provides a computer device, which adopts the following technical solutions:

The computer equipment comprises a memory and a processor, and a computer-readable instruction is stored in the memory, and when the processor executes the computer-readable instruction, the steps of the following crawler-based data grabbing method are realized:

Taking the first character as the starting point and the second character as the ending point, read all target characters between the starting point and the ending point;

In order to solve the above technical problems, the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:

The computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the steps of the crawler-based data grabbing method described below are implemented:

Compared with the prior art, the embodiments of the present application mainly have the following beneficial effects:

The present application obtains all the tags of the target web page by acquiring the target web page and analyzing the target web page, obtains the tag pairs according to the tags, traverses all the tag pairs, and searches for the characters in each of the tag pairs, which will satisfy the first requirement. The character with the preset condition is used as the first character, and the character satisfying the second preset condition is used as the second character, and the first character is used as the starting point and the second character is the end point, and the target character between the starting point and the ending point is read out, Judging whether the target character satisfies the extraction conditions, when it is determined that the extraction conditions are met, the target character is extracted as the page content; the present application searches for the first character and the second character in each label pair by traversing all the label pairs of the target webpage, Read out the target character between the first character as the starting point and the second character as the end point, determine that the target character meets the extraction conditions, and extract the target character as the page content, that is, by finding the characters in the label that meet the preset conditions To crawl web page data, it can avoid the need for different websites to write different scripts for crawling, enhance the adaptability of the crawler, at the same time, reduce the workload and improve the efficiency of data crawling.

Description of drawings

In order to illustrate the solutions in the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments of the present application. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;

2 is a flowchart of an embodiment of a crawler-based data capture method according to the present application;

Fig. 3 is a flow chart of a specific implementation manner of step S201 in Fig. 2;

Fig. 4 is a flow chart of a specific implementation manner of step S202 in Fig. 2;

5 is a schematic structural diagram of an embodiment of a crawler-based data grabbing device according to the present application;

FIG. 6 is a schematic structural diagram of an embodiment of a computer device according to the present application.

Detailed ways

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of this application; the terms used herein in the specification of the application are for the purpose of describing specific embodiments only It is not intended to limit the application; the terms "comprising" and "having" and any variations thereof in the description and claims of this application and the above description of the drawings are intended to cover non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.

Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings.

Before the further detailed description of the embodiments of the present application, the terms and terms involved in the embodiments of the present application will be described. The terms and terms involved in the embodiments of the present application are applicable to the following explanations.

1) A web crawler, also known as a web spider or a web robot, refers to a program or script that automatically grabs web information according to certain rules.

2) HTML (HyperText Mackeup Language, Hypertext Markup Language), is a descriptive markup language used to describe how the content in hypertext is displayed.

3) Tags, also known as tags, are an HTML web term, each of which is used to specify a specific meaning.

4) Web pages, pages composed of various tags.

In order to solve the problems of huge workload, time-consuming and labor-intensive, and low capture efficiency for crawler in the related art, the present application provides a crawler-based data capture method, which can be applied to the system architecture shown in FIG. 1 . In 100 , the system architecture 100 may include

terminal devices

101 , 102 , and 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the

terminal devices

101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user can use the

terminal devices

101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications may be installed on the

terminal devices

101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.

The

terminal devices

101, 102, and 103 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.

The server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the

terminal devices

101 , 102 , and 103 .

It should be noted that the crawler-based data grabbing method provided by the embodiment of the present application is generally executed by a server, and accordingly, the crawler-based data grabbing apparatus is generally set in the server.

It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.

Continuing to refer to FIG. 2 , a flowchart of an embodiment of the method for crawling data based on crawler according to the present application is shown, including the following steps:

In step S201, a target web page is acquired, and the target web page is parsed to obtain all tags of the target web page.

In this embodiment, the crawler obtains the web page under the address according to the provided entry URL, and the web page is the target web page to be crawled. When the crawler crawls the target webpage, the crawler will identify the tags of the target webpage and extract the content of the webpage from the tags.

In some optional implementations of this embodiment, the steps of parsing the target web page to obtain the label of the target web page are as follows:

Step S301, using a web page parser to extract the web page structure of the target web page.

Specifically, an HTML document is obtained by parsing the target web page with a web page parser, the HTML document is parsed to generate a DOM tree structure, and the generated DOM tree structure is used as the web page structure of the target web page.

In web pages, the tags of web pages are arranged in a tree-like structure. Web page parser is also called HTML (HyperText Markup Language) parser, jsoup is a Java HTML parser, which can directly parse a URL address and HTML text content.

In this embodiment, by parsing the target web page through jsoup or other web page parser, the web page content containing tags can be acquired. For example, the following is an HTML document:

It should be noted that the tags appear in pairs, that is, the tags in this embodiment are specifically tag pairs, for example, <html> and </html> are one tag, <head> and </head> are one tag, and < html> defines the HTML document, the <html> and </html> tags define the start and end points of the document; <head> defines the information about the document; <title> defines the title of the document; <body> defines the body of the document; <p> defines the document paragraph; <b> defines the bold font; <a> defines the hyperlink. The most important attribute of the <a> tag is the href attribute, which indicates the target of the link.

Convert HTML document into DOM tree structure by HTML parser. The DOM tree structure is the parse tree output by the HTML parser, including element nodes, text nodes, attribute nodes, and comment nodes. It is the object representation of the HTML document and serves as the external interface of the HTML element for JS and other calls. There are multiple branches in the DOM tree structure, and there are multiple layers in each branch, and the layer structure is the relationship between element nodes and element nodes.

As a specific example, when using a web page parser to parse an HTML document to generate a DOM tree structure, nodes that do not contain text information (that is, functional codes) in the web page can be removed, for example, meaningless HTML tags such as <style> can be removed. , </style>, <script>, </script>, etc.

In this embodiment, the target webpage is parsed to generate a DOM tree structure, and the tags in the DOM tree structure are used as traversing nodes, so that the tags can be traversed more conveniently.

In step S302, the tags in the web page structure are acquired.

It should be noted that the element node of the DOM tree structure is the label element of the web page, the text node is contained within the element node, and the attribute node is used to describe the element in detail. As shown in the above example, the attribute of the <a> tag is href attribute. It should be understood that not all element nodes contain attributes.

In this embodiment, the tag of the target webpage can be obtained from the generated DOM tree structure, the performance of parsing the webpage can be improved, and at the same time, the webpage information can be obtained completely and accurately.

In this embodiment, the web page structure is extracted by the web page parser, and the tags are obtained from the web page structure, so that the tags of the target web page can be obtained more quickly and easily.

Step S202, obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each tag pair, take the character that satisfies the first preset condition as the first character, and use the character that satisfies the second preset condition as the first character. as the second character.

A web page is a page composed of various tags. The tags displayed on the web page follow the tag specification, and the level of the tags can be distinguished according to the tag elements. Take the following example to illustrate:

In the above example, there are three label elements, namely <dd>, <ul>, and <li>. It can be seen that the label is divided into three layers, and each label element corresponds to one layer.

In this embodiment, tags exist in the form of tag pairs, and by traversing each tag as a node, the page content of the webpage corresponding to the tag can be obtained.

Specifically, the steps to traverse all tag pairs are as follows:

In step S401, the outermost label in the web page structure is used as an initial node.

In this embodiment, the tags are used as traversal nodes to perform depth-first traversal on all tag pairs. Specifically, the outermost tags in the web page structure are used as the first layer, and the tags nested in the first layer are used as the second layer. Layers, and so on, form a tree structure, take the label of the first layer as the initial node, access from the initial node, start from the unvisited adjacent nodes of the initial node, and perform depth-first traversal until all nodes are visited. so far.

Taking the above example as an example for further illustration, the tag element <dd> is the outermost tag, which is the first layer, and the tag element <ul> is the tag nested in <dd>, which is used as the first layer. The second layer, the label element is <li>, which is a nested label in <ul>, which is used as the third layer. Specifically, taking <dd> as the initial node, the adjacent node of <dd> is <ul>, the adjacent node of <ul> is <li>, and the order of depth-first traversal is: <dd>→<ul>→ <li>.

Step S402 , starting from the initial node, using the tags in the DOM tree structure as traversal nodes, perform depth-first traversal on all tag pairs.

If the element nodes contained in the generated DOM tree structure are the tag elements of the web page, the tag elements in the DOM tree structure are used as nodes, and the tag pairs corresponding to the tag elements are sequentially traversed in depth first. For example, if the tag element is ul in the DOM tree structure, the tag pair corresponding to the tag element is <ul> and </ul>.

The method of depth-first traversal is to start from the initial node v:

a. Visit the initial node v, and mark the initial node v as visited;

b. Find the first adjacent node w of the initial node v;

c. If the node w exists, continue to perform step d, otherwise the traversal ends;

d. If the node w is not visited, perform depth-first traversal recursion on the node w (that is, treat the node w as another initial node v, and perform steps a, b and c);

e. Find the next adjacent node of the initial node v, and go to step c.

For example, suppose the web page structure is: label A is the first layer; labels nested in label A are label B and label C, then label B and label C are the second layer; labels nested in label B are labels D and label E, the label nested in label C is label F, then label D, label E and label F are the third layer; therefore, label A is used as the initial node, label A has adjacent nodes label B and label C, label B has adjacent node label D and label E, and label C has adjacent node label F. The order of depth-first traversal is: label A → label B → label D → label E → label C → label F.

The label definition of the web page follows the label specification, and the label format is similar: <lable>page content</lable>. In this embodiment, each tag pair is traversed to extract the content of the page. During the traversal process, in the level corresponding to each tag pair, by searching for characters, the characters that satisfy the first preset condition are used as The first character, and the character that satisfies the second preset condition is used as the second character.

In this embodiment, the first preset condition is specifically the "greater than sign (>)" character that appears in the layer corresponding to the currently accessed tag pair, and the second preset condition is specifically the character corresponding to the currently accessed tag pair A "less than (<)" character that appears next to a "greater than (>)" character in a hierarchy. It should be understood that the ">" and "<" characters are label characters.

It should be noted that, in a tag pair, there may be multiple ">" characters and multiple "<" characters, and the ">" character that appears for the first time in the tag pair is taken as the first character, followed by the ">" character that appears for the first time. The "<" character that appears after the >" character is the second character, the second appearance of the ">" character is the first character, and the "<" character that appears immediately after the second appearance of the ">" character is the first character. two characters, and so on. In this embodiment, the first character and the second character are in sequence, the first character is in the front, the second character is in the back, and the content between the first character and the second character is the page content that may be extracted.

In this embodiment, all characters in each label pair can be searched, and the first and second characters found are marked respectively.

Specifically, all characters in the tag pair can be searched by using the string search method. String search methods include but are not limited to the following methods:

1. int indexOf(String str): Returns the index of the first occurrence of the specified substring in this string.

2. int indexOf(String str, int startIndex): Starting from the specified index, returns the index of the first occurrence of the specified substring in this string.

3. int lastIndexOf(String str): Returns the index of the specified substring that appears on the far right in this string.

4. int lastIndexOf(String str, int startIndex): Search backwards from the specified index, and return the index of the specified substring that appeared last in this string.

For each tag pair, the tail character of the first tag is the first character of the first occurrence, and the first character of the tail tag is the second character of the last occurrence. In each label pair, find all the first and second characters between the first occurrence of the first character and the last occurrence of the second character.

When the webpage is revised, the page content of the webpage will not change greatly, but the label will change. If the crawler obtains the data according to the method of analyzing the data one by one according to the label of the webpage, it needs to rewrite the crawling script. In the embodiment, by determining the character length between the first character and the second character to extract the page content, it is possible to avoid the situation of needing to write different scripts for crawling when different websites or webpages are revised. Reptile adaptations.

In this embodiment, the first character and the second character in each tag pair are determined by depth-first traversal, which can ensure that the page content is extracted completely and without omission.

Step S203, taking the first character as the starting point and the second character as the ending point, read out the target character between the starting point and the ending point.

In this embodiment, in a tag pair, there may be multiple ">" characters and multiple "<" characters, starting from the ">" character that appears for the first time, followed by the ">" character that appears for the first time. The "<" character appears as the end point, the second appearance of the ">" character is the starting point, and the "<" character that appears next to the second appearance of the ">" character is the end point, and the start and end points are read out. target characters in between, and so on.

Step S204, it is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.

Specifically, the length of the target character between the first character and the second character is recorded, and it is determined that the length of the target character satisfies the extraction length, then the target character is used as the page content, and the page content is extracted. The extracted page content can be saved for easy viewing by users.

In this embodiment, the label format is <lable>page content</lable>, and the page content between the first character ">" and the second character "<" should be extracted.

As an example, a tag pair is as follows:

<li><em>⋅</em><a href="/topics/50245210" target "_black">About ADO and number symbols</a></li>

The page content "About ADO and Numerical Symbols" needs to be extracted. In this tag pair, there are multiple first characters and second characters. Therefore, it is necessary to separate each pair of first and second characters between the first and second characters. The character length is recorded, and whether it is the content of the page to be extracted is determined according to the length.

The extraction length is set according to the actual situation. When it is determined that the target character between the first character and the second character meets the extraction preset length, the target character is extracted as the page content.

It should be emphasized that, in order to further ensure the privacy and security of the above page content, the page content can also be stored in a node of a blockchain.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

This embodiment determines whether the page content to be extracted is between the first character and the second character according to the character length, which can avoid extracting useless content, improve the accuracy of crawling data by the crawler, and improve the crawling efficiency.

In some optional implementations of this embodiment, after step 203, the following steps may also be performed:

Store page content in json format.

Specifically, the extracted page content is stored in json format, or stored in a database format, and the stored page content can be directly used for data analysis. The storage methods include, but are not limited to, storing to text files in json format, storing to excel, storing to SQLite (light database), and storing to mySQL (relational database management system) database.

JSON, the full name of JavaScript Object Notation, is the JavaScript object notation, any supported type can be represented by JSON, such as strings, numbers, objects, arrays, etc. In this embodiment, a path to a json file for saving page content is set, and the extracted page content is written into the json file through the path for storage.

It should be noted that the json format can ensure that when the file is opened, the stored data can be visually checked, and one data is stored in one line. This method is suitable for crawling a small amount of data, and subsequent reading and analysis are also very convenient. Yes; if the crawled data can be easily organized into a table, it can be stored in excel. After opening excel, it is more convenient to observe the data, and excel can also perform some simple operations; SQLite does not need to be installed, it is zero configuration Database, when the amount of crawler data is large, it needs to be persistently stored, and no other database is installed, you can choose SQLite for storage; mySQL can be accessed remotely, which means that data can be stored on a remote server host.

The present application traverses all the tag pairs of the target web page, finds the first character and the second character in each tag pair, reads out the target character with the first character as the starting point and the second character as the end point, and determines the target character If the extraction conditions are met, the target characters are extracted as page content, that is, the web page data is crawled by finding the corresponding characters in the tags, which can avoid the need for different websites to write different scripts for crawling and enhance the adaptability of the crawler. , at the same time, reduce workload and improve data capture efficiency.

The present application may be used in numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , when the program is executed, it may include the processes of the foregoing method embodiments. Wherein, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

It should be understood that although the various steps in the flowchart of the accompanying drawings are sequentially shown in the order indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders. Moreover, at least a part of the steps in the flowchart of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution sequence is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of sub-steps or stages of other steps.

Further referring to FIG. 5 , as an implementation of the method shown in FIG. 2 above, the present application provides an embodiment of a crawler-based data grabbing device, which corresponds to the method embodiment shown in FIG. 2 , Specifically, the device can be applied to various electronic devices.

As shown in FIG. 5 , the crawler-based data capture device 500 in this embodiment includes: an acquisition module 501 , a traversal module 502 and an extraction module 503 . in:

The obtaining module 501 is used to obtain a target web page, parse the target web page, and obtain all tags of the target web page;

The traversal module 502 is configured to obtain a pair of labels according to the labels, traverse all the pairs of labels, and search for the characters in each pair of labels, take the character that satisfies the first preset condition as the first character, and use A character that satisfies the second preset condition is used as the second character;

The extraction module 503 is configured to judge whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, extract the target character as the page content.

The above-mentioned crawler-based data grabbing device searches for the first character and the second character in each label pair by traversing all the label pairs of the target web page, and reads the first character as the starting point and the second character as the end point. If the target characters are determined to meet the extraction conditions, the target characters are extracted as the page content, that is, the page data is crawled by finding the corresponding characters in the tags, which can avoid the need to write different scripts for different websites. Crawl, enhance the adaptability of the crawler, at the same time, reduce the workload and improve the efficiency of data scraping.

In this embodiment, the acquiring module 501 includes a parsing sub-module and an acquiring sub-module. The parsing sub-module is used to extract the web page structure of the target web page by using a web page parser; the acquiring sub-module is used to acquire the Label.

In some optional implementations of this embodiment, the parsing submodule is further used for:

Parse the target web page by the web page parser to obtain an HTML document;

The HTML document is parsed to generate a DOM tree structure, and the generated DOM tree structure is used as the web page structure of the target web page.

In this embodiment, by parsing the target web page to generate a DOM tree structure, the performance of parsing the web page can be improved, and at the same time, the web page information can be acquired completely and accurately.

In this embodiment, the traversal module 502 is further used for:

using the outermost label in the web page structure as an initial node;

Starting from the initial node, using the tags in the DOM tree structure as traversal nodes, depth-first traversal is performed on all the tag pairs.

In some optional implementations of this embodiment, the traversal module 502 is further configured to:

All characters in the tag pair are searched using a string search method.

In this embodiment, the extraction module 503 includes a recording sub-module and an extraction sub-module, the recording sub-module is used to record the length of the target character; the extraction sub-module is used to determine that the length of the target character satisfies the extraction length, then the Taking the target characters as page content, the page content is extracted.

In some optional implementations of this embodiment, the crawler-based data crawling apparatus 500 further includes a storage module, and the storage module is configured to store the page content in json format.

In this embodiment, the extracted page content is stored in the database, and the stored page content can be directly used for data analysis.

To solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 6 for details. FIG. 6 is a block diagram of the basic structure of a computer device according to this embodiment.

The computer device 6 includes a memory 61, a processor 62, and a network interface 63 that communicate with each other through a system bus. It should be pointed out that only the computer device 6 with components 61-63 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.

The computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment. The computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.

The memory 61 stores computer-readable instructions, and the processor 62 implements the following steps when executing the computer-readable instructions:

The memory 61 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6 , such as a hard disk or a memory of the computer device 6 . In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 61 may also include both the internal storage unit of the computer device 6 and its external storage device. In this embodiment, the memory 61 is generally used to store the operating system and various application software installed on the computer device 6 , such as computer-readable instructions of a crawler-based data capture method, and the like. In addition, the memory 61 can also be used to temporarily store various types of data that have been output or will be output.

In some embodiments, the processor 62 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. This processor 62 is typically used to control the overall operation of the computer device 6 . In this embodiment, the processor 62 is configured to execute computer-readable instructions stored in the memory 61 or process data, for example, computer-readable instructions for executing the crawler-based data capture method.

The network interface 63 may include a wireless network interface or a wired network interface, and the network interface 63 is generally used to establish a communication connection between the computer device 6 and other electronic devices.

In this embodiment, when the processor executes the computer-readable instructions stored in the memory, the steps of the crawler-based data grabbing method in the above-mentioned embodiment are implemented, and the first character in each tag pair is searched by traversing all the tag pairs of the target web page. and the second character, read the target character between the first character as the starting point and the second character as the end point, determine that the target character meets the extraction conditions, and extract the target character as the page content, that is, by searching for the corresponding It can avoid the need for different websites to write different scripts for crawling, enhance the adaptability of the crawler, at the same time, reduce the workload and improve the efficiency of data crawling.

The present application also provides another implementation manner, which is to provide a computer-readable storage medium, where the computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor, so that the at least one processor executes the steps of the crawler-based data scraping method as described above, By traversing all the tag pairs of the target web page, finding the first character and the second character in each tag pair, and reading out the target character between the first character as the starting point and the second character as the end point, it is determined that the target character satisfies the extraction condition, the target characters are extracted as page content, that is, the web page data is crawled by finding the corresponding characters in the tags, which can avoid the need for different websites to write different scripts for crawling, enhance the adaptability of the crawler, and at the same time , reduce workload and improve data capture efficiency.

Wherein, when the computer-readable instructions are executed by the processor, the steps of the crawler-based data grabbing method described below are realized:

From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.

Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. This application may be embodied in many different forms, rather these embodiments are provided so that a thorough and complete understanding of the disclosure of this application is provided. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features. . Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.

Claims

A crawler-based data capture method, comprising the following steps:

Obtaining a target web page, and parsing the target web page to obtain all tags of the target web page;

Obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each of the tag pairs, take the character that satisfies the first preset condition as the first character, and set the character that satisfies the second preset condition as the first character. the character of the condition as the second character;

Taking the first character as the starting point and the second character as the ending point, read all target characters between the starting point and the ending point;

It is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
The crawler-based data crawling method according to claim 1, wherein the step of parsing the target web page to obtain the label of the target web page comprises:

extracting the web page structure of the target web page using a web page parser;

Get the tag in the web page structure.
The crawler-based data crawling method according to claim 2, wherein the step of extracting the web page structure of the target web page by using a web page parser comprises:

Parse the target web page by the web page parser to obtain an HTML document;

The HTML document is parsed to generate a DOM tree structure, and the generated DOM tree structure is used as the web page structure of the target web page.
The crawler-based data capture method according to claim 3, wherein the step of traversing all the tag pairs comprises:

Use the outermost label in the web page structure as an initial node;

Starting from the initial node, using the tags in the DOM tree structure as traversal nodes, depth-first traversal is performed on all the tag pairs.
The crawler-based data capture method according to claim 1, wherein the step of searching for characters in each of the label pairs comprises:

All characters in the tag pair are searched using a string search method.
The crawler-based data crawling method according to claim 1, wherein the step of judging that the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, extracting the target character as the page content comprises:

record the length of the target character;

If it is determined that the length of the target character satisfies the extraction length, the target character is used as the page content, and the page content is extracted.
The crawler-based data scraping method according to claim 1, wherein after the step of extracting the target characters as page content, the method further comprises:

The page content is stored in json format.
A crawler-based data capture device, comprising:

an acquisition module, used to acquire a target webpage, parse the target webpage, and obtain all tags of the target webpage;

The traversal module obtains the tag pairs according to the tags, traverses all the tag pairs, and searches for the characters in each of the tag pairs, takes the character that satisfies the first preset condition as the first character, and satisfies the first character. 2. The character of the preset condition is used as the second character;

a reading module, configured to take the first character as the starting point and the second character as the ending point, and read out the target character between the starting point and the ending point;

The extraction module is used for judging whether the target character satisfies the extraction condition, and extracting the target character as the page content when it is determined that the extraction condition is satisfied.
A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the steps of the following crawler-based data grabbing method are implemented:

Obtaining a target web page, and parsing the target web page to obtain all tags of the target web page;

Obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each of the tag pairs, take the character that satisfies the first preset condition as the first character, and set the character that satisfies the second preset condition as the first character. the character of the condition as the second character;

Taking the first character as the starting point and the second character as the ending point, read all target characters between the starting point and the ending point;

It is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
The computer device according to claim 9, wherein the step of parsing the target web page to obtain the label of the target web page comprises:

extracting the web page structure of the target web page using a web page parser;

Get the tag in the web page structure.
The computer device of claim 10, wherein the step of extracting the web page structure of the target web page using a web page parser comprises:

Parse the target web page by the web page parser to obtain an HTML document;

The HTML document is parsed to generate a DOM tree structure, and the generated DOM tree structure is used as the web page structure of the target web page.
The computer device of claim 11, wherein the step of traversing all of the tag pairs comprises:

Use the outermost label in the web page structure as an initial node;

Starting from the initial node, using the tags in the DOM tree structure as traversal nodes, depth-first traversal is performed on all the tag pairs.
10. The computer device of claim 9, wherein said step of searching for characters in each said tag pair comprises:

All characters in the tag pair are searched using a string search method.
The computer device according to claim 9, wherein the judging that the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the step of extracting the target character as the page content comprises:

record the length of the target character;

If it is determined that the length of the target character satisfies the extraction length, the target character is used as the page content, and the page content is extracted.
The computer device according to claim 9, wherein after the step of extracting the target characters as page content, further comprising:

The page content is stored in json format.
A computer-readable storage medium on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, the steps of the following crawler-based data grabbing method are implemented:

Obtaining a target web page, and parsing the target web page to obtain all tags of the target web page;

Obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each of the tag pairs, take the character that satisfies the first preset condition as the first character, and set the character that satisfies the second preset condition as the first character. the character of the condition as the second character;

Taking the first character as the starting point and the second character as the ending point, read all target characters between the starting point and the ending point;

It is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
The computer-readable storage medium according to claim 16, wherein the step of parsing the target web page to obtain the tag of the target web page comprises:

extracting the web page structure of the target web page using a web page parser;

Get the tag in the web page structure.
The computer-readable storage medium of claim 17, wherein the step of extracting the web page structure of the target web page using a web page parser comprises:

Parse the target web page by the web page parser to obtain an HTML document;

The HTML document is parsed to generate a DOM tree structure, and the generated DOM tree structure is used as the web page structure of the target web page.
19. The computer-readable storage medium of claim 18, wherein the step of traversing all of the tag pairs comprises:

Use the outermost label in the web page structure as an initial node;

Starting from the initial node, using the tags in the DOM tree structure as traversal nodes, depth-first traversal is performed on all the tag pairs.
17. The computer-readable storage medium of claim 16, wherein the step of searching for characters in each of the tag pairs comprises:

All characters in the tag pair are searched using a string search method.