WO2022179128A1 - Crawler-based data crawling method and apparatus, computer device, and storage medium - Google Patents

Crawler-based data crawling method and apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2022179128A1
WO2022179128A1 PCT/CN2021/124394 CN2021124394W WO2022179128A1 WO 2022179128 A1 WO2022179128 A1 WO 2022179128A1 CN 2021124394 W CN2021124394 W CN 2021124394W WO 2022179128 A1 WO2022179128 A1 WO 2022179128A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
web page
target
tag
characters
Prior art date
Application number
PCT/CN2021/124394
Other languages
French (fr)
Chinese (zh)
Inventor
郑如刚
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022179128A1 publication Critical patent/WO2022179128A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Definitions

  • the present application relates to the technical field of data collection of big data, and in particular, to a crawler-based data capture method, device, computer equipment and storage medium.
  • crawler technology is usually used for data capture.
  • Crawler technology is a program that searches web pages through their link addresses and automatically obtains web page content according to certain rules.
  • the purpose of the embodiments of the present application is to propose a crawler-based data grabbing method, device, computer equipment and storage medium, so as to solve the problems in the related art that the crawler has a huge workload for data grabbing, is time-consuming and labor-intensive, and has low grabbing efficiency. .
  • the embodiments of the present application provide a crawler-based data capture method, which adopts the following technical solutions:
  • the embodiments of the present application also provide a crawler-based data grabbing device, which adopts the following technical solutions:
  • an acquisition module used to acquire a target webpage, parse the target webpage, and obtain all tags of the target webpage
  • the traversal module is used to obtain the label pair according to the label, traverse all the label pairs, and search for the characters in each of the label pairs, and use the character that satisfies the first preset condition as the first character.
  • a character that satisfies the second preset condition is used as the second character;
  • a reading module configured to take the first character as the starting point and the second character as the ending point, and read out the target character between the starting point and the ending point;
  • the extraction module is used for judging whether the target character satisfies the extraction condition, and extracting the target character as the page content when it is determined that the extraction condition is satisfied.
  • the embodiment of the present application also provides a computer device, which adopts the following technical solutions:
  • the computer equipment comprises a memory and a processor, and a computer-readable instruction is stored in the memory, and when the processor executes the computer-readable instruction, the steps of the following crawler-based data grabbing method are realized:
  • the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:
  • the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the steps of the crawler-based data grabbing method described below are implemented:
  • the present application obtains all the tags of the target web page by acquiring the target web page and analyzing the target web page, obtains the tag pairs according to the tags, traverses all the tag pairs, and searches for the characters in each of the tag pairs, which will satisfy the first requirement.
  • the character with the preset condition is used as the first character, and the character satisfying the second preset condition is used as the second character, and the first character is used as the starting point and the second character is the end point, and the target character between the starting point and the ending point is read out, Judging whether the target character satisfies the extraction conditions, when it is determined that the extraction conditions are met, the target character is extracted as the page content; the present application searches for the first character and the second character in each label pair by traversing all the label pairs of the target webpage, Read out the target character between the first character as the starting point and the second character as the end point, determine that the target character meets the extraction conditions, and extract the target character as the page content, that is, by finding the characters in the label that meet the preset conditions
  • To crawl web page data it can avoid the need for different websites to write different scripts for crawling, enhance the adaptability of the crawler, at the same time, reduce the workload and improve the efficiency of data crawling.
  • FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
  • FIG. 2 is a flowchart of an embodiment of a crawler-based data capture method according to the present application
  • Fig. 3 is a flow chart of a specific implementation manner of step S201 in Fig. 2;
  • Fig. 4 is a flow chart of a specific implementation manner of step S202 in Fig. 2;
  • FIG. 5 is a schematic structural diagram of an embodiment of a crawler-based data grabbing device according to the present application.
  • FIG. 6 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • a web crawler also known as a web spider or a web robot, refers to a program or script that automatically grabs web information according to certain rules.
  • HTML HyperText Mackeup Language, Hypertext Markup Language
  • HTML is a descriptive markup language used to describe how the content in hypertext is displayed.
  • Tags also known as tags, are an HTML web term, each of which is used to specify a specific meaning.
  • the present application provides a crawler-based data capture method, which can be applied to the system architecture shown in FIG. 1 .
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
  • the terminal devices 101, 102, and 103 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4
  • the server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
  • the crawler-based data grabbing method provided by the embodiment of the present application is generally executed by a server, and accordingly, the crawler-based data grabbing apparatus is generally set in the server.
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • FIG. 2 a flowchart of an embodiment of the method for crawling data based on crawler according to the present application is shown, including the following steps:
  • step S201 a target web page is acquired, and the target web page is parsed to obtain all tags of the target web page.
  • the crawler obtains the web page under the address according to the provided entry URL, and the web page is the target web page to be crawled.
  • the crawler will identify the tags of the target webpage and extract the content of the webpage from the tags.
  • the steps of parsing the target web page to obtain the label of the target web page are as follows:
  • Step S301 using a web page parser to extract the web page structure of the target web page.
  • an HTML document is obtained by parsing the target web page with a web page parser, the HTML document is parsed to generate a DOM tree structure, and the generated DOM tree structure is used as the web page structure of the target web page.
  • Web page parser is also called HTML (HyperText Markup Language) parser
  • jsoup is a Java HTML parser, which can directly parse a URL address and HTML text content.
  • the web page content containing tags can be acquired.
  • the following is an HTML document:
  • tags appear in pairs, that is, the tags in this embodiment are specifically tag pairs, for example, ⁇ html> and ⁇ /html> are one tag, ⁇ head> and ⁇ /head> are one tag, and ⁇ html> defines the HTML document, the ⁇ html> and ⁇ /html> tags define the start and end points of the document; ⁇ head> defines the information about the document; ⁇ title> defines the title of the document; ⁇ body> defines the body of the document; ⁇ p> defines the document paragraph; ⁇ b> defines the bold font; ⁇ a> defines the hyperlink.
  • the most important attribute of the ⁇ a> tag is the href attribute, which indicates the target of the link.
  • the DOM tree structure is the parse tree output by the HTML parser, including element nodes, text nodes, attribute nodes, and comment nodes. It is the object representation of the HTML document and serves as the external interface of the HTML element for JS and other calls. There are multiple branches in the DOM tree structure, and there are multiple layers in each branch, and the layer structure is the relationship between element nodes and element nodes.
  • nodes that do not contain text information (that is, functional codes) in the web page can be removed, for example, meaningless HTML tags such as ⁇ style> can be removed. , ⁇ /style>, ⁇ script>, ⁇ /script>, etc.
  • the target webpage is parsed to generate a DOM tree structure, and the tags in the DOM tree structure are used as traversing nodes, so that the tags can be traversed more conveniently.
  • step S302 the tags in the web page structure are acquired.
  • the element node of the DOM tree structure is the label element of the web page, the text node is contained within the element node, and the attribute node is used to describe the element in detail.
  • the attribute of the ⁇ a> tag is href attribute. It should be understood that not all element nodes contain attributes.
  • the tag of the target webpage can be obtained from the generated DOM tree structure, the performance of parsing the webpage can be improved, and at the same time, the webpage information can be obtained completely and accurately.
  • the web page structure is extracted by the web page parser, and the tags are obtained from the web page structure, so that the tags of the target web page can be obtained more quickly and easily.
  • Step S202 obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each tag pair, take the character that satisfies the first preset condition as the first character, and use the character that satisfies the second preset condition as the first character. as the second character.
  • a web page is a page composed of various tags.
  • the tags displayed on the web page follow the tag specification, and the level of the tags can be distinguished according to the tag elements. Take the following example to illustrate:
  • each label element there are three label elements, namely ⁇ dd>, ⁇ ul>, and ⁇ li>. It can be seen that the label is divided into three layers, and each label element corresponds to one layer.
  • tags exist in the form of tag pairs, and by traversing each tag as a node, the page content of the webpage corresponding to the tag can be obtained.
  • step S401 the outermost label in the web page structure is used as an initial node.
  • the tags are used as traversal nodes to perform depth-first traversal on all tag pairs.
  • the outermost tags in the web page structure are used as the first layer, and the tags nested in the first layer are used as the second layer.
  • Layers, and so on, form a tree structure, take the label of the first layer as the initial node, access from the initial node, start from the unvisited adjacent nodes of the initial node, and perform depth-first traversal until all nodes are visited. so far.
  • the tag element ⁇ dd> is the outermost tag, which is the first layer
  • the tag element ⁇ ul> is the tag nested in ⁇ dd>, which is used as the first layer.
  • the second layer the label element is ⁇ li>, which is a nested label in ⁇ ul>, which is used as the third layer.
  • the adjacent node of ⁇ dd> is ⁇ ul>
  • the adjacent node of ⁇ ul> is ⁇ li>
  • the order of depth-first traversal is: ⁇ dd> ⁇ ul> ⁇ ⁇ li>.
  • Step S402 starting from the initial node, using the tags in the DOM tree structure as traversal nodes, perform depth-first traversal on all tag pairs.
  • the tag elements in the DOM tree structure are used as nodes, and the tag pairs corresponding to the tag elements are sequentially traversed in depth first. For example, if the tag element is ul in the DOM tree structure, the tag pair corresponding to the tag element is ⁇ ul> and ⁇ /ul>.
  • the method of depth-first traversal is to start from the initial node v:
  • step d If the node w exists, continue to perform step d, otherwise the traversal ends;
  • label A is the first layer; labels nested in label A are label B and label C, then label B and label C are the second layer; labels nested in label B are labels D and label E, the label nested in label C is label F, then label D, label E and label F are the third layer; therefore, label A is used as the initial node, label A has adjacent nodes label B and label C, label B has adjacent node label D and label E, and label C has adjacent node label F.
  • the order of depth-first traversal is: label A ⁇ label B ⁇ label D ⁇ label E ⁇ label C ⁇ label F.
  • each tag pair is traversed to extract the content of the page.
  • the characters that satisfy the first preset condition are used as The first character, and the character that satisfies the second preset condition is used as the second character.
  • the first preset condition is specifically the "greater than sign (>)” character that appears in the layer corresponding to the currently accessed tag pair
  • the second preset condition is specifically the character corresponding to the currently accessed tag pair A "less than ( ⁇ )” character that appears next to a "greater than (>)” character in a hierarchy. It should be understood that the ">” and " ⁇ " characters are label characters.
  • the ">” character that appears for the first time in the tag pair is taken as the first character, followed by the ">” character that appears for the first time.
  • the " ⁇ ” character that appears after the >” character is the second character
  • the second appearance of the ">” character is the first character
  • the " ⁇ " character that appears immediately after the second appearance of the ">” character is the first character.
  • the first character and the second character are in sequence, the first character is in the front, the second character is in the back, and the content between the first character and the second character is the page content that may be extracted.
  • all characters in each label pair can be searched, and the first and second characters found are marked respectively.
  • String search methods include but are not limited to the following methods:
  • indexOf(String str) Returns the index of the first occurrence of the specified substring in this string.
  • the tail character of the first tag is the first character of the first occurrence
  • the first character of the tail tag is the second character of the last occurrence.
  • the page content of the webpage will not change greatly, but the label will change. If the crawler obtains the data according to the method of analyzing the data one by one according to the label of the webpage, it needs to rewrite the crawling script. In the embodiment, by determining the character length between the first character and the second character to extract the page content, it is possible to avoid the situation of needing to write different scripts for crawling when different websites or webpages are revised. Reptile adaptations.
  • the first character and the second character in each tag pair are determined by depth-first traversal, which can ensure that the page content is extracted completely and without omission.
  • Step S203 taking the first character as the starting point and the second character as the ending point, read out the target character between the starting point and the ending point.
  • Step S204 it is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
  • the length of the target character between the first character and the second character is recorded, and it is determined that the length of the target character satisfies the extraction length, then the target character is used as the page content, and the page content is extracted.
  • the extracted page content can be saved for easy viewing by users.
  • the label format is ⁇ lable>page content ⁇ /lable>, and the page content between the first character ">” and the second character “ ⁇ " should be extracted.
  • a tag pair is as follows:
  • the page content "About ADO and Numerical Symbols" needs to be extracted.
  • this tag pair there are multiple first characters and second characters. Therefore, it is necessary to separate each pair of first and second characters between the first and second characters.
  • the character length is recorded, and whether it is the content of the page to be extracted is determined according to the length.
  • the extraction length is set according to the actual situation. When it is determined that the target character between the first character and the second character meets the extraction preset length, the target character is extracted as the page content.
  • the page content can also be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • This embodiment determines whether the page content to be extracted is between the first character and the second character according to the character length, which can avoid extracting useless content, improve the accuracy of crawling data by the crawler, and improve the crawling efficiency.
  • step 203 after step 203, the following steps may also be performed:
  • the extracted page content is stored in json format, or stored in a database format, and the stored page content can be directly used for data analysis.
  • the storage methods include, but are not limited to, storing to text files in json format, storing to excel, storing to SQLite (light database), and storing to mySQL (relational database management system) database.
  • JSON the full name of JavaScript Object Notation
  • JavaScript object notation any supported type can be represented by JSON, such as strings, numbers, objects, arrays, etc.
  • a path to a json file for saving page content is set, and the extracted page content is written into the json file through the path for storage.
  • the json format can ensure that when the file is opened, the stored data can be visually checked, and one data is stored in one line.
  • This method is suitable for crawling a small amount of data, and subsequent reading and analysis are also very convenient. Yes; if the crawled data can be easily organized into a table, it can be stored in excel. After opening excel, it is more convenient to observe the data, and excel can also perform some simple operations; SQLite does not need to be installed, it is zero configuration Database, when the amount of crawler data is large, it needs to be persistently stored, and no other database is installed, you can choose SQLite for storage; mySQL can be accessed remotely, which means that data can be stored on a remote server host.
  • the present application traverses all the tag pairs of the target web page, finds the first character and the second character in each tag pair, reads out the target character with the first character as the starting point and the second character as the end point, and determines the target character If the extraction conditions are met, the target characters are extracted as page content, that is, the web page data is crawled by finding the corresponding characters in the tags, which can avoid the need for different websites to write different scripts for crawling and enhance the adaptability of the crawler. , at the same time, reduce workload and improve data capture efficiency.
  • the present application may be used in numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, and the like.
  • the application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer storage media including storage devices.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
  • the present application provides an embodiment of a crawler-based data grabbing device, which corresponds to the method embodiment shown in FIG. 2 , Specifically, the device can be applied to various electronic devices.
  • the crawler-based data capture device 500 in this embodiment includes: an acquisition module 501 , a traversal module 502 and an extraction module 503 . in:
  • the obtaining module 501 is used to obtain a target web page, parse the target web page, and obtain all tags of the target web page;
  • the traversal module 502 is configured to obtain a pair of labels according to the labels, traverse all the pairs of labels, and search for the characters in each pair of labels, take the character that satisfies the first preset condition as the first character, and use A character that satisfies the second preset condition is used as the second character;
  • a reading module configured to take the first character as the starting point and the second character as the ending point, and read out the target character between the starting point and the ending point;
  • the extraction module 503 is configured to judge whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, extract the target character as the page content.
  • the page content can also be stored in a node of a blockchain.
  • the above-mentioned crawler-based data grabbing device searches for the first character and the second character in each label pair by traversing all the label pairs of the target web page, and reads the first character as the starting point and the second character as the end point. If the target characters are determined to meet the extraction conditions, the target characters are extracted as the page content, that is, the page data is crawled by finding the corresponding characters in the tags, which can avoid the need to write different scripts for different websites. Crawl, enhance the adaptability of the crawler, at the same time, reduce the workload and improve the efficiency of data scraping.
  • the acquiring module 501 includes a parsing sub-module and an acquiring sub-module.
  • the parsing sub-module is used to extract the web page structure of the target web page by using a web page parser; the acquiring sub-module is used to acquire the Label.
  • the web page structure is extracted by the web page parser, and the tags are obtained from the web page structure, so that the tags of the target web page can be obtained more quickly and easily.
  • the parsing submodule is further used for:
  • the HTML document is parsed to generate a DOM tree structure, and the generated DOM tree structure is used as the web page structure of the target web page.
  • the performance of parsing the web page can be improved, and at the same time, the web page information can be acquired completely and accurately.
  • the traversal module 502 is further used for:
  • depth-first traversal is performed on all the tag pairs.
  • the traversal module 502 is further configured to:
  • the first character and the second character in each tag pair are determined by depth-first traversal, which can ensure that the page content is extracted completely and without omission.
  • the extraction module 503 includes a recording sub-module and an extraction sub-module, the recording sub-module is used to record the length of the target character; the extraction sub-module is used to determine that the length of the target character satisfies the extraction length, then the Taking the target characters as page content, the page content is extracted.
  • This embodiment determines whether the page content to be extracted is between the first character and the second character according to the character length, which can avoid extracting useless content, improve the accuracy of crawling data by the crawler, and improve the crawling efficiency.
  • the crawler-based data crawling apparatus 500 further includes a storage module, and the storage module is configured to store the page content in json format.
  • the extracted page content is stored in the database, and the stored page content can be directly used for data analysis.
  • FIG. 6 is a block diagram of the basic structure of a computer device according to this embodiment.
  • the computer device 6 includes a memory 61, a processor 62, and a network interface 63 that communicate with each other through a system bus. It should be pointed out that only the computer device 6 with components 61-63 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead.
  • the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • embedded equipment etc.
  • the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment.
  • the computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
  • the memory 61 stores computer-readable instructions, and the processor 62 implements the following steps when executing the computer-readable instructions:
  • the memory 61 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc.
  • the memory 61 may be an internal storage unit of the computer device 6 , such as a hard disk or a memory of the computer device 6 .
  • the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 61 may also include both the internal storage unit of the computer device 6 and its external storage device.
  • the memory 61 is generally used to store the operating system and various application software installed on the computer device 6 , such as computer-readable instructions of a crawler-based data capture method, and the like.
  • the memory 61 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 62 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. This processor 62 is typically used to control the overall operation of the computer device 6 . In this embodiment, the processor 62 is configured to execute computer-readable instructions stored in the memory 61 or process data, for example, computer-readable instructions for executing the crawler-based data capture method.
  • CPU Central Processing Unit
  • controller central processing unit
  • microcontroller a microcontroller
  • microprocessor microprocessor
  • This processor 62 is typically used to control the overall operation of the computer device 6 .
  • the processor 62 is configured to execute computer-readable instructions stored in the memory 61 or process data, for example, computer-readable instructions for executing the crawler-based data capture method.
  • the network interface 63 may include a wireless network interface or a wired network interface, and the network interface 63 is generally used to establish a communication connection between the computer device 6 and other electronic devices.
  • the steps of the crawler-based data grabbing method in the above-mentioned embodiment are implemented, and the first character in each tag pair is searched by traversing all the tag pairs of the target web page. and the second character, read the target character between the first character as the starting point and the second character as the end point, determine that the target character meets the extraction conditions, and extract the target character as the page content, that is, by searching for the corresponding It can avoid the need for different websites to write different scripts for crawling, enhance the adaptability of the crawler, at the same time, reduce the workload and improve the efficiency of data crawling.
  • the present application also provides another implementation manner, which is to provide a computer-readable storage medium, where the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor, so that the at least one processor executes the steps of the crawler-based data scraping method as described above, By traversing all the tag pairs of the target web page, finding the first character and the second character in each tag pair, and reading out the target character between the first character as the starting point and the second character as the end point, it is determined that the target character satisfies the extraction condition, the target characters are extracted as page content, that is, the web page data is crawled by finding the corresponding characters in the tags, which can avoid the need for different websites to write different scripts for crawling, enhance the adaptability of the crawler, and at the same time , reduce workload and improve data capture efficiency.
  • the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A crawler-based data crawling method, comprising: obtaining a target webpage, and parsing the target webpage to obtain all labels of the target webpage (S201); obtaining label pairs according to the labels, traversing all the label pairs searching for characters in each label pair, taking a character satisfying a first preset condition as a first character, and taking a character satisfying a second preset condition as a second character (S202); by taking the first character as a starting point and the second character as an end point, reading a target character between the starting point and the end point (S203); and determining whether the target character satisfies an extraction condition, and when it is determined that the extraction condition is satisfied, extracting the target character as page content (S204). The method enhances the adaptability of a crawler, reduces the workload, and improves the data crawling efficiency.

Description

基于爬虫的数据抓取方法、装置、计算机设备及存储介质Crawler-based data capture method, device, computer equipment and storage medium
本申请要求于2021年02月25日提交中国专利局、申请号为202110213211.3,发明名称为“基于爬虫的数据抓取方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on February 25, 2021 with the application number 202110213211.3 and the invention title is "Crawler-based data capture method, device, computer equipment and storage medium", all of which The contents are incorporated herein by reference.
技术领域technical field
本申请涉及大数据的数据采集技术领域,尤其涉及一种基于爬虫的数据抓取方法、装置、计算机设备及存储介质。The present application relates to the technical field of data collection of big data, and in particular, to a crawler-based data capture method, device, computer equipment and storage medium.
背景技术Background technique
近些年,互联网已经逐步向大数据方向进行转移,在大数据环境下,数据的获取至关重要。在数据获取的方法中,通常采用爬虫技术进行数据抓取。爬虫技术是一种通过网页的链接地址寻找网页,并按照一定的规则,自动获取网页内容的程序。In recent years, the Internet has gradually shifted to the direction of big data. In the big data environment, the acquisition of data is very important. In the method of data acquisition, crawler technology is usually used for data capture. Crawler technology is a program that searches web pages through their link addresses and automatically obtains web page content according to certain rules.
发明人发现,目前,传统爬虫抓取网页页面数据,需要通过定位页面元素进行数据解析,也就是说在使用爬虫抓取数据的过程当中,必须进行脚本的编写,不同的网站需要编写不同的脚本,同时,假若网页改版,也需要重新编写抓取脚本,造成数据抓取的工作量巨大,费时费力,效率低下。The inventor found that, at present, traditional crawler crawling web page data needs to perform data analysis by locating page elements, that is to say, in the process of crawling data using crawler, scripts must be written, and different websites need to write different scripts. , At the same time, if the webpage is revised, the crawling script needs to be rewritten, resulting in a huge workload of data crawling, time-consuming and labor-intensive, and low efficiency.
发明内容SUMMARY OF THE INVENTION
本申请实施例的目的在于提出一种基于爬虫的数据抓取方法、装置、计算机设备及存储介质,以解决相关技术中爬虫进行数据抓取工作量巨大,费时费力,且抓取效率低的问题。The purpose of the embodiments of the present application is to propose a crawler-based data grabbing method, device, computer equipment and storage medium, so as to solve the problems in the related art that the crawler has a huge workload for data grabbing, is time-consuming and labor-intensive, and has low grabbing efficiency. .
为了解决上述技术问题,本申请实施例提供一种基于爬虫的数据抓取方法,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiments of the present application provide a crawler-based data capture method, which adopts the following technical solutions:
获取目标网页,并对所述目标网页进行解析,得到所述目标网页的所有标签;Obtaining a target web page, and parsing the target web page to obtain all tags of the target web page;
根据所述标签得到标签对,遍历所有所述标签对,并对每个所述标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符;Obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each of the tag pairs, take the character that satisfies the first preset condition as the first character, and set the character that satisfies the second preset condition as the first character. the character of the condition as the second character;
以所述第一字符为起点,所述第二字符为终点,读取出所述起点和所述终点之间的目标字符;Taking the first character as the starting point and the second character as the ending point, read out the target character between the starting point and the ending point;
判断所述目标字符是否满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取。It is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
为了解决上述技术问题,本申请实施例还提供一种基于爬虫的数据抓取装置,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiments of the present application also provide a crawler-based data grabbing device, which adopts the following technical solutions:
获取模块,用于获取目标网页,并对所述目标网页进行解析,得到所述目标网页的所有标签;an acquisition module, used to acquire a target webpage, parse the target webpage, and obtain all tags of the target webpage;
遍历模块,用于根据所述标签得到标签对,遍历所有所述标签对,并对每个所述标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符;The traversal module is used to obtain the label pair according to the label, traverse all the label pairs, and search for the characters in each of the label pairs, and use the character that satisfies the first preset condition as the first character. A character that satisfies the second preset condition is used as the second character;
读取模块,用于以所述第一字符为起点,所述第二字符为终点,读取出所述起点和所述终点之间的目标字符;a reading module, configured to take the first character as the starting point and the second character as the ending point, and read out the target character between the starting point and the ending point;
提取模块,用于判断所述目标字符是否满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取。The extraction module is used for judging whether the target character satisfies the extraction condition, and extracting the target character as the page content when it is determined that the extraction condition is satisfied.
为了解决上述技术问题,本申请实施例还提供一种计算机设备,采用了如下所述的技术方案:In order to solve the above-mentioned technical problems, the embodiment of the present application also provides a computer device, which adopts the following technical solutions:
该计算机设备包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理 器执行所述计算机可读指令时实现如下所述的基于爬虫的数据抓取方法的步骤:The computer equipment comprises a memory and a processor, and a computer-readable instruction is stored in the memory, and when the processor executes the computer-readable instruction, the steps of the following crawler-based data grabbing method are realized:
获取目标网页,并对所述目标网页进行解析,得到所述目标网页的所有标签;Obtaining a target web page, and parsing the target web page to obtain all tags of the target web page;
根据所述标签得到标签对,遍历所有所述标签对,并对每个所述标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符;Obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each of the tag pairs, take the character that satisfies the first preset condition as the first character, and set the character that satisfies the second preset condition as the first character. the character of the condition as the second character;
以所述第一字符为起点,所述第二字符为终点,读取所有所述起点和所述终点之间的目标字符;Taking the first character as the starting point and the second character as the ending point, read all target characters between the starting point and the ending point;
判断所述目标字符是否满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取。It is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:
所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下所述的基于爬虫的数据抓取方法的步骤:The computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the steps of the crawler-based data grabbing method described below are implemented:
获取目标网页,并对所述目标网页进行解析,得到所述目标网页的所有标签;Obtaining a target web page, and parsing the target web page to obtain all tags of the target web page;
根据所述标签得到标签对,遍历所有所述标签对,并对每个所述标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符;Obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each of the tag pairs, take the character that satisfies the first preset condition as the first character, and set the character that satisfies the second preset condition as the first character. the character of the condition as the second character;
以所述第一字符为起点,所述第二字符为终点,读取所有所述起点和所述终点之间的目标字符;Taking the first character as the starting point and the second character as the ending point, read all target characters between the starting point and the ending point;
判断所述目标字符是否满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取。It is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
与现有技术相比,本申请实施例主要有以下有益效果:Compared with the prior art, the embodiments of the present application mainly have the following beneficial effects:
本申请通过获取目标网页,并对目标网页进行解析,得到目标网页的所有标签,根据标签得到标签对,遍历所有标签对,并对每个所述标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符,以第一字符为起点,第二字符为终点,读取出起点和终点之间的目标字符,判断目标字符是否满足提取条件,在确定满足提取条件时,将目标字符作为页面内容进行提取;本申请通过遍历目标网页的所有标签对,查找每个标签对中的第一字符和第二字符,读取出以第一字符为起点,第二字符为终点之间的目标字符,确定目标字符满足提取条件,就将目标字符作为页面内容进行提取,也就是通过查找标签中满足预设条件的字符来进行网页页面数据的抓取,可以避免不同网站需要编写不同的脚本进行抓取,增强爬虫的适应性,同时,减少工作量,提高数据抓取效率。The present application obtains all the tags of the target web page by acquiring the target web page and analyzing the target web page, obtains the tag pairs according to the tags, traverses all the tag pairs, and searches for the characters in each of the tag pairs, which will satisfy the first requirement. The character with the preset condition is used as the first character, and the character satisfying the second preset condition is used as the second character, and the first character is used as the starting point and the second character is the end point, and the target character between the starting point and the ending point is read out, Judging whether the target character satisfies the extraction conditions, when it is determined that the extraction conditions are met, the target character is extracted as the page content; the present application searches for the first character and the second character in each label pair by traversing all the label pairs of the target webpage, Read out the target character between the first character as the starting point and the second character as the end point, determine that the target character meets the extraction conditions, and extract the target character as the page content, that is, by finding the characters in the label that meet the preset conditions To crawl web page data, it can avoid the need for different websites to write different scripts for crawling, enhance the adaptability of the crawler, at the same time, reduce the workload and improve the efficiency of data crawling.
附图说明Description of drawings
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the solutions in the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments of the present application. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.
图1是本申请可以应用于其中的示例性系统架构图;FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
图2根据本申请的基于爬虫的数据抓取方法的一个实施例的流程图;2 is a flowchart of an embodiment of a crawler-based data capture method according to the present application;
图3是图2中步骤S201的一种具体实施方式的流程图;Fig. 3 is a flow chart of a specific implementation manner of step S201 in Fig. 2;
图4是图2中步骤S202的一种具体实施方式的流程图;Fig. 4 is a flow chart of a specific implementation manner of step S202 in Fig. 2;
图5是根据本申请的基于爬虫的数据抓取装置的一个实施例的结构示意图;5 is a schematic structural diagram of an embodiment of a crawler-based data grabbing device according to the present application;
图6是根据本申请的计算机设备的一个实施例的结构示意图。FIG. 6 is a schematic structural diagram of an embodiment of a computer device according to the present application.
具体实施方式Detailed ways
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实 施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of this application; the terms used herein in the specification of the application are for the purpose of describing specific embodiments only It is not intended to limit the application; the terms "comprising" and "having" and any variations thereof in the description and claims of this application and the above description of the drawings are intended to cover non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings.
在对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。Before the further detailed description of the embodiments of the present application, the terms and terms involved in the embodiments of the present application will be described. The terms and terms involved in the embodiments of the present application are applicable to the following explanations.
1)爬虫(web crawler),又称网页蜘蛛,网络机器人,是指按照一定的规则,自动地抓取网络信息的程序或者脚本。1) A web crawler, also known as a web spider or a web robot, refers to a program or script that automatically grabs web information according to certain rules.
2)HTML(HyperText Mackeup Language,超文本标记语言),是一种描述性的标记语言,用于描述超文本中内容的显示方式。2) HTML (HyperText Mackeup Language, Hypertext Markup Language), is a descriptive markup language used to describe how the content in hypertext is displayed.
3)标签,也称为标记,是一种HTML的网络术语,每一种标签用于规定特定的含义。3) Tags, also known as tags, are an HTML web term, each of which is used to specify a specific meaning.
4)网页,由各种标记组成的页面。4) Web pages, pages composed of various tags.
为了解决相关技术中爬虫进行数据抓取工作量巨大,费时费力,且抓取效率低的问题,本申请提供了一种基于爬虫的数据抓取方法,可以应用于如图1所示的系统架构100中,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。In order to solve the problems of huge workload, time-consuming and labor-intensive, and low capture efficiency for crawler in the related art, the present application provides a crawler-based data capture method, which can be applied to the system architecture shown in FIG. 1 . In 100 , the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。The user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, and 103 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。The server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
需要说明的是,本申请实施例所提供的基于爬虫的数据抓取方法一般由服务器执行,相应地,基于爬虫的数据抓取装置一般设置于服务器中。It should be noted that the crawler-based data grabbing method provided by the embodiment of the present application is generally executed by a server, and accordingly, the crawler-based data grabbing apparatus is generally set in the server.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
继续参考图2,示出了根据本申请的基于爬虫的数据抓取的方法的一个实施例的流程图,包括以下步骤:Continuing to refer to FIG. 2 , a flowchart of an embodiment of the method for crawling data based on crawler according to the present application is shown, including the following steps:
步骤S201,获取目标网页,并对目标网页进行解析,得到目标网页的所有标签。In step S201, a target web page is acquired, and the target web page is parsed to obtain all tags of the target web page.
在本实施例中,爬虫根据提供的入口网址,获取到该地址下的网页,该网页即为待抓取的目标网页。当爬虫抓取目标网页页面时,爬虫会识别出目标网页的标签,并从标签中提取出网页的内容。In this embodiment, the crawler obtains the web page under the address according to the provided entry URL, and the web page is the target web page to be crawled. When the crawler crawls the target webpage, the crawler will identify the tags of the target webpage and extract the content of the webpage from the tags.
在本实施例的一些可选的实现方式中,对目标网页进行解析,得到目标网页的标签的步骤具体如下:In some optional implementations of this embodiment, the steps of parsing the target web page to obtain the label of the target web page are as follows:
步骤S301,使用网页解析器提取目标网页的网页结构。Step S301, using a web page parser to extract the web page structure of the target web page.
具体地,通过网页解析器解析目标网页得到HTML文档,将HTML文档解析生成DOM树结构,将生成的DOM树结构作为目标网页的网页结构。Specifically, an HTML document is obtained by parsing the target web page with a web page parser, the HTML document is parsed to generate a DOM tree structure, and the generated DOM tree structure is used as the web page structure of the target web page.
在网页中,网页的标签是按照一种树状结构排列。网页解析器也称作HTML(HyperText Markup Language,超文本标记语言)解析器,jsoup是一款Java的HTML解析器,可直接解析某个URL地址、HTML文本内容。In web pages, the tags of web pages are arranged in a tree-like structure. Web page parser is also called HTML (HyperText Markup Language) parser, jsoup is a Java HTML parser, which can directly parse a URL address and HTML text content.
在本实施例中,通过jsoup或其他网页解析器解析目标网页,可以获取含有标签的网页内容。例如,以下是一个HTML文档:In this embodiment, by parsing the target web page through jsoup or other web page parser, the web page content containing tags can be acquired. For example, the following is an HTML document:
Figure PCTCN2021124394-appb-000001
Figure PCTCN2021124394-appb-000001
需要说明的是,标签是成对出现的,即本实施例中的标签具体为标签对,例如<html>与</html>为一个标签、<head>与</head>为一个标签,<html>定义HTML文档,<html>与</html>标签限定了文档的开始点和结束点;<head>定义关于文档的信息;<title>定义文档的标题;<body>定义文档的主体;<p>定义文档段落;<b>定义粗体字;<a>定义超链接,<a>标签最重要的属性是href属性,它指示链接的目标。It should be noted that the tags appear in pairs, that is, the tags in this embodiment are specifically tag pairs, for example, <html> and </html> are one tag, <head> and </head> are one tag, and < html> defines the HTML document, the <html> and </html> tags define the start and end points of the document; <head> defines the information about the document; <title> defines the title of the document; <body> defines the body of the document; <p> defines the document paragraph; <b> defines the bold font; <a> defines the hyperlink. The most important attribute of the <a> tag is the href attribute, which indicates the target of the link.
通过HTML解析器将HTML文档转换成DOM树结构。DOM树结构是HTML解析器输出的解析树,包含元素节点、文本节点、属性节点和注释节点,它是HTML文档的对象表示,作为HTML元素的外部接口供JS等调用。在DOM树结构中存在多个分支,每个分支中存在多个层,层结构是元素节点与元素节点之间的关系。Convert HTML document into DOM tree structure by HTML parser. The DOM tree structure is the parse tree output by the HTML parser, including element nodes, text nodes, attribute nodes, and comment nodes. It is the object representation of the HTML document and serves as the external interface of the HTML element for JS and other calls. There are multiple branches in the DOM tree structure, and there are multiple layers in each branch, and the layer structure is the relationship between element nodes and element nodes.
作为一个具体示例,使用网页解析器解析将HTML文档解析生成DOM树结构时,可以去除网页中不包含文字信息的节点(即功能性代码),例如,去除无意义的HTML标签,如<style>、</style>、<script>、</script>等。As a specific example, when using a web page parser to parse an HTML document to generate a DOM tree structure, nodes that do not contain text information (that is, functional codes) in the web page can be removed, for example, meaningless HTML tags such as <style> can be removed. , </style>, <script>, </script>, etc.
在本实施例中,将目标网页解析生成DOM树结构,以DOM树结构中的标签作为遍历节点,可以更方便地对标签进行遍历。In this embodiment, the target webpage is parsed to generate a DOM tree structure, and the tags in the DOM tree structure are used as traversing nodes, so that the tags can be traversed more conveniently.
步骤S302,获取网页结构中的标签。In step S302, the tags in the web page structure are acquired.
需要说明的是,DOM树结构的元素节点就是网页的标签元素,文本节点被包含在元素节点内部,属性节点用于对元素做出具体描述,如上述例子所示,<a>标签的属性是href属性。应当理解,并不是所有的元素节点都包含属性。It should be noted that the element node of the DOM tree structure is the label element of the web page, the text node is contained within the element node, and the attribute node is used to describe the element in detail. As shown in the above example, the attribute of the <a> tag is href attribute. It should be understood that not all element nodes contain attributes.
在本实施例中,可以从生成的DOM树结构中得到目标网页的标签,提高解析网页的性能,同时,可以完整准确地获取网页信息。In this embodiment, the tag of the target webpage can be obtained from the generated DOM tree structure, the performance of parsing the webpage can be improved, and at the same time, the webpage information can be obtained completely and accurately.
本实施例通过网页解析器提取网页结构,并从网页结构中获取到标签,可以更快更简便地得到目标网页的标签。In this embodiment, the web page structure is extracted by the web page parser, and the tags are obtained from the web page structure, so that the tags of the target web page can be obtained more quickly and easily.
步骤S202,根据标签得到标签对,遍历所有标签对,并对每个标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符。Step S202, obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each tag pair, take the character that satisfies the first preset condition as the first character, and use the character that satisfies the second preset condition as the first character. as the second character.
网页是由各种标签组成的页面,网页所展示出来的标签都遵循标签规范,根据标签元素可以区分标签的层级。以如下例子进行说明:A web page is a page composed of various tags. The tags displayed on the web page follow the tag specification, and the level of the tags can be distinguished according to the tag elements. Take the following example to illustrate:
Figure PCTCN2021124394-appb-000002
Figure PCTCN2021124394-appb-000002
Figure PCTCN2021124394-appb-000003
Figure PCTCN2021124394-appb-000003
在上述例子中,包含三个标签元素,分别为<dd>、<ul>和<li>,由此看出,标签分为三层,每个标签元素对应一层。In the above example, there are three label elements, namely <dd>, <ul>, and <li>. It can be seen that the label is divided into three layers, and each label element corresponds to one layer.
在本实施例中,标签是以标签对的形式存在的,以每个标签作为节点进行遍历,可以获得该标签对应网页的页面内容。In this embodiment, tags exist in the form of tag pairs, and by traversing each tag as a node, the page content of the webpage corresponding to the tag can be obtained.
具体地,遍历所有标签对的步骤如下:Specifically, the steps to traverse all tag pairs are as follows:
步骤S401,将网页结构中最外层的标签作为初始节点。In step S401, the outermost label in the web page structure is used as an initial node.
在本实施例中,将标签作为遍历节点对所有的标签对进行深度优先遍历,具体地,将网页结构中最外层的标签作为第一层,将第一层中嵌套的标签为第二层,以此类推,形成树结构,以第一层的标签作为初始节点,从初始节点开始访问,依次从初始节点的未被访问的邻接节点出发,进行深度优先遍历,直到所有节点均被访问过为止。In this embodiment, the tags are used as traversal nodes to perform depth-first traversal on all tag pairs. Specifically, the outermost tags in the web page structure are used as the first layer, and the tags nested in the first layer are used as the second layer. Layers, and so on, form a tree structure, take the label of the first layer as the initial node, access from the initial node, start from the unvisited adjacent nodes of the initial node, and perform depth-first traversal until all nodes are visited. so far.
以上述例子为例进行进一步说明,标签元素为<dd>的是最外层的标签,将其作为第一层,标签元素为<ul>是<dd>中嵌套的标签,将其作为第二层,标签元素为<li>是<ul>中嵌套的标签,将其作为第三层。具体地,以<dd>为初始节点,则<dd>的邻接节点为<ul>,<ul>的邻接节点为<li>,进行深度优先遍历的顺序为:<dd>→<ul>→<li>。Taking the above example as an example for further illustration, the tag element <dd> is the outermost tag, which is the first layer, and the tag element <ul> is the tag nested in <dd>, which is used as the first layer. The second layer, the label element is <li>, which is a nested label in <ul>, which is used as the third layer. Specifically, taking <dd> as the initial node, the adjacent node of <dd> is <ul>, the adjacent node of <ul> is <li>, and the order of depth-first traversal is: <dd>→<ul>→ <li>.
步骤S402,从初始节点开始,以DOM树结构中的标签作为遍历节点,对所有标签对进行深度优先遍历。Step S402 , starting from the initial node, using the tags in the DOM tree structure as traversal nodes, perform depth-first traversal on all tag pairs.
生成的DOM树结构中包含的元素节点为网页的标签元素,则以DOM树结构中的标签元素作为节点,依次对标签元素对应的标签对进行深度优先遍历。例如,在DOM树结构中标签元素为ul,则标签元素对应的标签对为<ul>和</ul>。If the element nodes contained in the generated DOM tree structure are the tag elements of the web page, the tag elements in the DOM tree structure are used as nodes, and the tag pairs corresponding to the tag elements are sequentially traversed in depth first. For example, if the tag element is ul in the DOM tree structure, the tag pair corresponding to the tag element is <ul> and </ul>.
深度优先遍历的方法具体是,从初始节点v出发:The method of depth-first traversal is to start from the initial node v:
a、访问初始节点v,并标记初始结点v为已访问;a. Visit the initial node v, and mark the initial node v as visited;
b、查找初始节点v的第一个邻接节点w;b. Find the first adjacent node w of the initial node v;
c、若节点w存在,则继续执行步骤d,否则遍历结束;c. If the node w exists, continue to perform step d, otherwise the traversal ends;
d、若节点w未被访问,对节点w进行深度优先遍历递归(即把节点w当作另一个初始节点v,执行步骤a、b和c);d. If the node w is not visited, perform depth-first traversal recursion on the node w (that is, treat the node w as another initial node v, and perform steps a, b and c);
e、查找初始节点v的下一个邻接节点,转到步骤c。e. Find the next adjacent node of the initial node v, and go to step c.
举例而言,假设网页结构为:标签A为第一层;标签A中嵌套的标签为标签B和标签C,则标签B和标签C为第二层;标签B中嵌套的标签为标签D和标签E,标签C中嵌套的标签为标签F,则标签D、标签E以及标签F为第三层;于是,标签A作为初始节点,标签A有邻接节点标签B和标签C,标签B有邻接节点标签D和标签E,标签C有邻接节点标签F,进行深度优先遍历的顺序为:标签A→标签B→标签D→标签E→标签C→标签F。For example, suppose the web page structure is: label A is the first layer; labels nested in label A are label B and label C, then label B and label C are the second layer; labels nested in label B are labels D and label E, the label nested in label C is label F, then label D, label E and label F are the third layer; therefore, label A is used as the initial node, label A has adjacent nodes label B and label C, label B has adjacent node label D and label E, and label C has adjacent node label F. The order of depth-first traversal is: label A → label B → label D → label E → label C → label F.
网页页面的标签定义遵循标签规范,标签格式类似:<lable>页面内容</lable>。在本实施例中,遍历每个标签对,将页面内容提取出来,在进行遍历的过程中,在每个标签对对应的层别中,通过查找字符,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符。The label definition of the web page follows the label specification, and the label format is similar: <lable>page content</lable>. In this embodiment, each tag pair is traversed to extract the content of the page. During the traversal process, in the level corresponding to each tag pair, by searching for characters, the characters that satisfy the first preset condition are used as The first character, and the character that satisfies the second preset condition is used as the second character.
在本实施例中,第一预设条件具体为当前访问的标签对所对应的层别中出现的“大于号(>)”字符,第二预设条件具体为当前访问的标签对所对应的层别中,紧接于“大于号(>)”字符出现的“小于号(<)”字符。应当理解,“>”字符和“<”字符为标签字符。In this embodiment, the first preset condition is specifically the "greater than sign (>)" character that appears in the layer corresponding to the currently accessed tag pair, and the second preset condition is specifically the character corresponding to the currently accessed tag pair A "less than (<)" character that appears next to a "greater than (>)" character in a hierarchy. It should be understood that the ">" and "<" characters are label characters.
需要说明的是,在一个标签对中,可能存在多个“>”字符和多个“<”字符,将标签对中首次出现的“>”字符作为第一字符,紧接于首次出现的“>”字符之后出现的“<”字符为第二字符,第二次出现的“>”字符作为第一字符,紧接于第二次出现的“>”字符之后出现的“<”字符为第二字符,并以此类推。在本实施例中,第一字符和第二字符是有先后顺序的,第一字符在前,第二字符在后,第一字符和第二字符之间的内容是可能要提取的页面内容。It should be noted that, in a tag pair, there may be multiple ">" characters and multiple "<" characters, and the ">" character that appears for the first time in the tag pair is taken as the first character, followed by the ">" character that appears for the first time. The "<" character that appears after the >" character is the second character, the second appearance of the ">" character is the first character, and the "<" character that appears immediately after the second appearance of the ">" character is the first character. two characters, and so on. In this embodiment, the first character and the second character are in sequence, the first character is in the front, the second character is in the back, and the content between the first character and the second character is the page content that may be extracted.
在本实施例中,可以对每个标签对中的全部字符进行查找,将查找到的第一字符和第 二字符分别进行标记。In this embodiment, all characters in each label pair can be searched, and the first and second characters found are marked respectively.
具体地,可以利用字符串查找方法对标签对中所有的字符进行查找。字符串查找方法包括但不限于如下方法:Specifically, all characters in the tag pair can be searched by using the string search method. String search methods include but are not limited to the following methods:
1、int indexOf(String str):返回第一次出现的指定子字符串在此字符串中的索引。1. int indexOf(String str): Returns the index of the first occurrence of the specified substring in this string.
2、int indexOf(String str,int startIndex):从指定的索引处开始,返回第一次出现的指定子字符串在此字符串中的索引。2. int indexOf(String str, int startIndex): Starting from the specified index, returns the index of the first occurrence of the specified substring in this string.
3、int lastIndexOf(String str):返回在此字符串中最右边出现的指定子字符串的索引。3. int lastIndexOf(String str): Returns the index of the specified substring that appears on the far right in this string.
4、int lastIndexOf(String str,int startIndex):从指定的索引处开始向后搜索,返回在此字符串中最后一次出现的指定子字符串的索引。4. int lastIndexOf(String str, int startIndex): Search backwards from the specified index, and return the index of the specified substring that appeared last in this string.
对于每个标签对,首标签的尾字符为第一次出现的第一字符,尾标签的首字符为最后一次出现的第二字符。每个标签对中,查找第一次出现的第一字符和最后一次出现的第二字符之间所有的第一字符和第二字符。For each tag pair, the tail character of the first tag is the first character of the first occurrence, and the first character of the tail tag is the second character of the last occurrence. In each label pair, find all the first and second characters between the first occurrence of the first character and the last occurrence of the second character.
在网页改版时,网页的页面内容并不会有很大变更,但是标签会改变,若爬虫按照根据网页标签一行一列地逐个进行数据解析的方法来获取数据,需要重新编写抓取脚本,在本实施例中,通过确定第一字符和第二字符之间的字符长度来提取页面内容,可以避免对于不同的网站或者网页改版时需要编写不同的脚本进行抓取的情况,不需要修改脚本,增强爬虫的适应性。When the webpage is revised, the page content of the webpage will not change greatly, but the label will change. If the crawler obtains the data according to the method of analyzing the data one by one according to the label of the webpage, it needs to rewrite the crawling script. In the embodiment, by determining the character length between the first character and the second character to extract the page content, it is possible to avoid the situation of needing to write different scripts for crawling when different websites or webpages are revised. Reptile adaptations.
本实施例通过深度优先遍历来确定出每个标签对中的第一字符和第二字符,可以确保完整无遗漏地提取到页面内容。In this embodiment, the first character and the second character in each tag pair are determined by depth-first traversal, which can ensure that the page content is extracted completely and without omission.
步骤S203,以第一字符为起点,第二字符为终点,读取出起点和终点之间的目标字符。Step S203, taking the first character as the starting point and the second character as the ending point, read out the target character between the starting point and the ending point.
在本实施例中,在一个标签对中,可能存在多个“>”字符和多个“<”字符,以第一次出现的“>”字符为起点,紧接第一次出现的“>”字符出现的“<”字符为终点,以第二次出现的“>”字符为起点,紧接第二次出现的“>”字符出现的“<”字符为终点,读取出起点和终点之间的目标字符,并以此类推。In this embodiment, in a tag pair, there may be multiple ">" characters and multiple "<" characters, starting from the ">" character that appears for the first time, followed by the ">" character that appears for the first time. The "<" character appears as the end point, the second appearance of the ">" character is the starting point, and the "<" character that appears next to the second appearance of the ">" character is the end point, and the start and end points are read out. target characters in between, and so on.
步骤S204,判断目标字符是否满足提取条件,在确定满足提取条件时,将目标字符作为页面内容进行提取。Step S204, it is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
具体地,记录第一字符与第二字符之间的目标字符的长度,确定目标字符的长度满足提取长度,则将目标字符作为页面内容,提取出页面内容。提取出来的页面内容可以进行保存,方便用户查看。Specifically, the length of the target character between the first character and the second character is recorded, and it is determined that the length of the target character satisfies the extraction length, then the target character is used as the page content, and the page content is extracted. The extracted page content can be saved for easy viewing by users.
在本实施例中,标签格式为<lable>页面内容</lable>,要将第一字符“>”和第二字符“<”之间的页面内容提取出来。In this embodiment, the label format is <lable>page content</lable>, and the page content between the first character ">" and the second character "<" should be extracted.
举例说明,一个标签对如下:As an example, a tag pair is as follows:
<li><em>&sdot;</em><a href=“/topics/50245210”target“_black”>关于ADO和数字符号的问题</a></li><li><em>&sdot;</em><a href="/topics/50245210" target "_black">About ADO and number symbols</a></li>
需要将页面内容“关于ADO和数字符号的问题”中提取出来,在该标签对中,存在多个第一字符和第二字符,因此,需要将每一对第一字符和第二字符之间的字符长度进行记录,根据长度来确定是否是要提取的页面内容。The page content "About ADO and Numerical Symbols" needs to be extracted. In this tag pair, there are multiple first characters and second characters. Therefore, it is necessary to separate each pair of first and second characters between the first and second characters. The character length is recorded, and whether it is the content of the page to be extracted is determined according to the length.
提取长度根据实际情况进行设置,当确定第一字符和第二字符之间的目标字符满足提取的预设长度,将目标字符作为页面内容进行提取。The extraction length is set according to the actual situation. When it is determined that the target character between the first character and the second character meets the extraction preset length, the target character is extracted as the page content.
需要强调的是,为进一步保证上述页面内容的私密和安全性,页面内容还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above page content, the page content can also be stored in a node of a blockchain.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
本实施例通过字符长度确定第一字符和第二字符之间是否为待提取的页面内容,可以避免提取出无用的内容,提高爬虫抓取数据的准确性,同时提高抓取效率。This embodiment determines whether the page content to be extracted is between the first character and the second character according to the character length, which can avoid extracting useless content, improve the accuracy of crawling data by the crawler, and improve the crawling efficiency.
在本实施例的一些可选的实现方式中,在步骤203之后,还可以执行以下步骤:In some optional implementations of this embodiment, after step 203, the following steps may also be performed:
将页面内容以json格式进行存储。Store page content in json format.
具体地,将提取出来的页面内容以json格式进行存储,或者按照数据库形式进行存储,存储的页面内容可以直接用于数据分析。存储的方式包括但不限于以json格式存储到文本文件、存储到excel、存储到SQLite(轻型数据库)以及存储到mySQL(关系型数据库管理系统)数据库。Specifically, the extracted page content is stored in json format, or stored in a database format, and the stored page content can be directly used for data analysis. The storage methods include, but are not limited to, storing to text files in json format, storing to excel, storing to SQLite (light database), and storing to mySQL (relational database management system) database.
JSON,全称为JavaScript Object Notation,也就是JavaScript对象标记,任何支持的类型都可以通过JSON来表示,例如字符串、数字、对象、数组等。在本实施例中,设置用于保存页面内容的json文件的路径,将提取的页面内容通过路径写入json文件中进行存储。JSON, the full name of JavaScript Object Notation, is the JavaScript object notation, any supported type can be represented by JSON, such as strings, numbers, objects, arrays, etc. In this embodiment, a path to a json file for saving page content is set, and the extracted page content is written into the json file through the path for storage.
需要说明的是,json格式可以保证在打开文件时,可以直观的检查所存储的数据,一条数据存储一行,这种方式适用于爬取数据量比较小的情况,后续的读取分析也是很方便的;如果爬取的数据很容易被整理成表格的形式,那么可以存储到excel,打开excel后,对数据的观察更加方便,同时excel也可以进行一些简单的操作;SQLite无需安装,是零配置数据库,当爬虫数据量很大时,需要持久化存储,又没有安装其他数据库,可以选择SQLite进行存储;mySQL可以远程访问,意味着可以将数据存储到远程服务器主机。It should be noted that the json format can ensure that when the file is opened, the stored data can be visually checked, and one data is stored in one line. This method is suitable for crawling a small amount of data, and subsequent reading and analysis are also very convenient. Yes; if the crawled data can be easily organized into a table, it can be stored in excel. After opening excel, it is more convenient to observe the data, and excel can also perform some simple operations; SQLite does not need to be installed, it is zero configuration Database, when the amount of crawler data is large, it needs to be persistently stored, and no other database is installed, you can choose SQLite for storage; mySQL can be accessed remotely, which means that data can be stored on a remote server host.
本申请通过遍历目标网页的所有标签对,查找每个标签对中的第一字符和第二字符,读取出以第一字符为起点,第二字符为终点之间的目标字符,确定目标字符满足提取条件,就将目标字符作为页面内容进行提取,也就是通过查找标签中对应的字符来进行网页页面数据的抓取,可以避免不同网站需要编写不同的脚本进行抓取,增强爬虫的适应性,同时,减少工作量,提高数据抓取效率。The present application traverses all the tag pairs of the target web page, finds the first character and the second character in each tag pair, reads out the target character with the first character as the starting point and the second character as the end point, and determines the target character If the extraction conditions are met, the target characters are extracted as page content, that is, the web page data is crawled by finding the corresponding characters in the tags, which can avoid the need for different websites to write different scripts for crawling and enhance the adaptability of the crawler. , at the same time, reduce workload and improve data capture efficiency.
本申请可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The present application may be used in numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , when the program is executed, it may include the processes of the foregoing method embodiments. Wherein, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of the accompanying drawings are sequentially shown in the order indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders. Moreover, at least a part of the steps in the flowchart of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution sequence is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of sub-steps or stages of other steps.
进一步参考图5,作为对上述图2所示方法的实现,本申请提供了一种基于爬虫的数据抓取装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。Further referring to FIG. 5 , as an implementation of the method shown in FIG. 2 above, the present application provides an embodiment of a crawler-based data grabbing device, which corresponds to the method embodiment shown in FIG. 2 , Specifically, the device can be applied to various electronic devices.
如图5所示,本实施例所述的基于爬虫的数据抓取装置500包括:获取模块501、遍历模块502以及提取模块503。其中:As shown in FIG. 5 , the crawler-based data capture device 500 in this embodiment includes: an acquisition module 501 , a traversal module 502 and an extraction module 503 . in:
获取模块501用于获取目标网页,并对所述目标网页进行解析,得到所述目标网页的所有标签;The obtaining module 501 is used to obtain a target web page, parse the target web page, and obtain all tags of the target web page;
遍历模块502用于根据所述标签得到标签对,遍历所有所述标签对,并对每个所述标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符;The traversal module 502 is configured to obtain a pair of labels according to the labels, traverse all the pairs of labels, and search for the characters in each pair of labels, take the character that satisfies the first preset condition as the first character, and use A character that satisfies the second preset condition is used as the second character;
读取模块,用于以所述第一字符为起点,所述第二字符为终点,读取出所述起点和所述终点之间的目标字符;a reading module, configured to take the first character as the starting point and the second character as the ending point, and read out the target character between the starting point and the ending point;
提取模块503用于判断所述目标字符是否满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取。The extraction module 503 is configured to judge whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, extract the target character as the page content.
需要强调的是,为进一步保证上述页面内容的私密和安全性,页面内容还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above page content, the page content can also be stored in a node of a blockchain.
上述的基于爬虫的数据抓取装置,通过遍历目标网页的所有标签对,查找每个标签对中的第一字符和第二字符,读取出以第一字符为起点,第二字符为终点之间的目标字符,确定目标字符满足提取条件,就将目标字符作为页面内容进行提取,也就是通过查找标签中对应的字符来进行网页页面数据的抓取,可以避免不同网站需要编写不同的脚本进行抓取,增强爬虫的适应性,同时,减少工作量,提高数据抓取效率。The above-mentioned crawler-based data grabbing device searches for the first character and the second character in each label pair by traversing all the label pairs of the target web page, and reads the first character as the starting point and the second character as the end point. If the target characters are determined to meet the extraction conditions, the target characters are extracted as the page content, that is, the page data is crawled by finding the corresponding characters in the tags, which can avoid the need to write different scripts for different websites. Crawl, enhance the adaptability of the crawler, at the same time, reduce the workload and improve the efficiency of data scraping.
在本实施例中,获取模块501包括解析子模块和获取子模块,解析子模块用于使用网页解析器提取所述目标网页的网页结构;获取子模块用于获取所述网页结构中的所述标签。In this embodiment, the acquiring module 501 includes a parsing sub-module and an acquiring sub-module. The parsing sub-module is used to extract the web page structure of the target web page by using a web page parser; the acquiring sub-module is used to acquire the Label.
本实施例通过网页解析器提取网页结构,并从网页结构中获取到标签,可以更快更简便地得到目标网页的标签。In this embodiment, the web page structure is extracted by the web page parser, and the tags are obtained from the web page structure, so that the tags of the target web page can be obtained more quickly and easily.
在本实施例的一些可选的实现方式中,解析子模块进一步用于:In some optional implementations of this embodiment, the parsing submodule is further used for:
通过所述网页解析器解析所述目标网页得到HTML文档;Parse the target web page by the web page parser to obtain an HTML document;
将所述HTML文档解析生成DOM树结构,将生成的所述DOM树结构作为所述目标网页的网页结构。The HTML document is parsed to generate a DOM tree structure, and the generated DOM tree structure is used as the web page structure of the target web page.
本实施例通过将目标网页解析生成DOM树结构,可以提高解析网页的性能,同时,可以完整准确地获取网页信息。In this embodiment, by parsing the target web page to generate a DOM tree structure, the performance of parsing the web page can be improved, and at the same time, the web page information can be acquired completely and accurately.
在本实施例中,遍历模块502进一步用于:In this embodiment, the traversal module 502 is further used for:
将所述网页结构中最外层的所述标签作为初始节点;using the outermost label in the web page structure as an initial node;
从所述初始节点开始,以所述DOM树结构中的标签作为遍历节点,对所有所述标签对进行深度优先遍历。Starting from the initial node, using the tags in the DOM tree structure as traversal nodes, depth-first traversal is performed on all the tag pairs.
在本实施例的一些可选的实现方式中,遍历模块502进一步还用于:In some optional implementations of this embodiment, the traversal module 502 is further configured to:
利用字符串查找方法对所述标签对中的全部字符进行查找。All characters in the tag pair are searched using a string search method.
本实施例通过深度优先遍历来确定出每个标签对中的第一字符和第二字符,可以确保完整无遗漏地提取到页面内容。In this embodiment, the first character and the second character in each tag pair are determined by depth-first traversal, which can ensure that the page content is extracted completely and without omission.
在本实施例中,提取模块503包括记录子模块和提取子模块,记录子模块用于记录所述目标字符的长度;提取子模块用于确定所述目标字符的长度满足提取长度,则所述将目标字符作为页面内容,提取所述页面内容。In this embodiment, the extraction module 503 includes a recording sub-module and an extraction sub-module, the recording sub-module is used to record the length of the target character; the extraction sub-module is used to determine that the length of the target character satisfies the extraction length, then the Taking the target characters as page content, the page content is extracted.
本实施例通过字符长度确定第一字符和第二字符之间是否为待提取的页面内容,可以避免提取出无用的内容,提高爬虫抓取数据的准确性,同时提高抓取效率。This embodiment determines whether the page content to be extracted is between the first character and the second character according to the character length, which can avoid extracting useless content, improve the accuracy of crawling data by the crawler, and improve the crawling efficiency.
在本实施例的一些可选的实现方式中,基于爬虫的数据抓取装置500还包括存储模块,存储模块用于将所述页面内容以json格式进行存储。In some optional implementations of this embodiment, the crawler-based data crawling apparatus 500 further includes a storage module, and the storage module is configured to store the page content in json format.
本实施例将提取出来的页面内容存储至数据库,存储的页面内容可以直接用于数据分析。In this embodiment, the extracted page content is stored in the database, and the stored page content can be directly used for data analysis.
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图6,图6为本实施例计算机设备基本结构框图。To solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 6 for details. FIG. 6 is a block diagram of the basic structure of a computer device according to this embodiment.
所述计算机设备6包括通过系统总线相互通信连接存储器61、处理器62、网络接口 63。需要指出的是,图中仅示出了具有组件61-63的计算机设备6,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。The computer device 6 includes a memory 61, a processor 62, and a network interface 63 that communicate with each other through a system bus. It should be pointed out that only the computer device 6 with components 61-63 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment. The computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
所述存储器61中存储有计算机可读指令,所述处理器62执行所述计算机可读指令时实现如下步骤:The memory 61 stores computer-readable instructions, and the processor 62 implements the following steps when executing the computer-readable instructions:
获取目标网页,并对所述目标网页进行解析,得到所述目标网页的所有标签;Obtaining a target web page, and parsing the target web page to obtain all tags of the target web page;
根据所述标签得到标签对,遍历所有所述标签对,并对每个所述标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符;Obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each of the tag pairs, take the character that satisfies the first preset condition as the first character, and set the character that satisfies the second preset condition as the first character. the character of the condition as the second character;
以所述第一字符为起点,所述第二字符为终点,读取所有所述起点和所述终点之间的目标字符;Taking the first character as the starting point and the second character as the ending point, read all target characters between the starting point and the ending point;
判断所述目标字符是否满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取。It is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
所述存储器61至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器61可以是所述计算机设备6的内部存储单元,例如该计算机设备6的硬盘或内存。在另一些实施例中,所述存储器61也可以是所述计算机设备6的外部存储设备,例如该计算机设备6上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器61还可以既包括所述计算机设备6的内部存储单元也包括其外部存储设备。本实施例中,所述存储器61通常用于存储安装于所述计算机设备6的操作系统和各类应用软件,例如基于爬虫的数据抓取方法的计算机可读指令等。此外,所述存储器61还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 61 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6 , such as a hard disk or a memory of the computer device 6 . In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 61 may also include both the internal storage unit of the computer device 6 and its external storage device. In this embodiment, the memory 61 is generally used to store the operating system and various application software installed on the computer device 6 , such as computer-readable instructions of a crawler-based data capture method, and the like. In addition, the memory 61 can also be used to temporarily store various types of data that have been output or will be output.
所述处理器62在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器62通常用于控制所述计算机设备6的总体操作。本实施例中,所述处理器62用于运行所述存储器61中存储的计算机可读指令或者处理数据,例如运行所述基于爬虫的数据抓取方法的计算机可读指令。In some embodiments, the processor 62 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. This processor 62 is typically used to control the overall operation of the computer device 6 . In this embodiment, the processor 62 is configured to execute computer-readable instructions stored in the memory 61 or process data, for example, computer-readable instructions for executing the crawler-based data capture method.
所述网络接口63可包括无线网络接口或有线网络接口,该网络接口63通常用于在所述计算机设备6与其他电子设备之间建立通信连接。The network interface 63 may include a wireless network interface or a wired network interface, and the network interface 63 is generally used to establish a communication connection between the computer device 6 and other electronic devices.
本实施例通过处理器执行存储在存储器的计算机可读指令时实现如上述实施例基于爬虫的数据抓取方法的步骤,通过遍历目标网页的所有标签对,查找每个标签对中的第一字符和第二字符,读取出以第一字符为起点,第二字符为终点之间的目标字符,确定目标字符满足提取条件,就将目标字符作为页面内容进行提取,也就是通过查找标签中对应的字符来进行网页页面数据的抓取,可以避免不同网站需要编写不同的脚本进行抓取,增强爬虫的适应性,同时,减少工作量,提高数据抓取效率。In this embodiment, when the processor executes the computer-readable instructions stored in the memory, the steps of the crawler-based data grabbing method in the above-mentioned embodiment are implemented, and the first character in each tag pair is searched by traversing all the tag pairs of the target web page. and the second character, read the target character between the first character as the starting point and the second character as the end point, determine that the target character meets the extraction conditions, and extract the target character as the page content, that is, by searching for the corresponding It can avoid the need for different websites to write different scripts for crawling, enhance the adaptability of the crawler, at the same time, reduce the workload and improve the efficiency of data crawling.
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性。所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上 述的基于爬虫的数据抓取方法的步骤,通过遍历目标网页的所有标签对,查找每个标签对中的第一字符和第二字符,读取出以第一字符为起点,第二字符为终点之间的目标字符,确定目标字符满足提取条件,就将目标字符作为页面内容进行提取,也就是通过查找标签中对应的字符来进行网页页面数据的抓取,可以避免不同网站需要编写不同的脚本进行抓取,增强爬虫的适应性,同时,减少工作量,提高数据抓取效率。The present application also provides another implementation manner, which is to provide a computer-readable storage medium, where the computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor, so that the at least one processor executes the steps of the crawler-based data scraping method as described above, By traversing all the tag pairs of the target web page, finding the first character and the second character in each tag pair, and reading out the target character between the first character as the starting point and the second character as the end point, it is determined that the target character satisfies the extraction condition, the target characters are extracted as page content, that is, the web page data is crawled by finding the corresponding characters in the tags, which can avoid the need for different websites to write different scripts for crawling, enhance the adaptability of the crawler, and at the same time , reduce workload and improve data capture efficiency.
其中,所述计算机可读指令被处理器执行时实现如下所述的基于爬虫的数据抓取方法的步骤:Wherein, when the computer-readable instructions are executed by the processor, the steps of the crawler-based data grabbing method described below are realized:
获取目标网页,并对所述目标网页进行解析,得到所述目标网页的所有标签;Obtaining a target web page, and parsing the target web page to obtain all tags of the target web page;
根据所述标签得到标签对,遍历所有所述标签对,并对每个所述标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符;Obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each of the tag pairs, take the character that satisfies the first preset condition as the first character, and set the character that satisfies the second preset condition as the first character. the character of the condition as the second character;
以所述第一字符为起点,所述第二字符为终点,读取所有所述起点和所述终点之间的目标字符;Taking the first character as the starting point and the second character as the ending point, read all target characters between the starting point and the ending point;
判断所述目标字符是否满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取。It is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. This application may be embodied in many different forms, rather these embodiments are provided so that a thorough and complete understanding of the disclosure of this application is provided. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features. . Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.

Claims (20)

  1. 一种基于爬虫的数据抓取方法,包括下述步骤:A crawler-based data capture method, comprising the following steps:
    获取目标网页,并对所述目标网页进行解析,得到所述目标网页的所有标签;Obtaining a target web page, and parsing the target web page to obtain all tags of the target web page;
    根据所述标签得到标签对,遍历所有所述标签对,并对每个所述标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符;Obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each of the tag pairs, take the character that satisfies the first preset condition as the first character, and set the character that satisfies the second preset condition as the first character. the character of the condition as the second character;
    以所述第一字符为起点,所述第二字符为终点,读取所有所述起点和所述终点之间的目标字符;Taking the first character as the starting point and the second character as the ending point, read all target characters between the starting point and the ending point;
    判断所述目标字符是否满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取。It is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
  2. 根据权利要求1所述的基于爬虫的数据抓取方法,其中,所述对所述目标网页进行解析,得到所述目标网页的标签的步骤包括:The crawler-based data crawling method according to claim 1, wherein the step of parsing the target web page to obtain the label of the target web page comprises:
    使用网页解析器提取所述目标网页的网页结构;extracting the web page structure of the target web page using a web page parser;
    获取所述网页结构中的所述标签。Get the tag in the web page structure.
  3. 根据权利要求2所述的基于爬虫的数据抓取方法,其中,所述使用网页解析器提取所述目标网页的网页结构的步骤包括:The crawler-based data crawling method according to claim 2, wherein the step of extracting the web page structure of the target web page by using a web page parser comprises:
    通过所述网页解析器解析所述目标网页得到HTML文档;Parse the target web page by the web page parser to obtain an HTML document;
    将所述HTML文档解析生成DOM树结构,将生成的所述DOM树结构作为所述目标网页的网页结构。The HTML document is parsed to generate a DOM tree structure, and the generated DOM tree structure is used as the web page structure of the target web page.
  4. 根据权利要求3所述的基于爬虫的数据抓取方法,其中,所述遍历所有所述标签对的步骤包括:The crawler-based data capture method according to claim 3, wherein the step of traversing all the tag pairs comprises:
    将所述网页结构中最外层的标签作为初始节点;Use the outermost label in the web page structure as an initial node;
    从所述初始节点开始,以所述DOM树结构中的标签作为遍历节点,对所有所述标签对进行深度优先遍历。Starting from the initial node, using the tags in the DOM tree structure as traversal nodes, depth-first traversal is performed on all the tag pairs.
  5. 根据权利要求1所述的基于爬虫的数据抓取方法,其中,所述对每个所述标签对中的字符进行查找的步骤包括:The crawler-based data capture method according to claim 1, wherein the step of searching for characters in each of the label pairs comprises:
    利用字符串查找方法对所述标签对中的全部字符进行查找。All characters in the tag pair are searched using a string search method.
  6. 根据权利要求1所述的基于爬虫的数据抓取方法,其中,所述判断所述目标字符满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取的步骤包括:The crawler-based data crawling method according to claim 1, wherein the step of judging that the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, extracting the target character as the page content comprises:
    记录所述目标字符的长度;record the length of the target character;
    确定所述目标字符的长度满足提取长度,则将所述目标字符作为页面内容,提取所述页面内容。If it is determined that the length of the target character satisfies the extraction length, the target character is used as the page content, and the page content is extracted.
  7. 根据权利要求1所述的基于爬虫的数据抓取方法,其中,在所述将所述目标字符作为页面内容进行提取的步骤之后还包括:The crawler-based data scraping method according to claim 1, wherein after the step of extracting the target characters as page content, the method further comprises:
    将所述页面内容以json格式进行存储。The page content is stored in json format.
  8. 一种基于爬虫的数据抓取装置,包括:A crawler-based data capture device, comprising:
    获取模块,用于获取目标网页,并对所述目标网页进行解析,得到所述目标网页的所有标签;an acquisition module, used to acquire a target webpage, parse the target webpage, and obtain all tags of the target webpage;
    遍历模块,根据所述标签得到标签对,遍历所有所述标签对,并对每个所述标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符;The traversal module obtains the tag pairs according to the tags, traverses all the tag pairs, and searches for the characters in each of the tag pairs, takes the character that satisfies the first preset condition as the first character, and satisfies the first character. 2. The character of the preset condition is used as the second character;
    读取模块,用于以所述第一字符为起点,所述第二字符为终点,读取出所述起点和所述终点之间的目标字符;a reading module, configured to take the first character as the starting point and the second character as the ending point, and read out the target character between the starting point and the ending point;
    提取模块,用于判断所述目标字符是否满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取。The extraction module is used for judging whether the target character satisfies the extraction condition, and extracting the target character as the page content when it is determined that the extraction condition is satisfied.
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下所述的基于爬虫的数据抓取方法的步骤:A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the steps of the following crawler-based data grabbing method are implemented:
    获取目标网页,并对所述目标网页进行解析,得到所述目标网页的所有标签;Obtaining a target web page, and parsing the target web page to obtain all tags of the target web page;
    根据所述标签得到标签对,遍历所有所述标签对,并对每个所述标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符;Obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each of the tag pairs, take the character that satisfies the first preset condition as the first character, and set the character that satisfies the second preset condition as the first character. the character of the condition as the second character;
    以所述第一字符为起点,所述第二字符为终点,读取所有所述起点和所述终点之间的目标字符;Taking the first character as the starting point and the second character as the ending point, read all target characters between the starting point and the ending point;
    判断所述目标字符是否满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取。It is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
  10. 根据权利要求9所述的计算机设备,其中,所述对所述目标网页进行解析,得到所述目标网页的标签的步骤包括:The computer device according to claim 9, wherein the step of parsing the target web page to obtain the label of the target web page comprises:
    使用网页解析器提取所述目标网页的网页结构;extracting the web page structure of the target web page using a web page parser;
    获取所述网页结构中的所述标签。Get the tag in the web page structure.
  11. 根据权利要求10所述的计算机设备,其中,所述使用网页解析器提取所述目标网页的网页结构的步骤包括:The computer device of claim 10, wherein the step of extracting the web page structure of the target web page using a web page parser comprises:
    通过所述网页解析器解析所述目标网页得到HTML文档;Parse the target web page by the web page parser to obtain an HTML document;
    将所述HTML文档解析生成DOM树结构,将生成的所述DOM树结构作为所述目标网页的网页结构。The HTML document is parsed to generate a DOM tree structure, and the generated DOM tree structure is used as the web page structure of the target web page.
  12. 根据权利要求11所述的计算机设备,其中,所述遍历所有所述标签对的步骤包括:The computer device of claim 11, wherein the step of traversing all of the tag pairs comprises:
    将所述网页结构中最外层的标签作为初始节点;Use the outermost label in the web page structure as an initial node;
    从所述初始节点开始,以所述DOM树结构中的标签作为遍历节点,对所有所述标签对进行深度优先遍历。Starting from the initial node, using the tags in the DOM tree structure as traversal nodes, depth-first traversal is performed on all the tag pairs.
  13. 根据权利要求9所述的计算机设备,其中,所述对每个所述标签对中的字符进行查找的步骤包括:10. The computer device of claim 9, wherein said step of searching for characters in each said tag pair comprises:
    利用字符串查找方法对所述标签对中的全部字符进行查找。All characters in the tag pair are searched using a string search method.
  14. 根据权利要求9所述的计算机设备,其中,所述判断所述目标字符满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取的步骤包括:The computer device according to claim 9, wherein the judging that the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the step of extracting the target character as the page content comprises:
    记录所述目标字符的长度;record the length of the target character;
    确定所述目标字符的长度满足提取长度,则将所述目标字符作为页面内容,提取所述页面内容。If it is determined that the length of the target character satisfies the extraction length, the target character is used as the page content, and the page content is extracted.
  15. 根据权利要求9所述的计算机设备,其中,在所述将所述目标字符作为页面内容进行提取的步骤之后还包括:The computer device according to claim 9, wherein after the step of extracting the target characters as page content, further comprising:
    将所述页面内容以json格式进行存储。The page content is stored in json format.
  16. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下所述的基于爬虫的数据抓取方法的步骤:A computer-readable storage medium on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, the steps of the following crawler-based data grabbing method are implemented:
    获取目标网页,并对所述目标网页进行解析,得到所述目标网页的所有标签;Obtaining a target web page, and parsing the target web page to obtain all tags of the target web page;
    根据所述标签得到标签对,遍历所有所述标签对,并对每个所述标签对中的字符进行查找,将满足第一预设条件的字符作为第一字符,并将满足第二预设条件的字符作为第二字符;Obtain the tag pair according to the tag, traverse all the tag pairs, and search for the characters in each of the tag pairs, take the character that satisfies the first preset condition as the first character, and set the character that satisfies the second preset condition as the first character. the character of the condition as the second character;
    以所述第一字符为起点,所述第二字符为终点,读取所有所述起点和所述终点之间的目标字符;Taking the first character as the starting point and the second character as the ending point, read all target characters between the starting point and the ending point;
    判断所述目标字符是否满足提取条件,在确定满足提取条件时,将所述目标字符作为页面内容进行提取。It is judged whether the target character satisfies the extraction condition, and when it is determined that the extraction condition is satisfied, the target character is extracted as the page content.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述对所述目标网页进行解析,得到所述目标网页的标签的步骤包括:The computer-readable storage medium according to claim 16, wherein the step of parsing the target web page to obtain the tag of the target web page comprises:
    使用网页解析器提取所述目标网页的网页结构;extracting the web page structure of the target web page using a web page parser;
    获取所述网页结构中的所述标签。Get the tag in the web page structure.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述使用网页解析器提取所述目标网页的网页结构的步骤包括:The computer-readable storage medium of claim 17, wherein the step of extracting the web page structure of the target web page using a web page parser comprises:
    通过所述网页解析器解析所述目标网页得到HTML文档;Parse the target web page by the web page parser to obtain an HTML document;
    将所述HTML文档解析生成DOM树结构,将生成的所述DOM树结构作为所述目标网页的网页结构。The HTML document is parsed to generate a DOM tree structure, and the generated DOM tree structure is used as the web page structure of the target web page.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述遍历所有所述标签对的步骤包括:19. The computer-readable storage medium of claim 18, wherein the step of traversing all of the tag pairs comprises:
    将所述网页结构中最外层的标签作为初始节点;Use the outermost label in the web page structure as an initial node;
    从所述初始节点开始,以所述DOM树结构中的标签作为遍历节点,对所有所述标签对进行深度优先遍历。Starting from the initial node, using the tags in the DOM tree structure as traversal nodes, depth-first traversal is performed on all the tag pairs.
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述对每个所述标签对中的字符进行查找的步骤包括:17. The computer-readable storage medium of claim 16, wherein the step of searching for characters in each of the tag pairs comprises:
    利用字符串查找方法对所述标签对中的全部字符进行查找。All characters in the tag pair are searched using a string search method.
PCT/CN2021/124394 2021-02-25 2021-10-18 Crawler-based data crawling method and apparatus, computer device, and storage medium WO2022179128A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110213211.3A CN112925968A (en) 2021-02-25 2021-02-25 Crawler-based data capturing method and device, computer equipment and storage medium
CN202110213211.3 2021-02-25

Publications (1)

Publication Number Publication Date
WO2022179128A1 true WO2022179128A1 (en) 2022-09-01

Family

ID=76171932

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/124394 WO2022179128A1 (en) 2021-02-25 2021-10-18 Crawler-based data crawling method and apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN112925968A (en)
WO (1) WO2022179128A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925968A (en) * 2021-02-25 2021-06-08 深圳壹账通智能科技有限公司 Crawler-based data capturing method and device, computer equipment and storage medium
CN116881595B (en) * 2023-09-06 2023-12-15 江西顶易科技发展有限公司 Customizable webpage data crawling method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
US20170192882A1 (en) * 2016-01-06 2017-07-06 Hcl Technologies Limited Method and system for automatically generating a plurality of test cases for an it enabled application
CN108804458A (en) * 2017-05-02 2018-11-13 阿里巴巴集团控股有限公司 A kind of reptile web retrieval method and apparatus
CN110472126A (en) * 2018-05-10 2019-11-19 中国移动通信集团浙江有限公司 A kind of acquisition methods of page data, device and equipment
CN110764994A (en) * 2019-09-04 2020-02-07 深圳壹账通智能科技有限公司 Page element packaging method and device, electronic equipment and storage medium
CN110874428A (en) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 Structured data extraction device and method for e-commerce page and readable storage medium
CN111125598A (en) * 2019-12-20 2020-05-08 深圳壹账通智能科技有限公司 Intelligent data query method, device, equipment and storage medium
CN111737623A (en) * 2020-06-19 2020-10-02 深圳市小满科技有限公司 Webpage information extraction method and related equipment
CN111797336A (en) * 2020-07-07 2020-10-20 北京明略昭辉科技有限公司 Webpage parsing method and device, electronic equipment and medium
CN112925968A (en) * 2021-02-25 2021-06-08 深圳壹账通智能科技有限公司 Crawler-based data capturing method and device, computer equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314520A (en) * 2011-10-24 2012-01-11 莫雅静 Webpage text extraction method and device based on statistical backtracking positioning
CN103577466B (en) * 2012-08-03 2017-02-15 腾讯科技(深圳)有限公司 Method and device for displaying webpage content in browser
CN104866512B (en) * 2014-02-26 2018-09-07 腾讯科技(深圳)有限公司 Extract the method, apparatus and system of web page contents
CN107861974B (en) * 2017-09-19 2018-12-25 北京金堤科技有限公司 A kind of adaptive network crawler system and its data capture method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
US20170192882A1 (en) * 2016-01-06 2017-07-06 Hcl Technologies Limited Method and system for automatically generating a plurality of test cases for an it enabled application
CN108804458A (en) * 2017-05-02 2018-11-13 阿里巴巴集团控股有限公司 A kind of reptile web retrieval method and apparatus
CN110472126A (en) * 2018-05-10 2019-11-19 中国移动通信集团浙江有限公司 A kind of acquisition methods of page data, device and equipment
CN110764994A (en) * 2019-09-04 2020-02-07 深圳壹账通智能科技有限公司 Page element packaging method and device, electronic equipment and storage medium
CN110874428A (en) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 Structured data extraction device and method for e-commerce page and readable storage medium
CN111125598A (en) * 2019-12-20 2020-05-08 深圳壹账通智能科技有限公司 Intelligent data query method, device, equipment and storage medium
CN111737623A (en) * 2020-06-19 2020-10-02 深圳市小满科技有限公司 Webpage information extraction method and related equipment
CN111797336A (en) * 2020-07-07 2020-10-20 北京明略昭辉科技有限公司 Webpage parsing method and device, electronic equipment and medium
CN112925968A (en) * 2021-02-25 2021-06-08 深圳壹账通智能科技有限公司 Crawler-based data capturing method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN112925968A (en) 2021-06-08

Similar Documents

Publication Publication Date Title
US10380197B2 (en) Network searching method and network searching system
JP5384837B2 (en) System and method for annotating documents
JP5222581B2 (en) System and method for annotating documents
WO2022179128A1 (en) Crawler-based data crawling method and apparatus, computer device, and storage medium
WO2018095411A1 (en) Web page clustering method and device
US20160092566A1 (en) Clustering repetitive structure of asynchronous web application content
US20150142567A1 (en) Method and apparatus for identifying elements of a webpage
US10755091B2 (en) Method and apparatus for retrieving image-text block from web page
WO2022134776A1 (en) Label-based anti-crawler method and apparatus, computer device, and storage medium
CN114443928B (en) Web text data crawler method and system
CN103020179A (en) Method, device and equipment for extracting webpage contents
Yu et al. Web content information extraction based on DOM tree and statistical information
CN110110184B (en) Information inquiry method, system, computer system and storage medium
CN114021042A (en) Webpage content extraction method and device, computer equipment and storage medium
WO2015074455A1 (en) Method and apparatus for computing url pattern of associated webpage
CN104778232A (en) Searching result optimizing method and device based on long query
CN112380337A (en) Highlight method and device based on rich text
CN113127776A (en) Breadcrumb path generation method and device and terminal equipment
US10380195B1 (en) Grouping documents by content similarity
CN110825976B (en) Website page detection method and device, electronic equipment and medium
JP5564442B2 (en) Text search device
US20150324333A1 (en) Systems and methods for automatically generating hyperlinks
CN116362223B (en) Automatic identification method and device for web page article titles and texts
TWI764491B (en) Text information automatically mining method and system
CN114154092B (en) Method for translating web pages and related product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927552

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07.12.2023)