CN112925968A - Crawler-based data capturing method and device, computer equipment and storage medium - Google Patents

Crawler-based data capturing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112925968A
CN112925968A CN202110213211.3A CN202110213211A CN112925968A CN 112925968 A CN112925968 A CN 112925968A CN 202110213211 A CN202110213211 A CN 202110213211A CN 112925968 A CN112925968 A CN 112925968A
Authority
CN
China
Prior art keywords
characters
target
character
webpage
crawler
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110213211.3A
Other languages
Chinese (zh)
Inventor
郑如刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN202110213211.3A priority Critical patent/CN112925968A/en
Publication of CN112925968A publication Critical patent/CN112925968A/en
Priority to PCT/CN2021/124394 priority patent/WO2022179128A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the application belongs to the field of data acquisition of big data, and relates to a crawler-based data capture method, which comprises the steps of obtaining a target webpage, analyzing the target webpage to obtain all labels of the target webpage, obtaining label pairs according to the labels, traversing all the label pairs, searching characters in each label pair, taking characters meeting a first preset condition as first characters, taking characters meeting a second preset condition as second characters, taking the first characters as starting points and the second characters as end points, reading out target characters between the starting points and the end points, and extracting the target characters as page contents when the extraction conditions are determined to be met. The application also provides a crawler-based data grabbing device, computer equipment and a storage medium. In addition, the present application also relates to blockchain techniques, where page content may be stored in blockchains. This application can strengthen the adaptability of crawler, reduces work load, improves data and snatchs efficiency.

Description

Crawler-based data capturing method and device, computer equipment and storage medium
Technical Field
The application relates to the technical field of data acquisition of big data, in particular to a crawler-based data capturing method and device, computer equipment and a storage medium.
Background
In recent years, the internet has gradually moved to the direction of big data, and in the big data environment, the acquisition of data is crucial. In the data acquisition method, a crawler technology is generally adopted for data capture. The crawler technology is a program for searching a web page through a link address of the web page and automatically acquiring the content of the web page according to a certain rule.
At present, traditional crawlers need to perform data analysis by positioning page elements, namely, in the process of using the crawlers to capture data, scripts must be written, different scripts need to be written in different websites, and meanwhile, if the web pages are changed, the capturing scripts also need to be rewritten, so that the workload of data capturing is huge, time and labor are wasted, and the efficiency is low.
Disclosure of Invention
An object of the embodiment of the application is to provide a crawler-based data capture method and device, a computer device and a storage medium, so as to solve the problems that in the related art, data capture workload of a crawler is huge, time and labor are wasted, and capture efficiency is low.
In order to solve the above technical problem, an embodiment of the present application provides a crawler-based data capture method, which adopts the following technical scheme:
acquiring a target webpage, and analyzing the target webpage to obtain all tags of the target webpage;
obtaining label pairs according to the labels, traversing all the label pairs, searching characters in each label pair, taking the characters meeting a first preset condition as first characters, and taking the characters meeting a second preset condition as second characters;
taking the first character as a starting point and the second character as an end point, and reading a target character between the starting point and the end point;
and judging whether the target character meets the extraction condition, and extracting the target character as page content when the target character meets the extraction condition.
Further, the step of analyzing the target webpage to obtain the tag of the target webpage includes:
extracting a webpage structure of the target webpage by using a webpage parser;
and acquiring the label in the webpage structure.
Further, the step of extracting the web page structure of the target web page using a web page parser comprises:
analyzing the target webpage through the webpage analyzer to obtain an HTML document;
and analyzing the HTML document to generate a DOM tree structure, and taking the generated DOM tree structure as the webpage structure of the target webpage.
Further, the step of traversing all the tags comprises:
taking the label at the outermost layer in the webpage structure as an initial node;
and starting from the initial node, taking the tags in the DOM tree structure as traversal nodes, and performing depth-first traversal on all the tag pairs.
Further, the step of searching for the character in each label pair includes:
and searching all characters in the label by using a character string searching method.
Further, the step of judging that the target character meets the extraction condition and extracting the target character as the page content includes:
recording the length of the target character;
and if the length of the target character meets the extraction length, taking the target character as page content, and extracting the page content.
Further, after the step of extracting the target character as page content, the method further includes:
and storing the page content in a json format.
In order to solve the above technical problem, an embodiment of the present application further provides a crawler-based data capture device, which adopts the following technical scheme:
the acquisition module is used for acquiring a target webpage and analyzing the target webpage to obtain all tags of the target webpage;
the traversal module is used for obtaining label pairs according to the labels, traversing all the label pairs, searching characters in each label pair, taking the characters meeting a first preset condition as first characters, and taking the characters meeting a second preset condition as second characters;
the reading module is used for reading a target character between the starting point and the end point by taking the first character as the starting point and the second character as the end point;
and the extraction module is used for judging whether the target characters meet extraction conditions or not, and extracting the target characters as page content when the extraction conditions are determined to be met.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:
the computer device comprises a memory and a processor, wherein the memory is stored with computer readable instructions, and the processor realizes the steps of the crawler-based data capture method when executing the computer readable instructions.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:
the computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement the steps of the crawler-based data crawling method as described above.
Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:
the method comprises the steps of obtaining a target webpage, analyzing the target webpage to obtain all labels of the target webpage, obtaining label pairs according to the labels, traversing all label pairs, searching characters in each label pair, reading a target character between a starting point and an end point by taking the first character as the starting point and the second character as the end point, taking the character meeting a first preset condition as a first character and the character meeting a second preset condition as a second character, and extracting the target character as page content when the target character meets the extraction condition; this application is through all label pairs that traverse target webpage, look for first character and second character in every label pair, read out and use first character as the starting point, the second character is the target character between the terminal point, it satisfies the extraction condition to confirm the target character, just extract the target character as page content, just also carry out the snatching of webpage page data through the character that satisfies the preset condition in looking for the label, can avoid different websites to write different scripts and snatch, strengthen the adaptability of crawler, and simultaneously, the work load is reduced, improve data and snatch efficiency.
Drawings
In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a crawler-based data crawling method according to the present application;
FIG. 3 is a flowchart of one embodiment of step S201 in FIG. 2;
FIG. 4 is a flowchart of one embodiment of step S202 in FIG. 2;
FIG. 5 is a schematic diagram of an embodiment of a crawler-based data crawling apparatus according to the present application;
FIG. 6 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be applied to the following explanations.
1) A crawler (web crawler), also called a web spider or a web robot, refers to a program or a script that automatically captures network information according to a certain rule.
2) HTML (hypertext markup Language) is a descriptive markup Language for describing the display mode of the content in the hypertext.
3) Tags, also called tags, are a web term for HTML, each tag being used to specify a particular meaning.
4) A web page, a page composed of various tags.
In order to solve the problems of huge workload, time and labor waste and low capturing efficiency of crawlers in the related art, the application provides a data capturing method based on crawlers, which can be applied to a system architecture 100 shown in fig. 1, where the system architecture 100 may include terminal devices 101, 102, and 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the crawler-based data crawling method provided in the embodiment of the present application is generally executed by a server, and accordingly, the crawler-based data crawling apparatus is generally disposed in the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continuing reference to FIG. 2, a flowchart of one embodiment of a crawler-based data crawling method according to the present application is shown, comprising the steps of:
step S201, acquiring a target webpage, and analyzing the target webpage to obtain all tags of the target webpage.
In this embodiment, the crawler acquires a web page at the address according to the provided entry website, where the web page is a target web page to be crawled. When the crawler captures a target webpage, the crawler can identify the tag of the target webpage and extract the content of the webpage from the tag.
In some optional implementation manners of this embodiment, the step of analyzing the target webpage to obtain the tag of the target webpage specifically includes:
in step S301, a web page structure of the target web page is extracted using a web page parser.
Specifically, the target webpage is analyzed through the webpage analyzer to obtain an HTML document, the HTML document is analyzed to generate a DOM tree structure, and the generated DOM tree structure is used as the webpage structure of the target webpage.
In the web page, the tags of the web page are arranged in a tree structure. The web page parser is also called HTML (HyperText Markup Language) parser, and jsup is a Java HTML parser and can directly parse a certain URL address and HTML text content.
In this embodiment, the target web page is parsed by a jsup or other web page parser, so that the web page content containing the tag can be obtained. For example, the following is an HTML document:
<html><head><title>Example</title></head>
<body>
<p class="title"><b>Example</b></p>
<p class="description">There are three examples:
<a href="http://example.com/1"class="example"id"link1">Example1</a>,
<a href="http://example.com/2"class="example"id"link2">Example2</a>and
<a href="http://example.com/3class="example"id"link3">Example3</a></p>
</body>
</html>
it should be noted that the tags are presented in pairs, that is, the tags in this embodiment are specifically tag pairs, for example, the < HTML > and </HTML > are one tag, the < head > and </head > are one tag, the < HTML > defines the HTML document, and the < HTML > and </HTML > tags define the start point and the end point of the document; < head > defines information about a document; < title > defines the title of the document; < body > defines the body of the document; < p > define document paragraphs; < b > define bold type; < a > defines a hyperlink, and the < a > tag most important attribute is the href attribute, which indicates the target of the link.
The HTML document is converted into a DOM tree structure by an HTML parser. The DOM tree structure is a parse tree output by the HTML parser, comprises element nodes, text nodes, attribute nodes and annotation nodes, is an object representation of the HTML document, and serves as an external interface of HTML elements for JS and other calls. There are multiple branches in the DOM tree structure, multiple layers in each branch, and the layer structure is the relationship between element nodes.
As a specific example, when an HTML document is parsed into a DOM tree structure using a web page parser, nodes (i.e., functional code) in the web page that do not contain textual information may be removed, e.g., to remove meaningless HTML tags, such as < style >, </style >, < script >, etc.
In this embodiment, the target webpage is parsed to generate a DOM tree structure, and the tags in the DOM tree structure are used as traversal nodes, so that the tags can be more conveniently traversed.
Step S302, acquiring a label in a webpage structure.
It should be noted that the element nodes of the DOM tree structure are the tag elements of the web page, the text nodes are included in the element nodes, and the attribute nodes are used to describe the elements specifically, as shown in the above example, the attribute of the < a > tag is the href attribute. It should be understood that not all element nodes contain attributes.
In this embodiment, the tag of the target webpage can be obtained from the generated DOM tree structure, so that the performance of analyzing the webpage is improved, and meanwhile, the webpage information can be completely and accurately obtained.
In the embodiment, the webpage structure is extracted through the webpage parser, and the tag is acquired from the webpage structure, so that the tag of the target webpage can be acquired more quickly and more conveniently.
Step S202, label pairs are obtained according to the labels, all the label pairs are traversed, characters in each label pair are searched, characters meeting a first preset condition are used as first characters, and characters meeting a second preset condition are used as second characters.
The webpage is a page composed of various tags, the tags displayed on the webpage all conform to tag specifications, and the levels of the tags can be distinguished according to tag elements. The following examples are given:
<dd>
<ul class=“topic_list”>
< li > < em > & sdot; [ em > < a href [ "/topics/50245210" target "_ black" ]problem with respect to ADO and digital symbols ] </li >
</ul>
</dd>
In the above example, three tag elements are included, namely < dd >, < ul >, and < li >, so that it can be seen that the tag is divided into three layers, one for each tag element.
In this embodiment, the tags are present in the form of tag pairs, and each tag is used as a node to traverse, so that the page content of the webpage corresponding to the tag can be obtained.
Specifically, the steps of traversing all tag pairs are as follows:
step S401, taking the label at the outermost layer in the webpage structure as an initial node.
In this embodiment, a label is used as a traversal node to perform depth-first traversal on all label pairs, specifically, an outermost label in a web page structure is used as a first layer, a nested label in the first layer is used as a second layer, and so on, a tree structure is formed, the label of the first layer is used as an initial node, access is started from the initial node, and depth-first traversal is performed sequentially from an unaccessed adjacent node of the initial node until all nodes are accessed.
To further illustrate by taking the above example as an example, the tag element < dd > is the tag at the outermost layer, which is taken as the first layer, the tag element < ul > is the tag nested in < dd >, which is taken as the second layer, and the tag element < li > is the tag nested in < ul >, which is taken as the third layer. Specifically, if < dd > is used as the initial node, the adjacent node of < dd > is < ul >, and the adjacent node of < ul > is < li >, and the order of performing depth-first traversal is as follows: < dd > → < ul > → < li >.
And step S402, starting from the initial node, performing depth-first traversal on all the label pairs by taking the labels in the DOM tree structure as traversal nodes.
And taking the element nodes contained in the generated DOM tree structure as the label elements of the webpage, and sequentially performing depth-first traversal on the label pairs corresponding to the label elements by taking the label elements in the DOM tree structure as the nodes. For example, if the tag element is ul in the DOM tree structure, the tag pair corresponding to the tag element is < ul > and </ul >.
The depth-first traversal method is specifically that, starting from an initial node v:
a. accessing an initial node v and marking the initial node v as accessed;
b. searching a first adjacent node w of the initial node v;
c. if the node w exists, continuing to execute the step d, otherwise, finishing the traversal;
d. if the node w is not visited, performing depth-first traversal recursion on the node w (i.e. performing steps a, b and c by taking the node w as another initial node v);
e. and c, searching a next adjacent node of the initial node v, and turning to the step c.
For example, assume that the web page structure is: label a is the first layer; the nested labels in the label A are a label B and a label C, and the label B and the label C are the second layer; the nested labels in the label B are a label D and a label E, the nested label in the label C is a label F, and the label D, the label E and the label F are a third layer; then, label a is used as an initial node, label a has an adjacent node label B and a label C, label B has an adjacent node label D and a label E, label C has an adjacent node label F, and the order of performing depth-first traversal is: tag a → tag B → tag D → tag E → tag C → tag F.
The tag definition of a web page follows the tag specification, and the tag format is similar: < cable > page content </cable >. In this embodiment, each label pair is traversed, the page content is extracted, and in the process of traversing, in the layer corresponding to each label pair, characters meeting a first preset condition are used as first characters by searching for the characters, and characters meeting a second preset condition are used as second characters.
In this embodiment, the first preset condition is specifically a character "greater than number (>)" appearing in the layer corresponding to the currently accessed tag pair, and the second preset condition is specifically a character "less than number (<)", appearing next to the character "greater than number (>)" in the layer corresponding to the currently accessed tag pair. It should be understood that the ">" and "<" characters are label characters.
It should be noted that, in a tag pair, there may be a plurality of ">" characters and a plurality of "<" characters, the ">" character appearing first in the tag pair is taken as a first character, the "<" character appearing next to the ">" character appearing first is taken as a second character, the ">" character appearing second is taken as a first character, the "<" character appearing next to the ">" character appearing second is taken as a second character, and so on. In this embodiment, the first character and the second character are in a sequential order, the first character is before, the second character is after, and the content between the first character and the second character is the content of the page that may be extracted.
In this embodiment, all characters in each label pair may be searched, and the searched first character and second character are respectively marked.
Specifically, all the characters in the tag pair may be searched by using a character string search method. String lookup methods include, but are not limited to, the following:
1. inlnexof (string str): the index of the first occurrence of the specified substring in this string is returned.
2. intindexOf (String str, intstartIndex): starting at the designated index, the index of the first occurring designated substring in this string is returned.
3. intlastindexof (string str): the index of the designated substring that appears to the far right in this string is returned.
4. intlastIndexOf (String str, intstartIndex): starting with the designated index, a search is performed backwards, returning the index of the designated substring that last appeared in this string.
For each label pair, the first label's trailing character is the first character to occur for the first time and the first label's leading character is the second character to occur for the last time. In each label pair, all first and second characters between a first character occurring for the first time and a second character occurring for the last time are searched.
When the web page is modified, the page content of the web page is not changed greatly, but the tags are changed, if the crawler acquires data by a method of analyzing the data one by one row and one by one column according to the web page tags, the capture script needs to be rewritten.
The embodiment determines the first character and the second character in each label pair through depth-first traversal, and can ensure that page content is completely and unintelliptively extracted.
In step S203, the target character between the start point and the end point is read with the first character as the start point and the second character as the end point.
In this embodiment, in one tag pair, there may be a plurality of ">" characters and a plurality of "<" characters, with the first occurring ">" character as a starting point, the "<" character appearing next to the first occurring ">" character as an end point, the second occurring ">" character as a starting point, the "<" character appearing next to the second occurring ">" character as an end point, the target character between the starting point and the end point is read, and so on.
And step S204, judging whether the target character meets the extraction condition, and extracting the target character as page content when the extraction condition is determined to be met.
Specifically, the length of a target character between a first character and a second character is recorded, and if the length of the target character meets the extraction length, the target character is used as page content, and the page content is extracted. The extracted page content can be stored, and the user can conveniently check the page content.
In this embodiment, the tag format is < cable > page content </cable >, and the page content between the first character ">" and the second character "<" is to be extracted.
By way of example, one tag pair is as follows:
< li > < em > & sdot; [ em > < a href [ "/topics/50245210" target "_ black" ]problem with respect to ADO and digital symbols ] </li >
It is necessary to extract page contents "problem about ADO and numeric symbol", in which a plurality of first characters and second characters exist in the pair of tags, and therefore, it is necessary to record the character length between each pair of the first characters and the second characters and determine whether it is the page contents to be extracted according to the length.
And the extraction length is set according to actual conditions, and when the target character between the first character and the second character is determined to meet the preset extraction length, the target character is taken as page content for extraction.
It is emphasized that, in order to further ensure the privacy and security of the page contents, the page contents may also be stored in a node of a block chain.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Whether the page content to be extracted is between the first character and the second character is determined according to the character length, useless content can be prevented from being extracted, the data capturing accuracy of the crawler is improved, and meanwhile the capturing efficiency is improved.
In some optional implementations of this embodiment, after step 203, the following steps may also be performed:
the page content is stored in json format.
Specifically, the extracted page content is stored in a json format or in a database form, and the stored page content can be directly used for data analysis. Ways of storage include, but are not limited to, storage in json format to text files, to excel, to SQLite, and to mySQL (relational database management system) databases.
JSON, known collectively as JavaScript Object Notification, is a JavaScript Object Notation by which any supported type can be represented, such as a string, number, Object, array, etc. In this embodiment, a path of a json file for storing page content is set, and the extracted page content is written into the json file through the path and stored.
It should be noted that the json format can ensure that stored data can be visually checked when a file is opened, and one data is stored in one row, so that the method is suitable for the condition that the amount of crawled data is small, and the subsequent reading and analysis are also very convenient; if the crawled data are easy to arrange into a table form, the data can be stored in the excel, the data can be observed more conveniently after the excel is opened, and meanwhile, the excel can also perform some simple operations; the SQLite is not required to be installed, the SQLite is a zero configuration database, when the crawler data volume is large, persistent storage is required, other databases are not installed, and the SQLite can be selected for storage; mySQL is remotely accessible, meaning that data can be stored to a remote server host.
This application is through all label pairs that traverse target webpage, look for first character and second character in every label pair, read out and use first character as the starting point, the second character is the target character between the terminal point, it satisfies the extraction condition to confirm the target character, just extract the target character as page content, just also carry out the snatching of webpage page data through the character that corresponds in looking for the label, can avoid different websites to compile different scripts and snatch, strengthen the adaptability of crawler, and simultaneously, the work load is reduced, improve data and snatch efficiency.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, the processes of the embodiments of the methods described above can be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
With further reference to fig. 5, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a crawler-based data crawling apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.
As shown in fig. 5, the crawler-based data crawling apparatus 500 according to the present embodiment includes: an acquisition module 501, a traversal module 502, and an extraction module 503. Wherein:
the obtaining module 501 is configured to obtain a target webpage and analyze the target webpage to obtain all tags of the target webpage;
the traversal module 502 is configured to obtain tag pairs according to the tags, traverse all the tag pairs, search for a character in each tag pair, use a character meeting a first preset condition as a first character, and use a character meeting a second preset condition as a second character;
the reading module is used for reading a target character between the starting point and the end point by taking the first character as the starting point and the second character as the end point;
the extracting module 503 is configured to determine whether the target character meets an extracting condition, and extract the target character as page content when it is determined that the extracting condition is met.
It is emphasized that, in order to further ensure the privacy and security of the page contents, the page contents may also be stored in a node of a block chain.
The crawler-based data grabbing device searches for the first character and the second character in each label pair by traversing all the label pairs of a target webpage, reads out the target character which takes the first character as a starting point and the second character as a destination between the end points, determines that the target character meets the extraction condition, extracts the target character as the page content, namely grabs the webpage data by searching for the corresponding character in the label, can avoid different websites from compiling different scripts to grab, enhances the adaptability of the crawler, simultaneously reduces the workload, and improves the data grabbing efficiency.
In this embodiment, the obtaining module 501 includes an analyzing sub-module and an obtaining sub-module, where the analyzing sub-module is configured to use a web page analyzer to extract a web page structure of the target web page; the obtaining submodule is used for obtaining the label in the webpage structure.
In the embodiment, the webpage structure is extracted through the webpage parser, and the tag is acquired from the webpage structure, so that the tag of the target webpage can be acquired more quickly and more conveniently.
In some optional implementations of this embodiment, the parsing submodule is further configured to:
analyzing the target webpage through the webpage analyzer to obtain an HTML document;
and analyzing the HTML document to generate a DOM tree structure, and taking the generated DOM tree structure as the webpage structure of the target webpage.
According to the embodiment, the DOM tree structure is generated by analyzing the target webpage, so that the performance of analyzing the webpage can be improved, and meanwhile, the webpage information can be completely and accurately acquired.
In this embodiment, the traversing module 502 is further configured to:
taking the label at the outermost layer in the webpage structure as an initial node;
and starting from the initial node, taking the tags in the DOM tree structure as traversal nodes, and performing depth-first traversal on all the tag pairs.
In some optional implementations of this embodiment, the traversing module 502 is further configured to:
and searching all characters in the label pair by using a character string searching method.
The embodiment determines the first character and the second character in each label pair through depth-first traversal, and can ensure that page content is completely and unintelliptively extracted.
In this embodiment, the extracting module 503 includes a recording submodule and an extracting submodule, and the recording submodule is used to record the length of the target character; and the extraction submodule is used for determining that the length of the target character meets the extraction length, and extracting the page content by taking the target character as the page content.
Whether the page content to be extracted is between the first character and the second character is determined according to the character length, useless content can be prevented from being extracted, the data capturing accuracy of the crawler is improved, and meanwhile the capturing efficiency is improved.
In some optional implementations of this embodiment, the crawler-based data crawling apparatus 500 further includes a storage module, and the storage module is configured to store the page content in a json format.
The extracted page content is stored in the database, and the stored page content can be directly used for data analysis.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 6, fig. 6 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It is noted that only a computer device 6 having components 61-63 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 61 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal storage unit of the computer device 6 and an external storage device thereof. In this embodiment, the memory 61 is generally used for storing an operating system installed in the computer device 6 and various types of application software, such as computer readable instructions of a crawler-based data crawling method. Further, the memory 61 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 62 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute computer readable instructions stored in the memory 61 or process data, for example, execute computer readable instructions of the crawler-based data crawling method.
The network interface 63 may comprise a wireless network interface or a wired network interface, and the network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.
In this embodiment, when the processor executes the computer readable instructions stored in the memory, the steps of the crawler-based data capture method according to the above embodiments are implemented, by traversing all tag pairs of the target web page, searching for the first character and the second character in each tag pair, reading out the target character between the first character as the starting point and the second character as the end point, determining that the target character satisfies the extraction condition, and extracting the target character as the page content, that is, capturing the web page data by searching for the corresponding character in the tag, which can avoid different websites from needing to write different scripts for capturing, enhance the adaptability of the crawler, reduce the workload, and improve the data capture efficiency.
The present application further provides another embodiment, which is to provide a computer-readable storage medium, where computer-readable instructions are stored, and the computer-readable instructions are executable by at least one processor, so that the at least one processor performs the steps of the above crawler-based data crawling method, by traversing all tag pairs of a target web page, searching for a first character and a second character in each tag pair, reading out a target character between the first character as a starting point and the second character as an ending point, determining that the target character meets an extraction condition, extracting the target character as page content, that is, crawling web page data by searching for corresponding characters in the tags, which can avoid different websites needing to write different scripts for crawling, enhance adaptability of the crawler, and at the same time, the workload is reduced, and the data capturing efficiency is improved.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A crawler-based data capture method is characterized by comprising the following steps:
acquiring a target webpage, and analyzing the target webpage to obtain all tags of the target webpage;
obtaining label pairs according to the labels, traversing all the label pairs, searching characters in each label pair, taking the characters meeting a first preset condition as first characters, and taking the characters meeting a second preset condition as second characters;
reading all target characters between the starting point and the end point by taking the first character as the starting point and the second character as the end point;
and judging whether the target character meets the extraction condition, and extracting the target character as page content when the target character meets the extraction condition.
2. The crawler-based data crawling method according to claim 1, wherein the step of parsing the target webpage to obtain the tag of the target webpage comprises:
extracting a webpage structure of the target webpage by using a webpage parser;
and acquiring the label in the webpage structure.
3. The crawler-based data crawling method according to claim 2, wherein the step of extracting the web page structure of the target web page using a web page parser comprises:
analyzing the target webpage through the webpage analyzer to obtain an HTML document;
and analyzing the HTML document to generate a DOM tree structure, and taking the generated DOM tree structure as the webpage structure of the target webpage.
4. The crawler-based data crawling method according to claim 3, wherein the step of traversing all the tag pairs comprises:
taking the label at the outermost layer in the webpage structure as an initial node;
and starting from the initial node, taking the tags in the DOM tree structure as traversal nodes, and performing depth-first traversal on all the tag pairs.
5. The crawler-based data crawling method according to claim 1, wherein the step of searching for the characters in each of the tag pairs comprises:
and searching all characters in the label pair by using a character string searching method.
6. The crawler-based data crawling method according to claim 1, wherein the step of judging that the target character meets an extraction condition, and extracting the target character as page content when it is determined that the extraction condition is met comprises:
recording the length of the target character;
and if the length of the target character meets the extraction length, taking the target character as page content, and extracting the page content.
7. The crawler-based data crawling method according to claim 1, further comprising, after the step of extracting the target characters as page content:
and storing the page content in a json format.
8. The utility model provides a data grabbing device based on crawler which characterized in that includes:
the acquisition module is used for acquiring a target webpage and analyzing the target webpage to obtain all tags of the target webpage;
the traversal module is used for obtaining label pairs according to the labels, traversing all the label pairs, searching characters in each label pair, taking the characters meeting a first preset condition as first characters, and taking the characters meeting a second preset condition as second characters;
the reading module is used for reading a target character between the starting point and the end point by taking the first character as the starting point and the second character as the end point;
and the extraction module is used for judging whether the target characters meet extraction conditions or not, and extracting the target characters as page content when the extraction conditions are determined to be met.
9. A computer device comprising a memory having computer readable instructions stored therein and a processor that when executed performs the steps of the crawler-based data crawling method according to any of the claims 1 to 7.
10. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of the crawler-based data crawling method according to any one of claims 1 to 7.
CN202110213211.3A 2021-02-25 2021-02-25 Crawler-based data capturing method and device, computer equipment and storage medium Pending CN112925968A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110213211.3A CN112925968A (en) 2021-02-25 2021-02-25 Crawler-based data capturing method and device, computer equipment and storage medium
PCT/CN2021/124394 WO2022179128A1 (en) 2021-02-25 2021-10-18 Crawler-based data crawling method and apparatus, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110213211.3A CN112925968A (en) 2021-02-25 2021-02-25 Crawler-based data capturing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112925968A true CN112925968A (en) 2021-06-08

Family

ID=76171932

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110213211.3A Pending CN112925968A (en) 2021-02-25 2021-02-25 Crawler-based data capturing method and device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112925968A (en)
WO (1) WO2022179128A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022179128A1 (en) * 2021-02-25 2022-09-01 深圳壹账通智能科技有限公司 Crawler-based data crawling method and apparatus, computer device, and storage medium
CN116881595A (en) * 2023-09-06 2023-10-13 江西顶易科技发展有限公司 Customizable webpage data crawling method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314520A (en) * 2011-10-24 2012-01-11 莫雅静 Webpage text extraction method and device based on statistical backtracking positioning
US20150143230A1 (en) * 2012-08-03 2015-05-21 Tencent Technology (Shenzhen) Company Limited Method and device for displaying webpage contents in browser
CN104866512A (en) * 2014-02-26 2015-08-26 腾讯科技(深圳)有限公司 Method, device and system for extracting webpage content
CN107861974A (en) * 2017-09-19 2018-03-30 北京金堤科技有限公司 A kind of adaptive network crawler system and its data capture method

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
US20170192882A1 (en) * 2016-01-06 2017-07-06 Hcl Technologies Limited Method and system for automatically generating a plurality of test cases for an it enabled application
CN108804458B (en) * 2017-05-02 2021-08-17 创新先进技术有限公司 Crawler webpage collecting method and device
CN110472126A (en) * 2018-05-10 2019-11-19 中国移动通信集团浙江有限公司 A kind of acquisition methods of page data, device and equipment
CN110764994A (en) * 2019-09-04 2020-02-07 深圳壹账通智能科技有限公司 Page element packaging method and device, electronic equipment and storage medium
CN110874428A (en) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 Structured data extraction device and method for e-commerce page and readable storage medium
CN111125598A (en) * 2019-12-20 2020-05-08 深圳壹账通智能科技有限公司 Intelligent data query method, device, equipment and storage medium
CN111737623A (en) * 2020-06-19 2020-10-02 深圳市小满科技有限公司 Webpage information extraction method and related equipment
CN111797336A (en) * 2020-07-07 2020-10-20 北京明略昭辉科技有限公司 Webpage parsing method and device, electronic equipment and medium
CN112925968A (en) * 2021-02-25 2021-06-08 深圳壹账通智能科技有限公司 Crawler-based data capturing method and device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314520A (en) * 2011-10-24 2012-01-11 莫雅静 Webpage text extraction method and device based on statistical backtracking positioning
US20150143230A1 (en) * 2012-08-03 2015-05-21 Tencent Technology (Shenzhen) Company Limited Method and device for displaying webpage contents in browser
CN104866512A (en) * 2014-02-26 2015-08-26 腾讯科技(深圳)有限公司 Method, device and system for extracting webpage content
CN107861974A (en) * 2017-09-19 2018-03-30 北京金堤科技有限公司 A kind of adaptive network crawler system and its data capture method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
洪鸿辉;丁世涛;黄傲;郭致远;: "基于文本及符号密度的网页正文提取方法", 电子设计工程, no. 08, pages 139 - 143 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022179128A1 (en) * 2021-02-25 2022-09-01 深圳壹账通智能科技有限公司 Crawler-based data crawling method and apparatus, computer device, and storage medium
CN116881595A (en) * 2023-09-06 2023-10-13 江西顶易科技发展有限公司 Customizable webpage data crawling method
CN116881595B (en) * 2023-09-06 2023-12-15 江西顶易科技发展有限公司 Customizable webpage data crawling method

Also Published As

Publication number Publication date
WO2022179128A1 (en) 2022-09-01

Similar Documents

Publication Publication Date Title
US10380197B2 (en) Network searching method and network searching system
CN112015430A (en) JavaScript code translation method and device, computer equipment and storage medium
CN102779169A (en) Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label
WO2022179128A1 (en) Crawler-based data crawling method and apparatus, computer device, and storage medium
CN113377373A (en) Page loading method and device based on analysis engine, computer equipment and medium
CN110825941A (en) Content management system identification method, device and storage medium
CN112650905A (en) Anti-crawler method and device based on label, computer equipment and storage medium
US10755091B2 (en) Method and apparatus for retrieving image-text block from web page
US10042827B2 (en) System and method for recognizing non-body text in webpage
CN112395485A (en) Policy big data mining method and device, computer equipment and storage medium
CN116644213A (en) XML file reading method, device, equipment and storage medium
CN103020179A (en) Method, device and equipment for extracting webpage contents
CN114443928A (en) Web text data crawler method and system
Yu et al. Web content information extraction based on DOM tree and statistical information
CN111797297B (en) Page data processing method and device, computer equipment and storage medium
CN113849718A (en) Internet tobacco science and technology information automatic acquisition device, method and storage medium
CN112380337A (en) Highlight method and device based on rich text
CN104778232A (en) Searching result optimizing method and device based on long query
CN112667208A (en) Translation error recognition method and device, computer equipment and readable storage medium
Wang et al. A novel web page text information extraction method
CN112579937A (en) Character highlight display method and device
Alnavar et al. Document Parsing Tool for Language Translation and Web Crawling using Django REST Framework
CN110825976B (en) Website page detection method and device, electronic equipment and medium
CN113221035A (en) Method, apparatus, device, medium, and program product for determining an abnormal web page
CN110837614A (en) Method and system for efficiently generating webpage information extraction rule

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40049923

Country of ref document: HK