CN117435795A - Data acquisition method, apparatus, device, storage medium, and program product - Google Patents

Data acquisition method, apparatus, device, storage medium, and program product Download PDF

Info

Publication number
CN117435795A
CN117435795A CN202311609483.0A CN202311609483A CN117435795A CN 117435795 A CN117435795 A CN 117435795A CN 202311609483 A CN202311609483 A CN 202311609483A CN 117435795 A CN117435795 A CN 117435795A
Authority
CN
China
Prior art keywords
crawler
theme
data
preset
configuration file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311609483.0A
Other languages
Chinese (zh)
Inventor
廖浚深
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202311609483.0A priority Critical patent/CN117435795A/en
Publication of CN117435795A publication Critical patent/CN117435795A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The disclosure provides a data acquisition method, relates to the technical field of big data, and can be applied to the technical field of finance. The method comprises the following steps: responding to a data request instruction, and acquiring a crawler configuration file; consuming address information in an address queue to be crawled to obtain a webpage resource document, wherein the address information is obtained by analyzing according to the crawler configuration file; determining target data in the webpage resource document according to the extraction rule of the crawler configuration file; performing theme judgment on the target data according to a preset theme, wherein the preset theme is preset in the crawler configuration file by a user; and performing persistence processing on the target resource data conforming to the theme judgment. The present disclosure also provides a data acquisition apparatus, device, storage medium, and program product.

Description

Data acquisition method, apparatus, device, storage medium, and program product
Technical Field
The present disclosure relates to the field of big data technology, and in particular, to the field of data acquisition technology, and more particularly, to a data acquisition method, apparatus, device, storage medium, and program product.
Background
As data continues to grow, it has appeared inefficient to aggregate and process such massive amounts of data through common crawler programs. The traditional web crawler technology cannot customize crawling strategies and crawler algorithms for topics and information scenes, so that the obtained webpage information has low correlation degree with the topics in most cases.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a data acquisition method, apparatus, device, storage medium, and program product that improve data acquisition efficiency.
According to a first aspect of the present disclosure, there is provided a data acquisition method, the method comprising:
responding to a data request instruction, and acquiring a crawler configuration file;
consuming address information in an address queue to be crawled to obtain a webpage resource document, wherein the address information is obtained by analyzing according to the crawler configuration file;
determining target data in the webpage resource document according to the extraction rule of the crawler configuration file;
performing theme judgment on the target data according to a preset theme, wherein the preset theme is preset in the crawler configuration file by a user; and
and carrying out persistence processing on the target resource data conforming to the theme judgment.
According to an embodiment of the present disclosure, after obtaining the web resource document, the method further includes:
analyzing the webpage resource document to determine new address information; and
and adding the new address information into an address queue to be crawled.
According to an embodiment of the present disclosure, the performing, according to a preset theme, theme determination on the target data includes:
performing text preprocessing on the target data;
calculating the correlation between the preprocessed target data and a theme sample of a preset theme;
if the correlation degree is larger than or equal to a preset threshold value, determining that the target data is related to the preset theme; and
and if the correlation degree is smaller than a preset threshold value, determining that the target data is irrelevant to the preset theme.
According to an embodiment of the present disclosure, the extraction rule includes regular expressions, xpath, and Css, and the determining the target resource in the web resource document according to the extraction rule of the crawler profile includes:
and extracting the information conforming to the extraction rule from the webpage resource document.
According to an embodiment of the disclosure, the crawler configuration file includes a crawler task table, a crawler configuration table, and a crawler field table.
According to the embodiment of the disclosure, the crawler task table stores basic information of each crawler task, the crawler configuration table stores crawler configurations corresponding to the crawler tasks, the crawler field table stores field information of crawlers, and the field information comprises extraction rules and preset theme samples.
A second aspect of the present disclosure provides a data acquisition apparatus, the apparatus comprising:
the first acquisition module is used for responding to the data request instruction and acquiring a crawler configuration file;
the second acquisition module is used for consuming the address information in the address queue to be crawled to acquire a webpage resource document, and the address information is obtained by analyzing according to the crawler configuration file;
the determining module is used for determining target data in the webpage resource document according to the extraction rule of the crawler configuration file;
the theme judging module is used for judging the theme of the target data according to a preset theme, wherein the preset theme is preset in the crawler configuration file by a user; and
and the storage module is used for carrying out persistence processing on the target resource data conforming to the theme judgment.
According to an embodiment of the present disclosure, the apparatus further comprises: the parsing module and the queue updating module are used for processing the data,
the analysis module is used for analyzing the webpage resource document to determine new address information; and
and the queue updating module is used for adding the new address information into the address queue to be crawled.
According to an embodiment of the present disclosure, the topic determination module includes a text preprocessing sub-module, a relevance calculation sub-module, a first determination sub-module, and a second determination sub-module.
A text preprocessing sub-module, configured to perform text preprocessing on the target data;
the correlation calculation sub-module is used for calculating the correlation between the preprocessed target data and a theme sample of a preset theme;
the first determining submodule is used for determining that the target data is related to the preset theme if the correlation degree is greater than or equal to a preset threshold value; and
and the second determining submodule is used for determining that the target data is irrelevant to the preset theme if the correlation degree is smaller than a preset threshold value.
According to an embodiment of the disclosure, the determining module is specifically configured to extract information in the web resource document that accords with the extraction rule.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the data acquisition method described above.
A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described data acquisition method.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described data acquisition method.
According to the data acquisition method provided by the embodiment of the disclosure, a crawler configuration file is acquired in response to a data request instruction and in response to the data request instruction; consuming address information in an address queue to be crawled to obtain a webpage resource document, wherein the address information is obtained by analyzing according to the crawler configuration file; determining target data in the webpage resource document according to the extraction rule of the crawler configuration file; performing theme judgment on the target data according to a preset theme, wherein the preset theme is preset in the crawler configuration file by a user; and performing persistence processing on the target resource data conforming to the theme judgment. Compared with the prior art, the data acquisition method provided by the embodiment of the disclosure utilizes the focused web crawler technology to realize automatic acquisition and processing of mass data of a specific scene, reduces manual intervention and improves data acquisition efficiency.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
FIG. 1a schematically illustrates a system architecture diagram of a data acquisition device according to an embodiment of the present disclosure;
FIG. 1b schematically illustrates a system architecture diagram of a data acquisition module according to an embodiment of the present disclosure;
FIG. 1c schematically illustrates an overall system architecture diagram of a data acquisition device according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates an application scenario diagram of a data acquisition method, apparatus, device, storage medium and program product according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart of a method of data acquisition provided in accordance with an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of a method of topic determination provided in accordance with an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of a method of data acquisition provided in accordance with another embodiment of the present disclosure;
FIG. 6 schematically illustrates a block diagram of a data acquisition device according to an embodiment of the present disclosure; and
fig. 7 schematically illustrates a block diagram of an electronic device adapted to implement a data acquisition method according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The process of web crawler capturing belongs to a cyclic execution process, and new url and webpage information are acquired through continuous circulation. The flow of its work can be briefly summarized as the following steps: firstly, acquiring an initial link from a seed source and adding the initial link into a queue to be crawled; secondly, acquiring links from a queue to be crawled according to a preset rule and initiating a request to a server to acquire HTML source codes; analyzing the acquired webpage, extracting required data and storing the required data into a database; meanwhile, the new hyperlink is stored in a queue to be crawled after being de-duplicated; and then continuously acquiring the next link from the queue to be crawled and extracting data until the termination condition is met. Generally, to use crawler technology, it is necessary to have some knowledge about crawler programming, which requires some learning cost. Secondly, a crawling strategy and a crawler algorithm cannot be customized aiming at all topics and information scenes, so that the obtained webpage information has low correlation degree with the topics in most cases, and the webpage information needs to be filtered and screened.
Based on the technical problems described above, an embodiment of the present disclosure provides a data acquisition method, including: responding to a data acquisition instruction, and acquiring a unique identifier of a target operation page; determining a guide page corresponding to the target operation page according to the unique identifier of the target operation page; displaying the guide page and the target operation page in a first display area and a second display area respectively; responding to a clicking operation instruction of a first display area, and acquiring a display title of a pointing page in the first display area; and displaying a target component in the second display area according to the display title.
Fig. 1a schematically shows a system architecture diagram of a data acquisition device according to an embodiment of the present disclosure. Fig. 1b schematically illustrates a system architecture diagram of a data acquisition module according to an embodiment of the present disclosure. Fig. 1c schematically shows an overall system architecture diagram of a data acquisition device according to an embodiment of the present disclosure. As shown in fig. 1a, the data acquisition device of the embodiment of the present disclosure includes a data acquisition module and a subject content determination module. The data acquisition module is used for URL storage, page acquisition and analysis; the topic content judging module is used for text extraction and topic content judgment. As shown in fig. 1b and 1c, the data acquisition module is based on a WebMagic framework, which provides an API interface and four core components, including Scheduler, downloader, pageProcessor and Pipeline, to facilitate the user's customized crawler development. The downloading device is responsible for downloading pages from the Internet, transmitting the downloaded Page information to the PageProcessor through the Page object to perform subsequent Page processing PageProcessor: the method mainly takes charge of analyzing the downloaded page information, extracting useful information conforming to rules, finding new links, and placing the new links into a URL queue to be processed. Rules include regular expressions, xpath, css, etc., and the PageProcessor needs to perform a customization operation according to personal needs and website structures. Scheduler: the method is mainly responsible for managing URLs to be grabbed obtained from the PageProcessor and performing deduplication. The PageProcessor transmits the acquired URL address to a Scheduler, and the Scheduler transmits the URL address which is not grabbed to a Downloader for page downloading operation. When the URL addresses which are not grabbed exist in the Scheduler, the grabbing task is ended. Pipeline: and is mainly responsible for the management of crawler results. Output of the crawler results may be accomplished using different Pipeline, such as output to a console, save to a file, or save to a file in JSON format.
Fig. 2 schematically illustrates an application scenario diagram of a data acquisition method, apparatus, device, storage medium and program product according to an embodiment of the present disclosure.
As shown in fig. 2, the application scenario 200 according to this embodiment may include a data acquisition scenario. The network 204 is the medium used to provide communication links between the terminal devices 201, 202, 203 and the server 205. The network 204 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 205 via the network 204 using the terminal devices 201, 202, 203 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 201, 202, 203.
The terminal devices 201, 202, 203 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 205 may be a data acquisition server, which may perform the data acquisition method provided by the embodiments of the present disclosure, and acquire a crawler configuration file in response to a data request instruction initiated by a user through the terminal devices 201, 202, 203; consuming address information in an address queue to be crawled to obtain a webpage resource document, wherein the address information is obtained by analyzing according to the crawler configuration file; determining target data in the webpage resource document according to the extraction rule of the crawler configuration file; performing theme judgment on the target data according to a preset theme, wherein the preset theme is preset in the crawler configuration file by a user; and performing persistence processing on the target resource data conforming to the theme judgment.
It should be noted that the data acquisition method provided in the embodiments of the present disclosure may be generally performed by the server 205. Accordingly, the data acquisition device provided by the embodiments of the present disclosure may be generally disposed in the server 205. The data acquisition method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 205 and is capable of communicating with the terminal devices 201, 202, 203 and/or the server 205. Accordingly, the data acquisition apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 205 and capable of communicating with the terminal devices 201, 202, 203 and/or the server 205.
It should be understood that the number of terminal devices, networks and servers in fig. 2 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
It should be noted that, the data acquisition method and apparatus determined in the embodiments of the present disclosure may be used in the technical field of application programs, may also be used in the technical field of finance, and may also be used in any field other than the financial field, where the application field of the data acquisition method and apparatus determined in the embodiments of the present disclosure is not limited.
The data acquisition method according to the embodiments of the present disclosure will be described in detail below with reference to fig. 3 to 5 based on the system architecture described in fig. 1a to 1c and the application scenario described in fig. 2.
Fig. 3 schematically illustrates a flowchart of a data acquisition method provided according to an embodiment of the present disclosure. As shown in fig. 3, the data acquisition method of this embodiment includes operations S210 to S250, which may be performed by a server or other computing device.
In operation S210, a crawler profile is acquired in response to the data request instruction.
In operation S220, address information in the address queue to be crawled is consumed to obtain a web page resource document, where the address information is obtained according to the crawler configuration file.
In operation S230, target data in the web resource document is determined according to the extraction rule of the crawler profile.
In operation S240, subject determination is performed on the target data according to a preset subject, where the preset subject is preset in the crawler configuration file by the user.
In operation S250, target resource data conforming to the subject matter determination is subjected to persistence processing.
In one example, the implementation method of the topic-type configurable crawler in the embodiment of the disclosure may be divided into five modules, namely, a data request module, a data crawling module, a URL queue management module, a data processing module and a topic judgment module, which correspond to the foregoing Downloader, pageProcessor, scheduler, pipeline four components respectively.
In one example, the Downloader has dual identities of producer and consumer, i.e., it consumes requests while producing Page results. It contains HttpRequest request object encapsulation class, httpreresponse response object encapsulation class, and RequestExecutor request executor. In the implementation process, reference is made to the download method in the httpmentdownloader class, in which requests are sent and page information is acquired by incoming HTTP requests. In the RequestExecutor request executor, not only can the downloading method be realized, but also parameters can be combined according to preset configuration file information, and then the parameters are transferred to the method, so that the downloading of the webpage is completed. For crawling the JS-rendered dynamic page, the implementation scheme is to introduce webmagic-selenium packages, and modification is needed on the basis of the original. Firstly, a selenium related configuration FILE is added, a WebDriverPool class is modified, a constant DEFAULT_CONFIG_FILE in the WebDriverPool class is modified, a path is changed into the selenium configuration FILE, a mode of reading the configuration FILE is modified, secondly, the selenium downloading class is modified, a downloading method is modified, the operation of selenium needing simulation is modified according to requirements, the simulation operation of the embodiment of the present disclosure is simply a simple browser sliding down, and js rendered pages are conveniently loaded.
In one example, the data crawling module is intended to obtain the required resources from the downloaded HTML resource document and store them in custom variables, which are rewritten according to the PageProcessor component in the Webmagic framework. PageProcessor is the module responsible for the page processing functions. The main task of the module is to parse the page downloaded by the Downloader module, extract the required content from it and find out the new URL to be added to the queue to be crawled. Its input is Page type and its output is Request type. The method comprises the steps that a abstract class object is created according to a SpiderConfig object, wherein the SpiderConfig object is generated through information filled in by a User or preset configuration file information, the abstract class object realizes a PageProcessor interface, fields configured in the SpiderConfig, such as domain names, user-agents, sleeping time, retry time and the like, are assigned to Site objects, after a crawler robot operates, a processing method of a Pipeline is customized in the abstract processor, and crawled contents are collected by map.
In one example, the URL queue management module is rewritten based on the Scheduler component of the Webmargic framework. The Scheduler component is responsible for coordinating the PageProcessor component and the Downloader component. In the WebMagic framework, several commonly used schedulers are built in, and if custom implementation is required, the abstract base class duplex removedsschedulers are required to be inherited, and the classes provide some template methods. The URL management module realizes a self-defined scheduler CountDown scheduler, the class inherits from an abstract base class Duplex RemovedScheduler, and a MonitorableScheduler interface is realized, so that the monitoring of the number of the crawled URLs is realized. The blocking queue is used for storing the acquired URL, repeated URL is removed, and the URL management is realized.
In one example, the data processing module is mainly responsible for performing subsequent processing on the crawled data, and the module is based on a Pipeline component of a Webmagic framework for rewriting, and the custom Pipeline uniformly stores the crawled data to mongdb.
According to the data acquisition method provided by the embodiment of the disclosure, the focused web crawler technology is utilized to automatically acquire and process mass data of a specific scene, so that manual intervention is reduced, the data crawling efficiency is effectively improved, the crawling and occupancy rate of dirty data is reduced, the labor cost of manually maintaining the crawler query is saved, and the operation risk is reduced.
Fig. 4 schematically illustrates a flow chart of a method of topic determination provided in accordance with an embodiment of the present disclosure. As shown in fig. 4, operation S240 includes operations S241 to S244.
In operation S241, text preprocessing is performed on the target data.
In operation S242, a correlation between the preprocessed target data and a topic sample of a preset topic is calculated.
In operation S243, if it is determined that the correlation is greater than or equal to a preset threshold, it is determined that the target data is related to the preset theme.
In operation S244, if it is determined that the correlation is less than the preset threshold, it is determined that the target data is irrelevant to the preset theme.
In one example, in the data crawling module, web page content is obtained, text preprocessing is performed on the content, the text preprocessing includes removing HTML tags, extracting plain text, performing word segmentation and removing stop words, and some keywords or phrases provided by a user in information filled in a graphical page or provided configuration files are used as temporary preset theme samples. Before the correlation degree calculation is carried out by using a TF-IDF algorithm, vectorization is needed to be carried out on the temporary theme sample and the webpage content, and then the word frequency TF and the inverse document frequency IDF are obtained. The vectorization and TF-IDF calculation process can be simplified by means of some libraries. One common library in the Java language is Apache Lucene, which provides rich text processing and searching functions, including the implementation of TF-IDF. Let W denote the degree of correlation between the web page to be processed and the preset theme configuration, and P denote the lower limit value of the degree of correlation. If W > =p, it can be reasonably determined that there is a certain correlation between the data content of the current web page and the set theme, and the degree of correlation will be different according to the value of W. Conversely, when W < P, this means that the web page data is not sufficiently correlated with the preset theme, and cannot be regarded as having an actual correlation. Therefore, the setting of the P value has a direct influence on the crawling result of the automatic web crawler robot. If the P value is set too large, the result will appear as too little web page data acquired by the crawler robot, limiting its effectiveness. Conversely, if the P value is set too small, the web page data acquired by the crawler robot will be too much, so that unnecessary junk data is introduced, and the working efficiency of the robot is reduced. Therefore, proper setting of the P value is a key to ensuring the work effectiveness and the quality of the results of the crawler robot.
Fig. 5 schematically illustrates a flow chart of a data acquisition method provided in accordance with another embodiment of the present disclosure. As shown in fig. 5, operations S310 to S370 are included.
In operation S310, a crawler profile is acquired in response to the data request instruction.
According to an embodiment of the present disclosure, a crawler configuration file includes a crawler task table, a crawler configuration table, and a crawler field table. The crawler task table stores basic information of each crawler task, the crawler configuration table stores crawler configuration corresponding to the crawler tasks, the crawler field table stores field information of crawlers, and the field information comprises extraction rules and preset theme samples.
In one example, considering the need for configurability and persistence, the basic data of the crawler robot is stored based on a MYSQL database, and the core data table is: crawler task table spider_transmission, crawler configuration table spider_config, crawler field table spider_field. The crawler task table stores basic information of each crawler task, such as task state, entry address, crawling number, cookie string, header string, associated crawler configuration and the like; the crawler configuration table stores crawler configuration corresponding to a crawler task, such as regular expressions of target webpages, sleeping time, retry times, whether to use agent IP and other crawler basic information; the crawler field table stores field information of crawlers such as extraction rules (Xpath/CSS), keywords (for topic judgment, as a temporary preset topic sample). Considering configurable crawlers, the diversity of crawled topics stores crawled data based on mongo db. Because MongoDB supports dynamic fields, crawling data with different storage formats is convenient.
In operation S320, the address information in the address queue to be crawled is consumed to obtain a web page resource document, where the address information is obtained according to the crawler configuration file.
In operation S330, target data in the web resource document is determined according to the extraction rule of the crawler configuration file.
According to an embodiment of the present disclosure, the extraction rule includes regular expressions, xpath, and Css, and the determining the target resource in the web resource document according to the extraction rule of the crawler profile includes: and extracting the information conforming to the extraction rule from the webpage resource document.
In operation S340, the web resource document is parsed to determine new address information.
In operation S350, the new address information is added to the address queue to be crawled.
In operation S360, subject determination is performed on the target data according to a preset subject, where the preset subject is preset in the crawler configuration file by the user.
In operation S370, target resource data conforming to the subject matter determination is subjected to persistence processing.
Based on the data acquisition method, the disclosure further provides a data acquisition device. The device will be described in detail below in connection with fig. 6.
Fig. 6 schematically shows a block diagram of a data acquisition device according to an embodiment of the present disclosure. As shown in fig. 6, the data acquisition apparatus 600 of this embodiment includes a first acquisition module 610, a second acquisition module 620, a determination module 630, a topic determination module 640, and a storage module 650.
The first obtaining module 610 is configured to obtain a crawler configuration file in response to a data request instruction. In an embodiment, the first obtaining module 610 may be configured to perform the operation S210 described above, which is not described herein.
The second obtaining module 620 is configured to determine, according to the unique identifier of the target operation page, a guide page corresponding to the target operation page. In an embodiment, the second obtaining module 620 may be configured to perform the operation S220 described above, which is not described herein.
The determining module 630 is configured to display the guiding page and the target operation page in a first display area and a second display area, respectively. In an embodiment, the determining module 630 may be configured to perform the operation S230 described above, which is not described herein.
The theme determination module 640 is configured to obtain a display title of a guiding page in a first display area in response to a click operation instruction of the first display area. In an embodiment, the theme determination module 640 may be used to perform the operation S240 described above, which is not described herein.
The storage module 650 is configured to display a target component in the second display area according to the display title. In an embodiment, the storage module 650 may be configured to perform the operation S250 described above, which is not described herein.
According to an embodiment of the present disclosure, the apparatus further comprises: the parsing module and the queue updating module are used for processing the data,
the analysis module is used for analyzing the webpage resource document to determine new address information; in an embodiment, the parsing module may be configured to perform the operation S340 described above, which is not described herein.
And the queue updating module is used for adding the new address information into the address queue to be crawled. In an embodiment, the queue updating module may be configured to perform the operation S350 described above, which is not described herein.
According to an embodiment of the present disclosure, the topic determination module includes a text preprocessing sub-module, a relevance calculation sub-module, a first determination sub-module, and a second determination sub-module.
A text preprocessing sub-module, configured to perform text preprocessing on the target data; in an embodiment, the text preprocessing sub-module may be used to perform the operation S241 described above, which is not described herein.
The correlation calculation sub-module is used for calculating the correlation between the preprocessed target data and a theme sample of a preset theme; in an embodiment, the correlation calculation sub-module may be used to perform the operation S242 described above, which is not described herein.
The first determining submodule is used for determining that the target data is related to the preset theme if the correlation degree is greater than or equal to a preset threshold value; in an embodiment, the first determining sub-module may be used to perform the operation S243 described above, which is not described herein.
And the second determining submodule is used for determining that the target data is irrelevant to the preset theme if the correlation degree is smaller than a preset threshold value. In an embodiment, the second determining sub-module may be used to perform the operation S244 described above, which is not described herein.
According to an embodiment of the disclosure, the determining module is specifically configured to extract information in the web resource document that accords with the extraction rule.
According to an embodiment of the present disclosure, any of the plurality of modules of the first acquisition module 610, the second acquisition module 620, the determination module 630, the topic determination module 640, and the storage module 650 may be combined in one module to be implemented, or any of the plurality of modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the first acquisition module 610, the second acquisition module 620, the determination module 630, the subject determination module 640, and the storage module 650 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of any of the three implementations of software, hardware, and firmware. Alternatively, at least one of the first acquisition module 610, the second acquisition module 620, the determination module 630, the topic determination module 640, and the storage module 650 may be at least partially implemented as a computer program module that, when executed, performs the corresponding functions.
Fig. 7 schematically illustrates a block diagram of an electronic device adapted to implement a data acquisition method according to an embodiment of the disclosure.
As shown in fig. 7, an electronic device 900 according to an embodiment of the present disclosure includes a processor 901 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage portion 906 into a Random Access Memory (RAM) 903. The processor 901 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 901 may also include on-board memory for caching purposes. Processor 901 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. The processor 901 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the program may be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the disclosure, the electronic device 900 may also include an input/output (I/O) interface 905, the input/output (I/O) interface 905 also being connected to the bus 904. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 906 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 909 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 909, so that a computer program read therefrom is installed into the storage section 906 as needed.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium described above carries one or more programs, which when executed, implement a data acquisition method according to an embodiment of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 902 and/or RAM 903 and/or one or more memories other than ROM 902 and RAM 903 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code means for causing a computer system to carry out the data acquisition method provided by the embodiments of the present disclosure when the computer program product is run on the computer system.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, via communication portion 909, and/or installed from removable medium 911. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (10)

1. A method of data acquisition, the method comprising:
responding to a data request instruction, and acquiring a crawler configuration file;
consuming address information in an address queue to be crawled to obtain a webpage resource document, wherein the address information is obtained by analyzing according to the crawler configuration file;
determining target data in the webpage resource document according to the extraction rule of the crawler configuration file;
performing theme judgment on the target data according to a preset theme, wherein the preset theme is preset in the crawler configuration file by a user; and
and carrying out persistence processing on the target resource data conforming to the theme judgment.
2. The method of claim 1, wherein after obtaining the web page resource document, the method further comprises:
analyzing the webpage resource document to determine new address information; and
and adding the new address information into an address queue to be crawled.
3. The method according to claim 1, wherein the performing theme determination on the target data according to a preset theme includes:
performing text preprocessing on the target data;
calculating the correlation between the preprocessed target data and a theme sample of a preset theme;
if the correlation degree is larger than or equal to a preset threshold value, determining that the target data is related to the preset theme; and
and if the correlation degree is smaller than a preset threshold value, determining that the target data is irrelevant to the preset theme.
4. The method of claim 1, wherein the extraction rules include regular expressions, xpath, and Css, and wherein determining the target resource in the web resource document according to the extraction rules of the crawler profile includes:
and extracting the information conforming to the extraction rule from the webpage resource document.
5. The method of any of claims 1-4, wherein the crawler configuration file includes a crawler task table, a crawler configuration table, and a crawler field table.
6. The method of claim 5, wherein the crawler task table stores basic information of each crawler task, the crawler configuration table stores crawler configurations corresponding to the crawler tasks, the crawler field table stores field information of crawlers, and the field information comprises extraction rules and preset theme samples.
7. A data acquisition device, the device comprising:
the first acquisition module is used for responding to the data request instruction and acquiring a crawler configuration file;
the second acquisition module is used for consuming the address information in the address queue to be crawled to acquire a webpage resource document, and the address information is obtained by analyzing according to the crawler configuration file;
the determining module is used for determining target data in the webpage resource document according to the extraction rule of the crawler configuration file;
the theme judging module is used for judging the theme of the target data according to a preset theme, wherein the preset theme is preset in the crawler configuration file by a user; and
and the storage module is used for carrying out persistence processing on the target resource data conforming to the theme judgment.
8. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-6.
9. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 6.
CN202311609483.0A 2023-11-29 2023-11-29 Data acquisition method, apparatus, device, storage medium, and program product Pending CN117435795A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311609483.0A CN117435795A (en) 2023-11-29 2023-11-29 Data acquisition method, apparatus, device, storage medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311609483.0A CN117435795A (en) 2023-11-29 2023-11-29 Data acquisition method, apparatus, device, storage medium, and program product

Publications (1)

Publication Number Publication Date
CN117435795A true CN117435795A (en) 2024-01-23

Family

ID=89548149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311609483.0A Pending CN117435795A (en) 2023-11-29 2023-11-29 Data acquisition method, apparatus, device, storage medium, and program product

Country Status (1)

Country Link
CN (1) CN117435795A (en)

Similar Documents

Publication Publication Date Title
US10089579B1 (en) Predicting user navigation events
US8799262B2 (en) Configurable web crawler
US8935798B1 (en) Automatically enabling private browsing of a web page, and applications thereof
CN109829121B (en) Method and device for reporting click behavior data
US20170078361A1 (en) Method and System for Collecting Digital Media Data and Metadata and Audience Data
US10915378B1 (en) Open discovery service
US20150207691A1 (en) Preloading content based on network connection behavior
CN110929128A (en) Data crawling method, device, equipment and medium
CN116560661A (en) Code optimization method, device, equipment and storage medium
CN114528269A (en) Method, electronic device and computer program product for processing data
Mehta et al. A comparative study of various approaches to adaptive web scraping
US8832275B1 (en) Offline web browser
CN113656737A (en) Webpage content display method and device, electronic equipment and storage medium
US9754033B2 (en) Optimizing web crawling through web page pruning
US12001458B2 (en) Multi-cloud object store access
US20170147543A1 (en) Enabling legacy web applications for mobile devices
CN113792232B (en) Page feature calculation method, page feature calculation device, electronic equipment, page feature calculation medium and page feature calculation program product
CN112384940A (en) Mechanism for WEB crawling of electronic business resource page
CN117435795A (en) Data acquisition method, apparatus, device, storage medium, and program product
US8464147B2 (en) Method and system for validation of structured documents
CN113392311A (en) Field searching method, field searching device, electronic equipment and storage medium
CN113326079A (en) Service version switching method, switching device, electronic equipment and storage medium
CN113312053A (en) Data processing method and device
CN112016017A (en) Method and device for determining characteristic data
CN110737757B (en) Method and apparatus for generating information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination