CN116150513A - Data processing method, device, electronic equipment and computer readable storage medium - Google Patents

Data processing method, device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN116150513A
CN116150513A CN202211666824.3A CN202211666824A CN116150513A CN 116150513 A CN116150513 A CN 116150513A CN 202211666824 A CN202211666824 A CN 202211666824A CN 116150513 A CN116150513 A CN 116150513A
Authority
CN
China
Prior art keywords
target
task
grabbing
url
analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211666824.3A
Other languages
Chinese (zh)
Inventor
吴少云
唐振君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Taoyoutianxia Technology Co ltd
Original Assignee
Beijing Taoyoutianxia Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Taoyoutianxia Technology Co ltd filed Critical Beijing Taoyoutianxia Technology Co ltd
Priority to CN202211666824.3A priority Critical patent/CN116150513A/en
Publication of CN116150513A publication Critical patent/CN116150513A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the application provides a data processing method, a data processing device, electronic equipment and a computer readable storage medium, and relates to the technical field of data processing. The method is applied to a crawling service and comprises the steps of obtaining a target crawling task from a preset crawling task queue, wherein the target crawling task comprises a target uniform resource locator url of a webpage to be crawled and a target site identifier of a site corresponding to the target url; grabbing target page data in the target url according to the target grabbing task; creating an analysis task according to the target page data, placing the analysis task into a target analysis task queue corresponding to the target site identification, so that the analysis service corresponding to the target site identification executes the target analysis task in the corresponding target analysis task queue, and obtaining an analysis result. The embodiment of the application realizes that only related codes for realizing analysis of the target page data are required to be independently written, thereby reducing the number of repeated codes, reducing the occupancy rate of system resources, improving the universality and the flexibility and saving the time for developing software.

Description

Data processing method, device, electronic equipment and computer readable storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data processing method, a data processing device, an electronic device, and a computer readable storage medium.
Background
With the continuous development of internet technology, development software often needs to build multiple sites, and building sites often needs to utilize specific data on multiple web pages related to the application scenario in which the sites are located. The method adopted at present firstly establishes a URL set for storing the URL of the webpage related to the application scene of the site, and often configures independent crawling service and analysis service for each site, wherein each crawling service crawls the URL from the URL set corresponding to the site, and the analysis service obtains data required by the corresponding site according to the URL. Although the method can ensure that each site can obtain the required data, because each grabbing program and each analysis service are independently written according to application scenes, a large number of repeated codes exist in the system, system resources are excessively occupied, the universality and the flexibility are poor, and the time for developing software is influenced.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device, electronic equipment and a computer readable storage medium, which can solve the problems that a large number of repeated codes exist in a system, system resources are excessively occupied, code generality and flexibility are poor, and the time for developing software is influenced. The technical scheme is as follows:
according to an aspect of the embodiments of the present application, there is provided a data processing method applied to a crawling service, the method including:
obtaining a target grabbing task from a preset grabbing task queue, wherein the target grabbing task comprises a target uniform resource locator url of a webpage to be grabbed and a target site identifier of a site corresponding to the target url;
capturing target page data in the target url according to a target capturing task;
and creating an analysis task according to the target page data, putting the analysis task into a target analysis task queue corresponding to the target site identifier, so that analysis service corresponding to the target site identifier executes the target analysis task in the corresponding target analysis task queue, and obtaining an analysis result.
As an optional embodiment of the data processing method, the obtaining the target grabbing task from the preset grabbing task queue includes:
determining at least one site, and for each site, obtaining a site identification of the site and url of at least one webpage to be crawled in the site;
based on each url and the site identifier corresponding to the url, a corresponding grabbing task is created and sequentially put into the task queue.
As an optional embodiment of the data processing method, the capturing, according to a target capturing task, target page data in the target url includes:
obtaining a target proxy ip from a proxy ip pool;
accessing a webpage to be grabbed corresponding to the target url by using the target proxy ip;
and obtaining the target page data of the webpage to be grabbed.
As an optional embodiment, accessing the web page to be crawled corresponding to the target url by using the target proxy ip further includes:
initializing a browser and replacing a local ip of the browser by using the target proxy ip;
the accessing the web page to be crawled corresponding to the target url by using the target proxy ip includes:
and after initializing the browser, waiting for a preset sleep period, and sending a target access instruction to the browser so that the browser accesses the webpage to be grabbed according to the target access instruction, wherein the target access instruction is used for indicating the browser to access the webpage to be grabbed corresponding to the target url.
As an optional embodiment of a data processing method, the obtaining target page data of the web page to be crawled includes:
capturing page elements according to typesetting sequence of each page element in the webpage to be captured;
and when the grabbing of the preset target page element is determined, determining that the grabbing is completed, and taking the page element grabbed from the webpage to be grabbed as the target page data.
As an optional embodiment of the data processing method, the capturing page elements according to the typesetting sequence of each page element in the web page to be captured further includes:
if the target page elements are determined not to be captured after the page elements are captured according to the typesetting sequence, capturing the page elements according to the typesetting sequence again after the preset time length until the target page elements are determined to be captured.
As an optional embodiment of the data processing method, the capturing the target page data in the target url according to the target capturing task further includes:
and ending the target grabbing process for executing the target grabbing task when the target url is judged to be a dead chain.
According to another aspect of an embodiment of the present application, there is provided a data processing apparatus, the apparatus including:
the acquisition module is used for acquiring a target grabbing task from a preset grabbing task queue, wherein the target grabbing task comprises a target uniform resource locator url of a webpage to be grabbed and a target site identifier of a site corresponding to the target url;
the grabbing module is used for grabbing target page data in the target url according to a target grabbing task;
the creation module is used for creating an analysis task according to the target page data, placing the analysis task into a target analysis task queue corresponding to the target site identifier, so that analysis service corresponding to the target site identifier is executed, and a target analysis task in the corresponding target analysis task queue is executed to obtain an analysis result.
According to another aspect of the embodiments of the present application, there is provided an electronic device including a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement the steps of the data processing method described above.
According to a further aspect of embodiments of the present application, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data processing method described above.
The beneficial effects that technical scheme that this application embodiment provided brought are: the embodiment of the application provides a data processing method, a device, electronic equipment and a computer readable storage medium, wherein the target grabbing task comprises a target url and a corresponding target site identifier, so that a target analysis task can be directly placed into a target analysis task queue corresponding to the target site identifier after target page data is grabbed, and analysis service can conduct targeted analysis on the target page data, and an analysis result is obtained. The process of capturing the target url and obtaining the target page data is irrelevant to the scene, so that related programs do not need to be independently written, and only related codes for realizing analysis of the target page data need to be independently written, thereby reducing the number of repeated codes, reducing the occupancy rate of system resources, improving the universality and the flexibility and saving the time for developing software.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a schematic diagram of a system architecture for implementing a data processing method according to an embodiment of the present application;
fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application;
fig. 3 is a schematic attribute diagram of a grabbing task according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device for implementing a data processing method according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present application. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The data processing method, apparatus, electronic device, computer readable storage medium and computer program product provided in the present application aim to solve the above technical problems in the prior art.
The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.
Fig. 1 is a schematic diagram of a system architecture for implementing a data processing method according to an embodiment of the present application, where different developers respectively obtain, through respective corresponding terminals, analysis data required by each of the servers, so as to implement development of each site. If one developer obtains an analysis result aa from the storage service of the server 13 through the terminal 11, and develops the site a according to the analysis result aa, and another developer obtains an analysis result bb from the storage service of the server 13 through the terminal 12, and develops the site B according to the analysis result bb, where the analysis result aa is obtained by executing the analysis task of the analysis task queue 1 in the analysis service a in the server 13, the analysis result bb is obtained by executing the analysis task in the analysis task queue 2 by the analysis service B in the server 13, and the analysis task belongs to the analysis task queue 1 or the analysis task queue 2 is classified by a switch (exchange) in the server, that is, the switch routes the analysis task obtained by executing the capture task in the capture task queue by the capture service to the corresponding analysis task queue.
The server 13 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. Among them, the terminals 11 and 12 may be tablet computers, notebook computers, laptop portable computers, desktop computers, and the like, but are not limited thereto.
Only two terminals are shown in fig. 1, but the specific number of terminals present in different embodiments is not limiting and one or more terminals may access the server 13. The terminals 11 and 12 may be connected to the server 13 through a wireless network or a wired network.
The embodiment of the application provides a data processing method applied to a grabbing service, as shown in fig. 2, the method includes:
s100, obtaining a target grabbing task from a preset grabbing task queue, wherein the target grabbing task comprises a target uniform resource locator url of a webpage to be grabbed and a target site identifier of a site corresponding to the target url.
The grabbing task queue refers to a queue for storing at least one grabbing task to be processed, belongs to a first-in first-out linear table, and is usually implemented by a linked list or an array in specific application. Often, the grabbing task can only be inserted from the rear end of the queue, so that the grabbing process can conveniently and orderly execute the grabbing task.
In consideration of the fact that new grabbing tasks are continuously added on one side of the grabbing task queue to increase the total number of grabbing tasks to be processed, grabbing processes are continuously arranged on the other side of the grabbing task queue to process the grabbing tasks to reduce the total number of grabbing tasks to be processed, the method for processing the grabbing tasks in the asynchronous queue is convenient for a plurality of grabbing processes to execute and process the grabbing tasks corresponding to the grabbing tasks in parallel, decoupling between a framework and scene service is achieved, and performance and expansibility of grabbing services are greatly improved. The number of the grabbing processes may be adjusted according to the number of grabbing tasks, the growing speed of the grabbing tasks, and the speed of the grabbing processes executing one grabbing task, which is not limited in this embodiment.
It will be understood that a site refers to a collection of all resources of a website, i.e., folders that hold all pages of the website, and url refers to a specific web site of a page. For example, if the application scenario of the site to be developed is a website providing various real-time news to the client, the site is identified as a, the web page to be crawled includes, but is not limited to, newwave network, hundred degree news, and today's hot spot, etc., url of the newwave network is http:// www.sina.com.cn, url of the hundred degree news is https:// news.bailu.com/society, url of the today's feature is http:// www.todayhot.cn/; the application scene of the site to be developed is to provide various videos for clients, wherein the site identification is B, the web pages to be grabbed comprise but are not limited to a vacation video, a beep curry, an Aiqi art and the like, url of the vacation video is https:// v.qq.com, url of the beep curry is https:// www.bilibili.com/, url of the Aiqi art is https:// www.iqiyi.com/, and if url of the target url is https:// www.bilibili.com/, the target site identification is B.
S101, capturing target page data in the target url according to the target capturing task.
It will be appreciated that the data of text, images, animations, videos, music, hyperlinks, forms and various types of controls are present on the web pages to be crawled. Illustratively, the application scenario of the site1 is to provide various real-time news to clients, which require data such as text, images and video, and the application scenario of the site2 is to provide various video to clients, which require data such as text, animation and video, so that the target page data including but not limited to text, images, animation and video can ensure that the target page data is captured no matter which target site corresponding to the web page to be captured corresponds to the target url.
S102, creating an analysis task according to the target page data, placing the analysis task into a target analysis task queue corresponding to the target site identification, so that analysis service corresponding to the target site identification executes the target analysis task in the corresponding target analysis task queue, and obtaining an analysis result.
The analysis results obtained for different sites are different due to the fact that application scenes of different sites are different, so that codes written for executing analysis services are different, analysis task queues are utilized to correspond to site identifiers one by one, analysis tasks are placed into the corresponding analysis task queues, correspondence among the analysis service, the site identifiers and the analysis task queues is guaranteed, and analysis results conforming to the application scenes can be obtained when each site is developed. The parsing result is extracted from the target page data to the required field.
It can be understood that in this embodiment, since the target crawling task includes the target url and the corresponding target site identifier, the target parsing task can be directly put into the target parsing task queue corresponding to the target site identifier after crawling the target page data, so that the parsing service can perform targeted parsing on the target page data, thereby obtaining the parsing result. The process of capturing the target url and obtaining the target page data is irrelevant to the scene, so that related programs do not need to be independently written, and only related codes for realizing analysis of the target page data need to be independently written, thereby reducing the number of repeated codes, reducing the occupancy rate of system resources, improving the universality and the flexibility and saving the time for developing software.
On the basis of the foregoing embodiments, as an optional embodiment, the obtaining the target grabbing task from the preset grabbing task queue includes:
determining at least one site, and for each site, obtaining a site identification of the site and url of at least one webpage to be grabbed in the site;
based on each url and the site identifier corresponding to the url, a corresponding grabbing task is created and sequentially put into a task queue.
Further, with reference to fig. 3, an attribute schematic diagram of a grabbing task is exemplarily shown, and the grabbing service is implemented through a rabitimq, where exchange represents a switch, the type is string, and an analysis task created by a grabbing process after the current grabbing task is executed needs to be put into a corresponding analysis task queue through exchange, so as to indicate that the grabbing task queue, the analysis task queue and the exchange are bound respectively; url represents the website of the page to be grabbed of the current grabbing task, and the type is string; routing_key is also called a routing rule, and indicates exchange to send the created analysis task to a corresponding analysis task queue, and the type is string; site represents the site identification corresponding to the grabbing task, and the type is string. For example, if there are three stations, the station identifier of each station is 1, 2, and 3, and the corresponding routing_key is sequentially routing_key=site 1, routing_key=site 2, and routing_key=site 3.
On the basis of the foregoing embodiments, as an optional embodiment, capturing target page data in the target url according to the target capturing task includes:
obtaining a target proxy ip from a proxy ip pool;
accessing a webpage to be grabbed corresponding to a target url by using a target proxy ip;
and obtaining target page data of the webpage to be grabbed.
It can be understood that the proxy ip pool stores a plurality of proxy ips, and the privacy and security of the server configured with the crawling service in the access process and the data transmission process can be ensured by acquiring the target proxy ip before accessing the web page to be crawled each time.
Specifically, the proxy ip refers to an address of a proxy server, that is, a server configured with a crawling service is connected with the proxy server through the proxy ip, then a webpage to be crawled is accessed through the proxy server, and the content of the webpage to be crawled is returned to the server configured with the crawling service through the proxy server, so that the security of accessing the webpage to be crawled is improved. Agent ips may be obtained by purchase or collection from a free agent resource website. It will be appreciated that each time the target agent ip is obtained from the pool of agents ip, it may be random or have a logical order, which is not particularly limited in this embodiment.
Based on the above embodiments, as an optional embodiment, the accessing the web page to be crawled corresponding to the target url by using the target proxy ip further includes:
initializing a browser, and replacing a local ip of the browser by using a target proxy ip;
accessing a webpage to be crawled corresponding to a target url by using a target proxy ip, wherein the method comprises the following steps:
after initializing the browser, waiting for a preset sleep period, and sending a target access instruction to the browser so that the browser accesses the webpage to be grabbed according to the target access instruction, wherein the target access instruction is used for indicating the browser to access the webpage to be grabbed corresponding to the target url.
It can be understood that the browser is initialized before the web page to be grabbed is accessed, that is, all configurations of the browser are restored to be original, so that the situation that the web page to be grabbed is accessed abnormally due to the fact that a developer modifies the settings of the browser in an error mode is avoided, and grabbing efficiency is improved. Illustratively, the crawling service employs a selenium framework, and the accessing and crawling steps of the browser are implemented by using a browser in the selenium framework.
It should be explained that when we use the browser to access the web page to be crawled, the web site where the web page to be crawled is the local ip of the server where the browser is received, if the same web page to be crawled is frequently accessed, the ip is easily sealed, so that the web page to be crawled cannot be continued to be browsed normally, and the local ip is replaced by the proxy ip, so that the occurrence of the condition of network restriction is reduced, and the smoothness of accessing the web page to be crawled is further ensured. In addition, initializing the browser also facilitates replacing the local ip with the proxy ip.
The method comprises the steps of taking a plurality of grabbing processes into consideration to execute corresponding target grabbing tasks in parallel, and presetting a sleep period to avoid overlarge pressure caused by a website where a webpage to be grabbed is located at the same time so as to relieve the whole grabbing frequency. It will be appreciated that the specific duration of the sleep period may be adjusted according to the actual situation, which is not particularly limited in this embodiment.
On the basis of the above embodiments, as an optional embodiment, obtaining target page data of a web page to be crawled includes:
grabbing page elements according to typesetting sequence of each page element in the webpage to be grabbed;
when the grabbing of the preset target page element is determined, the grabbing is determined to be completed, and the page element grabbed from the webpage to be grabbed is used as target page data.
It should be explained that the page elements include, but are not limited to, text, pictures, audio, animation, and video, and the typesetting sequence refers to the presentation sequence of each page element, and is also the sequence for facilitating the user to browse the web page to be grabbed, and the page elements are grabbed through the typesetting sequence, so that it can be ensured that some page elements cannot be grabbed in a missing manner. In addition, the completion of grabbing is judged by determining that the preset target page elements are grabbed, so that grabbing efficiency is improved while the target page data cannot be missed.
For example, if the application scenario of the website is a website providing various real-time news for clients, when the web page to be crawled is a newwave web, the preset target page element is "news headline", that is, when the page element of "news headline" is crawled, the crawling is described to be completed, so that the website can be developed according to the analysis result obtained after analyzing the target page data.
Based on the above embodiments, as an optional embodiment, the capturing of the page elements according to the typesetting sequence of each page element in the web page to be captured further includes:
if the target page elements are determined not to be captured after the page elements are captured according to the typesetting sequence, capturing the page elements according to the typesetting sequence again after the preset time length until the target page elements are determined to be captured.
It can be understood that, since some page elements need to appear after the page is completely loaded, and some page elements also need to interact with the user accessing the web page to be grabbed, the page elements are grabbed again after the preset time length is passed, so that the comprehensiveness of the target page data and the accuracy of grabbing can be ensured. The specific duration of the preset duration can be adjusted according to actual conditions, and the embodiment does not specifically limit the specific duration.
Based on the above embodiments, as an optional embodiment, capturing the target page data in the target url according to the target capturing task further includes:
and when the target url is judged to be a dead link, ending the target grabbing process for executing the target grabbing task.
It should be explained that the dead link means that the address of the server of the webpage to be crawled is changed, so that the webpage to be crawled cannot be accessed according to url, and therefore the target crawling process is directly ended, so that the situation that the target crawling process incapable of obtaining the target webpage data is occupied all the time and the crawling efficiency is affected is avoided.
The dead links include a protocol dead link and a content dead link, and the protocol dead link means that the link fails, namely url is https://. 404, and a webpage to be grabbed cannot be accessed; the content dead link refers to that although the web page to be crawled can be accessed, the content of the web page to be crawled is changed to be absent or deleted, that is, text data like "content deleted" appears in the web page to be crawled.
An embodiment of the present application provides a data processing apparatus, as shown in fig. 4, where the data processing apparatus may include: the system comprises an acquisition module 601, a grabbing module 602 and a creation module 603, wherein the acquisition module 601 is configured to acquire a target grabbing task from a preset grabbing task queue, and the target grabbing task includes a target uniform resource locator url of a webpage to be grabbed and a target site identifier of a site corresponding to the target url; the grabbing module 602 is used for grabbing target page data in the target url according to the target grabbing task; the creating module 603 is configured to create an analysis task according to the target page data, place the analysis task into a target analysis task queue corresponding to the target site identifier, so that the analysis service corresponding to the target site identifier executes the target analysis task in the corresponding target analysis task queue, and obtain an analysis result.
It can be understood that in this embodiment, since the target crawling task includes the target url and the corresponding target site identifier, the crawling module 602 can send the crawling task to the creating module 603 after crawling the target page data, and the creating module 603 creates the corresponding target parsing task according to the target page data and directly places the target parsing task in the target parsing task queue corresponding to the target site identifier, so that the parsing service can parse the target page data in a targeted manner, thereby obtaining the parsing result. The process of capturing the target url and obtaining the target page data is irrelevant to the scene, so that related programs do not need to be independently written, and only related codes for realizing analysis of the target page data need to be independently written, thereby reducing the number of repeated codes, reducing the occupancy rate of system resources, improving the universality and the flexibility and saving the time for developing software.
On the basis of the above embodiments, as an optional embodiment, the obtaining module is configured to obtain the target grabbing task from a preset grabbing task queue, and is also configured to
Determining at least one site, and for each site, obtaining a site identification of the site and url of at least one webpage to be grabbed in the site;
based on each url and the site identifier corresponding to the url, a corresponding grabbing task is created and sequentially put into a task queue.
Based on the above embodiments, as an optional embodiment, the capturing module is configured to capture, according to a target capturing task, target page data in a target url, where the specific steps include:
obtaining a target proxy ip from a proxy ip pool;
accessing a webpage to be grabbed corresponding to a target url by using a target proxy ip;
and obtaining target page data of the webpage to be grabbed.
Based on the above embodiments, as an optional embodiment, the crawling module accesses the web page to be crawled corresponding to the target url by using the target proxy ip, and further includes:
initializing a browser, and replacing a local ip of the browser by using a target proxy ip;
accessing a webpage to be crawled corresponding to a target url by using a target proxy ip, wherein the method comprises the following steps:
after initializing the browser, waiting for a preset sleep period, and sending a target access instruction to the browser so that the browser accesses the webpage to be grabbed according to the target access instruction, wherein the target access instruction is used for indicating the browser to access the webpage to be grabbed corresponding to the target url.
On the basis of the foregoing embodiments, as an optional embodiment, the crawling module is configured to obtain target page data of a web page to be crawled, and specifically includes:
grabbing page elements according to typesetting sequence of each page element in the webpage to be grabbed;
when the grabbing of the preset target page element is determined, the grabbing is determined to be completed, and the page element grabbed from the webpage to be grabbed is used as target page data.
On the basis of the foregoing embodiments, as an optional embodiment, the crawling module is configured to crawl the page elements according to the typesetting sequence of each page element in the web page to be crawled, and then further includes:
if the target page elements are determined not to be captured after the page elements are captured according to the typesetting sequence, capturing the page elements according to the typesetting sequence again after the preset time length until the target page elements are determined to be captured.
On the basis of the foregoing embodiments, as an optional embodiment, the capturing module is configured to capture, according to a target capturing task, target page data in a target url, and is further previously configured to:
and when the target url is judged to be a dead link, ending the target grabbing process for executing the target grabbing task.
The apparatus of the embodiments of the present application may perform the method provided by the embodiments of the present application, and implementation principles of the method are similar, and actions performed by each module in the apparatus of each embodiment of the present application correspond to steps in the method of each embodiment of the present application, and detailed functional descriptions of each module of the apparatus may be referred to in the corresponding method shown in the foregoing, which is not repeated herein.
The embodiment of the application provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of a data processing method, and compared with the related art, the method can realize the steps of the data processing method: the target grabbing task comprises the target url and the corresponding target site identifier, so that the target analysis task can be directly placed into a target analysis task queue corresponding to the target site identifier after target page data is grabbed, and the analysis service can conduct targeted analysis on the target page data, and an analysis result is obtained. The process of capturing the target url and obtaining the target page data is irrelevant to the scene, so that related programs do not need to be independently written, and only related codes for realizing analysis of the target page data need to be independently written, thereby reducing the number of repeated codes, reducing the occupancy rate of system resources, improving the universality and the flexibility and saving the time for developing software.
In an alternative embodiment, there is provided an electronic device, as shown in fig. 5, the electronic device 4000 shown in fig. 5 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 5, but not only one bus or one type of bus.
Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.
The memory 4003 is used for storing a computer program that executes an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.
Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, may implement the steps and corresponding content of the foregoing method embodiments.
The embodiments of the present application also provide a computer program product, which includes a computer program, where the computer program can implement the steps of the foregoing method embodiments and corresponding content when executed by a processor.
The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the present application described herein may be implemented in other sequences than those illustrated or otherwise described.
It should be understood that, although the flowcharts of the embodiments of the present application indicate the respective operation steps by arrows, the order of implementation of these steps is not limited to the order indicated by the arrows. In some implementations of embodiments of the present application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages may be flexibly configured according to the requirement, which is not limited in the embodiment of the present application.
The foregoing is merely an optional implementation manner of the implementation scenario of the application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the application are adopted without departing from the technical ideas of the application, and also belong to the protection scope of the embodiments of the application.

Claims (10)

1. A data processing method for use in a crawling service, the method comprising:
obtaining a target grabbing task from a preset grabbing task queue, wherein the target grabbing task comprises a target uniform resource locator url of a webpage to be grabbed and a target site identifier of a site corresponding to the target url;
capturing target page data in the target url according to a target capturing task;
and creating an analysis task according to the target page data, putting the analysis task into a target analysis task queue corresponding to the target site identifier, so that analysis service corresponding to the target site identifier executes the target analysis task in the corresponding target analysis task queue, and obtaining an analysis result.
2. The method for processing data according to claim 1, wherein the obtaining the target grabbing task from the preset grabbing task queue comprises:
determining at least one site, and for each site, obtaining a site identification of the site and url of at least one webpage to be crawled in the site;
based on each url and the site identifier corresponding to the url, a corresponding grabbing task is created and sequentially put into the task queue.
3. The data processing method according to any one of claims 1 or 2, wherein the capturing the target page data in the target url according to the target capturing task includes:
obtaining a target proxy ip from a proxy ip pool;
accessing a webpage to be grabbed corresponding to the target url by using the target proxy ip;
and obtaining the target page data of the webpage to be grabbed.
4. A data processing method according to claim 3, wherein accessing the web page to be crawled corresponding to the target url using the target proxy ip further comprises:
initializing a browser and replacing a local ip of the browser by using the target proxy ip;
the accessing the web page to be crawled corresponding to the target url by using the target proxy ip includes:
and after initializing the browser, waiting for a preset sleep period, and sending a target access instruction to the browser so that the browser accesses the webpage to be grabbed according to the target access instruction, wherein the target access instruction is used for indicating the browser to access the webpage to be grabbed corresponding to the target url.
5. A data processing method according to claim 3, wherein the obtaining target page data of the web page to be crawled includes:
capturing page elements according to typesetting sequence of each page element in the webpage to be captured;
and when the grabbing of the preset target page element is determined, determining that the grabbing is completed, and taking the page element grabbed from the webpage to be grabbed as the target page data.
6. The method for processing data according to claim 5, wherein the crawling of the page elements according to the typesetting order of the page elements in the web page to be crawled further comprises:
if the target page elements are determined not to be captured after the page elements are captured according to the typesetting sequence, capturing the page elements according to the typesetting sequence again after the preset time length until the target page elements are determined to be captured.
7. The data processing method according to claim 1 or 2, wherein the capturing the target page data in the target url according to the target capturing task further includes:
and ending the target grabbing process for executing the target grabbing task when the target url is judged to be a dead chain.
8. A data processing apparatus, comprising:
the acquisition module is used for acquiring a target grabbing task from a preset grabbing task queue, wherein the target grabbing task comprises a target uniform resource locator url of a webpage to be grabbed and a target site identifier of a site corresponding to the target url;
the grabbing module is used for grabbing target page data in the target url according to a target grabbing task;
the creation module is used for creating an analysis task according to the target page data, placing the analysis task into a target analysis task queue corresponding to the target site identifier, so that analysis service corresponding to the target site identifier is executed, and a target analysis task in the corresponding target analysis task queue is executed to obtain an analysis result.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the data processing method according to any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the data processing method of any of claims 1-7.
CN202211666824.3A 2022-12-23 2022-12-23 Data processing method, device, electronic equipment and computer readable storage medium Pending CN116150513A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211666824.3A CN116150513A (en) 2022-12-23 2022-12-23 Data processing method, device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211666824.3A CN116150513A (en) 2022-12-23 2022-12-23 Data processing method, device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN116150513A true CN116150513A (en) 2023-05-23

Family

ID=86357475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211666824.3A Pending CN116150513A (en) 2022-12-23 2022-12-23 Data processing method, device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116150513A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116880901A (en) * 2023-09-05 2023-10-13 国网思极网安科技(北京)有限公司 Application page analysis method, device, electronic equipment and computer readable medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116880901A (en) * 2023-09-05 2023-10-13 国网思极网安科技(北京)有限公司 Application page analysis method, device, electronic equipment and computer readable medium
CN116880901B (en) * 2023-09-05 2023-11-24 国网思极网安科技(北京)有限公司 Application page analysis method, device, electronic equipment and computer readable medium

Similar Documents

Publication Publication Date Title
EP2724251B1 (en) Methods for making ajax web applications bookmarkable and crawlable and devices thereof
WO2016173200A1 (en) Malicious website detection method and system
US9342620B2 (en) Loading of web resources
US20090327460A1 (en) Application Request Routing and Load Balancing
CN109361754A (en) A kind of document transmission method and device based on browser
CN106656920B (en) Processing method, device, storage medium and the processor of HTTP service
CN109829121B (en) Method and device for reporting click behavior data
CN107105336A (en) Data processing method and data processing equipment
CN109213824B (en) Data capture system, method and device
CN116150513A (en) Data processing method, device, electronic equipment and computer readable storage medium
CN109862074B (en) Data acquisition method and device, readable medium and electronic equipment
CN105119764B (en) Method and apparatus for traffic monitoring
US9998559B2 (en) Preemptive caching of data
CN112492055A (en) Method, device and equipment for redirecting transmission protocol and readable storage medium
CN106919595B (en) Cookie mapping method and device and electronic equipment
CN111953718B (en) Page debugging method and device
AU2018390863B2 (en) Computer system and method for extracting dynamic content from websites
CN107508705B (en) Resource tree construction method of HTTP element and computing equipment
CN110838969A (en) Picture transmission method, device, equipment and medium
CN112306791B (en) Performance monitoring method and device
WO2011157183A2 (en) Investigation method and system for web application hosting
JP5860389B2 (en) Web browsing history acquisition system and method, proxy server, and Web browsing history acquisition program
CN111124365A (en) RPA demand collection method and device
US11252244B1 (en) System and method for web-session recording
CN110955851B (en) Network request processing method, system and computing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination