CN110765334A - Data capture method, system, medium and electronic device - Google Patents

Data capture method, system, medium and electronic device Download PDF

Info

Publication number
CN110765334A
CN110765334A CN201910854052.8A CN201910854052A CN110765334A CN 110765334 A CN110765334 A CN 110765334A CN 201910854052 A CN201910854052 A CN 201910854052A CN 110765334 A CN110765334 A CN 110765334A
Authority
CN
China
Prior art keywords
data
configuration file
data source
url address
task configuration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910854052.8A
Other languages
Chinese (zh)
Inventor
马福龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN201910854052.8A priority Critical patent/CN110765334A/en
Publication of CN110765334A publication Critical patent/CN110765334A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The invention provides a data capturing method, a data capturing system, a medium and electronic equipment. The method comprises the following steps: acquiring a task configuration file of a user terminal, wherein the task configuration file is generated by user definition; the task configuration file comprises at least one data source URL address, a request method for accessing the data source URL address, a request head, a request body and a filtering condition; and capturing target data in the URL address of the data source according to the task configuration file, and storing the target data in a database. The method provided by the invention can realize the universalization of the crawler and further reduce the workload of grabbing by a user.

Description

Data capture method, system, medium and electronic device
Technical Field
The invention relates to the technical field of web crawlers, in particular to a data capturing method, a system, a medium and electronic equipment.
Background
The internet has massive data and information, and it is a difficult matter how to convert the data and information into things that the internet wants to analyze and process the data and information. The advent of web crawlers solved all of these problems.
The web crawler is a program for automatically extracting web pages, downloads web pages from the world wide web for a search engine, and is an important component of the search engine. At present, many new projects, particularly content and original content products of users, have strong requirements on data capture, and many product lines also have task capture codes of the users, however, the task capture codes between different product lines cannot be used universally, so that the users need to rewrite the whole task capture code every time, and the workload of user capture is increased.
Therefore, in the long-term research and development, the inventor has conducted a lot of research on the generalization of the crawler and proposed a data capture method to solve one of the above technical problems.
Disclosure of Invention
The present invention is directed to a data capture method, system, medium, and electronic device, which can solve at least one of the above-mentioned technical problems. The specific scheme is as follows:
according to a specific implementation manner of the present invention, in a first aspect, the present invention provides a data capturing method, including: acquiring a task configuration file of a user terminal, wherein the task configuration file is generated by user definition; the task configuration file comprises at least one data source URL address, a request method for accessing the data source URL address, a request head, a request body and a filtering condition; and capturing target data in the URL address of the data source according to the task configuration file, and storing the target data in a database.
According to a second aspect, the present invention provides a data capture system, comprising: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a task configuration file of a user terminal, and the task configuration file is generated by user definition; the task configuration file comprises a data source URL address, a request method for accessing the data source URL address, a request head, a request body and a filtering condition; the grabbing module is used for grabbing target data in the data source URL address according to the task configuration file; and the storage module is used for storing the target data to a database.
According to a third aspect, the present invention provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements a data fetching method as described in any one of the above.
According to a fourth aspect of the present invention, there is provided an electronic apparatus including: one or more processors; a storage device to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the data crawling method of any of the above.
Compared with the prior art, the scheme of the embodiment of the invention at least has the following beneficial effects:
according to different grabbing requirements, grabbing and storing can be performed only by inputting user-defined task configuration data by a user, so that the universalization of the crawler is realized; furthermore, the whole task grabbing code does not need to be rewritten, so that the workload of grabbing by a user can be reduced; in addition, the filtering condition is directly configured at the client, so that the server can quickly and conveniently capture useful data.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
FIG. 1 is a flow chart illustrating an implementation of a data capture method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for fetching target data from the URL address of the data source according to the task configuration file according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating an implementation of a data capture method according to another embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating a data capture system according to an embodiment of the present invention;
fig. 5 shows a schematic diagram of an electronic device connection structure according to an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two.
It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
It should be understood that although the terms first, second, third, etc. may be used to describe … … in embodiments of the present invention, these … … should not be limited to these terms. These terms are used only to distinguish … …. For example, the first … … can also be referred to as the second … … and similarly the second … … can also be referred to as the first … … without departing from the scope of embodiments of the present invention.
The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.
It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in the article or device in which the element is included.
Alternative embodiments of the present invention are described in detail below with reference to the accompanying drawings.
Example 1
Fig. 1 is a flowchart illustrating a data capture method according to an embodiment of the present invention. Specifically, the data capture method comprises the following steps:
s11, acquiring a task configuration file of the user terminal, wherein the task configuration file is generated by user definition; the task configuration file comprises at least one data source URL (Uniform Resource Locator), a request method for accessing the data source URL, a request head, a request body and a filter condition;
specifically, the user terminal includes electronic devices such as a computer, a tablet computer, and a smart phone. In an embodiment of the present invention, the user terminal is provided with a pre-established visualization tool, and the visualization tool may be a graphical configuration tool based on an open source browser engine (e.g., webkit).
As an example, a user may open an operation interface through the user terminal by using the visualization tool, where the operation interface is provided with a data source URL address input box, a request method input box for accessing the data source URL address, a request header input box, and the like; the user inputs configuration data (DSL, domain specific language) in the input box and confirms, and then generates a task configuration file; finally, the user terminal may send the task configuration file generated after the user performs the setting to the server. Here, the server may be a crawler server, which analyzes and processes the task configuration file, which will be explained in detail later.
The configuration data of the task configuration file is not limited specifically, a user needs to define the configuration data according to the actual needs of the user, and the configuration data is used for capturing webpage data subsequently. Wherein, the URL address of the data source is a global uniform resource locator, which is used to define a unique resource on the internet, for example: a picture, a file, a video can be uniquely identified by a URL. The data source URL address can be a website address in a website or a website address in an application program (APP).
The request method comprises a post mode and a get mode. If the get mode is adopted, the request body has no content, and the request body of the get request is placed in a parameter behind the URL and can be directly seen in the URL address; if the post mode is adopted, the configuration data defined by the user is transmitted in a form, and the configuration data is contained in the request body and cannot be seen in the URL address. For example, the task profile is as follows:
url:https://weibo.com/ttarticle/p/show?id=2309403952864992042896;
method:get;
headers:{};
data:{}.
generally, the request headers (headers) are used to specify additional information to be used by the server, and more important information includes Cookie, referer, User-agent, and the like. The Cookies are added to the request header and sent to the server every time the browser requests a page of the site, the server identifies the Cookies and finds out that the current state is the login state, and therefore the returned result is the webpage content which can be seen after login. The Referer is used to identify the page from which the request is sent, and the server can take this information and perform corresponding processing, such as source statistics, anti-hotlinking processing, etc. The User-Agent is used for enabling the server to identify information such as an operating system and a version, a browser and a version and the like used by the client. When the crawler is made, the crawler can be disguised as a browser by constructing a request head through a User-Agent; if the User-Agent configuration is not added, the crawler is easily identified.
The content generally carried by the request body is the form data in the post request, and the request body of the get request is empty.
The task configuration file comprises a filtering condition, and the response data of the data source URL is filtered through the filtering condition to capture effective target data. Specifically, the user inputs a custom filter condition in an operation interface while inputting configuration data at the client, and generates the custom filter condition in the task configuration file. In the crawling data process, a lot of useless information such as advertisement links and the like may exist in the URL, crawling workload can be increased if crawling is conducted, efficiency is reduced, for this reason, a filtering condition can be directly configured when a user conducts custom configuration, and useful data of the user can be directly captured when follow-up crawling is facilitated. The filtering condition is determined according to the self requirement of the user so as to filter out the data packets which do not meet the condition during grabbing. For example, the filtering condition includes filtering out posting information with less than 100 comments, and the like, and is not particularly limited herein.
And S12, capturing target data in the URL address of the data source according to the task configuration file, and storing the target data in a database.
In step S11, the user terminal automatically uploads the user-defined task configuration file to the server, and in step S12, the server analyzes and processes the task configuration file. Specifically, referring to fig. 2, the capturing target data in the URL address of the data source according to the task configuration file includes:
s121, the simulation browser sends an access request to the at least one data source URL address. The browser is a WEB server, and the WEB server includes WEB servers provided by each internet service provider, such as a WEB portal: tenuous, new wave, phoenix net, etc. Specifically, a browser is disguised according to a User-Agent in a request header in the task configuration file, and the URL address of the data source is accessed according to information such as a request method. Here, the specific method for crawling data by the crawler is not limited, and existing crawler technologies can be used for crawling. As an example, when the task configuration file includes a plurality of data source URL addresses, the crawling target data in the data source URL addresses according to the task configuration file includes:
and scheduling a plurality of data source URL addresses in a message queue mode, and simultaneously capturing webpage data of the data source URL addresses. Specifically, the plurality of data source URL addresses are placed in a message queue, at least one process or thread is established for the message queue, and the process or thread schedules the data source URL addresses to capture web page data.
Here, the process or the thread performing the crawling process of the web page data includes: and acquiring other URL addresses on the current webpage from the URL of the current webpage, continuously extracting new URL addresses from the current webpage in the process of capturing the webpage, putting the new URL addresses into a queue and continuously extracting, and repeating the steps until the conditions set by the system or the webpage without permission to access is reached.
And S122, acquiring response data returned by the URL address of the data source. Specifically, when the server accesses the URL address, the corresponding site returns response data of the page to the server. The response data is all data in the URL address of the data source, and mainly includes HTML codes, JSON data, and binary data, such as pictures and videos.
And S123, capturing target data in the response data according to the task configuration file. Specifically, the server captures target data according to the user-defined configuration data. And capturing target data as effective target data because the task configuration file is provided with a filtering condition. The target data is data which the user wants to grab, and mainly depends on configuration data in the task configuration file.
After the server captures the corresponding target data, the target data is stored in a database, so that a user can check the captured data at any time. The database can be MySQL database or SQL Server database.
Further, the data capture method comprises the following steps: and S13, responding to the user viewing instruction, and sending the target data to the user terminal.
Specifically, since the target data are all stored in the database of the server, when the user views the data of the captured task, the user terminal pulls the corresponding target data from the database. As an example, the user may open another operation interface by using the visualization tool through the user terminal, where the operation interface is used to view the capture result; a user inputs a URL address of a data source to be checked through the operation interface, and can check one URL address and also can check a plurality of URL addresses at the same time; and after the user input is completed and confirmed, the user terminal displays the webpage data on an interface.
Finally, the data capture method provided by the embodiment 1 of the invention can capture different capture tasks by only generating a task configuration file, so that the universalization of the crawler can be realized; the workload of grabbing by a user can be further reduced; in addition, by directly setting filtering conditions at the client, useless data can be simply and quickly filtered out, and target data can be captured.
Example 2
Fig. 3 is a flowchart illustrating a data capture method according to another embodiment of the present invention, where an execution subject of the embodiment is a user terminal, which is different from the foregoing embodiment. The user terminal comprises electronic equipment such as a computer, a tablet computer and a smart phone, and can also be a certain client in the electronic equipment, such as a crawler client. Specifically, the data capture method comprises the following steps:
s31, receiving task configuration data input by a user and generating a task configuration file; the task configuration file comprises at least one data source URL (Uniform Resource Locator), a request method for accessing the data source URL, a request header, a request body and a filter condition.
In this step, the task configuration data is customized by the user, and the task configuration data may be different for different users, without any limitation.
And S32, uploading the task configuration file to a server, so that the server captures and stores webpage data in the at least one data source URL address according to the task configuration file. Specifically, after the user completes the configuration at the user terminal, a task configuration file is generated, and the task configuration file is automatically uploaded to the server.
It is specifically stated here that the basic contents of this embodiment are the same as those of embodiment 1, and the explanation of this embodiment specifically refers to the description of embodiment 1, and will not be presented here.
According to the data capturing method provided by the embodiment 2 of the invention, capturing can be performed by only generating a task configuration file aiming at different capturing tasks, and the universalization of the crawler can be realized; the workload of grabbing by a user can be further reduced; in addition, by directly setting filtering conditions at the client, useless data can be filtered out quickly and conveniently, and target data can be captured.
Example 3
Referring to fig. 4, an embodiment of the invention provides a data capture system 400, where the system 400 includes: an acquisition module 410, a fetching module 420 and a storage module 430.
The obtaining module 410 is configured to obtain a task configuration file of a user terminal, where the task configuration file is generated by user definition; the task configuration file comprises a data source URL address, a request method for accessing the data source URL address, a request head, a request body and a filter condition.
Specifically, the user terminal includes electronic devices such as a computer, a tablet computer, and a smart phone. In an embodiment of the present invention, the user terminal is provided with a pre-established visualization tool, and the visualization tool may be a graphical configuration tool based on an open source browser engine (e.g., webkit).
As an example, a user may open an operation interface through the user terminal by using the visualization tool, where the operation interface is provided with a data source URL address input box, a request method input box for accessing the data source URL address, a request header input box, and the like; the user inputs configuration data (DSL, domain specific language) in the input box and confirms, and then generates a task configuration file; finally, the user terminal may send the task configuration file generated after the user performs the setting to the server. Here, the server may be a crawler server, which analyzes and processes the task configuration file, which will be explained in detail later.
The configuration data of the task configuration file is not limited specifically, a user needs to define the configuration data according to the actual needs of the user, and the configuration data is used for capturing webpage data subsequently. Wherein, the URL address of the data source is a global uniform resource locator, which is used to define a unique resource on the internet, for example: a picture, a file, a video can be uniquely identified by a URL. The data source URL address can be a website address in a website or a website address in an application program (APP).
The request method comprises a post mode and a get mode. If the get mode is adopted, the request body has no content, and the request body of the get request is placed in a parameter behind the URL and can be directly seen in the URL address; if the post mode is adopted, the configuration data defined by the user is transmitted in a form, and the configuration data is contained in the request body and cannot be seen in the URL address. For example, the task profile is as follows:
url:https://weibo.com/ttarticle/p/show?id=2309403952864992042896;
method:get;
headers:{};
data:{}.
generally, the request headers (headers) are used to specify additional information to be used by the server, and more important information includes Cookie, referer, User-agent, and the like. The Cookies are added in the request header and sent to the server every time the browser requests the page of the site, the server identifies the Cookies and finds out that the current state is the login state, and therefore the returned result is the webpage content which can be seen after login. The Referer is used to identify the page from which the request is sent, and the server can take this information and perform corresponding processing, such as source statistics, anti-hotlinking processing, etc. The User-Agent is used for enabling the server to identify information such as an operating system and a version, a browser and a version and the like used by the client. When the crawler is made, the crawler can be disguised as a browser by constructing a request head through a User-Agent; if the User-Agent configuration is not added, the crawler is easily identified.
The content generally carried by the request body is the form data in the post request, and the request body of the get request is empty.
The task configuration file includes a filter condition, and the capture module 420 filters the response data of the data source URL according to the filter condition to capture valid target data. Specifically, the user inputs a custom filter condition in an operation interface while inputting configuration data at the client, and generates the custom filter condition in the task configuration file. In the crawling data process, a lot of useless information such as advertisement links and the like may exist in the URL, crawling workload can be increased if crawling is conducted, efficiency is reduced, for this reason, a filtering condition can be directly configured when a user conducts custom configuration, and useful data of the user can be directly captured when follow-up crawling is facilitated. The filtering condition is determined according to the self requirement of the user so as to filter out the data packets which do not meet the condition during grabbing. For example, the filtering condition includes filtering out posting information with the number of comments less than 100, and the like, and is not particularly limited herein.
The fetching module 420 is configured to fetch target data in the URL address of the data source according to the task configuration file. Specifically, after the obtaining module 410 automatically obtains the task configuration file generated by the user defined by the user terminal, the capturing module starts to crawl data.
Referring to fig. 2, the crawling module 420 may simulate a browser sending an access request to the at least one data source URL address. The browser is a WEB server, and the WEB server includes WEB servers provided by each internet service provider, such as a WEB portal: tenuous, new wave, phoenix net, etc. Specifically, the fetching module 420 pretends to be a browser according to a User-Agent in a request header in the task configuration file, and accesses the URL address of the data source according to information such as a request method. Here, the specific method for crawling data by the crawler is not limited, and existing crawler technologies can be used for crawling.
As an example, when the task configuration file includes a plurality of data source URL addresses, the crawling module 420 may schedule the plurality of data source URL addresses in a message queue manner, and crawl the web page data of the plurality of data source URL addresses respectively. Specifically, the fetching module 420 puts the plurality of data source URL addresses into a message queue, then establishes at least one process or thread for the message queue, and schedules the data source URL addresses to fetch the web page data through the process or thread.
Here, the process or the thread performing the crawling process of the web page data includes: and acquiring other URL addresses on the current webpage from the URL of the current webpage, continuously extracting new URL addresses from the current webpage in the process of capturing the webpage, putting the new URL addresses into a queue and continuously capturing the webpage, and repeating the steps until the condition set by the system or the webpage without permission to access is reached.
The crawling module 420 may also obtain response data returned by the URL address of the data source. Specifically, when the crawling module 420 accesses the URL address, the corresponding site returns the response data of the page to the server. The response data is all data in the URL address of the data source, and mainly includes HTML codes, JSON data, and binary data, such as pictures and videos.
The fetching module 420 may also fetch target data in the response data according to the task configuration file. Specifically, the capture module 420 captures target data according to the configuration data. And capturing target data as effective target data because the task configuration file is provided with a filtering condition. The target data is data which the user wants to grab, and mainly depends on configuration data in the task configuration file.
After the corresponding target data is captured by the capture module 420, the target data is stored in a database so that the user can view the captured data at any time. The database can be a MySQL database or an SQLServer database.
Further, the data capture method comprises the following steps: a sending module 430, configured to send the target data to the user terminal in response to a user viewing instruction.
Specifically, since the target data are all stored in the database of the server, when the user views the data of the captured task, the sending module 430 retrieves the corresponding target data from the database and sends the target data to the user terminal. As an example, the user may open another operation interface by using the visualization tool through the user terminal, where the operation interface is used to view the capture result; a user inputs a URL address of a data source to be checked through the operation interface, and can check one URL address and also can check a plurality of URL addresses at the same time; and after the user input is completed and confirmed, the user terminal displays the webpage data on an interface.
Finally, the data grabbing wedding candies provided by the embodiment of the invention can grab different grabbing tasks by only generating a task configuration file, so that the universalization of the crawler can be realized; the workload of grabbing by a user can be further reduced; in addition, by directly setting the filtering condition at the client, useless data can be filtered quickly and conveniently, and target data can be captured.
Example 4
The embodiment of the disclosure provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the data capture method in any method embodiment.
Example 5
The embodiment provides an electronic device, which is used for a data capture method, and the electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the one processor to cause the at least one processor to:
acquiring a task configuration file of a user terminal, wherein the task configuration file is generated by user definition; the task configuration file comprises at least one data source URL address, a request method for accessing the data source URL address, a request head, a request body and a filtering condition;
and capturing target data in the URL address of the data source according to the task configuration file, and storing the target data in a database.
Example 6
Referring now to FIG. 5, shown is a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, the electronic device may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 408 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

Claims (10)

1. A method for data capture, comprising:
acquiring a task configuration file of a user terminal, wherein the task configuration file is generated by user definition; the task configuration file comprises at least one data source URL address, a request method for accessing the data source URL address, a request head, a request body and a filtering condition;
and capturing target data in the URL address of the data source according to the task configuration file, and storing the target data in a database.
2. The method of claim 1, wherein the crawling target data in the data source URL address according to the task profile comprises:
the simulation browser sends an access request to the at least one data source URL address;
acquiring response data returned by the data source URL address;
and capturing target data in the response data according to the task configuration file.
3. The method according to claim 2, wherein the obtaining response data returned by the data source URL address comprises:
and acquiring HTML codes, JSON data and binary data responded by the URL address of the data source.
4. The method of claim 1, wherein the crawling target data in the data source URL address according to the task profile comprises:
and filtering the data in the URL address of the data source according to the filtering condition, and capturing target data.
5. The method of claim 1, wherein when the task configuration file includes a plurality of data source URL addresses, the crawling target data in the data source URL addresses according to the task configuration file comprises:
and scheduling a plurality of data source URL addresses in a message queue mode, and simultaneously capturing target data in the plurality of data source URL addresses.
6. The method of claim 1, further comprising:
and responding to a user viewing instruction, and sending the target data to the user terminal.
7. A data capture system, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a task configuration file of a user terminal, and the task configuration file is generated by user definition; the task configuration file comprises a data source URL address, a request method for accessing the data source URL address, a request head, a request body and a filtering condition;
the grabbing module is used for grabbing target data in the data source URL address according to the task configuration file;
and the storage module is used for storing the target data to a database.
8. The system of claim 7, further comprising:
and the sending module is used for responding to a user viewing instruction and sending the target data to the user terminal.
9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 6.
10. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1 to 6.
CN201910854052.8A 2019-09-10 2019-09-10 Data capture method, system, medium and electronic device Pending CN110765334A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910854052.8A CN110765334A (en) 2019-09-10 2019-09-10 Data capture method, system, medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910854052.8A CN110765334A (en) 2019-09-10 2019-09-10 Data capture method, system, medium and electronic device

Publications (1)

Publication Number Publication Date
CN110765334A true CN110765334A (en) 2020-02-07

Family

ID=69329513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910854052.8A Pending CN110765334A (en) 2019-09-10 2019-09-10 Data capture method, system, medium and electronic device

Country Status (1)

Country Link
CN (1) CN110765334A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414523A (en) * 2020-03-11 2020-07-14 中国建设银行股份有限公司 Data acquisition method and device
CN112287201A (en) * 2020-12-31 2021-01-29 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for removing duplicate of crawler request
CN112434205A (en) * 2020-11-30 2021-03-02 北京秒针人工智能科技有限公司 Data integration capturing method and system based on data site and computer equipment
CN113392301A (en) * 2021-06-08 2021-09-14 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for crawling data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750804A (en) * 2015-03-24 2015-07-01 南京途牛科技有限公司 Plug-in type configurable vertical network spider implementation method
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN107729508A (en) * 2017-10-23 2018-02-23 北京京东金融科技控股有限公司 Information crawler method and apparatus
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN109862021A (en) * 2019-02-26 2019-06-07 武汉思普崚技术有限公司 Threaten the acquisition methods and device of information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750804A (en) * 2015-03-24 2015-07-01 南京途牛科技有限公司 Plug-in type configurable vertical network spider implementation method
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN107729508A (en) * 2017-10-23 2018-02-23 北京京东金融科技控股有限公司 Information crawler method and apparatus
CN109862021A (en) * 2019-02-26 2019-06-07 武汉思普崚技术有限公司 Threaten the acquisition methods and device of information

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414523A (en) * 2020-03-11 2020-07-14 中国建设银行股份有限公司 Data acquisition method and device
CN112434205A (en) * 2020-11-30 2021-03-02 北京秒针人工智能科技有限公司 Data integration capturing method and system based on data site and computer equipment
CN112287201A (en) * 2020-12-31 2021-01-29 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for removing duplicate of crawler request
CN113392301A (en) * 2021-06-08 2021-09-14 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for crawling data

Similar Documents

Publication Publication Date Title
CN110765334A (en) Data capture method, system, medium and electronic device
CN107390994B (en) Interface presentation method and device
CN110493318B (en) HTTP request information processing method, device, medium and equipment
US9483624B2 (en) Method and apparatus for configuring privacy settings for publishing electronic images
CN111459364B (en) Icon updating method and device and electronic equipment
CN112312222A (en) Video sending method and device and electronic equipment
CN111598006A (en) Method and device for labeling objects
CN113343312A (en) Page tamper-proofing method and system based on front-end point burying technology
CN113505302A (en) Method, device and system for supporting dynamic acquisition of buried point data and electronic equipment
CN111311358B (en) Information processing method and device and electronic equipment
CN112351221B (en) Image special effect processing method, device, electronic equipment and computer readable storage medium
CN112558933A (en) Component rendering method and device, readable medium and electronic equipment
CN116596748A (en) Image stylization processing method, apparatus, device, storage medium, and program product
CN114528433B (en) Template selection method and device, electronic equipment and storage medium
CN114327453B (en) Page display method, device, equipment and storage medium
US20160125092A1 (en) Web component display by cross device portal
CN111222067B (en) Information generation method and device
CN110730251B (en) Method, device, medium and electronic equipment for analyzing domain name
CN114187169A (en) Method, device and equipment for generating video special effect package and storage medium
CN113553489B (en) Method, device, equipment, medium and program product for capturing content
CN111294657A (en) Information processing method and device
US20150149596A1 (en) Sending mobile applications to mobile devices from personal computers
CN111026983B (en) Method, device, medium and electronic equipment for realizing hyperlink
CN111371745B (en) Method and apparatus for determining SSRF vulnerability
CN113722634A (en) Data processing method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200207

RJ01 Rejection of invention patent application after publication