Disclosure of Invention
The application provides a data acquisition method and device and a computer readable storage medium, which aim to improve the flexibility of a data acquisition process and the data acquisition efficiency.
In a first aspect, the present application provides a data acquisition method, including:
the scheduling module acquires scheduling information in a scheduling template, and the scheduling template is written and stored in a dynamic template language;
the scheduling module generates a network request according to the scheduling information;
the downloading module downloads the webpage source codes according to the network request;
and the analysis module processes the webpage source code by using an analysis template to obtain target data, and the analysis template corresponds to the scheduling template.
In another possible design, the scheduling module obtains scheduling information in a scheduling template, including:
and the scheduling module calls a first analyzer and acquires the scheduling information obtained by the first analyzer analyzing the scheduling template.
In another possible design, the scheduling information further includes at least one of the following: portal information, frequency control, request mode, preprocessing method, subordinate data fusion mode, and additional data processing method.
In another possible design, the parsing module processes the web page source code using a parsing template to obtain target data, including:
and the analysis module calls a second analyzer and acquires the target data obtained by the second analyzer through processing the webpage source codes according to the analysis template.
In another possible design, the method further includes:
the analysis module sends the target data to the scheduling module;
and the scheduling module outputs or stores the target data according to the scheduling module.
In a second aspect, the present application provides a data acquisition device comprising: the system comprises a scheduling module, a downloading module and an analyzing module; wherein,,
the scheduling module is used for acquiring scheduling information in a scheduling template, and the scheduling template is written and stored in a dynamic template language;
the scheduling module is further used for generating a network request according to the scheduling information;
the downloading module is used for downloading the webpage source codes according to the network request;
the analysis module is further used for processing the webpage source codes by utilizing an analysis template to obtain target data, and the analysis template corresponds to the scheduling template.
In another possible design, the scheduling module is specifically configured to:
and calling a first analyzer, and acquiring the scheduling information obtained by analyzing the scheduling template by the first analyzer.
In another possible design, the scheduling information further includes at least one of the following: portal information, frequency control, request mode, preprocessing method, subordinate data fusion mode, and additional data processing method.
In another possible design, the parsing module is specifically configured to:
and the analysis module calls a second analyzer and acquires the target data obtained by the second analyzer through processing the webpage source codes according to the analysis template.
In another possible design, the parsing module is further configured to send the target data to the scheduling module;
the scheduling module is further used for outputting or storing the target data according to the scheduling module.
In a third aspect, the present application provides a data acquisition device comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any of the first aspects.
In a fourth aspect, the present application provides a computer-readable storage medium having a computer program stored thereon,
the computer program being executable by a processor to implement the method according to any of the first aspects.
In the data acquisition scheme provided by the application, the scheduling module can realize scheduling control of the whole data acquisition flow according to the scheduling information, the analysis module can analyze the webpage source codes downloaded by the downloading module by adopting the analysis template matched with the scheduling information, and in the data acquisition process, the scheduling template is written and stored in a dynamic template language, so that the flexibility of the template can be effectively improved, the coverage rate and the accuracy of the scheduling template are improved, the labor cost required in Internet mass data acquisition is greatly reduced, and the data acquisition efficiency is also improved.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
The specific application scene of the application is an internet data acquisition process. The existing data acquisition method can only be applied to data acquisition of specific websites or websites of one type, is difficult to apply and expand aiming at non-specific websites or websites of different types, and has low flexibility.
The application provides a data acquisition method, a data acquisition device and a computer readable storage medium, which aim to solve the technical problems in the prior art and provide the following solution ideas: the template structure is organized by using dynamic template languages such as extensible markup language (Extensible Markup Language, XML), and the corresponding analysis plug-ins are configured to realize analysis, so that a template collection mode with high efficiency and high universality is realized.
The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Example 1
The embodiment of the application provides a data acquisition method. Specifically, referring to fig. 1 and 2, fig. 1 illustrates the data acquisition method, and fig. 2 illustrates an execution device of the data acquisition method (hereinafter referred to as a data acquisition device). As shown in fig. 2, the data acquisition device includes: the scheduling module 21, the downloading module 22 and the analyzing module 23 may be implemented according to the method shown in fig. 1 when the data acquisition is specifically executed.
Specifically, as shown in fig. 1, the method comprises the following steps:
s102, the scheduling module 21 acquires scheduling information in a scheduling template written and stored in a dynamic template language.
S104, the scheduling module 21 generates a network request according to the scheduling information.
S106, the downloading module 22 downloads the webpage source codes according to the network request.
S108, the analysis module 23 processes the webpage source codes by using the analysis template to obtain target data, wherein the analysis template corresponds to the scheduling template.
First, a scheduling template and an analysis template shown in fig. 1 will be described.
In the embodiment of the application, the scheduling template is written and stored in a dynamic template language, and the analysis template can be written and stored in a dynamic template language, or can be written and stored in other modes, and the embodiment of the application is not particularly limited.
Taking a scheduling template as an example, before executing the method, the method may further include the following steps:
the scheduling templates are written and stored in a dynamic template language.
In particular, the method can be realized by extracting data from the websites of the same type and generating and storing templates in a dynamic template language. In the process, the acquisition content can be expanded without programming, and the acquisition development efficiency and the acquisition data accuracy are improved. The implementation manner of writing and storing the scheduling template in the dynamic template language is the same as above, and will not be described again.
The dynamic template language according to the embodiment of the present application may include, but is not limited to: extensible markup language (Extensible Markup Language, XML). The method can be used for defining the operation methods of a plurality of links such as task scheduling, source code analysis, data transmission, data formatting and the like in the data acquisition process.
In the embodiment of the application, the scheduling template carries scheduling information, and the scheduling information is used for describing the scheduling information of a specific channel. The scheduling information may include, but is not limited to, the following information: at least one of entry information, frequency control, request mode, preprocessing method, subordinate data fusion mode, additional data processing method, and identification information of corresponding analysis template. Wherein the identification information of the template is used for uniquely identifying the template, and the expression forms can include but are not limited to: name of the product.
The parsing template is used for carrying parsing information of a specific format webpage, namely, describing a process of extracting effective data on the specific format webpage. The embodiment of the application is not limited to the analysis method indicated by the analysis template, and can be preset according to the requirement.
In a preferred implementation scenario, if the parsing template is written and stored in a dynamic template language, the parsing template may extend parsing functions to multiple dimensions in multiple ways, e.g., template parsing may support multiple different parsing modes such as xpath, css, regex, and may extend support for other parsing forms such as jsonpath, etc.
In another preferred implementation scenario, the parsing template may be capable of extracting data on the web page in a plurality of ways as described above, and formatting the extracted data. The format processing method may include, but is not limited to: at least one of splicing, cutting, combining and formatting.
In the embodiment of the present application, the parsing template used when executing the parsing flow corresponds to the scheduling template. In specific implementation, the correspondence between the analysis template and the scheduling template may be preset.
At this time, the corresponding relation between the two storage positions can be established in a corresponding storage mode, at this time, the storage positions of any two scheduling templates are different, the storage positions of any two analysis templates are also different, and the storage positions of the scheduling templates and the analysis templates with the corresponding relation can be the same or different. Preferably, the scheduling template and the parsing template may also be stored in a storage location.
Alternatively, a correspondence relationship between the identification information of the two may also be preset. For example, the identification information of the scheduling template may be written into the analysis information of the analysis template, so, when the method step shown in fig. 1 is specifically implemented, the identification information of the analysis template carried in the scheduling information is directly read, and the analysis template corresponding to the scheduling template may be determined.
Alternatively, in a preferred implementation scenario, a data acquisition template may be preset, and the scheduling template and the parsing template are two parts of the data acquisition template. At this time, the data acquisition template may further include an analysis result template and a site information template.
The analysis result template comprises all purposes of analysis results returned to the scheduling template by the analysis template, and format regulations of final landing data and intermediate information are included. The result data type, the result floor form and other various information can be defined.
The site information template comprises relevant information of the site to which the channel belongs and is used for identifying the template level, so that the data acquisition device can quickly and conveniently search the reading template.
Based on the foregoing architecture, in the embodiment of the present application, when the data acquisition task is specifically executed, the scheduling module 21 obtains the scheduling information carried in the scheduling template, and assembles a network request according to the instruction of the scheduling template, where the network request is used to request downloading of the web page source code. That is, the generated network request carries information about the source code of the web page that needs to be downloaded by the download module 22.
In the embodiment of the application, the problem of calling the template is considered, and the template analyzer can be arranged in the data acquisition device or a superior device where the data acquisition device is positioned.
In a possible design, please refer to fig. 3, fig. 3 shows a data flow of another data collection method, in which, in the data collection process, the scheduling module 21 implements scheduling of a scheduling template and a parsing result template by calling a first parser 300 set in a previous device where the data collection device 200 is located; and, the parsing module 23 implements scheduling of the parsing templates by calling the second parser 400.
Further, the arrow in fig. 3 indicates the data transmission direction. Specifically, in the system shown in fig. 3, in the process of executing data collection, the scheduling module 21 sends scheduling information to the downloading module 22, after the downloading module 22 downloads the web page source code, the web page source code and the scheduling information are sent to the analysis module 23, and after analysis is completed, the analysis module sends the target data to the scheduling module 21.
In addition, the embodiment of the application is not particularly limited to the data interaction mode among the modules in the data acquisition module. If the scheduling module 21, the downloading module 22 and the analyzing module 23 are the same processing unit in one processor, or are at least two processors in one data acquisition device, the data transmission can be performed by an interface mode. Alternatively, the scheduling module 21, the downloading module 22 and the parsing module 23 may be integrated in the same processor or processing unit.
It should be noted that fig. 3 is a possible implementation manner, and is not meant to limit the present application.
In the embodiment of the application, the first parser can be used as a scheduling template parser for parsing the scheduling template to obtain scheduling information from the scheduling template.
In this implementation scenario, the step S102 may be implemented by the following means when it is specifically implemented:
and the scheduling module calls a first analyzer and acquires the scheduling information obtained by the first analyzer analyzing the scheduling template.
The second parser serves as a parse template parser, and is specifically configured to parse a parse template. In specific implementation, the second parser parses the web page source code according to the parsing template to obtain a parsing result, that is, target data. In a preferred design, the second parser is specifically configured to parse the parsing template, and may further be configured to perform post-formatting processing on the parsed candidate data, to finally obtain formatted target data.
For easy understanding, the embodiment of the present application provides a possible implementation manner when the second parser is specifically configured to parse the parsing template: the second parser constructs the web page source code into a document object template (Document Object Model, DOM) tree according to the parsing template, then traverses the DOM tree according to the depth-first principle, parses each node in the DOM tree to obtain parsed candidate data, and performs standard format conversion on the candidate data according to the parsing result template to obtain target data.
In the embodiment of the application, the second parser can be set by adopting the idea of a state machine. That is, each instruction received by the parser changes the state of the parser, so that after traversing all nodes of the DOM tree, the data contained in the final state will be returned as a final result.
In this implementation scenario, the step S102 may be implemented by the following means:
and the analysis module calls a second analyzer and acquires the target data obtained by the second analyzer through processing the webpage source codes according to the analysis template.
In order to facilitate data processing, in the embodiment of the present application, the target data obtained by the data acquisition method may be output in a standard format preset in Cai Yungong. As previously described, this may be defined in the parsing template so that the second parser performs the format conversion process described above when parsing the data according to the parsing template.
In addition, in the embodiment of the present application, after obtaining the target data, the parsing module may further execute the following flow:
the analysis module sends the target data to the scheduling module;
and the scheduling module outputs or stores the target data according to the scheduling module.
In the embodiment of the application, the scheduling module realizes the global scheduling of the data acquisition process, so that the scheduling module can output or store the target data according to the indication of the scheduling template after receiving the target data sent by the analysis module. In addition, the scheduling module can also execute the next round of data acquisition after receiving the target data.
In a preferred implementation scenario, the target data parsed by the parsing module and sent to the scheduling module is data in a standard format, and then the scheduling module may further verify the format of the target data after receiving the target data.
At this time, as shown in fig. 3, the first parser may further be used as a parse result template parser, where the first parser may specifically use the parse result template to check the target data sent by the parse module, so as to finally realize control of the scheduling flow.
The technical scheme provided by the embodiment of the application at least has the following technical effects:
in the data acquisition scheme provided by the embodiment of the application, the scheduling module can realize the scheduling control of the whole data acquisition flow according to the scheduling information, the analysis module can analyze the webpage source code downloaded by the downloading module by adopting the analysis template matched with the scheduling information, and in the data acquisition process, the scheduling template is written and stored in a dynamic template language, so that the flexibility of the template can be effectively improved, the coverage rate and the accuracy of the scheduling template are improved, the labor cost required in the acquisition of Internet mass data is greatly reduced, and the data acquisition efficiency is also improved.
Example two
Based on the data acquisition method provided in the first embodiment, the embodiment of the present application further provides an apparatus embodiment for implementing each step and method in the foregoing method embodiment.
An embodiment of the present application provides a data acquisition device, please refer to fig. 2 or fig. 3, the data acquisition device 200 includes: a scheduling module 21, a downloading module 22 and an analyzing module 23; wherein,,
the scheduling module 21 is configured to obtain scheduling information in a scheduling template, where the scheduling template is written and stored in a dynamic template language;
the scheduling module 21 is further configured to generate a network request according to the scheduling information;
the downloading module 22 is configured to download the web page source code according to the network request;
the parsing module 23 is further configured to process the web page source code by using a parsing template, so as to obtain target data, where the parsing template corresponds to the scheduling template.
In one possible implementation scenario, the scheduling module 21 is specifically configured to:
and calling a first analyzer, and acquiring the scheduling information obtained by analyzing the scheduling template by the first analyzer.
In the embodiment of the present application, the scheduling information further includes at least one of the following information: portal information, frequency control, request mode, preprocessing method, subordinate data fusion mode, and additional data processing method.
In another possible implementation scenario, the parsing module 23 is specifically configured to:
the parsing module 23 calls a second parser and obtains the target data obtained by the second parser processing the web page source code according to the parsing template.
In addition, in the embodiment of the present application, the parsing module 23 is further configured to send the target data to the scheduling module 21;
the scheduling module 21 is further configured to output or store the target data according to the scheduling module 21.
Moreover, referring to fig. 4, an embodiment of the present application provides a data acquisition device, the data acquisition device 400 includes:
a memory 410;
a processor 420; and
a computer program;
wherein the computer program is stored in the memory 410 and configured to be executed by the processor 420 to implement the method as described in the above embodiments.
In addition, as shown in fig. 4, a transceiver 430 is further provided in the data acquisition device 400, for performing data transmission or communication with other devices, which will not be described herein.
As shown in fig. 4, the memory 410, the processor 420 and the transceiver 430 are connected by a bus.
Furthermore, an embodiment of the present application provides a readable storage medium having stored thereon a computer program,
the computer program is executed by a processor to implement the method as described in embodiment one.
Since each module in this embodiment is capable of executing the method shown in embodiment one, a part of this embodiment which is not described in detail can be referred to the description related to embodiment one.
The technical scheme provided by the embodiment of the application at least has the following technical effects:
in the data acquisition scheme provided by the embodiment of the application, the scheduling module can realize the scheduling control of the whole data acquisition flow according to the scheduling information, the analysis module can analyze the webpage source code downloaded by the downloading module by adopting the analysis template matched with the scheduling information, and in the data acquisition process, the scheduling template is written and stored in a dynamic template language, so that the flexibility of the template can be effectively improved, the coverage rate and the accuracy of the scheduling template are improved, the labor cost required in the acquisition of Internet mass data is greatly reduced, and the data acquisition efficiency is also improved.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.