CN111125589A - Data acquisition method and device and computer readable storage medium - Google Patents

Data acquisition method and device and computer readable storage medium Download PDF

Info

Publication number
CN111125589A
CN111125589A CN201811283037.4A CN201811283037A CN111125589A CN 111125589 A CN111125589 A CN 111125589A CN 201811283037 A CN201811283037 A CN 201811283037A CN 111125589 A CN111125589 A CN 111125589A
Authority
CN
China
Prior art keywords
scheduling
template
module
analysis
data acquisition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811283037.4A
Other languages
Chinese (zh)
Other versions
CN111125589B (en
Inventor
李宇涵
曹六一
张丹
于晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201811283037.4A priority Critical patent/CN111125589B/en
Publication of CN111125589A publication Critical patent/CN111125589A/en
Application granted granted Critical
Publication of CN111125589B publication Critical patent/CN111125589B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a data acquisition method and device and a computer readable storage medium. The method comprises the following steps: the method comprises the steps that a scheduling module obtains scheduling information in a scheduling template, the scheduling template is compiled and stored in a dynamic template language, then the scheduling module generates a network request according to the scheduling information, so that a downloading module downloads a webpage source code according to the network request, then an analysis module processes the webpage source code by using an analysis template to obtain target data, and the analysis template corresponds to the scheduling template. The method of the invention improves the flexibility and the data acquisition efficiency of the data acquisition process.

Description

Data acquisition method and device and computer readable storage medium
Technical Field
The present invention relates to data processing technologies, and in particular, to a data acquisition method and apparatus, and a computer-readable storage medium.
Background
In recent years, with the rapid development of the internet, more and more people acquire and distribute information through the network, and the data volume and the data value in the internet are increased day by day. Because the internet occupies an important position in a personal information acquisition channel at present, the analysis of the internet big data has important application value for various industries. The necessary premise for analyzing the internet big data is to collect the relevant data.
There are generally several ways of acquiring data in current systems: customized collections for a particular web site or template collections tailored to a certain class of web sites. However, the customized acquisition for a specific website requires a long development period, and has low expandability and poor flexibility; the template collection method suitable for a certain type of websites cannot be applied to data collection of websites with different types, and is limited in expansion capability and poor in flexibility.
The existing data acquisition mode has the problem of application limitation in different degrees, so how to realize data acquisition in different application scenes becomes a technical problem to be solved urgently in the field.
Disclosure of Invention
The invention provides a data acquisition method and device and a computer readable storage medium, aiming to improve the flexibility of a data acquisition process and improve the data acquisition efficiency.
In a first aspect, the present invention provides a data acquisition method, including:
the method comprises the steps that a scheduling module obtains scheduling information in a scheduling template, and the scheduling template is compiled and stored in a dynamic template language;
the scheduling module generates a network request according to the scheduling information;
the downloading module downloads the webpage source code according to the network request;
and the analysis module processes the webpage source code by using an analysis template to obtain target data, wherein the analysis template corresponds to the scheduling template.
In another possible design, the obtaining, by the scheduling module, scheduling information in a scheduling template includes:
and the scheduling module calls a first analyzer and acquires the scheduling information obtained by analyzing the scheduling template by the first analyzer.
In another possible design, the scheduling information further includes at least one of the following information: entry information, frequency control, a request mode, a preprocessing method, a subordinate data fusion mode and an additional data processing method.
In another possible design, the analyzing module processes the web page source code by using an analyzing template to obtain target data, including:
and the analysis module calls a second analyzer and acquires the target data obtained by processing the webpage source code by the second analyzer according to the analysis template.
In another possible design, the method further includes:
the analysis module sends the target data to the scheduling module;
and the scheduling module outputs or stores the target data according to the scheduling module.
In a second aspect, the present invention provides a data acquisition apparatus comprising: the system comprises a scheduling module, a downloading module and an analyzing module; wherein the content of the first and second substances,
the scheduling module is used for acquiring scheduling information in a scheduling template, and the scheduling template is compiled and stored in a dynamic template language;
the scheduling module is further used for generating a network request according to the scheduling information;
the downloading module is used for downloading the webpage source code according to the network request;
the analysis module is further configured to process the webpage source code by using an analysis template to obtain target data, where the analysis template corresponds to the scheduling template.
In another possible design, the scheduling module is specifically configured to:
and calling a first analyzer, and acquiring the scheduling information obtained by analyzing the scheduling template by the first analyzer.
In another possible design, the scheduling information further includes at least one of the following information: entry information, frequency control, a request mode, a preprocessing method, a subordinate data fusion mode and an additional data processing method.
In another possible design, the parsing module is specifically configured to:
and the analysis module calls a second analyzer and acquires the target data obtained by processing the webpage source code by the second analyzer according to the analysis template.
In another possible design, the parsing module is further configured to send the target data to the scheduling module;
the scheduling module is further configured to output or store the target data according to the scheduling module.
In a third aspect, the present invention provides a data acquisition apparatus comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any of the first aspects.
In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program,
the computer program is executed by a processor to implement the method according to any of the first aspect.
In the data acquisition scheme provided by the invention, the scheduling module can realize scheduling control on the whole data acquisition process according to the scheduling information, the analysis module can adopt the analysis template matched with the scheduling information to analyze the webpage source code downloaded by the download module, and the scheduling template is compiled and stored in a dynamic template language in the data acquisition process, so that the flexibility of the template can be effectively improved, the coverage rate and the accuracy of the scheduling template are improved, the labor cost required in the internet mass data acquisition is greatly reduced, and the data acquisition efficiency can be improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a schematic flow chart of a data acquisition method according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a data acquisition device according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of another data acquisition device according to an embodiment of the present invention;
fig. 4 is a schematic physical structure diagram of a data acquisition device according to an embodiment of the present invention.
With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The specific application scenario of the invention is the internet data acquisition process. The existing data acquisition method can be only applied to data acquisition of one specific website or one type of websites, and is difficult to apply and expand aiming at non-specific websites or different types of websites, so that the flexibility is low.
The data acquisition method, the data acquisition device and the computer-readable storage medium provided by the invention aim to solve the technical problems in the prior art and provide the following solution ideas: the template structure is organized by using a dynamic template Language, such as Extensible Markup Language (XML), and the corresponding analysis plug-in is configured to realize analysis, so as to realize a template collection mode with high efficiency and high universality.
The following describes the technical solutions of the present invention and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
Example one
The embodiment of the invention provides a data acquisition method. Specifically, referring to fig. 1 and fig. 2, fig. 1 shows the data acquisition method, and fig. 2 shows an execution device (hereinafter referred to as data acquisition device) of the data acquisition method. As shown in fig. 2, the data acquisition apparatus includes: the scheduling module 21, the downloading module 22 and the analyzing module 23 may be implemented according to the method shown in fig. 1 when data acquisition is specifically performed.
Specifically, as shown in fig. 1, the method includes the following steps:
s102, the scheduling module 21 obtains scheduling information in a scheduling template, and the scheduling template is written and stored in a dynamic template language.
S104, the scheduling module 21 generates a network request according to the scheduling information.
S106, the downloading module 22 downloads the webpage source code according to the network request.
And S108, the analysis module 23 processes the webpage source code by using the analysis template to obtain target data, wherein the analysis template corresponds to the scheduling template.
First, the scheduling template and the analysis template shown in fig. 1 will be explained.
In the embodiment of the present invention, the scheduling template is written and stored in a dynamic template language, and the parsing template may be written and stored in a dynamic template language, or may be written and stored in other manners, which is not particularly limited in the embodiment of the present invention.
Taking the scheduling template as an example, before executing the method, the method may further include the following steps:
the scheduling template is written and stored in a dynamic template language.
In specific implementation, the method can be realized by extracting data from websites of the same type, and generating and storing templates in a dynamic template language. In the process, the extension of the collected content can be carried out without compiling a program, and the collection development efficiency and the data collection accuracy are improved. The implementation manner of writing and storing the scheduling template in the dynamic template language is the same as above, and is not described again.
The dynamic template language according to the embodiment of the present invention may include, but is not limited to: extensible Markup Language (XML). The method can be used for defining operation methods of a plurality of links such as task scheduling, source code analysis, data transmission, data formatting and the like in the data acquisition process.
In the embodiment of the invention, the scheduling template carries scheduling information, and the scheduling information is used for describing the scheduling information of the specific frequency channel. The scheduling information may include, but is not limited to, the following information: at least one of entry information, frequency control, a request mode, a preprocessing method, a subordinate data fusion mode, an additional data processing method, and identification information of a corresponding parsing template. The identification information of the template is used to uniquely identify the template, and its representation form may include but is not limited to: name of the patent.
The parsing template is used for bearing parsing information of a type of web pages with a specific format, namely, describing a process of extracting effective data on a type of web pages with a specific format. The embodiment of the invention has no limitation on the analysis method indicated by the analysis template, and the analysis method can be preset according to the requirement.
In a preferred implementation scenario, if the parsing template is written and stored in a dynamic template language, the parsing template may extend parsing functions to multiple dimensions in multiple ways, for example, the template parsing may support multiple different parsing ways such as xpath, css, regex, and the like, and may extend to support other parsing forms such as jsonnpath and the like.
In another preferred implementation scenario, the parsing template can extract data on the web page in the above-mentioned manners, and can format the extracted data. The format processing mode may include, but is not limited to: at least one of splicing, cutting, combining and formatting.
It should be further noted that, in the embodiment of the present invention, the parsing template used when the parsing process is executed corresponds to the scheduling template. In a specific implementation, the corresponding relationship between the analysis template and the scheduling template may be preset.
At this time, a corresponding relationship between the two storage locations may be established in a corresponding storage manner, at this time, the storage locations between any two scheduling templates are different, the storage locations between any two parsing templates are also different, and the storage locations of the scheduling templates and the parsing templates having the corresponding relationship may be the same or different. Preferably, the scheduling template and the parsing template may be stored in a storage location in a corresponding manner.
Or, the corresponding relationship between the identification information of the two can be preset. For example, the identification information of the scheduling template may be written into the parsing information of the parsing template, so that when the method steps shown in fig. 1 are specifically implemented, the identification information of the parsing template carried in the scheduling information is directly read, and the parsing template corresponding to the scheduling template may be determined.
Alternatively, in a preferred implementation scenario, a data acquisition template may be preset, and the scheduling template and the parsing template are two parts of the data acquisition template. At this time, the data acquisition template may further include an analysis result template and a site information template.
The analysis result template comprises all purposes of the analysis result returned to the scheduling template by the analysis template, and format regulations of final landing data and intermediate information are included. A variety of information such as the type of result data, the floor form of the result, etc. can be defined.
The site information template contains the relevant information of the site to which the channel belongs and is used for identifying the template level, so that the data acquisition device can quickly and conveniently search the reading template.
Based on the foregoing architecture, in the embodiment of the present invention, when a data acquisition task is executed specifically, the scheduling module 21 acquires scheduling information carried in a scheduling template, and assembles a network request according to an instruction of the scheduling template, where the network request is used to request downloading of a web page source code. That is, the generated network request carries the information related to the source code of the web page that the downloading module 22 needs to download.
In the embodiment of the present invention, in consideration of the problem of template calling, a template parser may be further provided in the data acquisition device or in a previous device in which the data acquisition device is located.
In a possible design, please refer to fig. 3, fig. 3 shows a data flow of another data acquisition method, in the data acquisition process, the scheduling module 21 implements scheduling of the scheduling template and the parsing result template by calling the first parser 300 set in the upper-level device where the data acquisition device 200 is located; and the parsing module 23 implements scheduling of the parsing template by calling the second parser 400.
Further, the arrows in fig. 3 indicate the data transmission direction. Specifically, in the system shown in fig. 3, in the process of executing data acquisition, the scheduling module 21 sends the scheduling information to the downloading module 22, the downloading module 22 sends the webpage source code and the scheduling information to the analyzing module 23 after downloading the webpage source code, and the analyzing module sends the target data to the scheduling module 21 after completing analysis to obtain the target data.
In addition, the embodiment of the present invention does not particularly limit the data interaction manner between the modules in the data acquisition module. If the scheduling module 21, the downloading module 22 and the analyzing module 23 are the same processing unit in one processor, or at least two processors in one data acquisition device, data transmission can be performed by means of an interface. Alternatively, the scheduling module 21, the downloading module 22 and the parsing module 23 may also be integrated in the same processor or processing unit.
It is understood that fig. 3 is a possible implementation and is not intended to limit the present application.
In the embodiment of the present invention, the first parser may be used as a scheduling template parser for parsing a scheduling template to obtain scheduling information from the scheduling template.
In this implementation scenario, the foregoing step S102 may be implemented by:
and the scheduling module calls a first analyzer and acquires the scheduling information obtained by analyzing the scheduling template by the first analyzer.
The second analyzer is used as an analysis template analyzer, and is specifically used for analyzing an analysis template. In specific implementation, the second parser parses the web page source code according to the parsing template to obtain a parsing result, that is, the target data. In a preferred design, when the second parser is specifically used for parsing the parsing template, the second parser may also be used for performing formatting post-processing on the parsed candidate data, so as to finally obtain formatted target data.
For convenience of understanding, the embodiment of the present invention provides a possible implementation manner when the second parser is specifically used for parsing the template: and the second parser constructs the webpage source code into a Document Object Model (DOM) tree according to the parsing template, traverses the DOM tree according to a depth-first principle, parses each node in the DOM tree to obtain parsed candidate data, and performs standard format conversion on the candidate data according to a parsing result template to obtain target data.
In the embodiment of the invention, the second resolver can be set by adopting the idea of a state machine. That is, each instruction received by the parser changes the state of the parser, so that after traversing all nodes of the DOM tree, the data contained in the final state will be returned as the final result.
In this implementation scenario, when the foregoing step S102 is implemented specifically, the following means may be implemented:
and the analysis module calls a second analyzer and acquires the target data obtained by processing the webpage source code by the second analyzer according to the analysis template.
In order to facilitate data processing, in the embodiment of the present invention, the target data obtained by the data acquisition method may be output in a standard format preset by a user. As mentioned above, this may be defined in the parsing template, so that the second parser performs the format conversion process when parsing the data according to the parsing template.
In addition, in the embodiment of the present invention, after the analysis module obtains the target data, the following process may be further performed:
the analysis module sends the target data to the scheduling module;
and the scheduling module outputs or stores the target data according to the scheduling module.
In the embodiment of the invention, the scheduling module realizes the global scheduling of the data acquisition process, so that the scheduling module can output or store the target data according to the indication of the scheduling template after receiving the target data sent by the analysis module. In addition, the scheduling module can also execute the next round of data acquisition after receiving the target data.
In a preferred implementation scenario, the target data analyzed and sent to the scheduling module based on the analysis module is data in a standard format, and then the scheduling module may check the format of the target data after receiving the target data.
At this time, as shown in fig. 3, the first parser may also be used as a parsing result template parser, and at this time, the first parser may specifically use the parsing result template to verify the target data sent by the parsing module, so as to finally implement control over the scheduling process.
The technical scheme provided by the embodiment of the invention at least has the following technical effects:
in the data acquisition scheme provided by the embodiment of the invention, the scheduling module can realize scheduling control on the whole data acquisition process according to the scheduling information, the analysis module can adopt the analysis template matched with the scheduling information to analyze the webpage source code downloaded by the download module, and the scheduling template is compiled and stored in a dynamic template language in the data acquisition process, so that the flexibility of the template can be effectively improved, the coverage rate and the accuracy of the scheduling template are improved, the labor cost required in the internet mass data acquisition is greatly reduced, and the data acquisition efficiency can be improved.
Example two
Based on the data acquisition method provided in the first embodiment, the embodiment of the present invention further provides an embodiment of an apparatus for implementing each step and method in the above method embodiment.
Referring to fig. 2 or 3, the data acquisition apparatus 200 according to an embodiment of the present invention includes: a scheduling module 21, a downloading module 22 and an analyzing module 23; wherein the content of the first and second substances,
the scheduling module 21 is configured to obtain scheduling information in a scheduling template, where the scheduling template is written and stored in a dynamic template language;
the scheduling module 21 is further configured to generate a network request according to the scheduling information;
the downloading module 22 is configured to download the webpage source code according to the network request;
the parsing module 23 is further configured to process the webpage source code by using a parsing template to obtain target data, where the parsing template corresponds to the scheduling template.
In a possible implementation scenario, the scheduling module 21 is specifically configured to:
and calling a first analyzer, and acquiring the scheduling information obtained by analyzing the scheduling template by the first analyzer.
In this embodiment of the present invention, the scheduling information further includes at least one of the following information: entry information, frequency control, a request mode, a preprocessing method, a subordinate data fusion mode and an additional data processing method.
In another possible implementation scenario, the parsing module 23 is specifically configured to:
the parsing module 23 calls a second parser, and obtains the target data obtained by processing the web page source code by the second parser according to the parsing template.
In addition, in this embodiment of the present invention, the parsing module 23 is further configured to send the target data to the scheduling module 21;
the scheduling module 21 is further configured to output or store the target data according to the scheduling module 21.
Also, an embodiment of the present invention provides a data acquisition apparatus, please refer to fig. 4, where the data acquisition apparatus 400 includes:
a memory 410;
a processor 420; and
a computer program;
wherein the computer program is stored in the memory 410 and configured to be executed by the processor 420 to implement the methods as described in the above embodiments.
In addition, as shown in fig. 4, a transceiver 430 is further disposed in the data acquisition apparatus 400 for data transmission or communication with other devices, which is not described herein again.
As shown in fig. 4, the memory 410, the processor 420 and the transceiver 430 are connected by a bus.
Furthermore, an embodiment of the present invention provides a readable storage medium, on which a computer program is stored,
the computer program is executed by a processor to implement the method according to the first embodiment.
Since each module in this embodiment can execute the method shown in the first embodiment, reference may be made to the related description of the first embodiment for a part of this embodiment that is not described in detail.
The technical scheme provided by the embodiment of the invention at least has the following technical effects:
in the data acquisition scheme provided by the embodiment of the invention, the scheduling module can realize scheduling control on the whole data acquisition process according to the scheduling information, the analysis module can adopt the analysis template matched with the scheduling information to analyze the webpage source code downloaded by the download module, and the scheduling template is compiled and stored in a dynamic template language in the data acquisition process, so that the flexibility of the template can be effectively improved, the coverage rate and the accuracy of the scheduling template are improved, the labor cost required in the internet mass data acquisition is greatly reduced, and the data acquisition efficiency can be improved.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (12)

1. A method of data acquisition, comprising:
the method comprises the steps that a scheduling module obtains scheduling information in a scheduling template, and the scheduling template is compiled and stored in a dynamic template language;
the scheduling module generates a network request according to the scheduling information;
the downloading module downloads the webpage source code according to the network request;
and the analysis module processes the webpage source code by using an analysis template to obtain target data, wherein the analysis template corresponds to the scheduling template.
2. The method of claim 1, wherein the scheduling module obtains the scheduling information in the scheduling template, and comprises:
and the scheduling module calls a first analyzer and acquires the scheduling information obtained by analyzing the scheduling template by the first analyzer.
3. The method according to claim 1 or 2, wherein the scheduling information further comprises at least one of the following information: entry information, frequency control, a request mode, a preprocessing method, a subordinate data fusion mode and an additional data processing method.
4. The method of claim 1, wherein the parsing module processes the web page source code using a parsing template to obtain target data, comprising:
and the analysis module calls a second analyzer and acquires the target data obtained by processing the webpage source code by the second analyzer according to the analysis template.
5. The method of claim 1, further comprising:
the analysis module sends the target data to the scheduling module;
and the scheduling module outputs or stores the target data according to the scheduling module.
6. A data acquisition device, comprising: the system comprises a scheduling module, a downloading module and an analyzing module; wherein the content of the first and second substances,
the scheduling module is used for acquiring scheduling information in a scheduling template, and the scheduling template is compiled and stored in a dynamic template language;
the scheduling module is further used for generating a network request according to the scheduling information;
the downloading module is used for downloading the webpage source code according to the network request;
the analysis module is further configured to process the webpage source code by using an analysis template to obtain target data, where the analysis template corresponds to the scheduling template.
7. The apparatus of claim 6, wherein the scheduling module is specifically configured to:
and calling a first analyzer, and acquiring the scheduling information obtained by analyzing the scheduling template by the first analyzer.
8. The apparatus of claim 6 or 7, wherein the scheduling information further comprises at least one of the following information: entry information, frequency control, a request mode, a preprocessing method, a subordinate data fusion mode and an additional data processing method.
9. The apparatus of claim 6, wherein the parsing module is specifically configured to:
and the analysis module calls a second analyzer and acquires the target data obtained by processing the webpage source code by the second analyzer according to the analysis template.
10. The apparatus of claim 6,
the analysis module is further used for sending the target data to the scheduling module;
the scheduling module is further configured to output or store the target data according to the scheduling module.
11. A data acquisition device, comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any of claims 1 to 5.
12. A computer-readable storage medium, having stored thereon a computer program,
the computer program is executed by a processor to implement the method of any one of claims 1 to 5.
CN201811283037.4A 2018-10-31 2018-10-31 Data acquisition method and device and computer readable storage medium Active CN111125589B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811283037.4A CN111125589B (en) 2018-10-31 2018-10-31 Data acquisition method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811283037.4A CN111125589B (en) 2018-10-31 2018-10-31 Data acquisition method and device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111125589A true CN111125589A (en) 2020-05-08
CN111125589B CN111125589B (en) 2023-09-05

Family

ID=70484996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811283037.4A Active CN111125589B (en) 2018-10-31 2018-10-31 Data acquisition method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111125589B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957816A (en) * 2009-07-13 2011-01-26 上海谐宇网络科技有限公司 Webpage metadata automatic extraction method and system based on multi-page comparison
CN102184184A (en) * 2011-04-07 2011-09-14 安徽博约信息科技有限责任公司 Method for acquiring webpage dynamic information
CN102651002A (en) * 2011-02-28 2012-08-29 腾讯科技(深圳)有限公司 Webpage information extracting method and system
CN103853770A (en) * 2012-12-03 2014-06-11 北大方正集团有限公司 Method and system for abstracting information of posts from forum website
CN104050281A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Webpage information extraction method and device based on http protocol
US9049117B1 (en) * 2009-10-21 2015-06-02 Narus, Inc. System and method for collecting and processing information of an internet user via IP-web correlation
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
US20160335243A1 (en) * 2013-11-26 2016-11-17 Uc Mobile Co., Ltd. Webpage template generating method and server
CN107092632A (en) * 2017-02-09 2017-08-25 北京小度信息科技有限公司 Data processing method and device
CN107317724A (en) * 2017-06-06 2017-11-03 中证信用增进股份有限公司 Data collecting system and method based on cloud computing technology
CN107404493A (en) * 2017-08-21 2017-11-28 广州快充网络有限公司 New-energy automobile vehicle data packet parsing component and analytic method
CN107463634A (en) * 2017-07-17 2017-12-12 广州特道信息科技有限公司 web page text extracting method and device
CN107729564A (en) * 2017-11-13 2018-02-23 北京众荟信息技术股份有限公司 A kind of distributed focused web crawler web page crawl method and system
CN107798035A (en) * 2017-04-10 2018-03-13 平安科技(深圳)有限公司 A kind of data processing method and terminal
CN108304498A (en) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 Webpage data acquiring method, device, computer equipment and storage medium
CN108334634A (en) * 2018-02-27 2018-07-27 北京中关村科金技术有限公司 A kind of method, apparatus, equipment and the storage medium of extraction data information

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957816A (en) * 2009-07-13 2011-01-26 上海谐宇网络科技有限公司 Webpage metadata automatic extraction method and system based on multi-page comparison
US9049117B1 (en) * 2009-10-21 2015-06-02 Narus, Inc. System and method for collecting and processing information of an internet user via IP-web correlation
CN102651002A (en) * 2011-02-28 2012-08-29 腾讯科技(深圳)有限公司 Webpage information extracting method and system
CN102184184A (en) * 2011-04-07 2011-09-14 安徽博约信息科技有限责任公司 Method for acquiring webpage dynamic information
CN103853770A (en) * 2012-12-03 2014-06-11 北大方正集团有限公司 Method and system for abstracting information of posts from forum website
US20160335243A1 (en) * 2013-11-26 2016-11-17 Uc Mobile Co., Ltd. Webpage template generating method and server
CN104050281A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Webpage information extraction method and device based on http protocol
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN107092632A (en) * 2017-02-09 2017-08-25 北京小度信息科技有限公司 Data processing method and device
CN107798035A (en) * 2017-04-10 2018-03-13 平安科技(深圳)有限公司 A kind of data processing method and terminal
CN107317724A (en) * 2017-06-06 2017-11-03 中证信用增进股份有限公司 Data collecting system and method based on cloud computing technology
CN107463634A (en) * 2017-07-17 2017-12-12 广州特道信息科技有限公司 web page text extracting method and device
CN107404493A (en) * 2017-08-21 2017-11-28 广州快充网络有限公司 New-energy automobile vehicle data packet parsing component and analytic method
CN107729564A (en) * 2017-11-13 2018-02-23 北京众荟信息技术股份有限公司 A kind of distributed focused web crawler web page crawl method and system
CN108304498A (en) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 Webpage data acquiring method, device, computer equipment and storage medium
CN108334634A (en) * 2018-02-27 2018-07-27 北京中关村科金技术有限公司 A kind of method, apparatus, equipment and the storage medium of extraction data information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YAN PENG ET AL: "Web information extraction and its application", 《2011 IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS》 *
王松: "垂直搜索引擎中智能爬虫系统的研究与实现", 《硕士电子期刊》 *

Also Published As

Publication number Publication date
CN111125589B (en) 2023-09-05

Similar Documents

Publication Publication Date Title
CN110287109B (en) Protocol interface testing method and device, computer equipment and storage medium thereof
US10620945B2 (en) API specification generation
CN109376291B (en) Website fingerprint information scanning method and device based on web crawler
CN111680253B (en) Page application data packet generation method and device, computer equipment and storage medium
US20210224349A1 (en) Method and apparatus for analyzing data flow, device, and medium
CN110673847A (en) Configuration page generation method and device, electronic equipment and readable storage medium
US10452730B2 (en) Methods for analyzing web sites using web services and devices thereof
CN111367595B (en) Data processing method, program running method, device and processing equipment
CN106547749B (en) Webpage data acquisition method and device
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN103577447A (en) Method and equipment used for determining page type information of target pages
CN104750463B (en) A kind of developing plug method and system
CN113419729A (en) Front-end page building method, device, equipment and storage medium based on modularization
CN114357943A (en) Universal efficient Excel reading processing method, tool, medium and equipment
CN101763432A (en) Method for constructing lightweight webpage dynamic view
KR20120122959A (en) Method and Device for Collecting Web Contents and Computer-readable Recording Medium for the same
CN112632419A (en) Domain name pre-resolution configuration method and device, computer equipment and storage medium
CN113296752A (en) Method, system, device and storage medium for generating API document
CN111125589B (en) Data acquisition method and device and computer readable storage medium
CN111221711A (en) User behavior data processing method, server and storage medium
CN114816364A (en) Method, device and application for dynamically generating template file based on Swagger
CN110737636B (en) Data import method, device and equipment
CN114238024A (en) Timing diagram generation method and system
CN109471966B (en) Method and system for automatically acquiring target data source
CN110879705B (en) Page generation method, device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230627

Address after: 3007, Hengqin International Financial Center Building, No. 58 Huajin Street, Hengqin New District, Zhuhai City, Guangdong Province, 519030

Applicant after: New founder holdings development Co.,Ltd.

Applicant after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 9 floor

Applicant before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Applicant before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

GR01 Patent grant
GR01 Patent grant