CN112667872A - Real-time acquisition method of new coronary pneumonia epidemic situation data - Google Patents

Real-time acquisition method of new coronary pneumonia epidemic situation data Download PDF

Info

Publication number
CN112667872A
CN112667872A CN202011290564.5A CN202011290564A CN112667872A CN 112667872 A CN112667872 A CN 112667872A CN 202011290564 A CN202011290564 A CN 202011290564A CN 112667872 A CN112667872 A CN 112667872A
Authority
CN
China
Prior art keywords
data
information source
acquired
fields
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011290564.5A
Other languages
Chinese (zh)
Other versions
CN112667872B (en
Inventor
刘春阳
解伟凡
张翔宇
钟习
解峥
杜慧
王鹏
俞晓明
刘悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Computing Technology of CAS
Priority to CN202011290564.5A priority Critical patent/CN112667872B/en
Publication of CN112667872A publication Critical patent/CN112667872A/en
Application granted granted Critical
Publication of CN112667872B publication Critical patent/CN112667872B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a real-time acquisition method of new coronary pneumonia epidemic situation data, which comprises the following steps: establishing a configuration file, and presetting basic information of webpages reflecting epidemic situation data in a plurality of information source websites in real time in the configuration file, wherein the basic information comprises names of a plurality of fields, storage paths of the fields and the number of times of adoption of the fields; secondly, collecting webpage data, and collecting the current numerical values of the fields to be collected from a plurality of information source websites through the storage paths of the fields to be collected in the configuration file; thirdly, data alignment processing, wherein the data alignment result of the field to be acquired is used as the acquired data of the field to be acquired; and step four, updating the configuration file, and adding 1 to the number of times of the fields to be acquired in the information source websites, wherein the numerical values of the fields to be acquired in the information source websites are the same as the acquired data of the fields to be acquired. The method of the invention obtains the data with the highest credibility from the real-time data of a plurality of information source websites as the collected data, thereby improving the accuracy of the epidemic situation real-time data.

Description

Real-time acquisition method of new coronary pneumonia epidemic situation data
Technical Field
The invention relates to the technical field of information acquisition. More specifically, the invention relates to a real-time acquisition method of new coronary pneumonia epidemic situation data.
Background
Along with the occurrence of the epidemic situation of the new coronary pneumonia, people have high attention to the epidemic situation, a plurality of websites provide real-time data of the epidemic situation of the new coronary pneumonia, including fields of newly increased infected people, newly increased cured people, accumulated dead people, accumulated infected people and the like in various regions of each country, such as a global epidemic situation real-time query module reported in the economy of the 21 st century and an epidemic situation real-time tracking module of Tengchong news, and the websites basically realize the real-time update of the epidemic situation data.
However, the information sources of epidemic situation modules of most current websites are different from each other, for example, the epidemic situation module of Tencent news takes world health organization and John Hopkins university websites as the information sources, and the epidemic situation module of Xinhua network takes national Wei Jian Poison as the information sources. These websites generally use a small number of information sources, even only adopt single information source, it is difficult to objectively and accurately reflect the change situation of epidemic situation in real time, which mainly shows in the following aspects: (1) data of newly-added infected people number fields in certain areas under a global epidemic situation real-time query module reported in the economy of the 21 st century can not be updated frequently and are replaced by underlining; (2) the data updating frequency of the Xinhua network is low, and only data before a plurality of days can be checked; (3) data provided by different information source websites are different, for example, each information source website may return different data by the time the number of infected people in a country is accumulated by a certain day in a certain month; (4) due to the limited information collection and integration capability of a single website, the problem that information such as the number of newly added cases in certain national regions is not updated completely when an acquisition task is initiated may occur, and a phenomenon that partial field data of epidemic situations are defaulted on the website is caused. The epidemic situation data on different websites have larger difference, and the change condition of the epidemic situation is difficult to be accurately reflected in real time, so that a collection method with higher epidemic situation data collection efficiency and more accuracy is urgently needed to be found in order to accurately reflect the change condition of the epidemic situation in real time.
Disclosure of Invention
An object of the present invention is to solve at least the above problems and to provide at least the advantages described later.
The invention also aims to provide a real-time acquisition method of the new coronary pneumonia epidemic situation data, which acquires the epidemic situation data from a plurality of information source websites in real time and acquires the data with higher credibility as the acquired data through data alignment processing.
To achieve these objects and other advantages in accordance with the purpose of the invention, there is provided a method for real-time acquisition of new coronary pneumonia epidemic data, comprising the steps of:
establishing a configuration file, and presetting basic information of webpages reflecting epidemic situation data in a plurality of information source websites in real time in the configuration file, wherein the basic information comprises names of a plurality of fields, storage paths of the fields and the number of times of the fields are adopted, and the fields are quantifiable indexes in the epidemic situation data;
secondly, collecting webpage data, and collecting the current numerical values of the fields to be collected from a plurality of information source websites through the storage paths of the fields to be collected in the configuration file;
thirdly, data alignment processing, namely aligning the field to be acquired by taking the data alignment result of the field to be acquired as the acquired data of the field to be acquired, and comprises the following steps:
s1, judging whether the current values of the fields collected from a plurality of information source websites are the same, if so, the current values of the fields are the alignment results, and if not, entering S2;
s2, counting the occurrence times of different values of the field acquired by a plurality of information source websites, judging whether the value with the most occurrence times is unique, if so, taking the value of the field with the most occurrence times as an alignment result, and if not, entering S3;
s3, for the information source website corresponding to the numerical value with the largest occurrence frequency, reading the adopted frequency of the field in the corresponding information source website in the configuration file, and taking the numerical value of the field collected on the information source website with the largest adopted frequency as an alignment result;
and step four, updating the configuration file, adding 1 to the acquired times of the fields to be acquired in the information source websites with the current numerical values of the fields to be acquired in the information source websites being the same as the acquired data of the fields to be acquired, and updating the acquired times of the corresponding fields in the corresponding information source websites in the configuration file, wherein the acquired times of the fields in the information source websites are preset to be 0 during initialization.
Preferably, the real-time acquisition method of the new coronary pneumonia epidemic situation data further comprises: when the value with the largest number of adopted times is not unique in step S3, determining whether the value of the field corresponding to the value with the largest number of adopted times is the same, if so, taking the value of the field as the alignment result; and if not, taking the maximum value in the values of the field corresponding to the value with the maximum number of times of adoption as the alignment result.
Preferably, the method for acquiring the new crown pneumonia epidemic situation data in real time further includes a plurality of webpage links connecting a plurality of information source websites and used for acquiring webpage source codes of the corresponding information source websites, and when acquiring the webpage data in the second step, the selector acquires the current numerical value of the field to be acquired in the information source websites from the webpage source codes of the corresponding information source websites by using the storage path of the field to be acquired in the configuration file in the information source websites as a parameter.
Preferably, for any information source website, when an inquiry interface is arranged, the inquiry interface is arranged in the configuration file when the configuration file is established in the first step, and when the webpage data is acquired in the second step, the current numerical value of the field to be acquired of the corresponding information source website is acquired by calling the inquiry interface.
Preferably, when web page data is acquired in the second step, when the current value of the field to be acquired cannot be acquired from any one of the information source websites through the storage path of the field to be acquired in the configuration file, the field to be acquired in the information source website is rendered by calling an open source automatic test tool to acquire the storage path of the field to be acquired, and then the current value of the field to be acquired in the information source website is acquired from the web page source code of the corresponding information source website by using the storage path of the field to be acquired in the configuration file in the information source website as a parameter through a selector.
Preferably, in the real-time acquisition method of the new coronary pneumonia epidemic situation data, the current numerical values of the fields to be acquired, which are acquired from a plurality of information source websites, are converted into a uniform numerical value format before the data alignment processing.
Preferably, in the real-time acquisition method of the new coronary pneumonia epidemic situation data, the plurality of information source websites include at least two websites of Tencent news, 21 st century finance and economics, Xinhua network, dog search and Xinlang network.
Preferably, the real-time acquisition method of the new coronary pneumonia epidemic situation data further comprises: and step five, storing the acquired data of the fields to be acquired in the step three into a database, recording the acquisition time, and placing the SQL sentences of the operation database into a configuration file.
Preferably, in the real-time acquisition method of the new crown pneumonia epidemic situation data, when the configuration file is read, the first type of information is read from the configuration file only once during acquisition initialization, and the first type of information comprises webpage links of different information source websites, storage paths and query interfaces of various fields, and SQL statements of an operation database; and for the second type of information, re-reading and collecting the second type of information when the configuration file is read every time, wherein the second type of information is the number of times of adopting each field in a plurality of information source websites.
The invention at least comprises the following beneficial effects: (1) epidemic situation data are collected from various information source websites in real time through configuration files, and the problem that the epidemic situation data of a single information source website is default or the data is not updated timely is solved; (2) through data alignment processing, data with the highest credibility is obtained from the real-time data of a plurality of information source websites and used as collected data, and the accuracy of epidemic situation real-time data is improved.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.
Drawings
Fig. 1 is a schematic flow chart of digital alignment processing performed on a field to be acquired in the technical solution of the present invention.
Detailed Description
The present invention is further described in detail below with reference to the attached drawings so that those skilled in the art can implement the invention by referring to the description text.
It is to be noted that the experimental methods described in the following embodiments are all conventional methods unless otherwise specified, and the reagents and materials are commercially available unless otherwise specified.
As shown in fig. 1, the invention provides a real-time acquisition method of new coronary pneumonia epidemic situation data, comprising the following steps:
firstly, establishing a configuration file, and presetting basic information of a webpage which reflects epidemic situation data in a plurality of information source websites in real time in the configuration file, wherein the basic information comprises names of a plurality of fields, storage paths of the fields and the number of times of the fields being collected, the fields are quantifiable indexes in the epidemic situation data, and the names of the fields are, for example, "newly increased infected people", "accumulated dead people", "asymptomatic people", "accumulated confirmed people", and "accumulated cured people", etc.;
acquiring webpage data, and acquiring current values of fields to be acquired from a plurality of information source websites through storage paths of the fields to be acquired in a configuration file (the current values are updated continuously according to acquisition time because data of each field on the information source websites are acquired, and when people need to acquire the data of the fields to be acquired on the information source websites, the latest data is acquired naturally);
thirdly, data alignment processing, namely aligning the field to be acquired by taking the data alignment result of the field to be acquired as the acquired data of the field to be acquired, and comprises the following steps:
s1, judging whether the current numerical values of the fields (the fields refer to fields to be acquired, and the same below) acquired from a plurality of information source websites are the same, if so, the current numerical values of the fields are the alignment results, and if not, entering S2;
s2, counting the occurrence times of different values of the field acquired by a plurality of information source websites, judging whether the value with the most occurrence times is unique, if so, taking the value of the field with the most occurrence times as an alignment result, and if not, entering S3;
s3, for the source website corresponding to the value with the largest occurrence number, reading the adopted number of times of the field in the corresponding source website in the configuration file, and taking the value of the field acquired on the source website with the largest adopted number of times as an alignment result (it needs to be explained that, in the accumulation process of big data acquisition, the source website corresponding to the value with the largest occurrence number has the same possibility of the adopted number of times of the field, so the value acquired on the source website with the largest adopted number of times is directly taken as the alignment result in the step S3 in the technical scheme of the application;
and step four, updating the configuration file, adding 1 to the acquired times of the fields to be acquired in the information source websites with the current numerical values of the fields to be acquired in the information source websites being the same as the acquired data of the fields to be acquired, and updating the acquired times of the corresponding fields in the corresponding information source websites in the configuration file, wherein the acquired times of the fields in the information source websites are preset to be 0 during initialization.
In the technical scheme, basic information of a webpage which reflects epidemic situation data on an information source website to be adopted in advance is preset in a configuration file, so that when fields to be collected such as newly-increased infected persons are collected, numerical values of the newly-increased infected persons are obtained from a plurality of preset information source websites directly through the configuration file; each time of acquisition, the current numerical values (number of people) of the fields to be acquired in the plurality of information source websites are obtained through the names of the fields to be acquired, that is, basic information of a plurality of different fields is preset in the configuration file, but the fields to be acquired are acquired one by one when data are acquired, and then the data are aligned and the number of times of the corresponding information source websites of the fields to be acquired in the configuration file is updated.
The plurality of source websites are preferably Tencent news, 21 st century finance and Xinhua network. Because the DOM (document object model tree) in each information source website can set a storage path for each field, the storage path of each field is preset in a configuration file in the technical scheme, the numerical value of the field can be extracted from the DOM, a plurality of information source websites have the numerical values of a plurality of fields, and the data alignment processing process is that the numerical value with the highest credibility is screened out from the numerical values of the fields of the information source websites as the corresponding numerical value of each field, namely the acquired data, because the real-time acquisition method of the scheme customizes and increases the 'acquired times of each field' of different information source websites, namely the alignment result is obtained by the method of the scheme through the third step, when the acquired data is obtained, the 'acquired times of each field' in the configuration file is updated by adding 1 every acquisition time, namely the acquired times of the corresponding fields in the acquired data of the method are counted and updated in the configuration file, and when data alignment is handled, combine the accepted data of each field to screen, under the accumulation of big data, can improve the accuracy that new coronary pneumonia epidemic situation data was gathered in this scheme in real time.
The adopted times of each field in each information source website are not adopted when the website is initialized, so that the data of any website are all preset to be 0, and the adopted times are continuously increased along with the increase of the data acquisition times, so that the accuracy of data acquisition is improved.
In the process of actually acquiring data, the following example is combined for explanation:
example 1: assuming that the field to be collected is the number of new deaths in the United states, the numerical values collected in sequence from Tencent news, 21 st century finance and Xinhua network include the following conditions: 4210, 4210; 4211, 4210; ③ 4211, 4212 and 4210;
analyzing and aligning treatment process:
for the first time, the numerical values of the newly increased American deaths collected on the three different websites are 4210, if the numerical values are the same, the numerical values 4210 are directly used as an alignment result, namely the collected data of the newly increased American deaths, and the collected times of the newly increased American deaths of the three websites are added with 1 for updating in a configuration file;
for the second, the number of new deaths in the united states collected on two websites (the 21 st century finance and the Xinhua network) of the three websites is 4210, one website (Tencent news) is 4211, because the number of times of 4210 occurrence is 2 and the number of times of 4211 occurrence is 1, 4210 is directly used as an alignment result, and the number of times of being collected of the new deaths in the united states of the 21 st century finance and the Xinhua network, which are the source website for outputting 4210, is added with 1 in the configuration file for updating;
thirdly, the number of new American deaths collected on the three websites is different, the number of occurrences is 1, and if the number of times of the new American deaths collected from the Xinhua network by the current Tencent news, 21 st century finance and Xinhua gateway in the configuration file is 521, 438 and 678, the number of new American deaths collected from the Xinhua network corresponding to 678 is used as an alignment result. If the third condition occurs in the acquisition method of the present application during the first acquisition, that is, the number of times of acquisition is preset to 0 during initialization, the largest value 4212 of 4211, 4212, and 4210 is taken as the alignment result (the data acquisition in this condition is described in the following technical scheme).
In the technical solution, the field path may be obtained by:
1) open the web page in the chrome browser and use the mouse to frame the field.
2) Right clicking the mouse selects the view element, i.e. the field can be viewed from the DOM (document object model tree) perspective in the chrome browser.
3) The fields are enclosed in the DOM (document object model tree) using a mouse.
4) Right-clicking the mouse selects the copy selector and pastes the content, i.e. the path of the field in the web page DOM (document object model tree).
In another technical solution, as shown in fig. 1, the method for acquiring new coronary pneumonia epidemic situation data in real time further includes: when the value with the largest number of adopted times is not unique in step S3, determining whether the value of the field corresponding to the value with the largest number of adopted times is the same, if so, taking the value of the field as the alignment result; and if not, taking the maximum value in the values of the field corresponding to the value with the maximum number of times of adoption as the alignment result.
Example 2: assuming that the field to be collected is the new number of American infection, the values collected in sequence from Tencent news, 21 st century finance and economics, Xinhua network, dog search and Xinhua network are 328, 326 and 327, before the data is collected currently, the number of American new infection corresponding to Tencent news, 21 st century finance and economics, Xinhua network, dog search and Xinhua network is taken as the number of times of aligning the results, and assuming that the number of times of collection is the following three conditions in sequence: 53, 52, 50, 51, 48; 52, 51, 50, 48; ③ 52, 51, 50, 52;
analyzing and aligning treatment process: because the numerical values of the American newly increased number of infectious agents collected from 5 source websites are different and not unique, the method enters S2, wherein 328 appears 2 times, 326 appears 2 times and 327 appears 1 time, and because the numerical values with the largest occurrence times are 328 and 326, the occurrence times are 2 times and not unique, the method enters S3, and the following three conditions are divided into the four source websites with the collected numerical values of 328 and 326:
for the first time, the adopted times of the newly increased American infected persons are 53, 52, 50 and 51 in sequence, wherein 53 is the largest and the only, so that the acquisition value 328 corresponding to the adopted times 53 is directly used as an alignment result, namely the acquisition data of the newly increased American infected persons is 328, and as the information source website with the acquisition value 328 comprises Tencent news and 21 st century finance, the adopted times of the newly increased American infected persons of Tencent news and 21 st century finance are added by 1 to be updated in a configuration file;
for the second time, the adopted times of the American newly-increased infected persons are 52, 51 and 50 in sequence, wherein 52 is the largest, although 2 times occur in 52, the 52 times are not unique, the websites with the 52 times of the adopted times are Tengchong news and 21 st century finance, and the collected data of the American newly-increased infected persons are 328, 328 is taken as an alignment result, and the information source websites with the value of 328 include Tengchong news and 21 st century finance, so that the adopted times of the American newly-increased infected persons with Tengchong news and 21 st century finance are added by 1 to be updated in a configuration file;
thirdly, the number of times of acquiring newly-increased American infected persons is 52, 51, 50 and 52 in sequence, wherein 52 is the largest, 2 are the largest, and are not unique, the websites with the 52 times of acquiring are Tengchen news and dog search, the number of the newly-increased American infected persons in Tengchen news is 328, the number of the newly-increased American infected persons in dog search is 326, the number of the newly-increased American infected persons in dog search is not the same, the Tengchen news with the 52 largest number of acquiring times and the value 328 with the larger number of the newly-increased American infected persons in dog search are used as alignment results, and the information source websites with the value 328 comprise Tengchen news and 21 century meridians, so that the Tengchen news and the number of the newly-increased American infected persons in 21 century meridians are increased by 1 in the configuration file for updating.
In another technical scheme, the method for acquiring the new crown pneumonia epidemic situation data in real time further comprises a plurality of webpage links connecting a plurality of information source websites and used for acquiring webpage source codes of the corresponding information source websites, and when the webpage data are acquired in the second step, the selector acquires the current numerical value of the field to be acquired in the information source websites from the webpage source codes of the corresponding information source websites by taking the storage path of the field to be acquired in the configuration file in the information source websites as a parameter.
The acquisition program needs to use a web page link to initiate a page request behavior, and obtains a web page source code of an information source website through the web page link, so that data can be directly extracted from the web page source code, and a CSS Selector (cascading style sheet Selector) can be used to extract data by taking a path of each field in a configuration file as a parameter of the Selector.
In another technical scheme, for any information source website, when an inquiry interface is arranged, the inquiry interface is arranged in a configuration file when the configuration file is established in the first step, and when webpage data is acquired in the second step, the current numerical value of a field to be acquired of the corresponding information source website is acquired by calling the inquiry interface.
In the technical scheme, some information source websites are provided with query interfaces, the values of the fields are not required to be extracted from the webpage source codes through the storage paths of the fields, the query interfaces can be directly called to obtain the values of the fields, but the query interfaces are required to be arranged in the configuration files. Some source websites do not provide a query interface, and can only extract data in the webpage source codes by adopting the storage path of the fields. For the information source website provided with the query interface, the query interface is preferentially adopted to acquire the numerical value of the field, so that the operation is more convenient.
In another technical scheme, when web page data is acquired in the second step, when the current value of the field to be acquired cannot be acquired from any information source website through the storage path of the field to be acquired in the configuration file, an open-source automatic test tool is called to render the field to be acquired in the information source website, the storage path of the field to be acquired is acquired, and then the current value of the field to be acquired in the information source website is acquired from the web page source code of the corresponding information source website by using the storage path of the field to be acquired in the configuration file in the information source website as a parameter through a selector.
In the technical scheme, the rendering script is possibly embedded in the webpage source code, and the rendering script is not executed when the acquisition program requests the source code according to the webpage link. When a user browses normally, the user sees a webpage after the rendering script is executed (the browser executes the rendering script), and the field path written in the configuration file is the path of the field in the webpage after the rendering script is executed. The effects of such scripts on the fields include: 1) paths are created and values are set for certain fields, which means that the source code to which the harvester requests from the web page link does not yet contain certain fields. 2) The path of some fields is modified, which means that the path of some fields in the source code requested by the acquisition program according to the web page link is not consistent with the path in the configuration file, and then the path extraction directly using the fields in the configuration file causes an error that the path cannot be found. In this case, an open source automated test tool (e.g., Selenium) or a lightweight headless browser (e.g., Phantomjs) may be invoked to automatically execute the rendering script and obtain the DOM (document object model Tree). Taking the Selenium library as an example, the get function of the library takes the link of the web page as input, and directly outputs DOM (document object model tree). And acquiring data to be acquired in the webpage by adopting a CSS Selector (cascading style sheet Selector) after acquiring the DOM (document object model tree).
In another technical scheme, in the real-time acquisition method of the new coronary pneumonia epidemic situation data, the current numerical values of the fields to be acquired, which are acquired from a plurality of information source websites, are converted into a uniform numerical value format before data alignment processing. The data is also called to be cleaned, so as to avoid that the data alignment processing cannot be performed because the numerical values of the fields to be acquired on different information source networks adopt different formats, for example, some websites use Chinese characters, some websites use Arabic numerals, and then all the websites need to be converted into a uniform numerical format for comparison, and the numerical format is preferably Arabic numerals.
In another technical scheme, the real-time acquisition method of the new coronary pneumonia epidemic situation data comprises at least two websites of Tencent news, 21 st century finance and economics, Xinhua network, dog search and Xinlang network. Preferably, in another technical scheme, the source websites are Tencent news, 21 st century finance and economics and Xinhua network.
It should be noted that, any one of the source websites may refer to a website from which other epidemic data is from a first source as a source network, for example, an epidemic module of Tencent news takes a website from world health organization and John Hopkins university as a source, and an epidemic module of Xinhua network takes national Wei Jian Commission as a source, that is, the website from the first source includes the world health organization, the website from John Hopkins university, and the national Wei Jian Commission;
in the technical scheme, the website requirements of the first source adopted in the plurality of information source websites are not repeated, and under the scheme, the weight of a certain information source cannot be increased in the plurality of information source websites by the website data of the first source, so that the accuracy is higher.
In another technical solution, the real-time acquisition method of new coronary pneumonia epidemic situation data further includes: and step five, storing the acquired data of the fields to be acquired in the step three into a database, recording the acquisition time, and placing the SQL sentences of the operation database into a configuration file.
In the technical scheme, the collected data, namely the alignment result, can be written into an open-source MySQL database for data persistence, a collection program calls a MySQL Connector (MySQL database connection tool) to establish database connection, the aligned data is written into MySQL, and then the database connection is closed. The SQL statements executed in this step may be read from the configuration file in step 2. Therefore, historical acquisition data stored by the acquisition method can be read, and the data format is defined as the name of a country and a region, acquisition time, the name of a field 1, the data of the field 1, the name of a field 2, the data of the field 2, …, the name of a tail field and the data of the tail field.
In another technical scheme, the real-time acquisition method of the new crown pneumonia epidemic situation data comprises the steps of reading a configuration file, and only reading first-class information from the configuration file once during acquisition initialization, wherein the first-class information comprises webpage links of different information source websites, storage paths and query interfaces of fields and SQL sentences of an operation database; and for the second type of information, re-reading and collecting the second type of information when the configuration file is read every time, wherein the second type of information is the number of times of adopting each field in a plurality of information source websites.
Because the link of the information source website, the path of each field in the DOM (document object model tree), the query interface provided by the website and the SQL statement for persistence of the operation database basically do not have great change, when the configuration file is read, the configuration file is read only once in the initialization stage of the acquisition program and is stored in the space opened by the acquisition program, and the content is not read from the configuration file in the process of each acquisition action.
And because the adopted times of each field of the source website are continuously updated along with the acquisition behavior, the part of information is re-read when the data alignment is carried out in the step three.
While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims (9)

1. The real-time acquisition method of the new coronary pneumonia epidemic situation data is characterized by comprising the following steps:
establishing a configuration file, and presetting basic information of webpages reflecting epidemic situation data in a plurality of information source websites in real time in the configuration file, wherein the basic information comprises names of a plurality of fields, storage paths of the fields and the number of times of the fields are adopted, and the fields are quantifiable indexes in the epidemic situation data;
secondly, collecting webpage data, and collecting the current numerical values of the fields to be collected from a plurality of information source websites through the storage paths of the fields to be collected in the configuration file;
thirdly, data alignment processing, namely aligning the field to be acquired by taking the data alignment result of the field to be acquired as the acquired data of the field to be acquired, and comprises the following steps:
s1, judging whether the current values of the fields collected from a plurality of information source websites are the same, if so, the current values of the fields are the alignment results, and if not, entering S2;
s2, counting the occurrence times of different values of the field acquired by a plurality of information source websites, judging whether the value with the most occurrence times is unique, if so, taking the value of the field with the most occurrence times as an alignment result, and if not, entering S3;
s3, for the information source website corresponding to the numerical value with the largest occurrence frequency, reading the adopted frequency of the field in the corresponding information source website in the configuration file, and taking the numerical value of the field collected on the information source website with the largest adopted frequency as an alignment result;
and step four, updating the configuration file, adding 1 to the acquired times of the fields to be acquired in the information source websites with the current numerical values of the fields to be acquired in the information source websites being the same as the acquired data of the fields to be acquired, and updating the acquired times of the corresponding fields in the corresponding information source websites in the configuration file, wherein the acquired times of the fields in the information source websites are preset to be 0 during initialization.
2. The method for real-time acquisition of new coronary pneumonia epidemic data according to claim 1, further comprising: when the value with the largest number of adopted times is not unique in step S3, determining whether the value of the field corresponding to the value with the largest number of adopted times is the same, if so, taking the value of the field as the alignment result; and if not, taking the maximum value in the values of the field corresponding to the value with the maximum number of times of adoption as the alignment result.
3. The method for collecting new crown pneumonia epidemic situation data in real time as claimed in claim 1, wherein said configuration file further includes a plurality of web page links connecting a plurality of information source websites for obtaining web page source codes of corresponding information source websites, when collecting web page data in step two, the selector collects current values of fields to be collected in the information source websites from the web page source codes of corresponding information source websites by using storage paths of the fields to be collected in the configuration file in the information source websites as parameters.
4. The method for real-time collection of new crown pneumonia epidemic situation data as claimed in claim 3, wherein for any information source website, when it is equipped with query interface, the query interface is set in the configuration file when the configuration file is established in the first step, and when the web page data is collected in the second step, the current value of the field to be collected of the corresponding information source website is collected by calling the query interface.
5. The method for collecting new crown pneumonia epidemic situation data in real time as claimed in claim 3, wherein when web page data is collected in step two, when the current value of the field to be collected can not be obtained from any of the information source websites through the storage path of the field to be collected in the configuration file, the field to be collected in the information source website is rendered by calling an open source automatic testing tool to obtain the storage path of the field to be collected, and then the current value of the field to be collected in the information source website is collected from the web page source code of the corresponding information source website by using the storage path of the field to be collected in the configuration file in the information source website as a parameter through the selector.
6. The method for real-time acquisition of new coronary pneumonia epidemic situation data as claimed in claim 1, wherein the current numerical value of the field to be acquired collected from multiple source websites is converted into a unified numerical value format before the data alignment process.
7. The method of claim 1, wherein the plurality of source websites comprise at least two websites selected from Tencent News, 21 st century finance, Xinhua network, search for dogs, and Xinlang network.
8. The method for real-time acquisition of new coronary pneumonia epidemic data of claim 4, further comprising: and step five, storing the acquired data of the fields to be acquired in the step three into a database, recording the acquisition time, and placing the SQL sentences of the operation database into a configuration file.
9. The method for real-time collection of new crown pneumonia epidemic situation data according to claim 8, wherein when the configuration file is read, the first kind of information is read from the configuration file only once when the collection is initialized, the first kind of information includes web page links of different information source websites, storage paths and query interfaces of each field, and SQL statements of the operation database; and for the second type of information, re-reading and collecting the second type of information when the configuration file is read every time, wherein the second type of information is the number of times of adopting each field in a plurality of information source websites.
CN202011290564.5A 2020-11-17 2020-11-17 Real-time acquisition method of new coronary pneumonia epidemic situation data Active CN112667872B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011290564.5A CN112667872B (en) 2020-11-17 2020-11-17 Real-time acquisition method of new coronary pneumonia epidemic situation data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011290564.5A CN112667872B (en) 2020-11-17 2020-11-17 Real-time acquisition method of new coronary pneumonia epidemic situation data

Publications (2)

Publication Number Publication Date
CN112667872A true CN112667872A (en) 2021-04-16
CN112667872B CN112667872B (en) 2023-04-07

Family

ID=75403596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011290564.5A Active CN112667872B (en) 2020-11-17 2020-11-17 Real-time acquisition method of new coronary pneumonia epidemic situation data

Country Status (1)

Country Link
CN (1) CN112667872B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113918142A (en) * 2021-11-24 2022-01-11 企查查科技有限公司 Data acquisition task code generation method and device and computer equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140289187A1 (en) * 2013-03-21 2014-09-25 Salesforce.Com, Inc. System and method for evaluating claims to update a record from conflicting data sources
CN104462547A (en) * 2014-12-25 2015-03-25 深圳联友科技有限公司 Configurable webpage data acquisition method and system
CN107958086A (en) * 2017-12-18 2018-04-24 北京睿力科技有限公司 The multi-source heterogeneous database data for solving data semantic Heterogeneity integrates method
CN109472005A (en) * 2018-11-08 2019-03-15 北京锐安科技有限公司 Data reliability appraisal procedure, device, equipment and storage medium
CN111523006A (en) * 2020-04-14 2020-08-11 上海安洵信息技术有限公司 Network public opinion tracking method for epidemic situation area
CN111680082A (en) * 2020-04-30 2020-09-18 四川弘智远大科技有限公司 Government financial data acquisition system and data acquisition method based on data integration
CN111708773A (en) * 2020-08-13 2020-09-25 江苏宝和数据股份有限公司 Multi-source scientific and creative resource data fusion method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140289187A1 (en) * 2013-03-21 2014-09-25 Salesforce.Com, Inc. System and method for evaluating claims to update a record from conflicting data sources
CN104462547A (en) * 2014-12-25 2015-03-25 深圳联友科技有限公司 Configurable webpage data acquisition method and system
CN107958086A (en) * 2017-12-18 2018-04-24 北京睿力科技有限公司 The multi-source heterogeneous database data for solving data semantic Heterogeneity integrates method
CN109472005A (en) * 2018-11-08 2019-03-15 北京锐安科技有限公司 Data reliability appraisal procedure, device, equipment and storage medium
CN111523006A (en) * 2020-04-14 2020-08-11 上海安洵信息技术有限公司 Network public opinion tracking method for epidemic situation area
CN111680082A (en) * 2020-04-30 2020-09-18 四川弘智远大科技有限公司 Government financial data acquisition system and data acquisition method based on data integration
CN111708773A (en) * 2020-08-13 2020-09-25 江苏宝和数据股份有限公司 Multi-source scientific and creative resource data fusion method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113918142A (en) * 2021-11-24 2022-01-11 企查查科技有限公司 Data acquisition task code generation method and device and computer equipment
CN113918142B (en) * 2021-11-24 2024-03-15 企查查科技股份有限公司 Data acquisition task code generation method, device and computer equipment

Also Published As

Publication number Publication date
CN112667872B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
JP4146347B2 (en) Access log analysis apparatus and access log analysis method
US8185530B2 (en) Method and system for web document clustering
US6199081B1 (en) Automatic tagging of documents and exclusion by content
US20110173178A1 (en) Method and system for obtaining script related information for website crawling
CN104516982A (en) Method and system for extracting Web information based on Nutch
JP2018198046A (en) Apparatus and method for generation of financial event database
CN112667872B (en) Real-time acquisition method of new coronary pneumonia epidemic situation data
CN111125485A (en) Website URL crawling method based on Scapy
CN104281629A (en) Method and device for extracting picture from webpage and client equipment
Sharma et al. A novel architecture for deep web crawler
KR100557874B1 (en) Method of scientific information analysis and media that can record computer program thereof
EP3422177A1 (en) Systems and methods for code parsing and lineage detection
Bross et al. Mapping the blogosphere with rss-feeds
CN113094568A (en) Data extraction method based on data crawler technology
CN104778232A (en) Searching result optimizing method and device based on long query
KR20080030196A (en) The way of internet web page tagging and tag search system
Yang et al. Data collection system for link analysis
CN114489673A (en) Method and device for deleting invalid codes in application program
US11256670B2 (en) Multi-database system
CN113254725A (en) Data management and retrieval enhancement method for graph database
Di Lucca et al. Towards a better comprehensibility of web applications: Lessons learned from reverse engineering experiments
KR100871470B1 (en) search system for constructing indexed data and method thereof
Xie et al. Design and Implementation of Web Information Extraction System Based on Crawler
CN113741766B (en) Visual acquisition tool for webpage codes
Kinnander A comparison between MongoDB & CouchDB on search performance: A comparative analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant