CN112948660A - Cluster electric bus monitoring website battery data continuous crawling and analyzing method - Google Patents
Cluster electric bus monitoring website battery data continuous crawling and analyzing method Download PDFInfo
- Publication number
- CN112948660A CN112948660A CN202110339920.6A CN202110339920A CN112948660A CN 112948660 A CN112948660 A CN 112948660A CN 202110339920 A CN202110339920 A CN 202110339920A CN 112948660 A CN112948660 A CN 112948660A
- Authority
- CN
- China
- Prior art keywords
- battery
- cluster
- data
- crawler
- crawling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 48
- 230000009193 crawling Effects 0.000 title claims abstract description 31
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000005516 engineering process Methods 0.000 claims abstract description 14
- 230000008859 change Effects 0.000 claims abstract description 6
- 238000005065 mining Methods 0.000 claims abstract description 6
- 230000006870 function Effects 0.000 claims description 18
- 238000004140 cleaning Methods 0.000 claims description 12
- 238000013461 design Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 7
- 238000007405 data analysis Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000004422 calculation algorithm Methods 0.000 claims description 3
- 230000007547 defect Effects 0.000 claims description 3
- 238000013480 data collection Methods 0.000 claims 1
- 238000013075 data extraction Methods 0.000 claims 1
- 238000004458 analytical method Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 241000239290 Araneae Species 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/248—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a method for continuously crawling and analyzing battery data of a cluster electric bus monitoring website, which comprises the steps of firstly, constructing a Docker cluster of a plurality of physical hosts by utilizing a Swarm container management tool; then continuously crawling nonstandard unified battery monitoring data information in the appointed electric bus monitoring website by using a Scarpyp crawler tool library of python; and finally, analyzing and mining the battery database to generate a battery pressure difference box whisker graph, a battery pack total pressure and total current trend change graph and a related visualized statistical chart. The method mainly utilizes two technologies of distributed Docker container clustering and web crawlers, and simultaneously utilizes a crawler Scarpy tool framework of python to continuously acquire, extract and clean unstructured battery information of a formulated large-scale electric vehicle monitoring website, so that a new structured battery information database is generated, a statistical chart for visually displaying data is generated, and intuitive battery running state information reference can be provided for technical analysts.
Description
Technical Field
The invention relates to the technical field of remote monitoring and analysis of power batteries in the field of electric automobiles, in particular to a method for continuously crawling and analyzing battery data of a cluster electric bus monitoring website.
Background
Web crawlers (Web crawlers), also known as Web spiders (Web spiders), can collect customized Web page information from the internet according to the logic of machine code. The key technology of the web crawler is divided into two parts, one is that information downloading of internet web pages (including links) is completed through a depth priority or breadth priority strategy; and secondly, extracting and cleaning useful data by webpage analysis and storing the useful data in a database. Scapy is a high-performance practical crawler frame written in Python language, has a clear structure and contains various middleware interfaces, and can provide various personalized requirements for users
Docker as a container open source engine can quickly and easily create a portable and lightweight container for a specific application, and allows a system image and a related library to be packaged into one image without installing an operating system and a related application program dependent package, thereby greatly reducing the environment deployment time. Compared with a VM cluster, the Docker cluster can more reasonably and efficiently utilize host machine resources. Swarm is as the Docker cluster management tool that an official body provided, provides the unified entry and can start distributed crawler deployment fast, can abstract a plurality of Docker host computers as a whole, and also can elasticity flexible Docker host computers realize load balancing, even can realize horizontal expansion through container arrangement.
The Echarts plug-in is pure JavaScript software of a Baidu open source and supports various browsers to smoothly run on a PC and a mobile terminal. Echarts provides rich API interfaces and documents, can design diversified data display by reasonably setting and combining JSON data transmitted by a background, has the characteristics of convenient configuration, rich chart types and light data transmission, and can better solve the problem of visual analysis of information data.
Disclosure of Invention
The invention provides a method for continuously crawling and analyzing battery data of a group of electric bus monitoring websites.
In order to achieve the purpose, the invention adopts the following technical scheme: a method for continuously crawling and analyzing battery data of a cluster electric bus monitoring website comprises the following steps,
firstly, constructing a Docker cluster of a plurality of physical hosts by using a Swarm container management tool;
and then, continuously crawling and recording the non-standard unified battery monitoring information stored in the designated electric bus monitoring website by using a Scarpyp crawler tool library of python. The technical processes of URL address duplication removal, data acquisition, extraction and cleaning are involved in the operation, and the SQL database for generating the monitoring information needs to be reordered;
and finally, analyzing and mining the battery database to generate a battery pressure difference box whisker graph, a battery pack total pressure and total current trend change graph and a related visualized statistical chart.
Further, the specific features include the steps of:
s1: designing a balanced load crawler data analysis system with a distributed uniform entrance by combining an open-source Docker container and a Swarm cluster management tool;
s2: 5 blade servers are adopted for erecting a system server, each server is configured to be a dual-core IntelXeon, the main frequency is 2.5GHz, the memory is 16GB and the hard disk is 100G, and host operating systems are 64-bit Ubuntu 16.04LTS;
s3: the system network structure design adopts four servers to build a container and a host cluster of Swarm, the cluster selects a Master node in a distributed mode through a Raft protocol, and agents run on the rest three Slave nodes to receive the unified scheduling management of the Master;
s4: a key-value type Redis database is deployed on a Master node and used for storing the URL address after duplication removal in the distributed crawler process, and the other physical server is used for storing a battery information database after the crawler is unstructured;
s5: configuring a Master node and a Slave node by means of a Swarm mirror image, using a service discovery function by manually designating an IP address of the node, sequentially starting Master and Slave nodes in a cluster, and viewing cluster information on the Master node;
s6: the system adopts container technology and Swarm cluster management, and uses a quick tool Docker file to create a project file and a crawler self-defined mirror image packaged by a basic environment. After the data are uploaded to a mirror image warehouse of a cluster, a Master node crawler container is created and operated, the data are rapidly deployed to three Slave nodes through a Swarm service, and 100 crawler containers are created and started to complete parallel crawling;
s7: for the crawled and collected structured battery data table, the system converts various battery data in the SQL data table into a JSON format through a background interface, and designs and completes diversified data display by utilizing an API (application program interface) interface and a document provided by an Echart plug-in, relating to a differential pressure box beard graph, a trend graph of total voltage and total current and a corresponding relation graph of differential pressure and current to be used for analyzing a battery.
Further, the S7 further includes S71: the system adopts a centralized scheduling strategy of a master control node and a crawler node, and overcomes the defect that a Scapy framework does not support distribution; the main control node distributes tasks to the crawler nodes according to task priority and a load balancing principle, the crawler nodes are responsible for capturing corresponding data from the URL webpage, and distributed crawling of script is achieved through a global URL queue shared by the main node and the slave node.
Further, the S7 further includes S72: and simultaneously, a bloom filter is adopted to judge whether the important elements are in a collection sequence and a Redis database is adopted to finish URL address deduplication.
Further, the S7 further includes S73: the setting system analyzes and acquires the CSS style structure of the battery data by adopting an xpath method.
Further, the S7 further includes S74: and extracting corresponding target information through a custom regular function to realize data cleaning aiming at different fields.
Further, the S7 further includes S75: configuring a system proxy server IP address to cope with the anti-crawling technology; after a series of processing such as URL address duplication removal, webpage collection, extraction and cleaning, the useful battery data are stored in a physical database server special for the system.
Further, where the S72 bloom filter can quickly determine whether an element exists in the set, the algorithm is as follows:
(1) firstly, k hash functions are required to be prepared, and each function can hash the URL into 1 integer;
(2) during initialization, an array with the length of n bits is needed, and each bit is initialized to be 0;
(3) when a URL is added into a set, k hash values are calculated by using k hash functions, and the corresponding bit position in the array is set to be 1;
(4) and when judging whether a URL is in a set, calculating k hash values by using k hash functions, inquiring corresponding bits in the array, and considering that all the bits are 1 in the set.
According to the technical scheme, due to the fact that the battery monitoring data standard and the information of a plurality of large-scale online electric vehicle monitoring platforms are not unified, technical analysis personnel cannot find appropriate and accurate information in the abundant and redundant battery monitoring information easily. The battery data continuous crawling and analyzing method for the cluster electric bus monitoring website is designed and realized by utilizing a webpage crawler technology under a distributed Docker container cluster architecture. Firstly, constructing a Docker cluster of a plurality of physical hosts by using a Swarm container management tool; then, carrying out continuous distributed network crawling on the flying standard unified battery monitoring information of the designated electric bus monitoring website by the aid of a Scarpyp crawler tool library of python, wherein processes of URL address duplicate removal, data acquisition, extraction and cleaning are involved, and an SQL database of the monitoring information is generated; and finally, analyzing and mining the battery database to generate a battery pressure difference box whisker graph, a battery pack total pressure and total current trend change graph and other visualized statistical charts, and providing intuitive battery operation state information reference for technical analysts.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a graph of cell differential pressure versus current trend generated by the present invention;
FIG. 3 is a graph of the total cell pressure profile generated by the present invention;
fig. 4 is a frequency plot of total current versus differential pressure distribution for a battery pack produced by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.
With the rapid development of internet technology, monitoring and analysis of power batteries in new energy vehicles are centralized on the internet, the actual measurement data source of technical analysts turns to a network data acquisition mode, all vehicle factories store data collected by monitoring modules installed on electric vehicles on corresponding websites, and new energy practitioners and government supervision departments can derive battery data analysis from the websites to judge the running performance of the batteries. However, most of different monitoring data websites of the whole automobile factory have inconsistent information, large quantity, complicated information and irregular variation, which leads to the embarrassing situation that the monitoring department and the technical analyst can not obtain. Aiming at the problem, the embodiment of the invention realizes a system for gathering and analyzing the battery information of the electric automobile remote monitoring website based on the web crawler technology. The system mainly utilizes two technologies of distributed Docker container clustering and web crawlers and utilizes a crawler Scarpy tool framework of python to continuously acquire, extract and clean unstructured battery information of a formulated large-scale electric vehicle monitoring website, so that a new structured position database is generated, and data is generated into a visualized displayed statistical chart.
According to the method for continuously crawling and analyzing the battery data of the cluster electric bus monitoring website, firstly, a Swarm container management tool is utilized to construct a Docker cluster of a plurality of physical hosts; then, carrying out continuous distributed network crawling on the flying standard unified battery monitoring information of the designated electric bus monitoring website by the aid of a Scarpyp crawler tool library of python, wherein processes of URL address duplicate removal, data acquisition, extraction and cleaning are involved, and an SQL database of the monitoring information is generated; and finally, analyzing and mining the battery database to generate a battery pressure difference box whisker graph, a battery pack total pressure and total current trend change graph and other visualized statistical charts, and providing intuitive battery operation state information reference for technical analysts.
As shown in fig. 1, the method for continuously crawling and analyzing battery data of a cluster electric bus monitoring website described in this embodiment includes the following specific steps:
s1: the design adopts the technology of combining an open-source Docker container and a Swarm cluster management tool to construct a balanced load crawler data analysis system with a distributed unified entry capable of being started quickly.
S2: the system server is erected by adopting 5 blade servers, each server is configured to be a dual-core IntelXeon, the frequency of the IntelXeon is 2.5GHz, the internal memory is 16GB, the hard disk is 100G, and the host operating systems are 64-bit Ubuntu 16.04LTS.
S3: designing a system network structure, building a host cluster of a container Swarm (Version2) by four servers, selecting a Master node by the cluster in a distributed manner through a Raft protocol, and running agent on the rest three Slave nodes to receive the unified management of the Master.
S4: and installing a key-value type Redis database on the Master node to store the duplicate removal URL address in the distributed crawler process, and using another physical server to store the battery information database after the crawler is unstructured.
S5: and configuring Master and Slave nodes by using the Swarm mirror image, using the service discovery function by using the manually specified node IP address, sequentially starting Master and Slave nodes in the cluster, and viewing the cluster information on the Master node.
S6: the system adopts a container technology of Swarm cluster management, utilizes a quick tool Docker file to create a project file and a crawler self-defined mirror image packaged by a basic environment, uploads the project file and the crawler self-defined mirror image to a mirror image warehouse of a cluster, creates a crawler running container on a Master node, and rapidly deploys to three Slave nodes through Swarm services to create and start 100 crawler containers to complete parallel crawling.
S7: the system adopts a centralized scheduling strategy of a master control node and a crawler node, and overcomes the defect that the Scapy framework does not support distribution. The master control node distributes tasks to the crawler nodes according to task priorities and a load balancing principle, the crawler nodes are responsible for capturing corresponding data from the URL webpage, and distributed crawling of script is achieved through a global URL queue shared by the master node and the slave node.
And S8, simultaneously adopting a method of combining a bloom filter and a Redis database to complete URL address deduplication.
S9, analyzing and acquiring by the setting system through analyzing the CSS style structure of the battery data by adopting xpath
And S10, extracting corresponding target information through a custom regular function to realize data cleaning aiming at different fields.
S11: the reverse-crawling technology for configuring the IP address of the system agent is used for solving the problem. Useful battery data are stored in a physical database server special for the system after a series of processing such as URL address duplicate removal, webpage collection, extraction and cleaning.
S12: for the structural battery data table after the crawling processing, the system converts various battery data in the SQL data table into a JSON format through a background interface, and designs and completes diversified data display by utilizing an API (application program interface) interface and a document provided by an Echart plug-in, relating to a differential pressure box whisker diagram, a trend diagram of total voltage and total current, a corresponding relation diagram of differential pressure and current and the like of the battery.
The S8 bloom filter can quickly judge whether an element exists in the set, and the storage space of the URL queue can be greatly reduced. The algorithm content is as follows:
(1) firstly, k hash functions are needed, and each function can hash the URL into 1 integer;
(2) during initialization, an array with the length of n bits is needed, and each bit is initialized to be 0;
(3) when a URL is added into a set, k hash values are calculated by using k hash functions, and the corresponding bit position in the array is set to be 1;
(4) and when judging whether a URL is in a set, calculating k hash values by using k hash functions, inquiring corresponding bits in the array, and considering that all the bits are 1 in the set.
The system utilizes the BitMap of Redis to realize the bottom layer mapping of the bloom filter, and is used for storing the hash value of the URL address of the crawled webpage to complete the duplicate removal of the URL address. And Redis supports a NoSQL database (supporting Key-value Key value data) with a memory to be quickly searched, and can quickly set a bit corresponding to a binary number to be 1. Fig. 2-4 are related displays generated by the present invention.
In summary, since many large-scale online electric vehicle monitoring platforms are not uniform in battery monitoring data standard and information, it is difficult for technical analysts to find appropriate and accurate information in the abundant and redundant battery monitoring information. The embodiment of the invention designs and realizes continuous crawling, analysis and display of battery data under a distributed Docker container cluster architecture by utilizing a webpage crawler technology. Firstly, constructing a Docker cluster of a plurality of physical hosts by using a Swarm container management tool; then, carrying out continuous distributed network crawling on the flying standard unified battery monitoring information of the designated electric bus monitoring website by the aid of a Scarpyp crawler tool library of python, wherein processes of URL address duplicate removal, data acquisition, extraction and cleaning are involved, and an SQL database of the monitoring information is generated; and finally, analyzing and mining the battery database to generate a battery pressure difference box whisker graph, a battery pack total pressure and total current trend change graph and other visualized statistical charts, and providing intuitive battery operation state information reference for technical analysts.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (8)
1. A method for continuously crawling and analyzing battery data of a cluster electric bus monitoring website is characterized by comprising the following steps of: comprises the following steps of (a) carrying out,
firstly, constructing a Docker cluster of a plurality of physical hosts by using a Swarm container management tool;
then, continuously crawling and collecting non-standard unified battery monitoring information stored in a designated electric bus monitoring website by using a Scarpyp crawler tool library of python, finishing the technical processes of URL address duplicate removal, data collection, extraction and cleaning, and reordering an SQL database generating the monitoring information;
and finally, analyzing and mining the battery database to generate a battery pressure difference box whisker graph, a battery pack total pressure and total current trend change graph and a related visualized statistical chart.
2. The cluster electric bus monitoring website battery data continuous crawling and analyzing method as claimed in claim 1, is characterized by comprising the following steps:
s1: designing a balanced load crawler data analysis system with a distributed uniform entrance by combining an open-source Docker container and a Swarm cluster management tool;
s2: 5 blade servers are adopted for erecting a system server, each server is configured to be a dual-core IntelXeon, the main frequency is 2.5GHz, the memory is 16GB and the hard disk is 100G, and host operating systems are 64-bit Ubuntu 16.04LTS;
s3: the system network structure design adopts four servers to build a container and a host cluster of Swarm, the cluster selects a Master node in a distributed mode through a Raft protocol, and agents run on the rest three Slave nodes to receive the unified scheduling management of the Master;
s4: a key-value type Redis database is deployed on a Master node and used for storing the URL address after duplication removal in the distributed crawler process, and the other physical server is used for storing a battery information database after the crawler is unstructured;
s5: configuring a Master node and a Slave node by means of a Swarm mirror image, using a service discovery function by manually designating an IP address of the node, sequentially starting Master and Slave nodes in a cluster, and viewing cluster information on the Master node;
s6: the system adopts container technology and Swarm cluster management, and uses a quick tool Docker file to create a project file and a crawler self-defined mirror image packaged by a basic environment. After the data are uploaded to a mirror image warehouse of a cluster, a Master node crawler container is created and operated, the data are rapidly deployed to three Slave nodes through a Swarm service, and 100 crawler containers are created and started to complete parallel crawling;
s7: for the crawled and collected structured battery data table, the system converts various battery data in the SQL data table into a JSON format through a background interface, and designs and completes diversified data display by utilizing an API (application program interface) interface and a document provided by an Echart plug-in, relating to a differential pressure box beard graph, a trend graph of total voltage and total current and a corresponding relation graph of differential pressure and current to be used for analyzing a battery.
3. The method for continuously crawling and analyzing battery data of cluster electric bus monitoring website as claimed in claim 2, wherein said S7 further comprises S71: the system adopts a centralized scheduling strategy of a master control node and a crawler node, and overcomes the defect that a Scapy framework does not support distribution; the main control node distributes tasks to the crawler nodes according to task priority and a load balancing principle, the crawler nodes are responsible for capturing corresponding data from the URL webpage, and distributed crawling of script is achieved through a global URL queue shared by the main node and the slave node.
4. The cluster electric bus monitoring website battery data continuous crawling and analyzing method according to claim 2, characterized by comprising the following steps: the S7 further includes S72: and simultaneously, a bloom filter is adopted to judge whether the important elements are in a collection sequence and a Redis database is adopted to finish URL address deduplication.
5. The cluster electric bus monitoring website battery data continuous crawling and analyzing method according to claim 2, characterized by comprising the following steps: the S7 further includes S73: the setting system analyzes and acquires the CSS style structure of the battery data by adopting an xpath method.
6. The cluster electric bus monitoring website battery data continuous crawling and analyzing method according to claim 2, characterized by comprising the following steps: the S7 further includes S74: and extracting corresponding target information through a custom regular function to realize data cleaning aiming at different fields.
7. The cluster electric bus monitoring website battery data continuous crawling and analyzing method according to claim 2, characterized by comprising the following steps: the S7 further includes S75: configuring a system proxy server IP address to cope with the anti-crawling technology; after a series of processing such as URL address duplication removal, webpage collection, extraction and cleaning, the useful battery data are stored in a physical database server special for the system.
8. The method for continuously crawling and analyzing battery data of cluster electric bus monitoring website according to claim 4, wherein: wherein the S72 bloom filter can quickly determine whether an element exists in the set, and the algorithm content is as follows:
(1) firstly, k hash functions are required to be prepared, and each function can hash the URL into 1 integer;
(2) during initialization, an array with the length of n bits is needed, and each bit is initialized to be 0;
(3) when a URL is added into a set, k hash values are calculated by using k hash functions, and the corresponding bit position in the array is set to be 1;
(4) and when judging whether a URL is in a set, calculating k hash values by using k hash functions, inquiring corresponding bits in the array, and considering that all the bits are 1 in the set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110339920.6A CN112948660A (en) | 2021-03-30 | 2021-03-30 | Cluster electric bus monitoring website battery data continuous crawling and analyzing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110339920.6A CN112948660A (en) | 2021-03-30 | 2021-03-30 | Cluster electric bus monitoring website battery data continuous crawling and analyzing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112948660A true CN112948660A (en) | 2021-06-11 |
Family
ID=76230477
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110339920.6A Pending CN112948660A (en) | 2021-03-30 | 2021-03-30 | Cluster electric bus monitoring website battery data continuous crawling and analyzing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112948660A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113486106A (en) * | 2021-07-30 | 2021-10-08 | 西安西热电站信息技术有限公司 | Python method for acquiring SIS or supervisory system data and analyzing big data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107045455A (en) * | 2017-06-19 | 2017-08-15 | 华中科技大学 | A kind of Docker Swarm cluster resource method for optimizing scheduling based on load estimation |
CN107071009A (en) * | 2017-03-28 | 2017-08-18 | 江苏飞搏软件股份有限公司 | A kind of distributed big data crawler system of load balancing |
CN109948079A (en) * | 2019-03-11 | 2019-06-28 | 湖南衍金征信数据服务有限公司 | A kind of method that distributed capture discloses page data |
-
2021
- 2021-03-30 CN CN202110339920.6A patent/CN112948660A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107071009A (en) * | 2017-03-28 | 2017-08-18 | 江苏飞搏软件股份有限公司 | A kind of distributed big data crawler system of load balancing |
CN107045455A (en) * | 2017-06-19 | 2017-08-15 | 华中科技大学 | A kind of Docker Swarm cluster resource method for optimizing scheduling based on load estimation |
CN109948079A (en) * | 2019-03-11 | 2019-06-28 | 湖南衍金征信数据服务有限公司 | A kind of method that distributed capture discloses page data |
Non-Patent Citations (4)
Title |
---|
方奇洲等: "基于Docker容器的分布式爬虫的设计与实现", 《电子设计工程》, vol. 28, no. 08, 20 April 2020 (2020-04-20), pages 62 - 65 * |
杨力: "布隆算法在网络爬虫中的应用", 《电子世界》, 28 February 2019 (2019-02-28), pages 156 * |
欧阳桂秀: "使用 Docker Swarm 搭建集群", 《福建电脑》, vol. 35, no. 11, 30 November 2019 (2019-11-30), pages 66 * |
赵天瑞: "基于微服务架构的铅酸蓄电池在线监测系统的设计与实现", 《万方学术期刊数据库》, 29 March 2021 (2021-03-29), pages 3 - 20 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113486106A (en) * | 2021-07-30 | 2021-10-08 | 西安西热电站信息技术有限公司 | Python method for acquiring SIS or supervisory system data and analyzing big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11816316B2 (en) | Event identification based on cells associated with aggregated metrics | |
US11086289B2 (en) | Control interface for metric definition specification for assets driven by search-derived asset tree hierarchy | |
CN109272155B (en) | Enterprise behavior analysis system based on big data | |
US10243818B2 (en) | User interface that provides a proactive monitoring tree with state distribution ring | |
US10579627B2 (en) | Database operation using metadata of data sources | |
Li et al. | A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce | |
Gil et al. | Review of the complexity of managing big data of the internet of things | |
CN110633186A (en) | Log monitoring system for electric power metering micro-service architecture and implementation method | |
US10970298B1 (en) | Control interface for disparate search frequency dispatch for dynamic elements of an asset monitoring and reporting system | |
US10268755B2 (en) | Systems and methods for providing dynamic indexer discovery | |
Ghallab et al. | Detection outliers on internet of things using big data technology | |
US11687219B2 (en) | Statistics chart row mode drill down | |
US11663172B2 (en) | Cascading payload replication | |
CN104486116A (en) | Multidimensional query method and multidimensional query system of flow data | |
CN106294805A (en) | Data processing method and device | |
CN106599190A (en) | Dynamic Skyline query method based on cloud computing | |
CN115827907A (en) | Cross-cloud multi-source data cube discovery and integration method based on distributed memory | |
CN106682206A (en) | Method and system for big data processing | |
CN106599189A (en) | Dynamic Skyline inquiry device based on cloud computing | |
CN112948660A (en) | Cluster electric bus monitoring website battery data continuous crawling and analyzing method | |
Honest et al. | A survey of big data analytics | |
CN113127526A (en) | Distributed data storage and retrieval system based on Kubernetes | |
CN112783977A (en) | Mass data search implementation method based on big data | |
CN116703526A (en) | Article recommendation method, device, equipment and storage medium | |
Yin et al. | Content‐Based Image Retrial Based on Hadoop |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210611 |