CN112948660A

CN112948660A - Cluster electric bus monitoring website battery data continuous crawling and analyzing method

Info

Publication number: CN112948660A
Application number: CN202110339920.6A
Authority: CN
Inventors: 单毅; 胡攀攀
Original assignee: Hefei Guoxuan High Tech Power Energy Co Ltd
Current assignee: Hefei Gotion High Tech Power Energy Co Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-06-11

Abstract

The invention relates to a method for continuously crawling and analyzing battery data of a cluster electric bus monitoring website, which comprises the steps of firstly, constructing a Docker cluster of a plurality of physical hosts by utilizing a Swarm container management tool; then continuously crawling nonstandard unified battery monitoring data information in the appointed electric bus monitoring website by using a Scarpyp crawler tool library of python; and finally, analyzing and mining the battery database to generate a battery pressure difference box whisker graph, a battery pack total pressure and total current trend change graph and a related visualized statistical chart. The method mainly utilizes two technologies of distributed Docker container clustering and web crawlers, and simultaneously utilizes a crawler Scarpy tool framework of python to continuously acquire, extract and clean unstructured battery information of a formulated large-scale electric vehicle monitoring website, so that a new structured battery information database is generated, a statistical chart for visually displaying data is generated, and intuitive battery running state information reference can be provided for technical analysts.

Description

Cluster electric bus monitoring website battery data continuous crawling and analyzing method

Technical Field

The invention relates to the technical field of remote monitoring and analysis of power batteries in the field of electric automobiles, in particular to a method for continuously crawling and analyzing battery data of a cluster electric bus monitoring website.

Background

Web crawlers (Web crawlers), also known as Web spiders (Web spiders), can collect customized Web page information from the internet according to the logic of machine code. The key technology of the web crawler is divided into two parts, one is that information downloading of internet web pages (including links) is completed through a depth priority or breadth priority strategy; and secondly, extracting and cleaning useful data by webpage analysis and storing the useful data in a database. Scapy is a high-performance practical crawler frame written in Python language, has a clear structure and contains various middleware interfaces, and can provide various personalized requirements for users

Docker as a container open source engine can quickly and easily create a portable and lightweight container for a specific application, and allows a system image and a related library to be packaged into one image without installing an operating system and a related application program dependent package, thereby greatly reducing the environment deployment time. Compared with a VM cluster, the Docker cluster can more reasonably and efficiently utilize host machine resources. Swarm is as the Docker cluster management tool that an official body provided, provides the unified entry and can start distributed crawler deployment fast, can abstract a plurality of Docker host computers as a whole, and also can elasticity flexible Docker host computers realize load balancing, even can realize horizontal expansion through container arrangement.

The Echarts plug-in is pure JavaScript software of a Baidu open source and supports various browsers to smoothly run on a PC and a mobile terminal. Echarts provides rich API interfaces and documents, can design diversified data display by reasonably setting and combining JSON data transmitted by a background, has the characteristics of convenient configuration, rich chart types and light data transmission, and can better solve the problem of visual analysis of information data.

Disclosure of Invention

The invention provides a method for continuously crawling and analyzing battery data of a group of electric bus monitoring websites.

In order to achieve the purpose, the invention adopts the following technical scheme: a method for continuously crawling and analyzing battery data of a cluster electric bus monitoring website comprises the following steps,

firstly, constructing a Docker cluster of a plurality of physical hosts by using a Swarm container management tool;

and then, continuously crawling and recording the non-standard unified battery monitoring information stored in the designated electric bus monitoring website by using a Scarpyp crawler tool library of python. The technical processes of URL address duplication removal, data acquisition, extraction and cleaning are involved in the operation, and the SQL database for generating the monitoring information needs to be reordered;

and finally, analyzing and mining the battery database to generate a battery pressure difference box whisker graph, a battery pack total pressure and total current trend change graph and a related visualized statistical chart.

Further, the specific features include the steps of:

s1: designing a balanced load crawler data analysis system with a distributed uniform entrance by combining an open-source Docker container and a Swarm cluster management tool;

s2: 5 blade servers are adopted for erecting a system server, each server is configured to be a dual-core IntelXeon, the main frequency is 2.5GHz, the memory is 16GB and the hard disk is 100G, and host operating systems are 64-bit Ubuntu 16.04LTS;

s3: the system network structure design adopts four servers to build a container and a host cluster of Swarm, the cluster selects a Master node in a distributed mode through a Raft protocol, and agents run on the rest three Slave nodes to receive the unified scheduling management of the Master;

s4: a key-value type Redis database is deployed on a Master node and used for storing the URL address after duplication removal in the distributed crawler process, and the other physical server is used for storing a battery information database after the crawler is unstructured;

s5: configuring a Master node and a Slave node by means of a Swarm mirror image, using a service discovery function by manually designating an IP address of the node, sequentially starting Master and Slave nodes in a cluster, and viewing cluster information on the Master node;

s6: the system adopts container technology and Swarm cluster management, and uses a quick tool Docker file to create a project file and a crawler self-defined mirror image packaged by a basic environment. After the data are uploaded to a mirror image warehouse of a cluster, a Master node crawler container is created and operated, the data are rapidly deployed to three Slave nodes through a Swarm service, and 100 crawler containers are created and started to complete parallel crawling;

s7: for the crawled and collected structured battery data table, the system converts various battery data in the SQL data table into a JSON format through a background interface, and designs and completes diversified data display by utilizing an API (application program interface) interface and a document provided by an Echart plug-in, relating to a differential pressure box beard graph, a trend graph of total voltage and total current and a corresponding relation graph of differential pressure and current to be used for analyzing a battery.

Further, the S7 further includes S71: the system adopts a centralized scheduling strategy of a master control node and a crawler node, and overcomes the defect that a Scapy framework does not support distribution; the main control node distributes tasks to the crawler nodes according to task priority and a load balancing principle, the crawler nodes are responsible for capturing corresponding data from the URL webpage, and distributed crawling of script is achieved through a global URL queue shared by the main node and the slave node.

Further, the S7 further includes S72: and simultaneously, a bloom filter is adopted to judge whether the important elements are in a collection sequence and a Redis database is adopted to finish URL address deduplication.

Further, the S7 further includes S73: the setting system analyzes and acquires the CSS style structure of the battery data by adopting an xpath method.

Further, the S7 further includes S74: and extracting corresponding target information through a custom regular function to realize data cleaning aiming at different fields.

Further, the S7 further includes S75: configuring a system proxy server IP address to cope with the anti-crawling technology; after a series of processing such as URL address duplication removal, webpage collection, extraction and cleaning, the useful battery data are stored in a physical database server special for the system.

Further, where the S72 bloom filter can quickly determine whether an element exists in the set, the algorithm is as follows:

(1) firstly, k hash functions are required to be prepared, and each function can hash the URL into 1 integer;

(2) during initialization, an array with the length of n bits is needed, and each bit is initialized to be 0;

(3) when a URL is added into a set, k hash values are calculated by using k hash functions, and the corresponding bit position in the array is set to be 1;

(4) and when judging whether a URL is in a set, calculating k hash values by using k hash functions, inquiring corresponding bits in the array, and considering that all the bits are 1 in the set.

According to the technical scheme, due to the fact that the battery monitoring data standard and the information of a plurality of large-scale online electric vehicle monitoring platforms are not unified, technical analysis personnel cannot find appropriate and accurate information in the abundant and redundant battery monitoring information easily. The battery data continuous crawling and analyzing method for the cluster electric bus monitoring website is designed and realized by utilizing a webpage crawler technology under a distributed Docker container cluster architecture. Firstly, constructing a Docker cluster of a plurality of physical hosts by using a Swarm container management tool; then, carrying out continuous distributed network crawling on the flying standard unified battery monitoring information of the designated electric bus monitoring website by the aid of a Scarpyp crawler tool library of python, wherein processes of URL address duplicate removal, data acquisition, extraction and cleaning are involved, and an SQL database of the monitoring information is generated; and finally, analyzing and mining the battery database to generate a battery pressure difference box whisker graph, a battery pack total pressure and total current trend change graph and other visualized statistical charts, and providing intuitive battery operation state information reference for technical analysts.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a graph of cell differential pressure versus current trend generated by the present invention;

FIG. 3 is a graph of the total cell pressure profile generated by the present invention;

fig. 4 is a frequency plot of total current versus differential pressure distribution for a battery pack produced by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.

With the rapid development of internet technology, monitoring and analysis of power batteries in new energy vehicles are centralized on the internet, the actual measurement data source of technical analysts turns to a network data acquisition mode, all vehicle factories store data collected by monitoring modules installed on electric vehicles on corresponding websites, and new energy practitioners and government supervision departments can derive battery data analysis from the websites to judge the running performance of the batteries. However, most of different monitoring data websites of the whole automobile factory have inconsistent information, large quantity, complicated information and irregular variation, which leads to the embarrassing situation that the monitoring department and the technical analyst can not obtain. Aiming at the problem, the embodiment of the invention realizes a system for gathering and analyzing the battery information of the electric automobile remote monitoring website based on the web crawler technology. The system mainly utilizes two technologies of distributed Docker container clustering and web crawlers and utilizes a crawler Scarpy tool framework of python to continuously acquire, extract and clean unstructured battery information of a formulated large-scale electric vehicle monitoring website, so that a new structured position database is generated, and data is generated into a visualized displayed statistical chart.

According to the method for continuously crawling and analyzing the battery data of the cluster electric bus monitoring website, firstly, a Swarm container management tool is utilized to construct a Docker cluster of a plurality of physical hosts; then, carrying out continuous distributed network crawling on the flying standard unified battery monitoring information of the designated electric bus monitoring website by the aid of a Scarpyp crawler tool library of python, wherein processes of URL address duplicate removal, data acquisition, extraction and cleaning are involved, and an SQL database of the monitoring information is generated; and finally, analyzing and mining the battery database to generate a battery pressure difference box whisker graph, a battery pack total pressure and total current trend change graph and other visualized statistical charts, and providing intuitive battery operation state information reference for technical analysts.

As shown in fig. 1, the method for continuously crawling and analyzing battery data of a cluster electric bus monitoring website described in this embodiment includes the following specific steps:

s1: the design adopts the technology of combining an open-source Docker container and a Swarm cluster management tool to construct a balanced load crawler data analysis system with a distributed unified entry capable of being started quickly.

S2: the system server is erected by adopting 5 blade servers, each server is configured to be a dual-core IntelXeon, the frequency of the IntelXeon is 2.5GHz, the internal memory is 16GB, the hard disk is 100G, and the host operating systems are 64-bit Ubuntu 16.04LTS.

S3: designing a system network structure, building a host cluster of a container Swarm (Version2) by four servers, selecting a Master node by the cluster in a distributed manner through a Raft protocol, and running agent on the rest three Slave nodes to receive the unified management of the Master.

S4: and installing a key-value type Redis database on the Master node to store the duplicate removal URL address in the distributed crawler process, and using another physical server to store the battery information database after the crawler is unstructured.

S5: and configuring Master and Slave nodes by using the Swarm mirror image, using the service discovery function by using the manually specified node IP address, sequentially starting Master and Slave nodes in the cluster, and viewing the cluster information on the Master node.

S6: the system adopts a container technology of Swarm cluster management, utilizes a quick tool Docker file to create a project file and a crawler self-defined mirror image packaged by a basic environment, uploads the project file and the crawler self-defined mirror image to a mirror image warehouse of a cluster, creates a crawler running container on a Master node, and rapidly deploys to three Slave nodes through Swarm services to create and start 100 crawler containers to complete parallel crawling.

S7: the system adopts a centralized scheduling strategy of a master control node and a crawler node, and overcomes the defect that the Scapy framework does not support distribution. The master control node distributes tasks to the crawler nodes according to task priorities and a load balancing principle, the crawler nodes are responsible for capturing corresponding data from the URL webpage, and distributed crawling of script is achieved through a global URL queue shared by the master node and the slave node.

And S8, simultaneously adopting a method of combining a bloom filter and a Redis database to complete URL address deduplication.

S9, analyzing and acquiring by the setting system through analyzing the CSS style structure of the battery data by adopting xpath

And S10, extracting corresponding target information through a custom regular function to realize data cleaning aiming at different fields.

S11: the reverse-crawling technology for configuring the IP address of the system agent is used for solving the problem. Useful battery data are stored in a physical database server special for the system after a series of processing such as URL address duplicate removal, webpage collection, extraction and cleaning.

S12: for the structural battery data table after the crawling processing, the system converts various battery data in the SQL data table into a JSON format through a background interface, and designs and completes diversified data display by utilizing an API (application program interface) interface and a document provided by an Echart plug-in, relating to a differential pressure box whisker diagram, a trend diagram of total voltage and total current, a corresponding relation diagram of differential pressure and current and the like of the battery.

The S8 bloom filter can quickly judge whether an element exists in the set, and the storage space of the URL queue can be greatly reduced. The algorithm content is as follows:

(1) firstly, k hash functions are needed, and each function can hash the URL into 1 integer;

The system utilizes the BitMap of Redis to realize the bottom layer mapping of the bloom filter, and is used for storing the hash value of the URL address of the crawled webpage to complete the duplicate removal of the URL address. And Redis supports a NoSQL database (supporting Key-value Key value data) with a memory to be quickly searched, and can quickly set a bit corresponding to a binary number to be 1. Fig. 2-4 are related displays generated by the present invention.

In summary, since many large-scale online electric vehicle monitoring platforms are not uniform in battery monitoring data standard and information, it is difficult for technical analysts to find appropriate and accurate information in the abundant and redundant battery monitoring information. The embodiment of the invention designs and realizes continuous crawling, analysis and display of battery data under a distributed Docker container cluster architecture by utilizing a webpage crawler technology. Firstly, constructing a Docker cluster of a plurality of physical hosts by using a Swarm container management tool; then, carrying out continuous distributed network crawling on the flying standard unified battery monitoring information of the designated electric bus monitoring website by the aid of a Scarpyp crawler tool library of python, wherein processes of URL address duplicate removal, data acquisition, extraction and cleaning are involved, and an SQL database of the monitoring information is generated; and finally, analyzing and mining the battery database to generate a battery pressure difference box whisker graph, a battery pack total pressure and total current trend change graph and other visualized statistical charts, and providing intuitive battery operation state information reference for technical analysts.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for continuously crawling and analyzing battery data of a cluster electric bus monitoring website is characterized by comprising the following steps of: comprises the following steps of (a) carrying out,

then, continuously crawling and collecting non-standard unified battery monitoring information stored in a designated electric bus monitoring website by using a Scarpyp crawler tool library of python, finishing the technical processes of URL address duplicate removal, data collection, extraction and cleaning, and reordering an SQL database generating the monitoring information;

2. The cluster electric bus monitoring website battery data continuous crawling and analyzing method as claimed in claim 1, is characterized by comprising the following steps:

3. The method for continuously crawling and analyzing battery data of cluster electric bus monitoring website as claimed in claim 2, wherein said S7 further comprises S71: the system adopts a centralized scheduling strategy of a master control node and a crawler node, and overcomes the defect that a Scapy framework does not support distribution; the main control node distributes tasks to the crawler nodes according to task priority and a load balancing principle, the crawler nodes are responsible for capturing corresponding data from the URL webpage, and distributed crawling of script is achieved through a global URL queue shared by the main node and the slave node.

4. The cluster electric bus monitoring website battery data continuous crawling and analyzing method according to claim 2, characterized by comprising the following steps: the S7 further includes S72: and simultaneously, a bloom filter is adopted to judge whether the important elements are in a collection sequence and a Redis database is adopted to finish URL address deduplication.

5. The cluster electric bus monitoring website battery data continuous crawling and analyzing method according to claim 2, characterized by comprising the following steps: the S7 further includes S73: the setting system analyzes and acquires the CSS style structure of the battery data by adopting an xpath method.

6. The cluster electric bus monitoring website battery data continuous crawling and analyzing method according to claim 2, characterized by comprising the following steps: the S7 further includes S74: and extracting corresponding target information through a custom regular function to realize data cleaning aiming at different fields.

7. The cluster electric bus monitoring website battery data continuous crawling and analyzing method according to claim 2, characterized by comprising the following steps: the S7 further includes S75: configuring a system proxy server IP address to cope with the anti-crawling technology; after a series of processing such as URL address duplication removal, webpage collection, extraction and cleaning, the useful battery data are stored in a physical database server special for the system.

8. The method for continuously crawling and analyzing battery data of cluster electric bus monitoring website according to claim 4, wherein: wherein the S72 bloom filter can quickly determine whether an element exists in the set, and the algorithm content is as follows: