CN114860830A

CN114860830A - System for building operation and maintenance data middlings based on big data technology

Info

Publication number: CN114860830A
Application number: CN202210489194.0A
Authority: CN
Inventors: 林茂军; 陆健华; 陈棣; 马永祥; 黄洁敏; 金杨; 杨小云; 崔立群; 胡红青; 钱苏尧
Original assignee: Bank Of Shanghai Co ltd
Current assignee: Bank Of Shanghai Co ltd
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2022-08-05

Abstract

The invention relates to the technical field of big data, and discloses a system for building an operation and maintenance data middle platform based on big data technology, which comprises a data acquisition module, a data access module, a data processing module, a data storage and analysis module and a data service module, wherein the data acquisition module, the data access module, the data processing module, the data storage and analysis module and the data service module are sequentially connected with one another. And the data requirements of various operation and maintenance operation scenes are flexibly supported.

Description

System for building operation and maintenance data middlings based on big data technology

Technical Field

The invention relates to the technical field of big data, in particular to a system for building an operation and maintenance data middlebox based on big data technology.

Background

With the high integration of IT and business, information and information technology increasingly become one of the most important assets of enterprises, the operation and maintenance management of an information system becomes an important link on a product value chain, the process of providing service is the process of creating value, and the contribution degree of the information system to enterprise profits reaches very important achievement, so that an operation and maintenance data middle platform needs to be built, aims at the operation and maintenance data field, and utilizes an intelligent operation and maintenance platform to carry out standardized and centralized acquisition, processing, integration and storage on various operation and maintenance data of a whole line to form standardized operation and maintenance data assets to achieve the aims of data reuse, cost reduction and value mining, but the defects of unstable operation, few analysis modes, low analysis degree, poor applicability and the like of the existing operation and maintenance data middle platform system are overcome.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a system for building an operation and maintenance data middlebox based on a big data technology.

(II) technical scheme

In order to achieve the purpose, the invention provides the following technical scheme: a system for building an operation and maintenance data middling station based on big data technology comprises a data acquisition module, a data access module, a data processing module, a data storage and analysis module and a data service module, wherein the data acquisition module, the data access module, the data processing module, the data storage and analysis module and the data service module are sequentially connected;

the data acquisition module realizes data access of an operation and maintenance log, database data and kafka data by using flash, wherein the operation and maintenance log comprises application log data, transaction log data, network log data, system log data and network packet capturing data;

the data access module realizes various operation and maintenance data caches by using a distributed publishing and subscribing message system kafka cluster;

the data processing module is connected with the real-time computing cluster, and a stream processing engine spark streaming used by the data processing module carries out aggregation computing on various operation and maintenance data;

the data storage analysis module mainly comprises a real-time database cluster druid and an offline database cluster kudu, and realizes the functions of data warehouse and multidimensional data analysis;

the data service module provides all data services of the platform and realizes value output of operation and maintenance data.

Preferably, the data acquisition module further comprises a label module, the data acquisition module acquires and transmits the data to kafka in real time by using flash, and labels are marked in the acquisition process, such as log acquisition time, a directory where the log is located, an application identifier, an IP (Internet protocol) and a host name, so as to provide data support for subsequent analysis and processing.

Preferably, the data processing module comprises a query module, a filtering module, an alarm module, a statistic module and a pre-polymerization module;

when data processing is carried out, the data processing is carried out in a micro batch mode, the processing frequency can be increased to 1 second at most, the data after the procedure ETL can be imported into a real-time analysis database drive, the drive can carry out minute granularity pre-polymerization on the data through a pre-polymerization module, and the statistics of the total transaction amount, the successful transaction amount and the transaction processing time are completed through a statistical module;

the alarm module inquires data needing to be alarmed by the query module, the draid database is sent back to kafka, the data are pushed to an open source monitoring platform zabbix in real time through a spark streaming program, the zabbix filters the full-pushed data through the filtering module according to agreed dimension indexes, and alarms with different dimension indexes such as application, host, transaction types and the like can be realized after corresponding threshold values are set;

and the alarm data in the kafka can be supplied to an AI alarm platform, and intelligent alarm is realized on the non-static threshold index.

Preferably, the data storage analysis module further comprises a detail database cluster, the kafka cluster is connected with a log retrieval cluster solr through a log retrieval module, and the log retrieval module comprises a combined query module and a segmentation module.

Preferably, the log retrieval module completes real-time access of the application logs in the production environment based on an open source log retrieval engine solr, provides a log retrieval function, and performs combined query through the combined query module according to the application system, IP, log path, time and keyword dimension;

the data acquisition module acquires the source system application log in real time by using the flash and transmits the source system application log to the kafka, the flash is used at the server side to take the log data into the solr, the segmentation module is used for segmenting the log content by the word segmentation rule during data taking to set an index, and a user can retrieve the log of the corresponding rule from the log data in the solr through the visual page.

Preferably, the operation and maintenance data center station divides the data warehouse into five layers: 1. data application layer/application layer; 2. data subject layer/label layer; 3. a service data layer/aggregation layer; 4. a detailed data layer/model layer; 5. original data layer/paste layer.

Preferably, the posting layer continuously accesses data of application transaction, host performance and operation and maintenance flow according to requirements.

Preferably, after data is accessed into the source layer, the data is firstly standardized and then enters the model layer through dimensional modeling, and in the actual implementation process, operation and maintenance data are classified according to topics and divided into ten major topics including personnel organization, IT assets, protocols, performance capacity, alarms, operation, flow, operation, logs and knowledge;

the dimension modeling adopts a constellation model, the theme data is divided and arranged into a fact table and a dimension table, and the fact table can be associated with the dimension table through a unique identifier.

Preferably, the tag layer counts the current-day behaviors of each subject object, and then the DWT tag layer counts the accumulated behaviors of each subject object, including the current-month transaction amount and the current-year change amount;

the label layer models objects, and abstracts data of applications, personnel, departments, equipment and machine rooms in different demand scenes.

Preferably, the application layer processes and organizes various processed data and some specific personalized indexes facing to the service together according to the requirements of consumption and service scene use, so as to flexibly support the scene requirements of the final service application.

(III) advantageous effects

Compared with the prior art, the invention provides a system for building an operation and maintenance data middle platform based on a big data technology, which has the following beneficial effects:

1. according to the system for building the operation and maintenance data middling platform based on the big data technology, the stream data processing is completed through the big data platform, the transaction amount, the technical success rate and the transaction processing time of the application system are calculated in real time, the open source monitoring platform is combined to realize transaction level monitoring, the fault finding and positioning capabilities are greatly improved, one-line operation and maintenance personnel are enabled, and the stable and efficient operation of the application system is guaranteed.

2. The system for building the operation and maintenance data middling platform based on the big data technology is based on a distributed message system and a log retrieval engine in a big data platform, completes real-time retrieval of production logs, realizes positioning and troubleshooting of office environment faults, promotes conversion of operation and maintenance personnel from a working mode of 'machine room operation and maintenance' to an 'office operation and maintenance' and promotes efficient development, completes operation and maintenance data theme design and data warehouse layered landing based on a distributed database in the big data platform, and flexibly supports various operation and maintenance operation scene data requirements.

3. According to the system for building the operation and maintenance data middlings based on the big data technology, the portrait of the application system is realized and the operation condition of the application system is displayed in an all-round mode based on the massive operation and maintenance data gathered by the big data platform. And the operating system level displays the CPU utilization rate, the memory utilization rate, the disk space utilization rate and the like of the application system in real time, the transaction level can analyze from multiple dimensions such as a host, channels, transaction types and the like, and displayed indexes comprise transaction amount, technical success rate, service acceptance rate, average processing time and the like.

4. According to the system for building the operation and maintenance data middlings based on the big data technology, the data value is further mined based on the massive operation and maintenance data gathered by the big data platform, the real-time display of more than 90 indexes of the operation assistant is realized, the breakthrough of operation and maintenance to operation is completed, and the management capacity of the operation service of the user is further improved.

5. According to the system for building the operation and maintenance data middleboxes based on the big data technology, the operation and maintenance data mining and analyzing capacity is provided based on a self-service mode, and the operation and maintenance data value is realized. At present, the multidimensional data analysis billboard covers 7 plates, namely a transaction plate, a host performance plate, an operation plate, an alarm plate, an event plate, a change plate and a problem plate. And the service system condition can be analyzed in multiple angles and multiple dimensions through the data of different plates.

6. The system for building the operation and maintenance data middlings based on the big data technology is based on a data warehouse five-layer framework system, wherein the first layer is an ODS (oxide dispersion strengthened) layer and stores original data; the second layer is a DWD layer and mainly cleans, degrades and the like the data of the ODS layer; the third layer is a DWS layer, based on a DWI) layer, light polymerization is carried out according to days, and the granularity is that one line of information represents one day of behavior; the fourth layer is a DWT layer, aggregation is carried out according to labels on the basis of DWS, and the granularity is a row of information representing accumulated behaviors; and the fifth layer is an ADS layer and provides data services for various application scenes.

7. The system for building the operation and maintenance data middlebox based on the big data technology realizes a node agent management and control system based on the ansable automatic operation and maintenance tool, and develops a set of node agent management and control system which can carry out software agent installation, restart, configuration file management and log viewing on an interface.

8. According to the system for building the operation and maintenance data middlebox based on the big data technology, the AI monitoring is supported by mass data of the operation and maintenance data middlebox, so that the monitoring index abnormity detection and abnormity positioning capability of an application system are improved, the instant early warning of faults is realized, and the operation and maintenance positioning faults are assisted.

Drawings

FIG. 1 is a schematic flow chart of the system of the present invention;

FIG. 2 is a schematic diagram of a data warehouse architecture according to the present invention;

FIG. 3 is a schematic view of a passivation layer structure according to the present invention;

FIG. 4 is a schematic diagram of a model layer structure according to the present invention;

FIG. 5 is a schematic view of a label layer structure according to the present invention;

FIG. 6 is a schematic diagram of an application layer structure according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

Referring to fig. 1-6, a system for building an operation and maintenance data center station based on big data technology includes a data acquisition module, a data access module, a data processing module, a data storage and analysis module, and a data service module, wherein the data acquisition module, the data access module, the data processing module, the data storage and analysis module, and the data service module are connected in sequence; the method comprises the steps that JSON logs normalized by a source system are collected, a data collection module uses flash to achieve data access of operation and maintenance logs, database data and kafka data, and the operation and maintenance logs comprise application log data, transaction log data, network log data, system log data and network packet capturing data; the data acquisition module also comprises a label module, the data acquisition module acquires and transmits to kafka in real time by using flash, labels are marked during acquisition, such as log acquisition time, a directory where the logs are located, application identification, IP (Internet protocol) and host names, data support is provided for subsequent analysis and processing, and the data access module realizes various operation and maintenance data caching by using a distributed publish-subscribe message system kafka cluster; the data processing module is connected with the real-time computing cluster, a stream processing engine spark timing used by the data processing module carries out aggregation computation on various operation and maintenance data, and the data processing module comprises a query module, a filtering module, an alarm module, a statistic module and a pre-polymerization module.

When data processing is carried out, data processing is carried out in a micro-batch mode, the processing frequency can be improved to 1 second at most, data after the procedure ETL are imported into a real-time analysis database drive, the drive carries out minute granularity pre-polymerization on the data through a pre-polymerization module, and statistics of the total transaction amount, the successful transaction amount and the transaction processing time are completed through a statistic module; the alarm module inquires data needing to be alarmed by the query module, the draid database is sent back to kafka, the data are pushed to an open source monitoring platform zabbix in real time through a spark streaming program, the zabbix filters the full-pushed data through the filtering module according to agreed dimension indexes, and alarms with different dimension indexes such as application, host, transaction types and the like can be realized after corresponding threshold values are set; and the alarm data in the kafka can be supplied to an AI alarm platform, and intelligent alarm is realized on the non-static threshold index. The data storage analysis module mainly comprises a real-time database cluster and an offline database cluster kudu, and further comprises a detail database cluster to realize the functions of a data warehouse and multidimensional data analysis; the data service module provides all data services of the platform and realizes value output of operation and maintenance data.

The kafka cluster is connected with a log retrieval cluster solr through a log retrieval module, the log retrieval module comprises a combined query module and a segmentation module, the log retrieval module completes real-time access of application logs in a production environment based on an open source log retrieval engine solr, provides a log retrieval function, and performs combined query through the combined query module according to application systems, IP, log paths, time and keyword dimensions; the data acquisition module acquires the source system application log in real time by using the flash and transmits the source system application log to the kafka, the flash is used at the server side to take the log data into the solr, the segmentation module is used for segmenting the log content by the word segmentation rule during data taking to set an index, and a user can retrieve the log of the corresponding rule from the log data in the solr through the visual page.

The platform collects mass operation and maintenance data, stores the operation and maintenance data in the real-time analysis database, and can provide real-time and historical data query. The front-end page helps a front-line operation and maintenance worker to control the application running condition in real time by showing multi-dimensional information such as key business system transaction, server and management of the line in real time.

By applying transaction log point-burying service data, corresponding dimension indexes such as service processing conditions of all levels of a whole bank, network point queuing information, network point self-service machine running conditions and the like are generated during data cleaning, the service conditions of the whole bank are covered, real-time and historical transaction conditions of all levels of the whole bank can be obtained, network point passenger flow early warning is realized, and the control capability of all levels of management layers of the whole bank on service operation service conditions is greatly improved.

In the aspect of multidimensional data analysis, 7 large plates of data are covered, and the transaction plates can use charts such as time series line graphs and time series bar graphs to show the change trend of service transaction amount, average processing time, thirty days of technical success rate, the situation of same ring ratio, the situation of TOP20 and the like. The performance board of the host is divided into three parts of an application server, a database server and a middleware server, can analyze the performance data change trend of the servers, such as CPU utilization rate, memory utilization rate and IO rate, and can also analyze personalized indexes of various servers, such as the table space utilization rate in a database server module, the disk space utilization rate, the MQ average queue depth in the middleware server module and the like. The job board block can analyze the success times, failure times, average execution duration, fluctuation conditions of the starting time and the like of each batch job in a certain time period. The alarm board displays the current alarm number and the change trend of the alarm number every day, and also has some analysis charts, including alarm index object classification, IP address classification, alarm statistics detail and the like. The event plate comprises the total event number, the event processing overtime count and other index displays, and the event category and the processing average time consumption classification statistics. The change board comprises index display of the monthly change number and the monthly on-time completion rate, change reasons, change types, department initiation, and distribution and detail display of the on-time change success rate. The problem board comprises time consumption for solving the problems all year round, index display of the total number of the problems all year round, priority classification statistics of the problems all year round and detailed display of the problems.

The operation and maintenance data center station divides a data warehouse into five layers according to the following rules:

simplifying the complex problem: the complex task is decomposed into a plurality of layers to be completed, and each layer only processes simple tasks, so that the problem of positioning is facilitated;

reducing repeated development: the data is standardized and layered, and repeated calculation can be reduced and the reusability of a calculation result can be increased through intermediate layer data;

isolating original data: the real data and the statistical data are decoupled in consideration of data abnormity and data sensitivity.

The method is divided into the following five layers:

1. the data application layer/application layer processes various processed data and some specific personalized indexes facing to the service and then organizes the processed data and the specific personalized indexes together according to the requirements of consumption and service scene use, so as to flexibly support the scene requirements of the final service application and flexibly support the scene requirements of the final service application. Similar to traditional data marts, but lighter and more flexible than data marts, for solving specific business problems. At present, health patrol, supervision and delivery, electronic monthly newspaper, palmtop, and the like are built, as shown in fig. 6.

2. The system comprises a data subject layer/a label layer, wherein the label layer counts the day behavior of each subject object, and then enters a DWT label layer to count the accumulated behavior of each subject object, including the current-month transaction amount and the current-year change amount; the label layer models objects, data of applications, personnel, departments, equipment and machine rooms are abstracted in different demand scenes, after the data are subjected to dimensionality degradation in the model layer, various information of the same object is still dispersed in different data fields and has different granularities, for example, transaction information such as application data, transaction amount of an application system, technical success rate and the like is in a performance capacity theme field, information of the number of emergencies is in a process theme field, information such as batch time consumption, alarm times and the like is in an operation theme field, and therefore comprehensive information of one application is difficult to know.

And the fact table of the model layer is associated from the angle of different objects, some statistical index data with modifiers and calculation methods can be obtained through SQL statements, for example, the measurement value of the change fact table is change ID, change type, completion condition and the like, the change fact table is associated from department dimension, the change success quantity in the month and the like of each department can be obtained, the change fact table is associated from personnel dimension, the change fact table can be obtained, the change quantity index in the month of a certain person can be obtained, and the event fact table is associated to obtain the indexes of the month processing event quantity and the like of a certain person. Therefore, the label layer finally achieves the effect of organizing the data of a certain object across service plates and data domains on the basis of the same granularity to get a normalization, and can meet the requirement of acquiring and analyzing the comprehensive data of the certain object, as shown in fig. 5.

3. Service data layer/aggregation layer.

4. In the actual implementation process, operation and maintenance data are classified according to topics and divided into ten major topics including personnel organization, IT assets, protocols, performance capacity, alarms, operation, flows, operations, logs and knowledge, the dimensionality modeling adopts a constellation model, the topic data are divided and sorted into a fact table and a dimensionality table, and the fact table can be associated with the dimensionality table through a unique identifier (such as an IP address), and the fact table is shown in figure 4.

5. The original data layer/the pasting layer, which is continuously accessed with the data of the application transaction, the host performance and the operation and maintenance flow according to the requirement, is shown in fig. 3.

And in the aspect of a node agent control system, the functions of system configuration, configuration management, state monitoring, version library management and version issuing control are realized. The system configuration module can complete the grouping management of the servers and realize the batch management of the agents. And the configuration management module has a perfect service history version management strategy and supports unified management of configuration files. And the state monitoring module can monitor the state of the agent and feed back the survival state of the agent in time. And version library management and version issuing control are realized, and one-key issuing of files such as configuration files, scripts and the like is realized.

And in the aspect of AI monitoring, single KPI abnormity detection and machine index abnormity positioning are mainly realized. An anomaly detection model integrating characteristics of various detectors is established according to a machine learning algorithm, high-precision recall rate anomaly detection is realized on a single monitoring index curve, the missing report rate and the false report rate of the original monitoring strategy are reduced, the stability, the high efficiency and the safety of production and operation are guaranteed, and meanwhile, unnecessary labor consumption is reduced. The method comprises the steps of summarizing monitoring data of a given service and clusters, modules and servers depended by the service, comparing performances of the monitoring data before and after service failure time points and historical synchronization time points, clustering and sequencing the performances to be displayed in a centralized mode, and automatically and quickly positioning abnormity from the data when the service fails.

Claims

1. The utility model provides a system for platform in operation and maintenance data based on big data technology construction, includes data acquisition module, data access module, data processing module, data storage analysis module and data service module, its characterized in that: the data acquisition module, the data access module, the data processing module, the data storage and analysis module and the data service module are sequentially connected;

the data service module provides all data services of the platform and realizes value output of operation and maintenance data;

the data acquisition module also comprises a tag module, the data acquisition module acquires and transmits to kafka in real time by using flume, tags are marked in the acquisition process, such as log acquisition time, a log directory, application identification, IP (Internet protocol) and host names, and data support is provided for subsequent analysis and processing;

the data processing module comprises a query module, a filtering module, an alarm module, a statistic module and a pre-polymerization module;

when data processing is carried out, data processing is carried out in a micro-batch mode, the processing frequency can be improved to 1 second at most, data after the procedure ETL are imported into a real-time analysis database drive, the drive carries out minute granularity pre-polymerization on the data through a pre-polymerization module, and statistics of the total transaction amount, the successful transaction amount and the transaction processing time are completed through a statistic module;

and the alarm data in the kafka is also supplied to an AI alarm platform, and intelligent alarm is realized on the non-static threshold indexes.

2. The system for building the operation and maintenance data middlebox based on big data technology according to claim 1, wherein: the data storage and analysis module further comprises a detail database cluster, the kafka cluster is connected with a log retrieval cluster solr through a log retrieval module, and the log retrieval module comprises a combined query module and a segmentation module.

3. The system for building the operation and maintenance data middlebox based on big data technology according to claim 2, wherein: the log retrieval module completes real-time access of application logs in the production environment based on an open source log retrieval engine solr, provides a log retrieval function, and performs combined query through the combined query module according to the application system, IP, log path, time and keyword dimension;

4. The system for building the operation and maintenance data middlebox based on big data technology according to claim 1, wherein: the operation and maintenance data center station divides a data warehouse into five layers: 1. data application layer/application layer; 2. data subject layer/label layer; 3. a service data layer/aggregation layer; 4. a detailed data layer/model layer; 5. original data layer/paste layer.

5. The system for building the operation and maintenance data middlebox based on big data technology according to claim 4, wherein: and the source layer is continuously accessed to the data of the application transaction, the host performance and the operation and maintenance flow according to the requirement.

6. The system for building the operation and maintenance data middlebox based on big data technology according to claim 4, wherein: after data is accessed to the source layer, the data is standardized firstly, and then the data enters the model layer through dimensional modeling, and in the actual implementation process, operation and maintenance data are classified according to topics and divided into ten major topics including personnel organization, IT assets, protocols, performance capacity, alarms, operation, flow, operation, logs and knowledge;

7. The system for building the operation and maintenance data middlebox based on big data technology according to claim 4, wherein: the tag layer counts the current-day behaviors of the subject objects, and then the DWT tag layer counts the accumulated behaviors of the subject objects, including the current-month transaction amount and the current-year change amount;

and modeling the object at the label level, and abstracting data of application, personnel, departments, equipment and a machine room in different demand scenes.

8. The system for building the operation and maintenance data middlebox based on big data technology according to claim 4, wherein: the application layer processes and organizes various processed data and some specific personalized indexes facing the service according to the requirements of consumption and service scene use so as to flexibly support the scene requirements of the final service application.