CN113268530A - Mass heterogeneous data acquisition method and system, computer equipment and storage medium - Google Patents
Mass heterogeneous data acquisition method and system, computer equipment and storage medium Download PDFInfo
- Publication number
- CN113268530A CN113268530A CN202010096216.8A CN202010096216A CN113268530A CN 113268530 A CN113268530 A CN 113268530A CN 202010096216 A CN202010096216 A CN 202010096216A CN 113268530 A CN113268530 A CN 113268530A
- Authority
- CN
- China
- Prior art keywords
- data
- target
- tool
- message middleware
- acquisition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 238000003860 storage Methods 0.000 title claims abstract description 27
- 238000007726 management method Methods 0.000 claims abstract description 38
- 238000004458 analytical method Methods 0.000 claims abstract description 34
- 238000004364 calculation method Methods 0.000 claims abstract description 14
- 238000013480 data collection Methods 0.000 claims description 38
- 238000004590 computer program Methods 0.000 claims description 11
- 230000000007 visual effect Effects 0.000 claims description 9
- 238000012545 processing Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 8
- 238000007405 data analysis Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000012423 maintenance Methods 0.000 description 3
- 238000012800 visualization Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 208000003443 Unconsciousness Diseases 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000005111 flow chemistry technique Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000010223 real-time analysis Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/54—Indexing scheme relating to G06F9/54
- G06F2209/548—Queue
Abstract
The application relates to a method and a system for acquiring massive heterogeneous data, computer equipment and a storage medium. The method comprises the following steps: issuing a data acquisition tool and an acquisition configuration file through a host management tool, tracking target data in a target file group and a mobile phone target file group by the data acquisition tool according to a mode specified by the acquisition configuration file, and sending the target data to a message middleware; the computing engine analyzes the target data according to a preset analysis rule to obtain structured data and puts the structured data into a message middleware; according to the method for acquiring the massive heterogeneous data, the data acquisition tool, the message middleware and the calculation engine which are managed and controlled by the host management tool are cooperated with one another, so that the system resource occupation of the host is reduced, and the large-scale heterogeneous data is preprocessed.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, a system, a computer device, and a storage medium for acquiring massive heterogeneous data.
Background
With the rise of cloud computing and micro-service architecture, class data of a server side is rapidly increased, service classes can touch the upper performance limit of a stand-alone database early, and technologies such as database partitioning and table partitioning technologies come into force; meanwhile, the production speed of non-service data is higher than that of service data, and non-relational data such as log data, operation and maintenance index data and the like cannot be met by traditional data, for example, a non-relational database and distributed storage appear successively, so that the burst growth of various heterogeneous log text data generated by a server side provides a new challenge for a data collection scheme.
However, in the related art, the data collection scheme is based on an ELK solution, the ELK solution is three software products sourced by an elastic company, and the three software products are respectively an elastic search, a logstack and a Kibana, the elastic search is a real-time mass data search and analysis engine based on a Lucene engine, supports full-text retrieval, structured search and analysis of text data, and has the characteristics of real-time analysis, distributed real-time document storage, full-text indexing, high availability, easy expansion, friendly interactive data and the like; logstash is a server-side data processing pipeline that can collect data from multiple sources and convert the data simultaneously before sending the data to a storage facility such as an Elasticsearch; kibana supports a tool for visualizing data in the elastic search using graphs and charts. The Elk solution uses logstack to track the target file at the server, collects and parses the file content into a structured document, which is stored on the Elasticsearch, and finally provides the chart display by Kibana. However, when cloud computing and micro-service architectures are emerging, the ELK solution is under the condition of being unconscious due to the explosive increase of the magnitude of heterogeneous log text data of the server, firstly, the data collection capacity of the logstack occupies more system resources under the condition of large single-machine data volume, and the data processing delay is increased. Secondly, the data analysis of the Logstash requires that a configuration file is configured in advance, under a cluster environment, the configured hot update operation and maintenance work is heavy, and under the condition of large data volume, the acquisition and analysis pressure is high, and the throughput is a bottleneck. Thirdly, because the logstack is a direct connection Elasticsearch when the logstack stores the data, accessing a large amount of logstack to the Elasticsearch cluster in a Transmission Control Protocol (TCP) connection mode will increase the cluster pressure.
Aiming at the problems that massive heterogeneous data collection occupies large resources and the collection and analysis throughput is limited in the related technology, an effective solution is not provided at present.
Disclosure of Invention
Therefore, it is necessary to provide a method, an apparatus, a computer device, and a storage medium for acquiring massive heterogeneous data in order to solve the above technical problems.
According to one aspect of the present invention, a method for acquiring massive heterogeneous data is provided, the method comprising:
the host management tool issues a data acquisition tool and an acquisition configuration file;
the data acquisition tool tracks a target file group according to a mode designated by the acquisition configuration file, collects target data in the target file group and sends the target data to the message middleware;
and the calculation engine analyzes the target data according to a preset analysis rule to obtain structured data, and the structured data is put into the message middleware.
In one embodiment, the collecting target data in the target file group and sending the target data to the message middleware comprises: the message middleware receives the target data collected by the data collection tool in sequence.
In one embodiment, the parsing, by the computing engine, the target data according to a preset parsing rule to obtain structured data, and the placing the structured data in the message middleware includes:
the computing engine acquires a preset analysis rule corresponding to the source according to the source of the target data, analyzes the target data according to the preset analysis rule to obtain structured data, and puts the structured data into a message middleware.
In one embodiment, the placing the structured data into the message middleware comprises:
and storing the structured data into a second theme of the message middleware, wherein the message middleware is a kafka message queue, and the target data is stored in the first theme of the message middleware.
In one embodiment, after the host management tool issues the data collection tool and the collection configuration file, the method further includes:
the host management tool detects the state of the data acquisition tool and the acquisition configuration file, and controls the start and stop of the data acquisition tool and the update of the acquisition configuration file, wherein the update comprises hot update.
In one embodiment, the computing engine parses the target data according to a preset parsing rule to obtain structured data, and after the structured data is placed in the message middleware, the method includes:
and storing the formatted data into a distributed storage, and displaying the formatted data based on the distributed storage by visual chart software.
According to another aspect of the present invention, there is also provided a mass heterogeneous data collection system, the system comprising a host management tool, a data collection tool, message middleware, and a computation engine,
the host management tool is used for issuing a data acquisition tool and acquiring configuration files
The data acquisition tool is used for tracking a target file group according to a mode designated by the acquisition configuration file, collecting target data in the target file group and sending the target data to the message middleware;
the computing engine is used for analyzing the target data according to a preset analysis rule to obtain structured data;
the message middleware is configured to receive the target data and the structured data.
In one embodiment, the host management tool is further configured to detect the status of the data collection tool and the collection configuration file, and control the start and stop of the data collection tool and the update of the collection configuration file.
According to another aspect of the present invention, there is also provided a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the above massive heterogeneous data acquisition method when executing the computer program.
According to another aspect of the present invention, there is also provided a computer-readable storage medium, on which a computer program is stored, wherein the computer program is executed by a processor to implement the mass-purchasable data acquisition method according to the above.
According to the massive heterogeneous data acquisition method, the massive heterogeneous data acquisition system and the computer equipment, the data acquisition tool and the acquisition configuration file are issued through the host management tool, the data acquisition tool tracks the target file group according to the mode specified by the acquisition configuration file, and the target data in the mobile phone target file group is sent to the message middleware; the computing engine analyzes the target data according to a preset analysis rule to obtain structured data and puts the structured data into a message middleware; according to the method for acquiring the massive heterogeneous data, the data acquisition tool, the message middleware and the calculation engine which are managed and controlled by the host management tool are cooperated with one another, so that the system resource occupation of the host is reduced, the large-scale heterogeneous data is preprocessed, and the structured data is obtained.
Drawings
FIG. 1 is a diagram illustrating an application scenario of a mass heterogeneous data acquisition method according to an embodiment of the present invention;
FIG. 2 is a first flowchart of a mass heterogeneous data acquisition method according to an embodiment of the present invention;
FIG. 3 is a second flowchart of a mass heterogeneous data acquisition method according to an embodiment of the present invention;
FIG. 4 is a flow chart of a method for acquiring massive heterogeneous data according to an embodiment of the present invention;
FIG. 5 is a fourth flowchart of a method for mass heterogeneous data acquisition according to an embodiment of the present invention;
FIG. 6 is a first schematic diagram of a mass heterogeneous data acquisition system according to another embodiment of the present invention;
FIG. 7 is a second schematic diagram of a mass heterogeneous data acquisition system according to another embodiment of the present invention;
FIG. 8 is a third schematic diagram of a mass heterogeneous data acquisition system according to another embodiment of the present invention;
fig. 9 is a flowchart of a method for acquiring massive heterogeneous data according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Fig. 1 is an application scenario diagram of a mass heterogeneous data acquisition method according to an embodiment of the present invention, where 102 is a cloud server used for performing centralized management and operation and maintenance, such as computation, storage, data analysis, and identity verification, on a cloud virtual machine on a terminal host, 104 and 106 are hosts, a host management tool installed on the cloud server 102 issues a data acquisition tool and an acquisition configuration file to the hosts 104 and 106, and the data acquisition tool tracks a target file group on the hosts 104 and 106 according to a mode specified by the acquisition configuration file, collects target data in the target file group, and sends the target data to a message middleware; the computing engine installed on the cloud server 102 acquires the target data through the message middleware, analyzes the target data according to a preset analysis rule to obtain structured data, and places the structured data into the message middleware to facilitate further analysis and processing of the data. The server 102 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.
According to an aspect of the present invention, fig. 2 is a first flowchart of a mass heterogeneous data acquisition method according to an embodiment of the present invention, and as shown in fig. 2, a mass heterogeneous data acquisition method is provided, which includes the following steps:
step S210: the host management tool issues a data acquisition tool and an acquisition configuration file;
in step S210, the host management tool may be a Saltstack or otherwise centralized management platform such as an ansable, taking the Saltstack as an example, by deploying a Saltstack environment, a batch execution command may be performed on a large number of servers, centralized management of configuration, file distribution, server data collection, operating system foundation, software package management, and the like are performed according to different service characteristics, where the Saltstack includes a Master control end (Master) and a controlled end (Minion). The data collection tool may be a lightweight text data collection tool filebed or fluent, etc. Taking a filebear as an example, when the Logstash function depends on java and the data size is large, the Logstash process consumes excessive system resources and seriously affects the performance of a service system, and the filebear is lighter than Logstash based on Go language, so that the system resources are rarely occupied; filebed involves two components: the finder (provector) and collector (harvester) are used for reading files and sending event data to a specified output, when the Filebeat is started, the finder(s) is/are started, the file(s) in a specified path is/are checked, the collector is started again, the collected data are aggregated, and the aggregated data are sent to the output configured by the Filebeat. The data acquisition tool also needs a corresponding acquisition configuration file, and a target path for data acquisition, namely a target file group, and a target file format and data output, namely a data flow direction of the acquired target data are generally required to be specified in the acquisition configuration file.
S220: the data acquisition tool tracks the target file group according to the mode designated by the acquisition configuration file, collects target data in the target file group and sends the target data to the message middleware;
in step S220, the data collection tool will track the target file group according to the target file group specified by the collection configuration file, the target file format, the data output, and other configurations, collect the target data in the target file group, and further specify the flow direction of the target data in the collection configuration file, in this embodiment, the target data is sent to a message middleware such as Kafka or rocktmq. Taking a message middleware Kafka as an example, the Kafka is a distributed message queue and has the characteristics of high performance, persistence, multiple copy backup and strong horizontal expansion capability, a Producer (Producer) writes messages into the queue, and a consumer (Customer) takes messages from the queue to perform service logic;
s230: and the calculation engine analyzes the target data according to a preset analysis rule to obtain structured data, and the structured data is put into the message middleware.
In step S230, heterogeneous target data in the message middleware flows into a computation engine, where the computation engine may be a Flink, spark, or storm processing engine, and taking Flink as an example, Flink is a framework and distributed processing engine, and is used to perform stateful computation on unbounded and bounded data streams, support high-throughput, low-latency, and high-performance stream processing, and then the computation engine analyzes the target data according to a preset analysis rule, that is, performs data conversion on the target data to obtain structured data, where Map operation may complete cleaning and conversion on a data set; the FlatMap operator is mainly applied to processing and inputting an element, generating a calculation scene of one or more elements, cutting the text data of each line and generating a word sequence; the Filter operator carries out screening operation on the result set according to the conditions, outputs the data set which meets the conditions, and filters the results which do not meet the conditions; and the KeyBy converts the data stream format of the input data into the key data stream according to the specified key, equivalently, a partition operation is executed in the data set, the data with the same key value is put into the same partition, and the heterogeneous data is subjected to data format conversion through the data processing and key information is acquired from the heterogeneous data. And finally, the processed data is put into the message middleware again, the topic in the message middleware is different from the previous topic, and the message flow direction of the message middleware can be specified according to the scene requirement, so that the utilization of the structured data is realized.
According to the mass heterogeneous data acquisition method, a data acquisition tool and an acquisition configuration file are issued through a host management tool, the data acquisition tool tracks a target file group according to a mode specified by the acquisition configuration file, target data in the target file group are sent to a message middleware through a mobile phone; the computing engine analyzes the target data according to a preset analysis rule to obtain structured data and puts the structured data into a message middleware; according to the method for acquiring the massive heterogeneous data, the data acquisition tool, the message middleware and the calculation engine which are managed and controlled by the host management tool are cooperated with one another, so that the system resource occupation of the host is reduced, and the large-scale heterogeneous data is preprocessed.
In an embodiment, fig. 3 is a flowchart of a second method for acquiring massive heterogeneous data according to an embodiment of the present invention, and as shown in fig. 3, after the host management tool issues the data acquisition tool and the acquisition configuration file, the method further includes: s310: the host management tool detects the state of the data acquisition tool and acquires the configuration file, and controls the start and stop of the data acquisition tool and the update of the acquisition configuration file. In this embodiment, the host collection tool may monitor the working state of the file collection tool in real time after issuing the data collection tool and collecting the configuration file, and control the start and stop of the data collection tool in real time according to the change of the collection requirement to perform the thermal update on the collection configuration file. For example, the Saltstack enables the management commands to be executed in the remote system in parallel, allows the system to be located not only through the host name but also through the system attributes, has a higher essence degree for data acquisition, and when the acquisition requirement changes, the system does not need to be configured in advance, and only needs to issue a new acquisition configuration file for the virtual machine which needs to be changed.
In one embodiment, collecting the target data in the target file group and sending to the message middleware comprises: and the message middleware receives the target data collected by the data acquisition tool in sequence. Because in the data collection process, the data collection tool collects a large amount of target data, in order to enable the data collection process to be more ordered, the preset message middleware receives the target data according to the collection sequence of the data collection tool, on one hand, transmission conflict or blockage caused by inflow of a large amount of data is avoided, and on the other hand, the target data is enabled to be more complete and convenient for subsequent analysis.
In an embodiment, fig. 4 is a flow chart of a method for acquiring massive heterogeneous data according to an embodiment of the present invention, and as shown in fig. 4, a computing engine parses the target data according to a preset parsing rule to obtain structured data, and placing the structured data in the message middleware includes: step S410: the computing engine acquires a preset analysis rule corresponding to the source according to the source of the target data, analyzes the target data according to the preset analysis rule to obtain structured data, and puts the structured data into the message middleware. In this embodiment, the target data is analyzed according to the source of the target data, and different analysis rules can be set for different data sources; under the condition of large-scale data volume, firstly, the target data is divided into different analysis rules according to data sources, and then the format conversion is carried out on the data by using the different analysis rules, so that the data analysis efficiency is further improved, and the data acquisition and analysis throughput is improved.
In one embodiment, placing the structured data into message middleware comprises: and storing the structured data into a second theme of the message middleware, wherein the message middleware is a kafka message queue, and the target data is stored in the first theme of the message middleware. In this embodiment, a kafka message queue is used as the message middleware, the data collection tool tracks the target file group according to the mode specified by the collection configuration file, collects the target data in the target file group, sends the target data to the kafka message queue and stores the target data in a topic, that is, the first topic, which can be understood as the classification of the message, for example, according to the source of the target data, the first topic is defined as data from a specified program, and then the data from the specified program is the target data at this time, and is stored in a preset topic in the kafka message queue. The computing engine loads various data analysis expression rules, receives target data in the message queue, classifies the target data according to the source of the target data, applies corresponding analysis rules to analyze the formatted data, and transfers the analyzed and formatted structured data into another topic, namely a second theme of the message queue kafka, wherein the formatted data representing the target data are stored in the second theme, the first theme can be called when the target data is needed subsequently, the second theme is called when the formatted data is needed, and the data are convenient to be further analyzed and processed. In the implementation, the kafka message queue is used as a message middleware, and the topic in the kafka message queue is used for classifying and storing the received target data and analyzing the formatted structured data set, for example, data from different sources can be stored in corresponding topics, and data in the topic can be called quickly by subsequently needing data from a certain source, so that the collection and calling of the data are more accurate and quick.
In an embodiment, fig. 5 is a fourth flowchart of a method for acquiring massive heterogeneous data according to an embodiment of the present invention, as shown in fig. 5, a computing engine parses the target data according to a preset parsing rule to obtain structured data, and after the structured data is placed in the message middleware, the method includes: s510: and storing the formatted data into a distributed storage, and displaying the formatted data based on the distributed storage by using visual chart software. In this embodiment, structured data in the message middleware is further processed, and the structured data is stored in a distributed storage, such as an elastic search, so that a better data storage performance is obtained, the stored data can be conveniently used for querying or exporting, and meanwhile, formatted data in the distributed storage can be more efficiently and safely called by subsequent software, for example, visual display is performed, so that a massive heterogeneous data acquisition result is more reliable.
In an embodiment, visualization chart software, such as Grafana, may perform report display on the acquired data based on the above distributed stored data source, where Grafana is a visualization panel (Dashboard), may perform chart and layout display, includes a measurement Dashboard and a graphic editor with various functions, and supports grapite, Elasticsearch, infiluxdb, OpenTSDB, and the like as data sources. In addition, other visual iconic software such as kibana also supports dashboards, supports in-depth data analysis and presents data in a variety of charts, tables and visualizations. By the method in the embodiment, the acquired data can be further displayed, so that the acquisition result of the massive heterogeneous data is more complete and visual.
It should be understood that, although the respective steps in the flowcharts in fig. 2 to 5 are sequentially shown as indicated by arrows, the steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-5 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
According to another aspect of the present invention, fig. 6 is a schematic diagram of a mass heterogeneous data collection system according to another embodiment of the present invention, as shown in fig. 6, a mass heterogeneous data collection system 60 is provided, where the system includes a host management tool 62, a data collection tool 64, a message middleware 66, and a computation engine 68, the host management tool 62 is configured to issue the data collection tool and a collection configuration file, the data collection tool 64 is configured to track a target file group according to a mode specified by the collection configuration file, collect target data in the target file group, and send the target data to the message middleware 66; the calculation engine 68 is configured to analyze the target data according to a preset analysis rule to obtain structured data; message middleware 66 is for receiving the target data and the structured data.
In one embodiment, the host management tool 62 is further configured to detect the status of the data collection tool and the collection profile, and control the start and stop of the data collection tool and the update of the collection profile.
In an embodiment, fig. 7 is a schematic diagram of a second mass heterogeneous data acquisition method according to another embodiment of the present invention, and as shown in fig. 7, the mass heterogeneous data acquisition system 60 further includes a distributed storage 72, where the distributed storage is used for storing the structured data.
In an embodiment, fig. 8 is a schematic diagram of a third embodiment of the mass heterogeneous data collection system according to the present invention, and as shown in fig. 8, the mass heterogeneous data collection system 60 further includes a visual charting module 82, where the visual charting module 82 is configured to display the formatted data.
The mass heterogeneous data acquisition system issues a data acquisition tool and an acquisition configuration file through a host management tool, the data acquisition tool tracks a target file group according to a mode specified by the acquisition configuration file, and target data in the target file group of the mobile phone is sent to a message middleware; the computing engine analyzes the target data according to a preset analysis rule to obtain structured data and puts the structured data into a message middleware; according to the method for acquiring the massive heterogeneous data, the data acquisition tool, the message middleware and the calculation engine which are managed and controlled by the host management tool are cooperated with one another, so that the system resource occupation of the host is reduced, and the large-scale heterogeneous data is preprocessed.
For specific limitations of the mass heterogeneous data acquisition system, reference may be made to the above limitations on the mass heterogeneous data method, and details are not described here. All modules in the mass heterogeneous data acquisition system can be completely or partially realized through software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In a specific embodiment, fig. 9 is a flowchart of a method for acquiring massive heterogeneous data according to an embodiment of the present invention, and as shown in fig. 9, the method includes:
s910, tracking and collecting a target file group by using a lightweight text data acquisition tool filehead and a host management tool Saltstack according to a mode specified by a configuration file, such as directory and file name regular mode matching, and the like, wherein the target file group comprises deployment, configuration, monitoring and hot update of the data acquisition tool;
s910a, using Saltstalk to issue a collection tool filebolt, collecting configuration files;
s910b, starting filehead to collect and report data to a corresponding topic queue of the message queue Kafka according to the rule of the configuration file;
s910 and 910 c: detecting the filebeat working state by using Saltstack, performing effective configuration, and controlling the starting and stopping of the filebeat and the change of the configuration file;
s920, using a message queue component Kafka to receive data reported by an acquisition tool filebeat in sequence;
s930, writing a stream computing program based on the stream computing engine flink, analyzing the data from different sources in real time according to different specified rules, and sending the data into the message queue kafka again so as to be conveniently accessed to other subsequent processing and other storage terminals;
s930 a: loading various data analysis expression rules by a flink flow calculation task, and receiving collected data in a message queue;
s930 b: according to the data source classification, applying a corresponding analysis rule to analyze the formatted data;
s930 c: the analyzed and formatted structured data are transferred to another topic of the message queue kafka, so that the data can be further analyzed and processed conveniently;
s930 d: storing the parsed structured data into one or more distributed stores, such as an elastic search;
s940: and configuring report displays based on the elastic search data source by using the open-source visual chart software grafana.
According to the acquisition method of the massive heterogeneous data, the efficiency of an acquisition system cannot be tired when the data scale is linearly increased, and the resource consumption of the acquisition service on the host machine cannot be linearly increased along with the data scale; the access of new type data and the control of file tracking are more flexible; for subsequent other processing of data, if data analysis, statistical preprocessing and the like are more friendly; the real-time performance of the data from collection to landing is better, the influence of the data scale is avoided, and only the middleware and the calculation engine resources are linearly increased. In addition, the method uses the real-time computing engine to analyze the data, so that the real-time performance of the analyzing process is more stable and the support for a new data source is better, and the message queue is used as a data transmission channel, so that the data is more stably transmitted among services, and the fault tolerance is better.
The scheme provides a uniform data acquisition scheme on the cloud platform, and the uniform data acquisition scheme comprises the stages of data collection access, transmission, formatting analysis, classified routing, storage, query and report display, is not specific to any specific type, specific scene or specific format data, and is generally applied to all conditions of file type data collection on the cloud platform. The scenario content describes a full flow processing scenario for data that is desired to be collected, which enters the lifecycle of the scenario once it is produced. The main contents of the scheme describe how to collect data in a large scale, how to manage and control the collection, and how to support analysis of newly dynamically added collected data.
In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the mass heterogeneous data collection method is implemented.
The computer equipment issues a data acquisition tool and an acquisition configuration file through a host management tool, the data acquisition tool tracks the target file group according to a mode designated by the acquisition configuration file, and the target data in the mobile phone target file group is sent to the message middleware; the computing engine analyzes the target data according to a preset analysis rule to obtain structured data and puts the structured data into a message middleware; according to the method for acquiring the massive heterogeneous data, the data acquisition tool, the message middleware and the calculation engine which are managed and controlled by the host management tool are cooperated with one another, so that the system resource occupation of the host is reduced, and the large-scale heterogeneous data is preprocessed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the above-described massive heterogeneous data acquisition method.
The computer readable storage medium issues a data acquisition tool and an acquisition configuration file through a host management tool, the data acquisition tool tracks a target file group according to a mode specified by the acquisition configuration file, and target data in the mobile phone target file group is sent to a message middleware; the computing engine analyzes the target data according to a preset analysis rule to obtain structured data and puts the structured data into a message middleware; according to the method for acquiring the massive heterogeneous data, the data acquisition tool, the message middleware and the calculation engine which are managed and controlled by the host management tool are cooperated with one another, so that the system resource occupation of the host is reduced, and the large-scale heterogeneous data is preprocessed.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A mass heterogeneous data acquisition method is characterized by comprising the following steps:
the method comprises the following steps that a host management tool issues a data acquisition tool and an acquisition configuration file, the data acquisition tool tracks a target file group according to a mode designated by the acquisition configuration file, collects target data in the target file group and sends the target data to a message middleware;
and the calculation engine analyzes the target data according to a preset analysis rule to obtain structured data, and the structured data is put into the message middleware.
2. The method of claim 1, wherein collecting the target data in the target set of files and sending the target data to a message middleware comprises:
the message middleware receives the target data collected by the data collection tool in sequence.
3. The method of claim 1, wherein the computing engine parses the target data according to a preset parsing rule to obtain structured data, and placing the structured data in the message middleware comprises:
the computing engine acquires a preset analysis rule corresponding to the source according to the source of the target data, analyzes the target data according to the preset analysis rule to obtain structured data, and puts the structured data into a message middleware.
4. The method of claim 1, wherein placing the structured data into the message middleware comprises:
and storing the structured data into a second topic of the message middleware, wherein the message middleware is a kafka message queue, and the target data is stored in a first topic of the message middleware.
5. The method of claim 1, wherein after the host management tool issues the data collection tool and the collection configuration file, the method further comprises:
the host management tool detects the state of the data acquisition tool and the acquisition configuration file, and controls the start and stop of the data acquisition tool and the update of the acquisition configuration file, wherein the update comprises hot update.
6. The method according to claim 1, wherein the computing engine parses the target data according to a preset parsing rule to obtain structured data, and after the structured data is placed in the message middleware, the method comprises:
and storing the formatted data into a distributed storage, and displaying the formatted data based on the distributed storage by visual chart software.
7. A massive heterogeneous data acquisition system is characterized by comprising a host management tool, a data acquisition tool, message middleware and a calculation engine,
the host management tool is used for issuing a data acquisition tool and acquiring configuration files
The data acquisition tool is used for tracking a target file group according to a mode designated by the acquisition configuration file, collecting target data in the target file group and sending the target data to the message middleware;
the computing engine is used for analyzing the target data according to a preset analysis rule to obtain structured data;
the message middleware is configured to receive the target data and the structured data.
8. The mass heterogeneous data collection system according to claim 7, wherein the host management tool is further configured to detect a state of the data collection tool and the collection configuration file, and control start and stop of the data collection tool and update of the collection configuration file.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 6 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010096216.8A CN113268530A (en) | 2020-02-17 | 2020-02-17 | Mass heterogeneous data acquisition method and system, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010096216.8A CN113268530A (en) | 2020-02-17 | 2020-02-17 | Mass heterogeneous data acquisition method and system, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113268530A true CN113268530A (en) | 2021-08-17 |
Family
ID=77227497
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010096216.8A Pending CN113268530A (en) | 2020-02-17 | 2020-02-17 | Mass heterogeneous data acquisition method and system, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113268530A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113918238A (en) * | 2021-09-27 | 2022-01-11 | 中盈优创资讯科技有限公司 | Flink-based heterogeneous data source synchronization method and device |
CN114070879A (en) * | 2021-11-26 | 2022-02-18 | 安天科技集团股份有限公司 | Data acquisition unit control method, device and related equipment |
CN114257646A (en) * | 2021-12-20 | 2022-03-29 | 浙江时空道宇科技有限公司 | Telemetering data processing method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180165307A1 (en) * | 2016-12-09 | 2018-06-14 | International Business Machines Corporation | Executing Queries Referencing Data Stored in a Unified Data Layer |
CN109284430A (en) * | 2018-09-07 | 2019-01-29 | 杭州艾塔科技有限公司 | Visualization subject web page content based on distributed structure/architecture crawls system and method |
CN109542733A (en) * | 2018-12-05 | 2019-03-29 | 焦点科技股份有限公司 | A kind of highly reliable real-time logs collection and visual m odeling technique method |
CN109542011A (en) * | 2018-12-05 | 2019-03-29 | 国网江西省电力有限公司信息通信分公司 | A kind of standardized acquisition system of multi-source heterogeneous monitoring data |
CN109977158A (en) * | 2019-02-28 | 2019-07-05 | 武汉烽火众智智慧之星科技有限公司 | Public security big data analysis processing system and method |
CN110276002A (en) * | 2019-06-26 | 2019-09-24 | 浙江大搜车软件技术有限公司 | Search for application data processing method, device, computer equipment and storage medium |
-
2020
- 2020-02-17 CN CN202010096216.8A patent/CN113268530A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180165307A1 (en) * | 2016-12-09 | 2018-06-14 | International Business Machines Corporation | Executing Queries Referencing Data Stored in a Unified Data Layer |
CN109284430A (en) * | 2018-09-07 | 2019-01-29 | 杭州艾塔科技有限公司 | Visualization subject web page content based on distributed structure/architecture crawls system and method |
CN109542733A (en) * | 2018-12-05 | 2019-03-29 | 焦点科技股份有限公司 | A kind of highly reliable real-time logs collection and visual m odeling technique method |
CN109542011A (en) * | 2018-12-05 | 2019-03-29 | 国网江西省电力有限公司信息通信分公司 | A kind of standardized acquisition system of multi-source heterogeneous monitoring data |
CN109977158A (en) * | 2019-02-28 | 2019-07-05 | 武汉烽火众智智慧之星科技有限公司 | Public security big data analysis processing system and method |
CN110276002A (en) * | 2019-06-26 | 2019-09-24 | 浙江大搜车软件技术有限公司 | Search for application data processing method, device, computer equipment and storage medium |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113918238A (en) * | 2021-09-27 | 2022-01-11 | 中盈优创资讯科技有限公司 | Flink-based heterogeneous data source synchronization method and device |
CN114070879A (en) * | 2021-11-26 | 2022-02-18 | 安天科技集团股份有限公司 | Data acquisition unit control method, device and related equipment |
CN114070879B (en) * | 2021-11-26 | 2024-01-26 | 安天科技集团股份有限公司 | Data collector control method and device and related equipment |
CN114257646A (en) * | 2021-12-20 | 2022-03-29 | 浙江时空道宇科技有限公司 | Telemetering data processing method, device, equipment and storage medium |
CN114257646B (en) * | 2021-12-20 | 2023-11-14 | 浙江时空道宇科技有限公司 | Telemetry data processing method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108039959B (en) | Data situation perception method, system and related device | |
CN113268530A (en) | Mass heterogeneous data acquisition method and system, computer equipment and storage medium | |
EP4099170B1 (en) | Method and apparatus of auditing log, electronic device, and medium | |
CN105677615B (en) | A kind of distributed machines learning method based on weka interface | |
CN113360554B (en) | Method and equipment for extracting, converting and loading ETL (extract transform load) data | |
US9992269B1 (en) | Distributed complex event processing | |
CN110134738B (en) | Distributed storage system resource estimation method and device | |
CN112527848B (en) | Report data query method, device and system based on multiple data sources and storage medium | |
CN111400361A (en) | Data real-time storage method and device, computer equipment and storage medium | |
CN111209310A (en) | Service data processing method and device based on stream computing and computer equipment | |
CN110851234A (en) | Log processing method and device based on docker container | |
CN111324606A (en) | Data fragmentation method and device | |
CN112988741A (en) | Real-time service data merging method and device and electronic equipment | |
CN110928851A (en) | Method, device and equipment for processing log information and storage medium | |
CN113282611A (en) | Method and device for synchronizing stream data, computer equipment and storage medium | |
CN107871055B (en) | Data analysis method and device | |
CN111159135A (en) | Data processing method and device, electronic equipment and storage medium | |
CN106557483B (en) | Data processing method, data query method, data processing equipment and data query equipment | |
CN113918532A (en) | Portrait label aggregation method, electronic device and storage medium | |
CN113568813A (en) | Mass network performance data acquisition method, device and system | |
CN112631754A (en) | Data processing method, data processing device, storage medium and electronic device | |
CN112182025A (en) | Log analysis method, device, equipment and computer readable storage medium | |
CN116483831A (en) | Recommendation index generation method for distributed database | |
US8856152B2 (en) | Apparatus and method for visualizing data | |
CN116401025A (en) | Data processing system and data processing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210817 |