CN108681569B - Automatic data analysis system and method thereof - Google Patents

Automatic data analysis system and method thereof Download PDF

Info

Publication number
CN108681569B
CN108681569B CN201810419994.9A CN201810419994A CN108681569B CN 108681569 B CN108681569 B CN 108681569B CN 201810419994 A CN201810419994 A CN 201810419994A CN 108681569 B CN108681569 B CN 108681569B
Authority
CN
China
Prior art keywords
data
processing
unit
receivable
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810419994.9A
Other languages
Chinese (zh)
Other versions
CN108681569A (en
Inventor
李昂
马继伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Asia Factor Shenzhen Co ltd
Original Assignee
Asia Factor Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Asia Factor Shenzhen Co ltd filed Critical Asia Factor Shenzhen Co ltd
Priority to CN201810419994.9A priority Critical patent/CN108681569B/en
Publication of CN108681569A publication Critical patent/CN108681569A/en
Application granted granted Critical
Publication of CN108681569B publication Critical patent/CN108681569B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • G06F9/44526Plug-ins; Add-ons

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an automatic data analysis system and a method thereof, wherein the automatic data analysis system comprises: the data acquisition module is used for connecting an external application interface and receiving data information; the data storage module is used for storing, processing and communicating data information; and the data analysis module is used for analyzing and deleting the data information to obtain final data. The method comprises the steps of receiving data, completely recording and classifying the data, analyzing and deleting the verified data, uploading the analyzed and processed data to a server, and generating final data; the efficiency of data analysis capability is greatly improved, the error ratio is reduced to the minimum, redundant data is cleaned, the service trend is completely shown, the true value is infinitely approximated, and the confidentiality is good.

Description

Automatic data analysis system and method thereof
Technical Field
The invention relates to the field of data analysis, in particular to an automatic data analysis system and a method thereof.
Background
The existing technical data acquisition has low efficiency, large data is not applied in place, huge data volume is not classified completely, so that the data cannot be effectively utilized, and even if the data is extracted, the analysis encryption problem of the block cannot be made in place.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an automatic data analysis system and a method thereof.
In order to achieve the purpose, the invention adopts the following technical scheme:
an automatic data analysis system comprising:
the data acquisition module is used for connecting an external application interface and receiving data information;
the data storage module is used for storing, processing and communicating data information;
and the data analysis module is used for analyzing and deleting the data information to obtain final data.
The further technical scheme is as follows: the data acquisition module is used for accessing a financial application interface of a partner, and is responsible for maintenance of agent end configuration, receiving and transferring of monitoring data, acquisition of network equipment data and monitoring of port health state.
The further technical scheme is as follows: the data acquisition module comprises:
the agent unit is used for acquiring data of host hardware, an operating system, middleware and a service system in a built-in metric and custom plug-in mode and asynchronously sending the data to the repeater unit through tcp long connection;
the net-collection unit is used for collecting various performance indexes of the network equipment, and comprises the number of bytes received and sent by each interface, the number of data packets and the number of errors; monitoring data is sent to a repeater unit through a tcp long connection, and configuration and interface information is sent to an cfc unit;
cfc, deployed in a data center, directly connected to MySQL, and responsible for maintaining synchronization of metric information and plug-ins synchronized by agent or net-collection;
cfc-proxy unit, used to be deployed in branch or different place machine room, which is the communication bridge between agent/net-collection and cfc;
and the repeater unit can be deployed at will and is responsible for receiving the time sequence data and forwarding the time sequence data to a specified back end, and the repeater unit supports repeater- > repeater, repeater- > openTSDB and repeater- > Redis.
The further technical scheme is as follows: the data storage module includes:
the storage unit is used for storing the account data, the structured metadata and the file data blob;
the processing unit is used for carrying out combined service logic, sequential service logic and decentralized processing;
and the communication unit is used for interacting the information of the file data blob storage, the metadata storage and the query queue.
The further technical scheme is as follows: the data analysis module includes:
the mixed processing and analyzing unit is used for processing the workloads of the batch processing and analyzing unit and the stream processing and analyzing unit;
the batch processing analysis unit is used for completely processing the data in the memory, reading the data into the memory only at the beginning, interacting with the storage layer when the final result is persistently stored, and storing the processing results of all intermediate states in the memory;
and the stream processing analysis unit is used for buffering the data, and then cutting the buffered data into small fixed data sets for batch processing.
The further technical scheme is as follows: the mixing process analysis unit further comprises the following: the use of the same or related components and APIs to process two types of data thereby simplifying the different processing requirements, implemented by Spark and Flink, combining two different processing modes, and making assumptions about the relationship between fixed and non-fixed data sets, provides not only the methods needed to process data, but also provides integrated items, libraries and tools that can be used for graphical analysis, machine learning and interactive queries.
The further technical scheme is as follows: the batch analysis unit further comprises the following: the RDD model is adopted to process data, only located in a memory and with a constant structure, new RDDs can be generated according to operations executed by the RDDs, and each RDD can be traced back to a parent RDD through Lineage (Link) and finally traced back to data on a disk.
The further technical scheme is as follows: the stream processing analysis unit further comprises the following: using the stream processing model of Flink to process data, each item is treated as a true data stream when processing incoming data, and the DataStream API provided by Flink can be used to process endless data streams.
An automatic data analysis method comprises the following steps:
connecting an external application interface and receiving data information;
step two, storing, transmitting and checking the data information;
step three, analyzing and deleting the verified data;
and step four, uploading the analyzed and processed data to a server to generate final data.
The further technical scheme is as follows: in the first step, the data acquisition module is used for connecting an API (application program interface) of a partner to acquire receivable account data; in the second step, the following contents are also included: the system comprises a database, a database management system and a database management system, wherein the database is used for storing receivables data, structured metadata and file data blob, performing combined business logic, sequential business logic and decentralized processing, and performing interaction on information of file data blob storage, metadata storage and query queues; in the third step, the following contents are also included: processing data, wherein the processing work is completely carried out in a memory, the data is read into the memory only at the beginning, the interaction with a storage layer is needed when a final result is persistently stored, and the processing results of all intermediate states are stored in the memory; the data is buffered and then the buffered data is cut into small fixed data sets for batch processing.
Compared with the prior art, the invention has the beneficial effects that: the data are completely recorded and classified by receiving the data, the verified data are analyzed and deleted, and the analyzed and processed data are uploaded to a server to generate final data; the efficiency of data analysis capability is greatly improved, the error ratio is reduced to the minimum, redundant data is cleaned, the service trend is completely shown, the true value is infinitely approximated, and the confidentiality is good.
The invention is further described below with reference to the accompanying drawings and specific embodiments.
Drawings
FIG. 1 is a block diagram of an automated data analysis system;
fig. 2 is a flow chart of a method for automatically analyzing data.
10 data acquisition module 11 agent unit
12 net-collect unit 13 cfc unit
14 cfc-proxy Unit 15 repeat Unit
20 data storage module 21 storage unit
22 processing unit 23 communication unit
30 data analysis module 31 mixed processing analysis unit
32 batch analysis unit 33 stream processing analysis unit
Detailed Description
In order to more fully understand the technical content of the present invention, the technical solution of the present invention will be further described and illustrated with reference to the following specific embodiments, but not limited thereto.
As shown in fig. 1 to 2, the present invention discloses an automatic data analysis system, as shown in fig. 1 to 2, including:
the data acquisition module 10 is used for connecting an external application interface and receiving data information;
the data storage module 20 is used for storing, processing and communicating data information;
and the data analysis module 30 is used for analyzing and deleting the data information to obtain final data.
Specifically, as shown in fig. 1, the data acquisition module 10 is used for accessing a financial application interface of a partner, and is responsible for maintaining agent end configuration, receiving and unloading monitoring data, acquiring network device data, and monitoring port health status; when the server needs maintenance, the whole monitoring service is not available, and is not beneficial to expansion.
Wherein, data acquisition module 10 includes:
the agent unit 11 is used for acquiring data of host hardware, an operating system, middleware and a service system in a built-in metric and custom plug-in mode, and asynchronously sending the data to the repeater unit 15 through tcp long connection;
a net-collection unit 12, configured to collect various performance indexes of the network device, including the number of bytes received and sent, the number of packets, and the number of errors of each interface; the monitoring data is sent to the repeater unit 15 through a tcp long connection, and the configuration and interface information is sent to the cfc unit 13;
cfc unit 13, deployed in data center, connected directly to MySQL, responsible for maintaining synchronization of metric information and plug-ins synchronized by agent or net-collection;
cfc-proxy unit 14, used to be deployed in branch or different place machine room, which is the communication bridge between agent/net-collection and cfc;
the repeater unit 15, which can be deployed arbitrarily, is responsible for receiving the time sequence data and forwarding the time sequence data to a specified back end, and supports repeater- > repeater, repeater- > openTSDB and repeater- > Redis.
Furthermore, the cfc unit is mainly responsible for configuration and maintenance, the repeater unit is responsible for monitoring data reception and forwarding, the net-collection unit is responsible for collecting network device data, any one component can be horizontally expanded, and system risks are greatly reduced.
Further, the data collection module 10 interfaces with the financial application interface of the partner, and the specific embodiment is as follows: for example, an application is newly developed at present, important indexes such as throughput, delay, interface or url access of a system need to be combed, since OWL does not support active push data, the data need to be exposed in an Http REST API manner, and then an app-collection plug-in carried by the OWL is used to collect the data at regular time, where the data structure exposed by the API is as follows:
Figure BDA0001650445670000061
Figure BDA0001650445670000071
based on the structure, an alarm system, a statistical analysis system, a report system and the like can be constructed on the upper layer; can be used freely.
The alarm service is reconstructed in software by using a go language and is divided into a controller and an alarm logic processing module; the controller is responsible for alarm strategy generation and alarm result processing; and the logic processing module is responsible for acquiring the strategy from the controller and removing 0 pendDB read data for comparison, and the generated result is returned to the controller for processing.
Specifically, as shown in fig. 1, the data storage module 20 includes:
a storage unit 21 that stores the receivable data, the structured metadata, and the file data blob;
a processing unit 22 for performing a combination service logic, a sequential service logic and a decentralized processing;
and the communication unit 23 is used for interacting the information of the file data blob storage, the metadata storage and the query queue.
Wherein, the storage unit 21 further includes the following contents:
the receivable account data is stored, and the receivable is a storage unit of value, namely receivable, air mileage or digital art copyright; the main operation on the receivable storage system is to issue and transmit the receivable (the receivable has many variants at the same time), while preventing the double flower and the like problems; other receivable networks do not support a network by means of internal incentives, but rather support the network as incentives in a higher level network where lower level infrastructure stores the receivable.
A structured metadata store that stores structured metadata specifically, such as a table (relational DB), a document store (e.g., JSON file), a key-value store, a time series or a chart; the data is then quickly retrieved by way of a query (e.g., SQL).
Further, traditional distributed (but centralized) databases such as MongoDB and Cassandra typically store hundreds of terabytes, and even hundreds of Petabytes, with throughputs that can exceed 100 million times per second, as it does.
BigchainDB is decentralized database software; specifically, the file storage station is established on MongoDB (or RethinkDB), and inherits the query and expansion of Mongo; but it also has the blockchain-y characteristics of decentralized control, tamper resistance, and accommodation support; IPDB is a public network instance of BigchainDB with governance functions; also in the field of blockchains, IOTA can be considered as a time series database.
File data blob stores, these are systems that store large files (creditor debt, large data sets), organized in a hierarchy of directories and files.
IPFS and Tahoe-LAFS are decentralized file systems that consolidate decentralized or centralized blob stores together. FileCoin, Storj, Sia and Tieron are all used for decentralized blob storage; the same is true for the well-used old system, BitTorrent, although it adopted the tit-for-tat solution rather than being receivable; bee colony, Dat and Swarm-JS are basically used in two ways.
The processing unit 22 further includes the following contents:
the "smart contract" system is a label for a system that is handled in a decentralized manner, which in fact contains two very different subsets of attributes: stateless (combinational) business logic and stateful (sequential) business logic; stateless and stateful create fundamental differences in complexity, verifiability, etc., and the third decentralized processing building block is a High Performance Computing (HPC).
Further, stateless (combinational) business logic, which is arbitrary logic that does not retain state internally; in electrical engineering terms, it can be combined into combinational digital logic circuits, which can be represented as truth tables, schematics, or codes with conditional statements (if/then, or, and, combinations of non-statements), which, because they have no state, easily validate large stateless intelligent contracts, thus establishing large validation/security systems; n inputs and one output require 0(2^ N) computations to verify.
The exchange protocol (ILP) comprises a Cryptographic Condition (CC) protocol purely for specifying combinational circuits; it is becoming an internet standard through the IETF and because ILP is widely adopted in centralized and decentralized payment networks (e.g., over 75 banks using Ripple); CC has independent realization in multiple aspects such as JavaScript, Python, Java and the like; systems such as BigchainDB, Ripple, use CC; thus supporting combinational business logic/intelligence contracts; since state logic is a superset of stateless logic, a system that supports state logic also supports stateless logic (albeit at the expense of additional complexity and verifiability challenges).
Further, state (sequential) business logic, which is any logic that retains state internally, that is, it has memory, or it is a combinational logic circuit with at least one feedback loop (and one clock); for example, a microprocessor has an internal register that is updated according to machine code instructions sent to it, and more generally, the state service logic is a turing machine that receives a series of inputs and returns a series of outputs, and systems that exhibit this property (similar in a practical sense) are known as turing complete systems; because sequential logic is a superset of combinational logic, these systems also support combinational logic.
Small errors in the code can have serious consequences, as if DA0 were hacked, formal verification can help solve this problem as it helps the chip industry, but it has size limitations, the number of possible mappings is 2^ (number of inputs) for combinational circuits, and 2^ (number of internal state variables) for sequential logic if the variables are all boolean values; for example, if there is a combinational circuit with an output of 3, it will have 23Verify for 8 possible states; but if it is a 32-bit register sequential circuit, you must check 2 to fully verify3242 hundred million states; this limits the complexity of the sequential circuit; "correctness of construction" is another method of trust status intelligence contracts, as Rchain uses the rho algorithm.
Decentralized processing, for many use cases, only processing within the browser or the mobile device internal client, i.e., running JavaScript or Swift. Processing running on the client, but this is generally acceptable if running on the device in hand, which is an alternative to the "fat client" framework, all that many webapps require is the application state, JS + IPDB (using JS-bigchaidb-driver), Blob store and pay, including JSFS client version of IPFS (IPFS. JS); golem and iEx.ec take the same as a combination of a decentralized super computer and related application programs to form, and Nyraid takes the same as storage processing; basically, the processing is located beside the decentralized storage (there is also a corresponding solution for nyiad).
TrueBit allows third party computation but performs post-computation checks (implicitly checks when a problem is likely to occur; explicitly checks when a problem arises); running a large number of computations on a VM or Docker container and putting the results (final VM state, or just computed results) into a blob store with restricted access; they then sell access to these containers using, for example, indicia reading rights, which requires more customers to verify the results.
The communication unit 23 further includes the following:
adopting IPFS + Ethereum or IPFS + IPDB, Ujo using IPFS | Swarm + IPDB + Ethereum to perform decentralized storage; IPFS or Swarm is used for file systems and blob stores, IPDB (using BigchainDB) is used for metadata stores and query queues; for receivable storage and status service logic.
Adopting Innogy to use IPFS + IPDB + IOTA in a supply chain/IoT application program; IPFS is used for file system and Blob storage, IPDB (using BigchainDB) is used for metadata storage and query queues, and IOTA is used for storing time series data.
The "fat protocol" framework of Joel Monegro is adopted to emphasize each building block as a protocol; it limits the interaction of building blocks through network protocols; but there is another method: a block may simply be an "import" statement or library call.
The reason for using import may be (a) lower latency: network calls require time, which may harm or destroy availability; (b) simplicity: using libraries (even embedded codes) is generally simpler than connecting over a network, paying for accounts receivable, etc.; (c) and (3) more mature: protocol stacks have been developed.
Specifically, as shown in fig. 1, the data analysis module 30 includes:
a mixed processing analysis unit 31 for processing the workload of the batch processing analysis unit 32 and the stream processing analysis unit 33;
the batch processing analysis unit 32 is used for completely processing data in the memory, reading the data into the memory only at the beginning, interacting with the storage layer when the final result is persistently stored, and storing the processing results of all intermediate states in the memory;
the stream processing analysis unit 33 buffers the data, and then cuts the buffered data into small fixed data sets for batch processing.
The mixing processing and analyzing unit 31 further includes the following components: the use of the same or related components and APIs to process two types of data thereby simplifying the different processing requirements, implemented by Spark and Flink, combining two different processing modes, and making assumptions about the relationship between fixed and non-fixed data sets, provides not only the methods needed to process data, but also provides integrated items, libraries and tools that can be used for graphical analysis, machine learning and interactive queries.
Further, Apache Spark is a next generation batch processing framework including stream processing capability, and Spark developed based on various same principles with the MapReduce engine of Hadoop mainly focuses on accelerating the running speed of batch processing workload through a perfect memory computing and processing optimization mechanism.
Further, Spark may be deployed as a stand-alone cluster (requiring cooperation of the respective storage tiers), or may be integrated with Hadoop and replace the MapReduce engine.
The batch analysis unit 32 further includes the following components: the RDD model is adopted to process data, only located in a memory and with a constant structure, new RDDs can be generated according to operations executed by the RDDs, and each RDD can be traced back to a parent RDD through Lineage (Link) and finally traced back to data on a disk.
Further, unlike MapReduce, the spare data processing is performed in the memory, and only at the beginning, the data is read into the memory, and the final result needs to interact with the storage layer when being persistently stored, and the processing results of all intermediate states are stored in the memory; although the performance can be greatly improved by the processing mode in the memory, the speed of Spark is greatly improved when the Spark processes tasks related to the disk, and the complete integral optimization can be realized by analyzing the whole task set in advance; to this end, Spark may create a Directed Acyclic Graph, DAG, representing all operations that need to be performed, the data that needs to be operated on, and the relationships between operations and data, whereby the processor may coordinate tasks more intelligently.
Further, to realize batch computation in the memory, Spark processes data using a model named as a Resource Distributed Dataset (RDD); the representative data set is only located in a memory and is of a structure which is invariable forever, new RDDs can be generated according to operations executed by the RDDs, and each RDD can be traced back to a parent RDD through a Lineage (Link) and finally traced back to data on a disk; spark can implement fault tolerance by RDD without writing the results of each operation back to disk.
Wherein, the stream processing and analyzing unit 33 further includes the following contents: using the stream processing model of Flink to process data, each item is treated as a true data stream when processing incoming data, and the DataStream API provided by Flink can be used to process endless data streams.
Further, the stream processing capability is realized by Spark Streaming. Spark is mainly oriented to batch processing workload in design, and in order to make up for the difference in the characteristics of engine design and stream processing workload, Spark realizes a concept called Micro-batch (Micro-batch); in terms of a particular strategy, the technique may treat a data stream as a series of very small "batches" whereby it can be processed through the native semantics of a batch processing engine.
Spark Streaming buffers the stream in sub-second increments, and then these buffers are batch processed as small fixed data sets; the practical effect of this approach is very good, but there are still deficiencies in performance compared to the real stream processing framework; the main reason for using Spark rather than Hadoop MapReduce is speed; with the help of mechanisms such as memory computing strategies and advanced DAG scheduling, Spark can process the same data set at a higher speed; another important advantage of Spark is diversity, the product can be deployed as an independent cluster, or integrated with an existing Hadoop cluster, the product can run batch and stream processing, and different types of tasks can be processed by running one cluster.
Besides the self-capability of the engine, an ecosystem comprising various libraries is established around Spark, and better support can be provided for tasks such as machine learning, interactive query and the like; the Spark task is easier to write "as is well known" than MapReduce, and therefore can greatly improve productivity.
A batch processing method is adopted for a stream processing system, and data entering the system needs to be buffered; the buffering mechanism enables the technique to process a very large amount of incoming data, improving the overall throughput rate, but waiting for the buffer to empty can also result in increased delay; this means that Spark Streaming may not be suitable for handling workloads with higher requirements on latency; since memory is typically more expensive than disk space, Spark costs are higher than disk-based systems; however, the increase in processing speed means that tasks can be completed more quickly, and this feature can often offset the increased cost in environments where resources need to be paid for in hours.
Another consequence of Spark memory computation for this design is that insufficient resources may be encountered if deployed in a shared cluster; compared with Hadoop MapReduce, Spark is more resource-consuming and may affect other tasks that need to use the cluster at the same time. By its nature, Spark is less suitable for coexistence with other components of the Hadoop stack.
Spark is the best choice for diversified workload processing tasks; spark batch processing capability provides an unparalleled speed advantage at the expense of higher memory usage; for workloads that place importance on throughput rather than latency, Spark Streaming is more suitable as a Streaming solution.
Apache Flink is a stream processing framework that can handle batch processing tasks; the techniques may treat batch data as a data stream with limited boundaries, whereby batch tasks are processed as a subset of stream processing; this stream processing-first approach, also known as the Kappa architecture, is opposed to the more well-known Lambda architecture (which uses batch processing as the main processing method and streams as supplements and provides early unrefined results); in the Kappa architecture, models are simplified by streaming all that is feasible after the streaming engines have matured.
Further, the data analysis module 30 further includes a stream processing model (not shown in the figure) and a batch processing model (not shown in the figure):
a stream processing model, Flink's stream processing model treats each item as a true data stream when processing incoming data; the DataStream API provided by Flink can be used to handle endless data streams, and the basic components with which Flink can be used include:
stream refers to a constantly moving set of borderless data that is streamed in the system;
operator refers to a function that performs operations on a data stream to produce other data streams;
source refers to the entry point of the data stream into the system;
sink refers to the location where the data stream enters after leaving the Flink system, the slot may be a database or a connector to other systems;
in order to recover after a problem is encountered in the calculation process, the stream processing task creates a snapshot at a preset time point; to implement state storage, Flink may be used with a variety of state backend systems, depending on the complexity and persistence level of the desired implementation; furthermore, Flink's stream processing capability also understands the concept "event time", which refers to the time when an event actually occurs, and the functionality can also handle sessions, which means that execution order and grouping can be ensured in some interesting way.
Batch model, Flink's batch model is largely only an extension to the flow process model; at this point, the model no longer reads data from the persistent stream, but rather reads the bounded data set in the form of a stream from persistent storage, and Flink would use exactly the same runtime for these process models.
The Flink can realize certain optimization on the batch processing workload; for example, since batch operations may be supported by persistent storage, Flink may not create snapshots of batch workloads, data may still be restored, but conventional processing operations may be performed faster; another optimization is to decompose the batch task so that different phases and components can be called when needed, whereby the Flink can better coexist with other users of the cluster, and analyze the task in advance so that the Flink can view all operations to be performed, the size of the data set, and the operation steps to be performed downstream, thereby achieving further optimization.
Flink is currently a unique technology in the field of processing framework, although Spark can also perform batch processing and stream processing, the micro-batch architecture adopted by Spark stream processing makes Spark stream processing not suitable for many cases, and the Flink stream processing-first method can provide low delay, high throughput and nearly one-by-one processing capability.
Many components of the Flink are managed by self, although the method is rare, due to the performance reason, the technology can manage the memory by self, a native Java garbage collection mechanism is not needed, different from Spark, after the characteristics of the data to be processed are changed, the Flink does not need to be optimized and adjusted manually, and the technology can also process operations such as data partitioning, automatic caching and the like by self.
The Flink can permit work in a plurality of ways to optimize tasks, the analysis is partially similar to the optimization of a relational database by an SQL query planner, the most efficient implementation method can be determined for specific tasks, the technology also supports multi-stage parallel execution, the data of blocked tasks can be collected together, and for iterative tasks, due to the performance consideration, the Flink can try to execute corresponding calculation tasks on the nodes for storing data; in addition, "incremental iterations" may be performed, or only portions of the data that have changes may be iterated.
On the user tool side, Flink provides a Web-based scheduling view whereby tasks can be easily managed and system status viewed; the user can also view the optimization scheme of the submitted task, so as to know how the task is finally realized in the cluster, and for the analysis-class task, Flink provides SQL-like query, graphical processing and a machine learning library, and in addition, memory calculation is supported.
The Flink can be well matched with other components, can be well integrated into the whole environment if being matched with a Hadoop stack, only occupies necessary resources at any time, can be easily integrated with YARN, HDFS and Kafka, and can also be operated as other processing frameworks, such as tasks written by Hadoop and Storm, with the help of a compatible package.
Flink provides low latency streaming while supporting traditional batch tasks, is perhaps best suited for organizations with very high streaming requirements and a small number of batch tasks, is compatible with native Storm and Hadoop programs, can run on YARN-managed clusters, and thus can be easily evaluated.
Specifically, as shown in fig. 2, the present invention also discloses an automatic data analysis method, which includes the following steps:
connecting an external application interface and receiving data information;
step two, storing, transmitting and checking the data information;
step three, analyzing and deleting the verified data;
and step four, uploading the analyzed and processed data to a server to generate final data.
In the first step, the method is used for connecting an API (application program interface) of a partner to acquire data of accounts receivable; in the second step, the following contents are also included: the system comprises a database, a database management system and a database management system, wherein the database is used for storing receivables data, structured metadata and file data blob, performing combined business logic, sequential business logic and decentralized processing, and performing interaction on information of file data blob storage, metadata storage and query queues; in step three, the following is also included: processing data, wherein the processing work is completely carried out in a memory, the data is read into the memory only at the beginning, the interaction with a storage layer is needed when a final result is persistently stored, and the processing results of all intermediate states are stored in the memory; the data is buffered and then the buffered data is cut into small fixed data sets for batch processing.
In summary, the networking of the invention receives data, records and classifies the data completely, analyzes and deletes the verified data, and uploads the analyzed and processed data to the server to generate final data; the efficiency of data analysis capability is greatly improved, the error ratio is reduced to the minimum, redundant data is cleaned, the service trend is completely shown, the true value is infinitely approximated, and the confidentiality is good.
The technical contents of the present invention are further illustrated by the examples only for the convenience of the reader, but the embodiments of the present invention are not limited thereto, and any technical extension or re-creation based on the present invention is protected by the present invention. The protection scope of the invention is subject to the claims.

Claims (9)

1. An automatic data analysis system, comprising:
the data acquisition module is used for connecting an external application interface and receiving data information;
the data storage module is used for storing, processing and communicating data information;
the data analysis module is used for analyzing and deleting the data information to obtain final data;
the data storage module includes:
the storage unit is used for storing the account data, the structured metadata and the file data blob;
the processing unit is used for carrying out combined service logic, sequential service logic and decentralized processing;
the communication unit is used for interacting the information of the file data blob storage, the metadata storage and the query queue;
wherein, the memory cell still includes the following content:
the receivable account data is stored, and the receivable is a storage unit of value, namely receivable, air mileage or digital art copyright; the main operation on the receivable storage system is to issue and transmit the receivable while preventing the double flower problem; other receivable networks do not support a network by means of internal incentives, but rather support the network as incentives in a higher level network where lower level infrastructure stores the receivable;
the structured metadata store is specially used for storing structured metadata, including tables, document stores, key value stores, time series or charts, and then the data is quickly retrieved in a query mode.
2. The system according to claim 1, wherein the data collection module is configured to access a financial application interface of a partner, and is responsible for maintenance of agent-side configuration, reception and unloading of monitoring data, collection of network device data, and monitoring of port health status.
3. The system of claim 2, wherein the data collection module comprises:
the agent unit is used for acquiring data of host hardware, an operating system, middleware and a service system in a built-in metric and custom plug-in mode and asynchronously sending the data to the repeater unit through tcp long connection;
the net-collection unit is used for collecting various performance indexes of the network equipment, and comprises the number of bytes received and sent by each interface, the number of data packets and the number of errors; monitoring data is sent to a repeater unit through a tcp long connection, and configuration and interface information is sent to an cfc unit;
cfc, deployed in a data center, directly connected to MySQL, and responsible for maintaining synchronization of metric information and plug-ins synchronized by agent or net-collection;
cfc-proxy unit, used to be deployed in branch or different place machine room, which is the communication bridge between agent/net-collection and cfc;
and the repeater unit can be deployed at will, is responsible for receiving the time sequence data and forwarding the time sequence data to a specified back end, and supports repeater- > repeater, repeater- > openTSDB and repeater- > Redis.
4. The system of claim 1, wherein the data analysis module comprises:
the mixed processing and analyzing unit is used for processing the workloads of the batch processing and analyzing unit and the stream processing and analyzing unit;
the batch processing analysis unit is used for completely processing the data in the memory, reading the data into the memory only at the beginning, interacting with the storage layer when the final result is persistently stored, and storing the processing results of all intermediate states in the memory;
and the stream processing analysis unit is used for buffering the data, and then cutting the buffered data into small fixed data sets for batch processing.
5. The automatic data analysis system of claim 4, wherein the hybrid process analysis unit further comprises: the use of the same or related components and APIs to process two types of data thereby simplifying the different processing requirements, implemented by Spark and Flink, combining two different processing modes, and making assumptions about the relationship between fixed and non-fixed data sets, provides not only the methods needed to process data, but also provides integrated items, libraries and tools that can be used for graphical analysis, machine learning and interactive queries.
6. The system of claim 4, wherein the batch analysis unit further comprises: the RDD model is adopted to process data, only located in a memory and with a constant structure, new RDDs can be generated according to operations executed by the RDDs, and each RDD can be traced back to a parent RDD through Lineage (Link) and finally traced back to data on a disk.
7. The system of claim 4, wherein the stream processing and analyzing unit further comprises: using the stream processing model of Flink to process data, each item is treated as a true data stream when processing incoming data, and the DataStream API provided by Flink can be used to process endless data streams.
8. An automatic data analysis method is characterized by comprising the following steps:
connecting an external application interface and receiving data information;
step two, storing, transmitting and checking the data information;
step three, analyzing and deleting the verified data;
uploading the analyzed and processed data to a server to generate final data;
wherein, still include the following content:
the receivable account data is stored, and the receivable is a storage unit of value, namely receivable, air mileage or digital art copyright; the main operation on the receivable storage system is to issue and transmit the receivable while preventing the double flower problem; other receivable networks do not support a network by means of internal incentives, but rather support the network as incentives in a higher level network where lower level infrastructure stores the receivable;
the structured metadata store is specially used for storing structured metadata, including tables, document stores, key value stores, time series or charts, and then the data is quickly retrieved in a query mode.
9. The method according to claim 8, wherein in the first step, the data of the accounts receivable is obtained by connecting with an API interface of a partner; in the second step, the following contents are also included: the system comprises a database, a database management system and a database management system, wherein the database is used for storing receivables data, structured metadata and file data blob, performing combined business logic, sequential business logic and decentralized processing, and performing interaction on information of file data blob storage, metadata storage and query queues; in the third step, the following contents are also included: processing data, wherein the processing work is completely carried out in a memory, the data is read into the memory only at the beginning, the interaction with a storage layer is needed when a final result is persistently stored, and the processing results of all intermediate states are stored in the memory; the data is buffered and then the buffered data is cut into small fixed data sets for batch processing.
CN201810419994.9A 2018-05-04 2018-05-04 Automatic data analysis system and method thereof Active CN108681569B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810419994.9A CN108681569B (en) 2018-05-04 2018-05-04 Automatic data analysis system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810419994.9A CN108681569B (en) 2018-05-04 2018-05-04 Automatic data analysis system and method thereof

Publications (2)

Publication Number Publication Date
CN108681569A CN108681569A (en) 2018-10-19
CN108681569B true CN108681569B (en) 2021-11-02

Family

ID=63802879

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810419994.9A Active CN108681569B (en) 2018-05-04 2018-05-04 Automatic data analysis system and method thereof

Country Status (1)

Country Link
CN (1) CN108681569B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW202016743A (en) 2018-10-25 2020-05-01 財團法人資訊工業策進會 Data processing apparatus and data processing method for internet of things system
CN109710680A (en) * 2018-12-29 2019-05-03 杭州趣链科技有限公司 A kind of block chain data processing engine and operating method
CN110209646A (en) * 2019-05-14 2019-09-06 汇通达网络股份有限公司 A kind of data platform system calculated based on real-time streaming
CN110659304B (en) * 2019-09-09 2023-06-16 杭州中科先进技术研究院有限公司 Multi-path data stream connection system based on data inclination
CN110908991A (en) * 2019-11-27 2020-03-24 苏州骐越信息科技有限公司 Enterprise data deleting and selecting system
CN111128308B (en) * 2019-12-26 2023-03-24 上海市精神卫生中心(上海市心理咨询培训中心) New mutation information knowledge platform for neuropsychiatric diseases
CN111127196A (en) * 2019-12-31 2020-05-08 中信百信银行股份有限公司 Credit wind control characteristic variable management method and system
CN111414159B (en) * 2020-03-16 2023-07-25 北京艾鸥科技有限公司 Block chain virtual machine device, virtual machine creation method and transaction method
CN112068933B (en) * 2020-09-02 2021-08-10 成都鱼泡科技有限公司 Real-time distributed data monitoring method
CN112184448A (en) * 2020-09-30 2021-01-05 上海旺链信息科技有限公司 Block chain-based self-organizing trusted incentive processing method, system and storage medium
CN115106682A (en) * 2021-03-19 2022-09-27 润智科技有限公司 Full-automatic welding machine data interaction method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0833285A2 (en) * 1996-09-27 1998-04-01 Xerox Corporation Method and product for generating electronic tokens
US7596530B1 (en) * 2008-09-23 2009-09-29 Marcelo Glasberg Method for internet payments for content
CN106296184A (en) * 2015-06-05 2017-01-04 地气股份有限公司 Electronic money management method and electronic-monetary system
CN106777243A (en) * 2016-12-27 2017-05-31 浪潮软件集团有限公司 Dynamic modeling of streaming data analysis
CN107846690A (en) * 2017-10-19 2018-03-27 湖北工业大学 Collaboration communication dynamic bargain motivational techniques under a kind of independent asymmetrical information

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105163349B (en) * 2015-08-03 2016-05-04 东南大学 A kind of multiple gateway Wireless Mesh network implementation method based on IEEE802.11s
CN105701161A (en) * 2015-12-31 2016-06-22 深圳先进技术研究院 Real-time big data user label system
US10275278B2 (en) * 2016-09-14 2019-04-30 Salesforce.Com, Inc. Stream processing task deployment using precompiled libraries
CN106549829B (en) * 2016-10-28 2019-11-12 北方工业大学 Big data computing platform monitoring system and method
CN107395669B (en) * 2017-06-01 2020-04-07 华南理工大学 Data acquisition method and system based on streaming real-time distributed big data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0833285A2 (en) * 1996-09-27 1998-04-01 Xerox Corporation Method and product for generating electronic tokens
US7596530B1 (en) * 2008-09-23 2009-09-29 Marcelo Glasberg Method for internet payments for content
CN106296184A (en) * 2015-06-05 2017-01-04 地气股份有限公司 Electronic money management method and electronic-monetary system
CN106777243A (en) * 2016-12-27 2017-05-31 浪潮软件集团有限公司 Dynamic modeling of streaming data analysis
CN107846690A (en) * 2017-10-19 2018-03-27 湖北工业大学 Collaboration communication dynamic bargain motivational techniques under a kind of independent asymmetrical information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Computer aided engineering and project performance: managing a double-edged sword;T.J. Allen 等;《IEEE International Conference on Engineering Management, Gaining the Competitive Advantage》;20020806;1-8 *
于会计知识流动的A施工企业信用政策动态管理;程平 等;《商业会计》;20141010(第19期);6-9 *

Also Published As

Publication number Publication date
CN108681569A (en) 2018-10-19

Similar Documents

Publication Publication Date Title
CN108681569B (en) Automatic data analysis system and method thereof
US10447772B2 (en) Managed function execution for processing data streams in real time
US10769148B1 (en) Relocating data sharing operations for query processing
Narkhede et al. Kafka: the definitive guide: real-time data and stream processing at scale
US10713247B2 (en) Executing queries for structured data and not-structured data
US10528599B1 (en) Tiered data processing for distributed data
Grover et al. Data Ingestion in AsterixDB.
US20140122559A1 (en) Runtime grouping of tuples in a streaming application
US11727004B2 (en) Context dependent execution time prediction for redirecting queries
US9426219B1 (en) Efficient multi-part upload for a data warehouse
Kotenko et al. Aggregation of elastic stack instruments for collecting, storing and processing of security information and events
CN107103064B (en) Data statistical method and device
US9992269B1 (en) Distributed complex event processing
CN108228322B (en) Distributed link tracking and analyzing method, server and global scheduler
Ahuja et al. State of big data analysis in the cloud
Roth et al. Pgx. d/async: A scalable distributed graph pattern matching engine
Senger et al. BSP cost and scalability analysis for MapReduce operations
Theeten et al. Chive: Bandwidth optimized continuous querying in distributed clouds
CN109829094A (en) Distributed reptile system
Panda et al. High-performance big data computing
Chen et al. Towards low-latency big data infrastructure at sangfor
Zhou et al. Cloudview: describe and maintain resource view in cloud
US20230195510A1 (en) Parallel execution of stateful black box operators
US11782983B1 (en) Expanded character encoding to enhance regular expression filter capabilities
Rachuri et al. Optimizing Near-Data Processing for Spark

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant