CN108681569B

CN108681569B - Automatic data analysis system and method thereof

Info

Publication number: CN108681569B
Application number: CN201810419994.9A
Authority: CN
Inventors: 李昂; 马继伟
Original assignee: Asia Factor Shenzhen Co ltd
Current assignee: Asia Factor Shenzhen Co ltd
Priority date: 2018-05-04
Filing date: 2018-05-04
Publication date: 2021-11-02
Anticipated expiration: 2038-05-04
Also published as: CN108681569A

Abstract

The invention relates to an automatic data analysis system and a method thereof, wherein the automatic data analysis system comprises: the data acquisition module is used for connecting an external application interface and receiving data information; the data storage module is used for storing, processing and communicating data information; and the data analysis module is used for analyzing and deleting the data information to obtain final data. The method comprises the steps of receiving data, completely recording and classifying the data, analyzing and deleting the verified data, uploading the analyzed and processed data to a server, and generating final data; the efficiency of data analysis capability is greatly improved, the error ratio is reduced to the minimum, redundant data is cleaned, the service trend is completely shown, the true value is infinitely approximated, and the confidentiality is good.

Description

Automatic data analysis system and method thereof

Technical Field

The invention relates to the field of data analysis, in particular to an automatic data analysis system and a method thereof.

Background

The existing technical data acquisition has low efficiency, large data is not applied in place, huge data volume is not classified completely, so that the data cannot be effectively utilized, and even if the data is extracted, the analysis encryption problem of the block cannot be made in place.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an automatic data analysis system and a method thereof.

In order to achieve the purpose, the invention adopts the following technical scheme:

an automatic data analysis system comprising:

the data acquisition module is used for connecting an external application interface and receiving data information;

the data storage module is used for storing, processing and communicating data information;

and the data analysis module is used for analyzing and deleting the data information to obtain final data.

The further technical scheme is as follows: the data acquisition module is used for accessing a financial application interface of a partner, and is responsible for maintenance of agent end configuration, receiving and transferring of monitoring data, acquisition of network equipment data and monitoring of port health state.

The further technical scheme is as follows: the data acquisition module comprises:

the agent unit is used for acquiring data of host hardware, an operating system, middleware and a service system in a built-in metric and custom plug-in mode and asynchronously sending the data to the repeater unit through tcp long connection;

the net-collection unit is used for collecting various performance indexes of the network equipment, and comprises the number of bytes received and sent by each interface, the number of data packets and the number of errors; monitoring data is sent to a repeater unit through a tcp long connection, and configuration and interface information is sent to an cfc unit;

cfc, deployed in a data center, directly connected to MySQL, and responsible for maintaining synchronization of metric information and plug-ins synchronized by agent or net-collection;

cfc-proxy unit, used to be deployed in branch or different place machine room, which is the communication bridge between agent/net-collection and cfc;

and the repeater unit can be deployed at will and is responsible for receiving the time sequence data and forwarding the time sequence data to a specified back end, and the repeater unit supports repeater- > repeater, repeater- > openTSDB and repeater- > Redis.

The further technical scheme is as follows: the data storage module includes:

the storage unit is used for storing the account data, the structured metadata and the file data blob;

the processing unit is used for carrying out combined service logic, sequential service logic and decentralized processing;

and the communication unit is used for interacting the information of the file data blob storage, the metadata storage and the query queue.

The further technical scheme is as follows: the data analysis module includes:

the mixed processing and analyzing unit is used for processing the workloads of the batch processing and analyzing unit and the stream processing and analyzing unit;

the batch processing analysis unit is used for completely processing the data in the memory, reading the data into the memory only at the beginning, interacting with the storage layer when the final result is persistently stored, and storing the processing results of all intermediate states in the memory;

and the stream processing analysis unit is used for buffering the data, and then cutting the buffered data into small fixed data sets for batch processing.

The further technical scheme is as follows: the mixing process analysis unit further comprises the following: the use of the same or related components and APIs to process two types of data thereby simplifying the different processing requirements, implemented by Spark and Flink, combining two different processing modes, and making assumptions about the relationship between fixed and non-fixed data sets, provides not only the methods needed to process data, but also provides integrated items, libraries and tools that can be used for graphical analysis, machine learning and interactive queries.

The further technical scheme is as follows: the batch analysis unit further comprises the following: the RDD model is adopted to process data, only located in a memory and with a constant structure, new RDDs can be generated according to operations executed by the RDDs, and each RDD can be traced back to a parent RDD through Lineage (Link) and finally traced back to data on a disk.

The further technical scheme is as follows: the stream processing analysis unit further comprises the following: using the stream processing model of Flink to process data, each item is treated as a true data stream when processing incoming data, and the DataStream API provided by Flink can be used to process endless data streams.

An automatic data analysis method comprises the following steps:

connecting an external application interface and receiving data information;

step two, storing, transmitting and checking the data information;

step three, analyzing and deleting the verified data;

and step four, uploading the analyzed and processed data to a server to generate final data.

The further technical scheme is as follows: in the first step, the data acquisition module is used for connecting an API (application program interface) of a partner to acquire receivable account data; in the second step, the following contents are also included: the system comprises a database, a database management system and a database management system, wherein the database is used for storing receivables data, structured metadata and file data blob, performing combined business logic, sequential business logic and decentralized processing, and performing interaction on information of file data blob storage, metadata storage and query queues; in the third step, the following contents are also included: processing data, wherein the processing work is completely carried out in a memory, the data is read into the memory only at the beginning, the interaction with a storage layer is needed when a final result is persistently stored, and the processing results of all intermediate states are stored in the memory; the data is buffered and then the buffered data is cut into small fixed data sets for batch processing.

Compared with the prior art, the invention has the beneficial effects that: the data are completely recorded and classified by receiving the data, the verified data are analyzed and deleted, and the analyzed and processed data are uploaded to a server to generate final data; the efficiency of data analysis capability is greatly improved, the error ratio is reduced to the minimum, redundant data is cleaned, the service trend is completely shown, the true value is infinitely approximated, and the confidentiality is good.

The invention is further described below with reference to the accompanying drawings and specific embodiments.

Drawings

FIG. 1 is a block diagram of an automated data analysis system;

fig. 2 is a flow chart of a method for automatically analyzing data.

10 data acquisition module 11 agent unit

12 net-collect unit 13 cfc unit

14 cfc-proxy Unit 15 repeat Unit

20 data storage module 21 storage unit

22 processing unit 23 communication unit

30 data analysis module 31 mixed processing analysis unit

32 batch analysis unit 33 stream processing analysis unit

Detailed Description

In order to more fully understand the technical content of the present invention, the technical solution of the present invention will be further described and illustrated with reference to the following specific embodiments, but not limited thereto.

As shown in fig. 1 to 2, the present invention discloses an automatic data analysis system, as shown in fig. 1 to 2, including:

the data acquisition module 10 is used for connecting an external application interface and receiving data information;

the data storage module 20 is used for storing, processing and communicating data information;

and the data analysis module 30 is used for analyzing and deleting the data information to obtain final data.

Specifically, as shown in fig. 1, the data acquisition module 10 is used for accessing a financial application interface of a partner, and is responsible for maintaining agent end configuration, receiving and unloading monitoring data, acquiring network device data, and monitoring port health status; when the server needs maintenance, the whole monitoring service is not available, and is not beneficial to expansion.

Wherein, data acquisition module 10 includes:

the agent unit 11 is used for acquiring data of host hardware, an operating system, middleware and a service system in a built-in metric and custom plug-in mode, and asynchronously sending the data to the repeater unit 15 through tcp long connection;

a net-collection unit 12, configured to collect various performance indexes of the network device, including the number of bytes received and sent, the number of packets, and the number of errors of each interface; the monitoring data is sent to the repeater unit 15 through a tcp long connection, and the configuration and interface information is sent to the cfc unit 13;

cfc unit 13, deployed in data center, connected directly to MySQL, responsible for maintaining synchronization of metric information and plug-ins synchronized by agent or net-collection;

cfc-proxy unit 14, used to be deployed in branch or different place machine room, which is the communication bridge between agent/net-collection and cfc;

the repeater unit 15, which can be deployed arbitrarily, is responsible for receiving the time sequence data and forwarding the time sequence data to a specified back end, and supports repeater- > repeater, repeater- > openTSDB and repeater- > Redis.

Furthermore, the cfc unit is mainly responsible for configuration and maintenance, the repeater unit is responsible for monitoring data reception and forwarding, the net-collection unit is responsible for collecting network device data, any one component can be horizontally expanded, and system risks are greatly reduced.

Further, the data collection module 10 interfaces with the financial application interface of the partner, and the specific embodiment is as follows: for example, an application is newly developed at present, important indexes such as throughput, delay, interface or url access of a system need to be combed, since OWL does not support active push data, the data need to be exposed in an Http REST API manner, and then an app-collection plug-in carried by the OWL is used to collect the data at regular time, where the data structure exposed by the API is as follows:

based on the structure, an alarm system, a statistical analysis system, a report system and the like can be constructed on the upper layer; can be used freely.

The alarm service is reconstructed in software by using a go language and is divided into a controller and an alarm logic processing module; the controller is responsible for alarm strategy generation and alarm result processing; and the logic processing module is responsible for acquiring the strategy from the controller and removing 0 pendDB read data for comparison, and the generated result is returned to the controller for processing.

Specifically, as shown in fig. 1, the data storage module 20 includes:

a storage unit 21 that stores the receivable data, the structured metadata, and the file data blob;

a processing unit 22 for performing a combination service logic, a sequential service logic and a decentralized processing;

and the communication unit 23 is used for interacting the information of the file data blob storage, the metadata storage and the query queue.

Wherein, the storage unit 21 further includes the following contents:

the receivable account data is stored, and the receivable is a storage unit of value, namely receivable, air mileage or digital art copyright; the main operation on the receivable storage system is to issue and transmit the receivable (the receivable has many variants at the same time), while preventing the double flower and the like problems; other receivable networks do not support a network by means of internal incentives, but rather support the network as incentives in a higher level network where lower level infrastructure stores the receivable.

A structured metadata store that stores structured metadata specifically, such as a table (relational DB), a document store (e.g., JSON file), a key-value store, a time series or a chart; the data is then quickly retrieved by way of a query (e.g., SQL).

Further, traditional distributed (but centralized) databases such as MongoDB and Cassandra typically store hundreds of terabytes, and even hundreds of Petabytes, with throughputs that can exceed 100 million times per second, as it does.

BigchainDB is decentralized database software; specifically, the file storage station is established on MongoDB (or RethinkDB), and inherits the query and expansion of Mongo; but it also has the blockchain-y characteristics of decentralized control, tamper resistance, and accommodation support; IPDB is a public network instance of BigchainDB with governance functions; also in the field of blockchains, IOTA can be considered as a time series database.

File data blob stores, these are systems that store large files (creditor debt, large data sets), organized in a hierarchy of directories and files.

IPFS and Tahoe-LAFS are decentralized file systems that consolidate decentralized or centralized blob stores together. FileCoin, Storj, Sia and Tieron are all used for decentralized blob storage; the same is true for the well-used old system, BitTorrent, although it adopted the tit-for-tat solution rather than being receivable; bee colony, Dat and Swarm-JS are basically used in two ways.

The processing unit 22 further includes the following contents:

the "smart contract" system is a label for a system that is handled in a decentralized manner, which in fact contains two very different subsets of attributes: stateless (combinational) business logic and stateful (sequential) business logic; stateless and stateful create fundamental differences in complexity, verifiability, etc., and the third decentralized processing building block is a High Performance Computing (HPC).

Further, stateless (combinational) business logic, which is arbitrary logic that does not retain state internally; in electrical engineering terms, it can be combined into combinational digital logic circuits, which can be represented as truth tables, schematics, or codes with conditional statements (if/then, or, and, combinations of non-statements), which, because they have no state, easily validate large stateless intelligent contracts, thus establishing large validation/security systems; n inputs and one output require 0(2^ N) computations to verify.

The exchange protocol (ILP) comprises a Cryptographic Condition (CC) protocol purely for specifying combinational circuits; it is becoming an internet standard through the IETF and because ILP is widely adopted in centralized and decentralized payment networks (e.g., over 75 banks using Ripple); CC has independent realization in multiple aspects such as JavaScript, Python, Java and the like; systems such as BigchainDB, Ripple, use CC; thus supporting combinational business logic/intelligence contracts; since state logic is a superset of stateless logic, a system that supports state logic also supports stateless logic (albeit at the expense of additional complexity and verifiability challenges).

Further, state (sequential) business logic, which is any logic that retains state internally, that is, it has memory, or it is a combinational logic circuit with at least one feedback loop (and one clock); for example, a microprocessor has an internal register that is updated according to machine code instructions sent to it, and more generally, the state service logic is a turing machine that receives a series of inputs and returns a series of outputs, and systems that exhibit this property (similar in a practical sense) are known as turing complete systems; because sequential logic is a superset of combinational logic, these systems also support combinational logic.

Small errors in the code can have serious consequences, as if DA0 were hacked, formal verification can help solve this problem as it helps the chip industry, but it has size limitations, the number of possible mappings is 2^ (number of inputs) for combinational circuits, and 2^ (number of internal state variables) for sequential logic if the variables are all boolean values; for example, if there is a combinational circuit with an output of 3, it will have 2³Verify for 8 possible states; but if it is a 32-bit register sequential circuit, you must check 2 to fully verify³²42 hundred million states; this limits the complexity of the sequential circuit; "correctness of construction" is another method of trust status intelligence contracts, as Rchain uses the rho algorithm.

Decentralized processing, for many use cases, only processing within the browser or the mobile device internal client, i.e., running JavaScript or Swift. Processing running on the client, but this is generally acceptable if running on the device in hand, which is an alternative to the "fat client" framework, all that many webapps require is the application state, JS + IPDB (using JS-bigchaidb-driver), Blob store and pay, including JSFS client version of IPFS (IPFS. JS); golem and iEx.ec take the same as a combination of a decentralized super computer and related application programs to form, and Nyraid takes the same as storage processing; basically, the processing is located beside the decentralized storage (there is also a corresponding solution for nyiad).

TrueBit allows third party computation but performs post-computation checks (implicitly checks when a problem is likely to occur; explicitly checks when a problem arises); running a large number of computations on a VM or Docker container and putting the results (final VM state, or just computed results) into a blob store with restricted access; they then sell access to these containers using, for example, indicia reading rights, which requires more customers to verify the results.

The communication unit 23 further includes the following:

adopting IPFS + Ethereum or IPFS + IPDB, Ujo using IPFS | Swarm + IPDB + Ethereum to perform decentralized storage; IPFS or Swarm is used for file systems and blob stores, IPDB (using BigchainDB) is used for metadata stores and query queues; for receivable storage and status service logic.

Adopting Innogy to use IPFS + IPDB + IOTA in a supply chain/IoT application program; IPFS is used for file system and Blob storage, IPDB (using BigchainDB) is used for metadata storage and query queues, and IOTA is used for storing time series data.

The "fat protocol" framework of Joel Monegro is adopted to emphasize each building block as a protocol; it limits the interaction of building blocks through network protocols; but there is another method: a block may simply be an "import" statement or library call.

The reason for using import may be (a) lower latency: network calls require time, which may harm or destroy availability; (b) simplicity: using libraries (even embedded codes) is generally simpler than connecting over a network, paying for accounts receivable, etc.; (c) and (3) more mature: protocol stacks have been developed.

Specifically, as shown in fig. 1, the data analysis module 30 includes:

a mixed processing analysis unit 31 for processing the workload of the batch processing analysis unit 32 and the stream processing analysis unit 33;

the batch processing analysis unit 32 is used for completely processing data in the memory, reading the data into the memory only at the beginning, interacting with the storage layer when the final result is persistently stored, and storing the processing results of all intermediate states in the memory;

the stream processing analysis unit 33 buffers the data, and then cuts the buffered data into small fixed data sets for batch processing.

The mixing processing and analyzing unit 31 further includes the following components: the use of the same or related components and APIs to process two types of data thereby simplifying the different processing requirements, implemented by Spark and Flink, combining two different processing modes, and making assumptions about the relationship between fixed and non-fixed data sets, provides not only the methods needed to process data, but also provides integrated items, libraries and tools that can be used for graphical analysis, machine learning and interactive queries.

Further, Apache Spark is a next generation batch processing framework including stream processing capability, and Spark developed based on various same principles with the MapReduce engine of Hadoop mainly focuses on accelerating the running speed of batch processing workload through a perfect memory computing and processing optimization mechanism.

Further, Spark may be deployed as a stand-alone cluster (requiring cooperation of the respective storage tiers), or may be integrated with Hadoop and replace the MapReduce engine.

The batch analysis unit 32 further includes the following components: the RDD model is adopted to process data, only located in a memory and with a constant structure, new RDDs can be generated according to operations executed by the RDDs, and each RDD can be traced back to a parent RDD through Lineage (Link) and finally traced back to data on a disk.

Further, unlike MapReduce, the spare data processing is performed in the memory, and only at the beginning, the data is read into the memory, and the final result needs to interact with the storage layer when being persistently stored, and the processing results of all intermediate states are stored in the memory; although the performance can be greatly improved by the processing mode in the memory, the speed of Spark is greatly improved when the Spark processes tasks related to the disk, and the complete integral optimization can be realized by analyzing the whole task set in advance; to this end, Spark may create a Directed Acyclic Graph, DAG, representing all operations that need to be performed, the data that needs to be operated on, and the relationships between operations and data, whereby the processor may coordinate tasks more intelligently.

Further, to realize batch computation in the memory, Spark processes data using a model named as a Resource Distributed Dataset (RDD); the representative data set is only located in a memory and is of a structure which is invariable forever, new RDDs can be generated according to operations executed by the RDDs, and each RDD can be traced back to a parent RDD through a Lineage (Link) and finally traced back to data on a disk; spark can implement fault tolerance by RDD without writing the results of each operation back to disk.

Wherein, the stream processing and analyzing unit 33 further includes the following contents: using the stream processing model of Flink to process data, each item is treated as a true data stream when processing incoming data, and the DataStream API provided by Flink can be used to process endless data streams.

Further, the stream processing capability is realized by Spark Streaming. Spark is mainly oriented to batch processing workload in design, and in order to make up for the difference in the characteristics of engine design and stream processing workload, Spark realizes a concept called Micro-batch (Micro-batch); in terms of a particular strategy, the technique may treat a data stream as a series of very small "batches" whereby it can be processed through the native semantics of a batch processing engine.

Spark Streaming buffers the stream in sub-second increments, and then these buffers are batch processed as small fixed data sets; the practical effect of this approach is very good, but there are still deficiencies in performance compared to the real stream processing framework; the main reason for using Spark rather than Hadoop MapReduce is speed; with the help of mechanisms such as memory computing strategies and advanced DAG scheduling, Spark can process the same data set at a higher speed; another important advantage of Spark is diversity, the product can be deployed as an independent cluster, or integrated with an existing Hadoop cluster, the product can run batch and stream processing, and different types of tasks can be processed by running one cluster.

Besides the self-capability of the engine, an ecosystem comprising various libraries is established around Spark, and better support can be provided for tasks such as machine learning, interactive query and the like; the Spark task is easier to write "as is well known" than MapReduce, and therefore can greatly improve productivity.

A batch processing method is adopted for a stream processing system, and data entering the system needs to be buffered; the buffering mechanism enables the technique to process a very large amount of incoming data, improving the overall throughput rate, but waiting for the buffer to empty can also result in increased delay; this means that Spark Streaming may not be suitable for handling workloads with higher requirements on latency; since memory is typically more expensive than disk space, Spark costs are higher than disk-based systems; however, the increase in processing speed means that tasks can be completed more quickly, and this feature can often offset the increased cost in environments where resources need to be paid for in hours.

Another consequence of Spark memory computation for this design is that insufficient resources may be encountered if deployed in a shared cluster; compared with Hadoop MapReduce, Spark is more resource-consuming and may affect other tasks that need to use the cluster at the same time. By its nature, Spark is less suitable for coexistence with other components of the Hadoop stack.

Spark is the best choice for diversified workload processing tasks; spark batch processing capability provides an unparalleled speed advantage at the expense of higher memory usage; for workloads that place importance on throughput rather than latency, Spark Streaming is more suitable as a Streaming solution.

Apache Flink is a stream processing framework that can handle batch processing tasks; the techniques may treat batch data as a data stream with limited boundaries, whereby batch tasks are processed as a subset of stream processing; this stream processing-first approach, also known as the Kappa architecture, is opposed to the more well-known Lambda architecture (which uses batch processing as the main processing method and streams as supplements and provides early unrefined results); in the Kappa architecture, models are simplified by streaming all that is feasible after the streaming engines have matured.

Further, the data analysis module 30 further includes a stream processing model (not shown in the figure) and a batch processing model (not shown in the figure):

a stream processing model, Flink's stream processing model treats each item as a true data stream when processing incoming data; the DataStream API provided by Flink can be used to handle endless data streams, and the basic components with which Flink can be used include:

stream refers to a constantly moving set of borderless data that is streamed in the system;

operator refers to a function that performs operations on a data stream to produce other data streams;

source refers to the entry point of the data stream into the system;

sink refers to the location where the data stream enters after leaving the Flink system, the slot may be a database or a connector to other systems;

in order to recover after a problem is encountered in the calculation process, the stream processing task creates a snapshot at a preset time point; to implement state storage, Flink may be used with a variety of state backend systems, depending on the complexity and persistence level of the desired implementation; furthermore, Flink's stream processing capability also understands the concept "event time", which refers to the time when an event actually occurs, and the functionality can also handle sessions, which means that execution order and grouping can be ensured in some interesting way.

Batch model, Flink's batch model is largely only an extension to the flow process model; at this point, the model no longer reads data from the persistent stream, but rather reads the bounded data set in the form of a stream from persistent storage, and Flink would use exactly the same runtime for these process models.

The Flink can realize certain optimization on the batch processing workload; for example, since batch operations may be supported by persistent storage, Flink may not create snapshots of batch workloads, data may still be restored, but conventional processing operations may be performed faster; another optimization is to decompose the batch task so that different phases and components can be called when needed, whereby the Flink can better coexist with other users of the cluster, and analyze the task in advance so that the Flink can view all operations to be performed, the size of the data set, and the operation steps to be performed downstream, thereby achieving further optimization.

Flink is currently a unique technology in the field of processing framework, although Spark can also perform batch processing and stream processing, the micro-batch architecture adopted by Spark stream processing makes Spark stream processing not suitable for many cases, and the Flink stream processing-first method can provide low delay, high throughput and nearly one-by-one processing capability.

Many components of the Flink are managed by self, although the method is rare, due to the performance reason, the technology can manage the memory by self, a native Java garbage collection mechanism is not needed, different from Spark, after the characteristics of the data to be processed are changed, the Flink does not need to be optimized and adjusted manually, and the technology can also process operations such as data partitioning, automatic caching and the like by self.

The Flink can permit work in a plurality of ways to optimize tasks, the analysis is partially similar to the optimization of a relational database by an SQL query planner, the most efficient implementation method can be determined for specific tasks, the technology also supports multi-stage parallel execution, the data of blocked tasks can be collected together, and for iterative tasks, due to the performance consideration, the Flink can try to execute corresponding calculation tasks on the nodes for storing data; in addition, "incremental iterations" may be performed, or only portions of the data that have changes may be iterated.

On the user tool side, Flink provides a Web-based scheduling view whereby tasks can be easily managed and system status viewed; the user can also view the optimization scheme of the submitted task, so as to know how the task is finally realized in the cluster, and for the analysis-class task, Flink provides SQL-like query, graphical processing and a machine learning library, and in addition, memory calculation is supported.

The Flink can be well matched with other components, can be well integrated into the whole environment if being matched with a Hadoop stack, only occupies necessary resources at any time, can be easily integrated with YARN, HDFS and Kafka, and can also be operated as other processing frameworks, such as tasks written by Hadoop and Storm, with the help of a compatible package.

Flink provides low latency streaming while supporting traditional batch tasks, is perhaps best suited for organizations with very high streaming requirements and a small number of batch tasks, is compatible with native Storm and Hadoop programs, can run on YARN-managed clusters, and thus can be easily evaluated.

Specifically, as shown in fig. 2, the present invention also discloses an automatic data analysis method, which includes the following steps:

connecting an external application interface and receiving data information;

step two, storing, transmitting and checking the data information;

step three, analyzing and deleting the verified data;

In the first step, the method is used for connecting an API (application program interface) of a partner to acquire data of accounts receivable; in the second step, the following contents are also included: the system comprises a database, a database management system and a database management system, wherein the database is used for storing receivables data, structured metadata and file data blob, performing combined business logic, sequential business logic and decentralized processing, and performing interaction on information of file data blob storage, metadata storage and query queues; in step three, the following is also included: processing data, wherein the processing work is completely carried out in a memory, the data is read into the memory only at the beginning, the interaction with a storage layer is needed when a final result is persistently stored, and the processing results of all intermediate states are stored in the memory; the data is buffered and then the buffered data is cut into small fixed data sets for batch processing.

In summary, the networking of the invention receives data, records and classifies the data completely, analyzes and deletes the verified data, and uploads the analyzed and processed data to the server to generate final data; the efficiency of data analysis capability is greatly improved, the error ratio is reduced to the minimum, redundant data is cleaned, the service trend is completely shown, the true value is infinitely approximated, and the confidentiality is good.

The technical contents of the present invention are further illustrated by the examples only for the convenience of the reader, but the embodiments of the present invention are not limited thereto, and any technical extension or re-creation based on the present invention is protected by the present invention. The protection scope of the invention is subject to the claims.

Claims

1. An automatic data analysis system, comprising:

the data analysis module is used for analyzing and deleting the data information to obtain final data;

the data storage module includes:

the communication unit is used for interacting the information of the file data blob storage, the metadata storage and the query queue;

wherein, the memory cell still includes the following content:

the receivable account data is stored, and the receivable is a storage unit of value, namely receivable, air mileage or digital art copyright; the main operation on the receivable storage system is to issue and transmit the receivable while preventing the double flower problem; other receivable networks do not support a network by means of internal incentives, but rather support the network as incentives in a higher level network where lower level infrastructure stores the receivable;

the structured metadata store is specially used for storing structured metadata, including tables, document stores, key value stores, time series or charts, and then the data is quickly retrieved in a query mode.

2. The system according to claim 1, wherein the data collection module is configured to access a financial application interface of a partner, and is responsible for maintenance of agent-side configuration, reception and unloading of monitoring data, collection of network device data, and monitoring of port health status.

3. The system of claim 2, wherein the data collection module comprises:

and the repeater unit can be deployed at will, is responsible for receiving the time sequence data and forwarding the time sequence data to a specified back end, and supports repeater- > repeater, repeater- > openTSDB and repeater- > Redis.

4. The system of claim 1, wherein the data analysis module comprises:

5. The automatic data analysis system of claim 4, wherein the hybrid process analysis unit further comprises: the use of the same or related components and APIs to process two types of data thereby simplifying the different processing requirements, implemented by Spark and Flink, combining two different processing modes, and making assumptions about the relationship between fixed and non-fixed data sets, provides not only the methods needed to process data, but also provides integrated items, libraries and tools that can be used for graphical analysis, machine learning and interactive queries.

6. The system of claim 4, wherein the batch analysis unit further comprises: the RDD model is adopted to process data, only located in a memory and with a constant structure, new RDDs can be generated according to operations executed by the RDDs, and each RDD can be traced back to a parent RDD through Lineage (Link) and finally traced back to data on a disk.

7. The system of claim 4, wherein the stream processing and analyzing unit further comprises: using the stream processing model of Flink to process data, each item is treated as a true data stream when processing incoming data, and the DataStream API provided by Flink can be used to process endless data streams.

8. An automatic data analysis method is characterized by comprising the following steps:

connecting an external application interface and receiving data information;

step two, storing, transmitting and checking the data information;

step three, analyzing and deleting the verified data;

uploading the analyzed and processed data to a server to generate final data;

wherein, still include the following content:

9. The method according to claim 8, wherein in the first step, the data of the accounts receivable is obtained by connecting with an API interface of a partner; in the second step, the following contents are also included: the system comprises a database, a database management system and a database management system, wherein the database is used for storing receivables data, structured metadata and file data blob, performing combined business logic, sequential business logic and decentralized processing, and performing interaction on information of file data blob storage, metadata storage and query queues; in the third step, the following contents are also included: processing data, wherein the processing work is completely carried out in a memory, the data is read into the memory only at the beginning, the interaction with a storage layer is needed when a final result is persistently stored, and the processing results of all intermediate states are stored in the memory; the data is buffered and then the buffered data is cut into small fixed data sets for batch processing.