CN108076111B

CN108076111B - System and method for distributing data in big data platform

Info

Publication number: CN108076111B
Application number: CN201611029700.9A
Authority: CN
Inventors: 周伟; 俞力; 赵贵阳; 周春楠
Original assignee: Yiyang Safety Technology Co ltd
Current assignee: Yiyang Safety Technology Co ltd
Priority date: 2016-11-15
Filing date: 2016-11-15
Publication date: 2021-07-09
Anticipated expiration: 2036-11-15
Also published as: CN108076111A

Abstract

The system and the method for distributing data in the big data platform construct a big data distribution unit for high-speed distribution of big data by adopting asynchronous I/O as a technical basis, adopt thread separation of a server and a client, improve the throughput of data distribution, ensure the complete structuralization of the data by a multi-dimensional structure storage unit, and ensure the accuracy and the correctness of the data distribution by a big data management center and a data bus module, so that all parts can run at high speed without waiting for resources mutually and can fully utilize the resources. Meanwhile, the whole system has good flexibility.

Description

System and method for distributing data in big data platform

Technical Field

The invention relates to the field of data security, in particular to a method and a system for distributing data in a big data platform.

Background

In the prior art, a big data platform based on a Hadoop architecture has high expandability, high reliability and high fault tolerance. At present, a large amount of data queries and data flows widely adopt a Memory database (Memory DB), a non-relational database technology (NoSql) and a Cache technology (Cache), and good progress has been made. However, in the actual process of big data service processing, such as applications of wireless application protocol internet log, big user email system, Blog log analysis, user information tracking and analysis, etc., the current big data platform has defects in data I/O processing method, especially for unstructured, semi-structured, big data volume services, the I/O processing speed has a serious problem, which is mainly reflected in that:

1. under the condition of large data volume, especially under the condition of continuous writing of large data volume, the I/O performance is slower, and the I/O speed-up ratio and the number of the server nodes are not in a linear relation;

2. on the processing of unstructured and semi-structured data such as LOG, BLOG, video, social relationship information and the like, optimization is not carried out according to the storage type and characteristics of big data, and the processing speed is slow;

3. the adoption of the multi-service synchronous writing technology leads to longer synchronization time under the conditions of unclear network and storage equipment and the like, and leads to longer spending time on the processing of data consistency.

Therefore, improving data high-speed distribution and I/O operations is the most important goal to improve the performance of large data platforms.

Disclosure of Invention

The purpose of the invention is realized by the following technical scheme.

According to an embodiment of the present invention, a system for distributing data in a big data platform is provided, where the system specifically includes: the system comprises a big data distribution unit, a data bus module, a management center and a big data adaptation module; wherein the content of the first and second substances,

the big data distribution unit is used for receiving data sent by a plurality of clients and storing the data in a cache of the big data distribution unit; acquiring a data distribution rule from the data bus module and distributing the cache data; then distributing the data to a target server;

the data bus module is a distributed published message subscription system and is used for continuously storing data distribution rules for concurrent reading of big data distribution units;

the management center formulates a data distribution rule according to the resource information and sends the data distribution rule to the data bus module for storage;

and the big data adaptation module is used for collecting resource information of the big data distribution unit and the target server and sending the resource information to the management center.

Preferably, the big data distribution unit specifically includes:

the load balancing module is used for receiving data sent by a plurality of clients and caching the data into self caches of the asynchronous servers by using a load balancing algorithm;

the asynchronous server is used for caching the data balanced by the load balancing module; the data bus module is also used for acquiring data distribution rules from the data bus module; carrying out structure reconstruction on data in a cache of the cache according to a distribution rule, generating new multi-dimensional structure data containing distribution resource information, and storing the new multi-dimensional structure data in a multi-dimensional structure storage unit;

the multidimensional structure storage unit is used for storing multidimensional structure data generated by the reconstruction data of the plurality of asynchronous servers;

and the asynchronous client is used for acquiring the data in the multi-dimensional structure storage unit, finding the resource information of the distribution target server and distributing the data in the multi-dimensional structure storage unit to the target server.

In particular, the asynchronous server and the asynchronous client interact using an asynchronous I/O mode.

Preferably, the asynchronous server further specifically includes:

the configuration module is used for configuring the service provided by the asynchronous server and sending the configuration information to the management center;

the data acquisition module is used for receiving the data balanced by the load balancing module and storing the data in a cache of the data acquisition module;

the distribution rule acquisition module is used for acquiring a data distribution rule from the data bus module;

the structure reconstruction module is used for loading the distribution rule on data in a cache of the structure reconstruction module, forming a distribution packet comprising a source address and a port, a destination address and a port, a connection protocol and a data part, and loading the distribution packet into a bidirectional opening continuity data queue, wherein the queue is a bidirectional insertion/deletion queue at the head end and the tail end;

and the asynchronous client response module is used for responding to the request of the asynchronous client.

Preferably, the asynchronous client further specifically includes:

and the data distribution module is used for acquiring the data in the multi-dimensional structure storage unit through the asynchronous server, finding out the resource information of the distribution target server and distributing the data in the multi-dimensional structure storage unit to the target server.

The detection module asynchronously waits for the operation completion signal of the target server and detects the signal; if the detection signal shows that the data distribution is successful, deleting the distributed data part from the multidimensional structure storage unit; and if the detection signal shows that the data distribution fails, calling the data distribution module to retransmit the data.

According to another embodiment of the present invention, there is also provided a method performed by the above system for distributing data in a big data platform, the method including the steps of:

the big data distribution unit configures the provided service and sends configuration information to the management center;

the big data adaptation module collects configuration information of the big data distribution unit and resource information of the target server and sends the configuration information and the resource information to the management center;

the management center formulates a data distribution rule according to the configuration information and the resource information, and sends the data distribution rule to the data bus module for storage; the data bus module is a distributed published message subscription system and is used for persistently storing data distribution rules so as to enable the big data distribution units to concurrently read the big data distribution rules;

the big data distribution unit receives data sent by a plurality of clients and stores the data in a cache;

the big data distribution unit acquires a data distribution rule from the data bus module and distributes and processes data in the cache;

the large data distribution unit distributes data to the target server.

Preferably, the big data distribution unit receives data sent by the plurality of clients through the plurality of load balancing modules, and caches the data in the cache of each asynchronous server by using a load balancing algorithm.

Preferably, the big data distribution unit acquires the data distribution rule from the data bus module through the asynchronous server; performing structure reconstruction on the data in the cache according to a distribution rule to generate new multi-dimensional structure data containing distribution resource information, and storing the new multi-dimensional structure data in a multi-dimensional structure storage unit; the big data distribution unit obtains the data in the multi-dimensional structure storage unit through the asynchronous client, finds the resource information of the distribution target server, and distributes the data in the multi-dimensional structure storage unit to the target server.

The method includes that the asynchronous server side performs structure reconstruction on data in a cache of the asynchronous server side to generate new multi-dimensional structure data, and specifically includes:

the asynchronous server loads the distribution rule to the data in its own high-speed buffer memory, and forms the distribution packet including source address and port, destination address and port, connection protocol and data part, and loads it into the bidirectional open continuity data queue, the queue is the bidirectional insertion/deletion queue at the head and tail ends.

And the asynchronous server and the asynchronous client use an asynchronous I/O mode for interaction.

Preferably, after the asynchronous client distributes the data to the target server, the method further includes:

the asynchronous client asynchronously waits for an operation completion signal of the target server and detects the signal;

if the detection signal shows that the data distribution is successful, deleting the distributed data part from the multi-dimensional structure storage unit;

and if the detection signal shows that the data distribution fails, repeatedly distributing the data to the target server.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a schematic diagram of a system for distributing data in a big data platform according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a big data distribution unit according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an asynchronous server architecture according to an embodiment of the present invention;

FIG. 4 shows a flow diagram of a method for distributing data in a big data platform, according to another embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

According to an embodiment of the present invention, a system for distributing data in a big data platform is provided, as shown in fig. 1, the system specifically includes: the system comprises a big data distribution unit M101, a data bus module M102, a management center M103 and a big data adaptation module M104; wherein the content of the first and second substances,

the big data distribution unit is used for receiving mass data sent by a plurality of clients and storing the mass data in a cache of the big data distribution unit; acquiring a data distribution rule from the data bus module and distributing the cache data; then distributing the data to a target server;

the data bus module is a distributed published message subscription system and is used for continuously storing data distribution rules so that big data distribution units can efficiently and concurrently read at any time;

the big data adaptation module is used for collecting resource information of the big data distribution unit and the target server and sending the resource information to the management center; the resource information at least comprises a connection protocol, an IP address and a port.

Preferably, as shown in fig. 2, the big data distribution unit specifically includes:

the load balancing module is used for receiving mass data sent by a plurality of clients and caching the data into self caches of the asynchronous servers by using a load balancing algorithm;

Preferably, as shown in fig. 3, the asynchronous server further specifically includes:

the structure reconstruction module is used for loading the distribution rule on the data in the cache of the structure reconstruction module, forming a distribution packet comprising a source address and a port, a destination address and a port, a connection protocol and a data part, and loading the distribution packet into a bidirectional opening continuity data queue, wherein the queue is an efficient bidirectional insertion/deletion queue at the head end and the tail end;

Preferably, the asynchronous client further specifically includes:

According to another embodiment of the present invention, there is also provided a method for distributing data in a big data platform, performed by the above system, as shown in fig. 4, the method including the steps of:

the big data adaptation module collects configuration information of the big data distribution unit and resource information of the target server and sends the configuration information and the resource information to the management center; the resource information at least comprises a connection protocol, an IP address and a port;

the management center formulates a data distribution rule according to the configuration information and the resource information, and sends the data distribution rule to the data bus module for storage; the data bus module is a distributed published message subscription system and is used for persistently storing data distribution rules so as to efficiently and concurrently read the big data distribution units at any time; the consistency and correctness of the distribution rule are ensured.

For example: the management center firstly carries out 'formatting' on the resource information of the target server, and the data formed after formatting is as follows: a connection protocol of/user name of password @ host name @ ip address of port of ssh:// zhangsan of 123456@ localhost @127.0.0.1: 22.; then, the management center arranges the formatted resource information into a distribution rule according to the convention, and the specific form is as follows: connection protocol:// user name: password @ gateway name @ gateway ip address: source port->Connection protocol// user name: password @ destination host name @ destination ip address: destination port, as follows: { [ ssh:// zhangsan:123456@ host1@192.168.0.1@22->ssh://lisi:654321@host2@192.168.0.2:8022][…][…]…}。

The big data distribution unit receives mass data sent by a plurality of clients and stores the mass data in a cache;

the large data distribution unit distributes data to the target server.

Preferably, the big data distribution unit receives mass data sent by a plurality of clients through a plurality of load balancing modules, and caches the data in the cache of each asynchronous server by using a load balancing algorithm;

the asynchronous server loads the distribution rule to the data in its own high-speed buffer memory, and forms the distribution packet including source address and port, destination address and port, connection protocol and data part, and loads it into the bidirectional open continuous data queue, which is a high-efficiency bidirectional insertion/deletion queue at the head and tail ends. And the asynchronous server and the asynchronous client use an asynchronous I/O mode for interaction.

The following describes in detail a specific implementation of the core part of the present application, i.e. the asynchronous I/O part. The specific implementation manner of the asynchronous I/O part is an asynchronous processing process, and specifically includes:

the load balancing modules receive data sent by the client through a Linux virtual server cluster (LVS) and send the data to the asynchronous server by using a load balancing technology. The load balancing technology comprises DNS load balancing, HTTP load balancing, IP load balancing, link layer load balancing and mixed P load balancing.

The asynchronous I/O selects an asyncio asynchronous module provided by python language, the asynchronous server generates asynchronous service by using get _ event _ loop rewrite service api in the asyncio asynchronous module, monitors a self port, and circularly receives data sent by the LVS by calling a run _ neutral _ complete method.

And after receiving the data, the asynchronous server calls a background method to reload the api, and writes the data into the cache of the asynchronous server.

And the asynchronous server feeds back a receiving completion signal to the LVS.

The asynchronous server reconstructs the data in the cache of the asynchronous server into a data structure which is convenient to call in real time, wherein a deque data structure provided by python language is used, and the data structure has the characteristics of advanced use, excellent performance, and excellent characteristics of deadlock prevention and the like.

The management center collects the information of the target server through a big data adaptation module and converts the information into a set of distribution rules which can be communicated with the asynchronous server, the rules can use json format, dit format or xml format, and the rules are stored on a bus of a kafka technical framework after being serialized by using a protocol buf technology provided by google.

The asynchronous server uses kafka to read and parse the distribution rules on the bus and saves the rules into its cache.

The asynchronous server loads the data in the deque structure through the distribution rule, and more dimensions and depths can be regenerated in the deque according to the distribution rule.

The asynchronous server side uses the asyncio asynchronous client side to distribute data at a high speed according to rules, and the optional client side is aiohttp/parmiko and the like.

The asynchronous server uses the async and the awake keywords to asynchronize the function, firstly, asynchronously obtain the response, and then asynchronously read the content of the response. The request is initiated using the client Session as the primary interface. Client sessions allow cookies and related object information to be saved between multiple requests. Session needs to be closed after the Session is used, and closing Session is another asynchronous operation, so asynchronization needs to be performed each time by using async with key words.

The asynchronous server establishes a client session, uses it to initiate a request, and starts other multiple asynchronous operations. After the asynchronous distribution program runs normally, the asynchronous server side adds other data in the cache into the event loop.

After the asynchronous server side finishes distributing the data, asynchronously waiting for a detection signal sent by a target server and directly storing the distributed data into a cache of the asynchronous server side; and when the completion signal sent by the target server is asynchronously received, releasing the corresponding part of the self cache.

When the asynchronous server receives a signal of failed reception of the target server, the asynchronous server reconstructs and pushes the part of data from the cache of the asynchronous server to the high-speed data structure again, resends the data to the target server, and repeats the two steps.

In the above example, the LVS represents a Linux Virtual Server cluster, including a load balancer, which is responsible for collecting data requests of clients and sending the data to a group of servers for caching; server pool: executing a data request of a client; shared stored provides data storage for server pool. The system provides a data source for the asynchronous server.

In the above example, kafka technology needs to be used to ensure the consistency of the distribution rules. kafka is a distributed publish-subscribe messaging system that provides high throughput for both publish and subscribe; it supports multiple subscribers, automatically balances consumers when failing; it persists messages to disk and is therefore available for bulk consumption (e.g., data warehousing technology ETL) as well as real-time applications. So kafka is a good technical carrier for providing distribution rules for asynchronous servers; when the distribution strategy is updated, the asynchronous server side can poll the kafka bus regularly to ensure the consistency of the distribution strategy.

The data distribution of the invention is based on a big data platform and adopts asynchronous I/O as a technical basis to carry out high-speed distribution, so that the usability of the whole system is greatly increased, and the efficiency is obviously improved under the condition of frequent reading and writing of unstructured data. The implementation method can effectively improve the I/O efficiency of the large data platform on the premise of keeping the consistency and the integrity of the data.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A system for distributing data in a big data platform specifically comprises: the system comprises a big data distribution unit, a data bus module, a management center and a big data adaptation module; wherein the content of the first and second substances,

the big data adaptation module is used for collecting configuration information of the big data distribution unit and resource information of the target server and sending the configuration information and the resource information to the management center;

the management center is used for formulating a data distribution rule according to the configuration information and the resource information;

the big data distribution unit specifically comprises:

the asynchronous server further specifically comprises:

the asynchronous client response module is used for responding to the request of the asynchronous client;

2. The system of claim 1, the asynchronous server and asynchronous client to interact using asynchronous I/O mode.

3. The system of claim 1, wherein the asynchronous client further comprises:

the data distribution module is used for acquiring data in the multi-dimensional structure storage unit through the asynchronous server, finding out resource information of a distribution target server and distributing the data in the multi-dimensional structure storage unit to the target server;

4. A method of distributing data in a big data platform, the method comprising the steps of:

the big data distribution unit distributes data to the target server;

the big data distribution unit receives data sent by a plurality of clients through a plurality of load balancing modules and caches the data in the cache of each asynchronous server by using a load balancing algorithm;

the big data distribution unit acquires a data distribution rule from the data bus module through the asynchronous server; performing structure reconstruction on the data in the cache according to a distribution rule to generate new multi-dimensional structure data containing distribution resource information, and storing the new multi-dimensional structure data in a multi-dimensional structure storage unit; the big data distribution unit acquires data in the multi-dimensional structure storage unit through the asynchronous client, finds resource information of a distribution target server, and distributes the data in the multi-dimensional structure storage unit to the target server;

the asynchronous server performs structure reconstruction on data in a cache of the asynchronous server to generate new multi-dimensional structure data, and the method specifically comprises the following steps:

the asynchronous server loads the distribution rule on the data in the cache of the asynchronous server, forms a distribution packet comprising a source address and a port, a destination address and a port, a connection protocol and a data part, and loads the distribution packet into a bidirectional open continuity data queue, wherein the queue is a bidirectional insertion/deletion queue at the head end and the tail end; and the asynchronous server and the asynchronous client use an asynchronous I/O mode for interaction.

5. The method of claim 4, after the asynchronous client distributing data to the target server, further comprising: