CN110716986B

CN110716986B - Big data analysis system and application method thereof

Info

Publication number: CN110716986B
Application number: CN201910985540.2A
Authority: CN
Inventors: 袁思静
Original assignee: Shenzhen Lansi Network Technology Co ltd
Current assignee: Shenzhen Lansi Network Technology Co.,Ltd.
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2021-05-04
Anticipated expiration: 2039-10-17
Also published as: CN110716986A

Abstract

The embodiment of the invention discloses a big data analysis system, which comprises a data acquisition module, a capacity monitoring unit, a database allocation unit, a data import module, a sorting integration unit and a data statistical analysis module, wherein the application method specifically comprises the following steps: correspondingly distributing a plurality of databases to each client, and dividing the databases into a main database and a standby database; monitoring the rate and the residual capacity of the main database and the plurality of standby databases in real time during data acquisition; the use of the main database and the plurality of standby databases is replaced in a circulating mode, and the use volume of each database is limited; the data of the main database and the data of the standby databases are transmitted to the distributed storage libraries, and the data of the main database and the data of the standby databases are arranged according to the time sequence; summarizing and analyzing the data in sequence; according to the scheme, the data are transferred to the standby database in time for data acquisition, and when the data in the front-end database are imported and transmitted to the distributed database, the data can be transmitted in a sectional mode, so that the transmission efficiency is improved.

Description

Big data analysis system and application method thereof

Technical Field

The embodiment of the invention relates to the technical field of big data processing, in particular to a big data analysis system and an application method thereof.

Background

With the rapid development of informatization, big data come into play, and a big data technology refers to a data set which is large in scale and greatly exceeds the capability range of a traditional database software tool in the aspects of acquisition, storage, management and analysis, and has the four characteristics of massive data scale, rapid data circulation, various data types and low value density. The formation of data diversity is mainly due to two reasons: the method has the advantages that firstly, the data sources are multiple, and a search engine, a social network, call records, sensors and the like are available; and secondly, the data format is multiple, and the data format comprises structural data, semi-structural data and non-structural data.

The strategic significance of big data counting is not to grasp huge data information, but to perform specialized analysis processing on the meaningful data, wherein the whole process of data processing can be summarized into four steps, namely data acquisition, data import and preprocessing, data statistical analysis and data mining.

The main feature and challenge of data collection is high concurrency, because there may be thousands of user operations accessing at the same time, so a large amount of databases need to be deployed at the client for support, and for the databases used for data collection, the following disadvantages exist:

(1) when the storage capacity of the database reaches a certain limit, the database may be jammed during collection and storage, so that the database is crashed and lost, the client data cannot be collected in real time in time, and the accuracy of an analysis result is affected;

(2) when a plurality of databases are used for collecting data of the same client, the databases cannot be rapidly and synchronously transmitted to the distributed storage library to be distributed and summarized in sequence.

Disclosure of Invention

Therefore, the embodiment of the invention provides a big data analysis system and an application method thereof, when the memory of one database reaches a set value, the big data analysis system is timely transferred to a standby database for data acquisition, when the data of a front-end database is imported and transmitted to a distributed database, the data can be transmitted in a sectional mode, and the transmission efficiency is improved, so that the problems that the database is broken down and the data is lost due to the fact that the database is possibly blocked during acquisition and storage in the prior art and the problem that a plurality of databases cannot be rapidly and synchronously transmitted to the distributed storage banks to be distributed and summarized according to the sequence are solved.

In order to achieve the above object, an embodiment of the present invention provides the following: a big data analytics system, comprising:

the data acquisition module consists of a plurality of front-end databases for receiving client data, and the front-end database distributed by each client is divided into a main database and a plurality of standby databases;

the capacity monitoring unit is used for monitoring the residual memory capacity and the data acquisition efficiency of each main database and each standby database in real time;

the database allocation unit is used for replacing the client to connect to a standby database for data acquisition when the capacity of the main database exceeds a set value;

the data import module is used for receiving data in the main database and the standby database by adopting a plurality of distributed repositories;

the sequencing and integrating unit is used for sequencing the data of the main database and the standby database according to the time sequence;

and the data statistical analysis module is used for summarizing and analyzing the mass data in the distributed storage library in sequence.

As a preferred aspect of the present invention, when the total number of the spare databases is greater than two, the database allocation unit sorts the transmission order of the plurality of spare databases, and the database allocation unit sequentially and circularly calls the main database and the spare databases in a front-back order to realize data transmission.

As a preferred scheme of the present invention, by using the sequential work of the main database and the standby database, the capacity volumes of the main database and the standby database are maintained in a stable range, so as to avoid data transmission jam and realize stable transmission, and the specific working steps are as follows:

the data of the client is preferably transmitted to the data import module through the main database;

the capacity monitoring unit monitors the data acquisition rate and the residual memory capacity of the main database in real time, when the transmission rate is smaller than a set value or the residual memory capacity is smaller than the set value, the client is connected with the first standby database, and the data of the client is transmitted to the data import module through the first standby database;

the capacity monitoring unit monitors the data acquisition rate and the residual memory capacity of the current standby database in real time, when the transmission rate is smaller than a set value or the residual memory capacity is smaller than the set value, the client is connected with a second standby database, and the data of the client is transmitted to the data import module through the second standby database;

repeating the steps until the data of the client side is transmitted to the data import module through the nth standby database;

and after all the standby databases are used in sequence, the client reuses the main database and transmits the data to the data import module.

As a preferred scheme of the present invention, the usage sequence of the primary database and the backup database is specifically as follows: sequential recycling of the primary database and the plurality of backup databases.

In addition, the invention also provides an application method of the big data analysis system, which is characterized by comprising the following steps:

step 100, correspondingly distributing a plurality of databases to each client, and dividing the databases into a main database and a standby database;

step 200, monitoring the rate of the main database and the standby databases in real time during data acquisition, and monitoring the residual capacity of each database;

step 300, circularly replacing the main database and the plurality of standby databases, and limiting the use volume of each database;

step 400, transmitting data of the main database and the plurality of standby databases to a distributed repository, wherein the data of the main database and the data of the standby databases are arranged according to a time sequence;

and 500, summarizing and analyzing the data according to a sequence.

As a preferred aspect of the present invention, in step 300, the primary database and the plurality of backup databases are cyclically replaced, and the data transfer and storage of the client is implemented by the following steps:

step 301, when the remaining memory capacity of the main database and the plurality of standby databases reaches a set value, setting a transfer node of each main database and each standby database;

step 302, determining a mapping relation between transfer nodes of a main database and a plurality of standby databases;

step 303, transferring data from a database to a database corresponding to the mapping according to the mapping relation, and continuously acquiring the data;

and step 304, importing the data in the transfer node of each database into the distributed storage library according to a first-in first-out sequence, and sequentially connecting the two database data with the mapping relation according to the mapping relation of the transfer node.

In a preferred embodiment of the present invention, in step 400, the data of the primary database and the plurality of backup databases are transferred to the distributed storage bases of the data import module according to a first-in first-out principle.

As a preferred scheme of the present invention, the specific steps of sequentially transferring the data of the primary database and the plurality of standby databases to the data import module are as follows:

step 401, the distributed repository receives and arranges the data of the main database in sequence until the distributed repository accesses the transfer node of the main database, and the distributed repository transfers and receives the data of the next standby database according to the mapping relationship;

step 402, splicing the data of the next standby database to the data tail end of the main database in sequence, and arranging the data according to the receiving mode of the step 401;

step 403, the distributed repository transfers and receives the data of the last standby database according to the mapping relationship, and sequentially splices the data of the last standby database to the data tail end of the last standby database;

and step 404, according to the mapping relation of the transfer node of the last standby database, the distributed database receives the data of the main database again, and the data of the main database is spliced to the data tail end of the last standby database in sequence.

As a preferable scheme of the invention, when the data acquisition efficiency of the main database and the plurality of standby databases is less than a set value, the transfer node is also arranged to realize the acquisition of the replacement database.

The embodiment of the invention has the following advantages:

(1) according to the invention, a plurality of databases are arranged on each client in a matching manner, and when the memory of one database reaches a set value, the database is transferred to a standby database in time for data acquisition, so that the data acquisition loss is effectively avoided, the effectiveness and stability of data analysis are improved, and stable and accurate data are provided for data analysis;

(2) the data of the client is dispersed in different databases, and each database stores data in different time periods, so that the storage position of the data in each database cannot be artificially predicted, and the condition that the data of the client is stolen can be effectively avoided;

(3) when each database is called, the storage capacity of each database can be transferred to other databases for collection until the storage capacity reaches a set value, so that the storage capacity of each database is large, and the databases are collected according to a time sequence, so that when the databases are imported and transmitted to the distributed databases, data can be transmitted in a sectional mode, the transmission efficiency is improved, and the condition that errors occur in data transmission arrangement is also avoided.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

FIG. 1 is a block diagram of a big data analysis system according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a big data analysis application method in the embodiment of the present invention.

In the figure:

1-a data acquisition module; 2-a capacity monitoring unit; 3-a database allocation unit; 4-a data import module; 5-a sort integration unit; 6-data statistical analysis module.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the present invention provides a big data analysis system, wherein due to timeliness and stability of data acquisition, accuracy and validity of data analysis are determined, and according to forward correlation between stability of data acquisition and data analysis, when a data storage volume of one database exceeds a set value of a total storage volume of the database, the database is susceptible to stutter cleaning, and in order to avoid data loss caused by database stutter, the system sets a plurality of databases in each client in a matching manner, and has the following advantages:

(1) when the memory of one database reaches a set value, the memory is transferred to a standby database in time for data acquisition, so that data acquisition loss is effectively avoided;

(2) the effectiveness and stability of data analysis are improved, and stable and accurate data are provided for the data analysis;

(3) the data of the client is dispersed in different databases, each database stores data in different time periods, and the storage position of the data in each database cannot be predicted artificially, so that the condition that the data of the client is stolen can be effectively avoided.

The system specifically comprises a data acquisition module 1, a capacity monitoring unit 2, a database distribution unit 3, a data import module 4, a sequencing integration unit 5 and a data statistical analysis module 6:

the data acquisition module 1 is configured to receive client data, each client is correspondingly allocated with a front-end database, and each front-end database is divided into a main database and a plurality of standby databases, that is, each client is allocated with one main database and a plurality of standby databases.

When the total number of the standby databases is more than two, the database allocation unit 3 sorts the transmission sequence of the standby databases, and the database allocation unit 3 sequentially and circularly calls the main database and the standby databases in sequence from front to back to realize data transmission.

The capacity monitoring unit 2 is used for monitoring the remaining memory capacity and data acquisition efficiency of each main database and each standby database in real time.

And a database allocation unit 3 for replacing the client to connect to a standby database for data transmission when the capacity of the main database exceeds a set value.

For many databases, for example, when the amount of data collected and stored in the database accounts for 90% of the capacity, and then the data is stored, the situation of acquisition card pause and even breakdown easily occurs, so to avoid this problem, the capacity monitoring unit 2 monitors the remaining memory capacity and the data acquisition efficiency in real time.

Once the remaining capacity of the database is close to 10%, the database distribution unit 3 is automatically replaced with another database to collect the data of the client, so that the problem of data acquisition card pause is effectively solved. And the data import module 4 is used for receiving the data in the main database and the standby database by adopting a plurality of distributed repositories, and the data collected by the front end of each client is merged and sent to the distributed repositories, so that the distributed repositories can conveniently gather all the data, and the data analysis and data mining at the later stage are facilitated.

Summarizing the above, sequential work of the main database and the standby database is utilized to ensure that the capacity and volume of the main database and the standby database are kept in a stable range, data transmission is prevented from being blocked, and stable transmission is realized, and the specific working steps are as follows:

the capacity monitoring unit 2 monitors the data acquisition rate and the residual memory capacity of the main database in real time, when the transmission rate is less than a set value or the residual memory capacity is less than the set value, the client is connected with a first standby database, and the data of the client is transmitted to the data import module through the first standby database;

the capacity monitoring unit 2 monitors the data acquisition rate and the residual memory capacity of the current standby database in real time, when the transmission rate is less than a set value or the residual memory capacity is less than the set value, the client is connected with a second standby database, and the data of the client is transmitted to the data import module through the second standby database;

after all the standby databases are used in sequence, the client reuses the primary database and transmits the primary database to the data import module 4.

And the sequencing and integrating unit 5 is used for sequencing the data of the main database and the data of the standby database according to the time sequence.

And the data statistical analysis module 6 is used for summarizing and analyzing the mass data in the distributed storage library in sequence.

When the main database and the standby database are used for data acquisition of a client, the main database and the standby database are alternately and circularly used, so that data in one database is intermittently acquired on a time axis, and the using sequence of the main database and the standby database is specifically as follows: sequential recycling of the primary database and the plurality of backup databases.

Therefore, when the data of the main database and the standby database are imported into the distributed storage library, the data of the main database and the standby database need to be rearranged and stored according to the time sequence, so that the orderliness and the authenticity of the data are ensured.

Example 2

In order to refine the working method of the big data analysis system, the invention also provides an application method of the big data analysis system, which particularly substitutes a data arrangement method between the main database and the plurality of standby databases, improves the efficiency of importing data into the distributed storage library, realizes efficient transmission, meets the real-time transmission calculation analysis requirements of part of services, and simultaneously ensures that the acquired data are arranged and stored according to the time sequence.

The method specifically comprises the following steps:

step 100, correspondingly distributing a plurality of databases to each client, and dividing the databases into a main database and a standby database.

Step 200, monitoring the rate of the main database and the plurality of standby databases in real time during data acquisition, and monitoring the residual capacity of each database.

Step 300, cyclically replacing the use of the primary database and the plurality of backup databases, limiting the volume of use of each database.

The method is characterized in that a main database and a plurality of standby databases are labeled to conveniently determine a cycle sequence, when the data storage capacity of the main database or the standby databases reaches a set value, the data of a client needs to be transferred to other databases for storage, and the specific implementation steps for cyclically replacing the main database and the standby databases are as follows:

when the residual memory capacity of the main database and the plurality of standby databases reaches a set value, setting a transfer node of each main database and each standby database;

secondly, determining the mapping relation between the transfer nodes of the main database and the plurality of standby databases;

thirdly, transferring the data from one database to a database corresponding to the mapping according to the mapping relation to continuously acquire the data;

and fourthly, importing the data in the transfer node of each database into the distributed storage library according to a first-in first-out sequence, and sequentially connecting the data of the two databases with mapping relations according to the mapping relations of the transfer nodes.

When the data acquisition efficiency of the main database and the standby databases is lower than a set value, the transfer node is also arranged to realize the acquisition of the replacement database.

The operation method of replacing the use sequence of the primary database and the standby database is specifically illustrated by using an example method.

Such as: when the residual memory capacity of the main database is a set value, a transfer node A is arranged in the main database, the transfer node A of the main database determines the mapping relation with the first standby database, and then the data of the client side is transferred to the first standby database for collection and storage;

when the residual memory capacity of the first standby database is a set value, a transfer node B is arranged in the first standby database, the transfer node B of the first standby database determines the mapping relation with a second standby database, and then the data of the client side is transferred to the second standby database for collection and storage;

when the residual memory capacity of the second standby database is a set value, a transfer node C is arranged in the second standby database, the transfer node C of the second standby database determines the mapping relation with the next standby database, and then the data of the client side is transferred to the next standby database for collection and storage, and the like;

when the residual memory capacity of the last standby database is a set value, a transfer node n is arranged in the last standby database, the mapping relation between the last standby database of the last standby database and the main database is determined, the data of the client side is transferred to the main database again for collection and storage, and the calling cycle between the main database and the standby databases is completed.

As a main feature point of this embodiment, this way of cyclically calling multiple databases can solve the problem that a single database is subjected to data acquisition card pause and even crash, and through cyclic calling, it is ensured that the maximum storage capacity of each database does not exceed the storage capacity of the data acquisition card pause, thereby avoiding data loss.

In addition, when each database is called, the storage capacity of each database can be transferred to other databases for collection until the storage capacity reaches a set value, so that the storage capacity of each database is large, and the databases are collected according to a time sequence, so that when the databases are imported and transmitted to the distributed databases, data can be transmitted in a sectional mode, the transmission efficiency is improved, and the condition that errors occur in data transmission arrangement is also avoided.

And step 400, transmitting the data of the main database and the plurality of standby databases to the distributed storage library, wherein the data of the main database and the data of the standby databases are arranged according to the time sequence.

In this step, the data of the primary database and the plurality of backup databases are transferred to the distributed storage bases of the data import module according to the first-in first-out principle

The specific steps of sequentially transferring the data of the main database and the plurality of standby databases to the data import module are as follows:

(1) and the distributed databases receive and sequentially arrange the data of the main database until the distributed databases access the transfer node of the main database, and transfer and receive the data of the next standby database according to the mapping relation.

That is, the main database synchronously transfers data to the distributed repository while collecting the client data, and when the data amount in the main database reaches a set value, the next database is replaced to collect the client data.

At this moment, the distributed repository continues to receive the data in the main database until the distributed repository accesses the transfer node of the main database, and the distributed repository transfers and receives the data in the mapped standby database according to the mapping relation determined by the transfer node.

(2) And splicing the data of the next standby database to the data tail end of the main database in sequence, and arranging the data according to the receiving mode of the step 401.

(3) And the distributed storage library receives the data of the last standby database according to the mapping relation transfer, and sequentially splices the data of the last standby database to the data tail end of the last standby database.

Both the step (2) and the step (3) are for explaining the data distribution sequence in the distributed databases, and since in the present embodiment, the data in the same database is not completely arranged in time sequence, but is stored in different primary databases and backup databases in segments, each transfer node is equivalent to a time junction, and the data in the two databases are arranged in a splicing manner.

(4) And according to the mapping relation of the transfer node of the last standby database, the distributed database receives the data of the main database again, and sequentially splices the data of the main database to the data tail end of the last standby database.

The following will illustrate the data arrangement between the primary database and the plurality of backup databases:

1. the distributed storage library receives and sequentially arranges the data of the main database until the distributed storage library accesses the transfer node A, and the distributed storage library transfers and receives the data of the first standby database;

2. splicing the data of the first standby database to the data tail end of the main database in sequence until the distributed repository accesses the transfer node B, and transferring and receiving the data of the second standby database by the distributed repository;

3. splicing the data of the second standby database to the data tail end of the first standby database in sequence until the distributed storage library accesses to the transfer node C, transferring and receiving the data of the next standby database by the distributed storage library, splicing the next standby database to the tail end of the second standby database in sequence, and so on until the distributed storage library transfers and receives the data of the last standby database;

4. and the distributed database receives and sequentially arranges the data of the last standby database until the distributed database accesses the transfer node n, receives the data of the main database by transfer, and sequentially splices the data of the main database to the last standby database.

In this embodiment, in order to avoid overlapping of the transfer nodes when the databases are cyclically used, the transfer nodes may be sequentially numbered, for example, the first transfer node of the primary database is named as a1, the second transfer node is named as a2, … …, and the ith transfer node is named as Ai, when the databases are cyclically used in the second round, and the distributed repository receives data of the primary database from the last backup database in a transfer manner, the data sequentially connected with the data acquisition time of the last backup database in the primary database may be received until the distributed repository accesses the transfer node a2, and so on, a plurality of backup databases are applied.

And 500, summarizing and analyzing the data according to a sequence.

As a main feature point of the present invention, in the embodiment, when the collected data is transmitted to the distributed storage libraries in a centralized manner, the data in the multiple databases are arranged in the distributed storage libraries according to a time sequence in a segmented transmission manner, so that the arrangement efficiency is high, the processing speed is high, and the problem of data confusion can be effectively avoided.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A big data analytics system, comprising:

the data acquisition module (1) consists of a plurality of front-end databases for receiving client data, and the front-end databases distributed by each client are divided into a main database and a plurality of standby databases;

the capacity monitoring unit (2) is used for monitoring the residual memory capacity and the data acquisition efficiency of each main database and each standby database in real time;

the database allocation unit (3) is used for replacing the client to be connected to a standby database for data acquisition when the capacity of the main database exceeds a set value, the database allocation unit (3) sorts the transmission sequence of the standby databases when the total number of the standby databases is more than two, and the database allocation unit (3) sequentially and circularly calls the main database and the standby databases in sequence from front to back to realize data transmission;

the use sequence of the main database and the standby database is specifically as follows: the main database and a plurality of standby databases are sequentially recycled;

a data import module (4) which adopts a plurality of distributed storage banks to receive data in the primary database and the standby database;

the sequencing and integrating unit (5) is used for sequencing the data of the main database and the data of the standby database according to time sequence;

and the data statistical analysis module (6) collects and analyzes the mass data in the distributed storage library in sequence.

2. The big data analysis system according to claim 1, wherein the capacity volume of the main database and the spare database is maintained in a stable range by sequential operations of the main database and the spare database, thereby avoiding data transmission stagnation and realizing stable transmission, and the specific operation steps are as follows:

the priority of the main database is different from that of the standby database, the priority of the main database is higher than that of the standby database, and data of a client side is transmitted to a data import module through the main database with the high priority;

the capacity monitoring unit (2) monitors the data acquisition rate and the residual memory capacity of the main database in real time, when the transmission rate is less than a set value or the residual memory capacity is less than the set value, the client is connected with a first standby database, and the data of the client is transmitted to the data import module through the first standby database;

the capacity monitoring unit (2) monitors the data acquisition rate and the residual memory capacity of the current standby database in real time, when the transmission rate is less than a set value or the residual memory capacity is less than the set value, the client is connected with a second standby database, and the data of the client is transmitted to the data import module through the second standby database;

after all the standby databases are used in sequence, the client reuses the main database and transmits the data to the data import module (4).

3. An application method of a big data analysis system is characterized by comprising the following steps:

and 500, summarizing and analyzing the data according to a sequence.

4. The method for applying the big data analysis system according to claim 3, wherein in step 300, the primary database and the plurality of backup databases are cyclically replaced, and the data transfer and storage of the client is implemented by the following steps:

5. The method of claim 3, wherein in step 400, the data in the primary database and the plurality of backup databases are transferred to the distributed storage of the data import module according to a first-in-first-out principle.

6. The application method of the big data analysis system according to claim 5, wherein the specific steps of sequentially transferring the data of the primary database and the plurality of standby databases to the data import module are as follows:

7. The application method of the big data analysis system according to any one of claims 3 to 6, wherein when the data collection efficiency of the main database and the plurality of standby databases is less than a set value, the transfer node is also set to realize the collection of the replacement database.