CN109947743A

CN109947743A - A kind of the NoSQL big data storage method and system of optimization

Info

Publication number: CN109947743A
Application number: CN201910151451.8A
Authority: CN
Inventors: 王进; 吴文兵; 张经宇; 王磊; 何施茗
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2019-06-28

Abstract

The invention discloses the NoSQL big data storage methods and system of a kind of optimization, including data collection server, and for receiving original isomeric data and being pre-processed, received original isomeric data is classified and optimized by the pretreatment；Primary server, the data for that will pre-process carry out distributed storage；Multiple dependent servers, the instruction for receiving primary server store the data, and the present invention guarantees the accuracy, consistency and integrality of data by the quality to the data prediction raising storing data received.Data are sorted out, are stored using relevant database, while being mapped to non-relational database with unique reference number, stores unstructured data, while so that data scalability is improved, and increases the flexibility of storage mode.Using the distributed storage method based on NoSQL, meet high concurrent read-write demand.

Description

A kind of the NoSQL big data storage method and system of optimization

Technical field

The present invention relates to big data technical field of memory, the NoSQL big data storage method of especially a kind of optimization and it is System.

Background technique

With the continuous development and application of the technologies such as cloud computing, Internet of Things, the data of magnanimity are living in production and operation, commercial affairs The fields such as dynamic, social life constantly generate.We are in the information age, and big data promotion has even been arrived country by some countries Strategic level, enterprise is using big data as the important means for improving itself competitiveness.Nowadays, big data is affecting people's The epoch of work and life, big data have arrived.

Big data will bring a subversive revolution, it will push social production obtain general advance, promote government, The dramatic change of the industries such as finance, medical treatment, education.Big data is commonly referred to as those enormous amounts, is difficult to collect, is difficult to handle With the data set of analysis, also refer to those data saved for a long time in infrastructure.One main feature of big data is real-time Property, similar, the application of a financial class can quickly excavate phase in miscellaneous data of enormous amount for business personnel Close information, it will be able to which leading opponent makes transaction.Big data another characteristic is that data inconsistency, since data are adopted The difference of the approaches and methods of collection, identical data are also possible to generate different storage organizations.In addition, most of data are all Non-structured, such as picture, document, video etc. brings difficulty to the storage of data.

Three basic structures in data storage are DAS storage, NAS storage, SAN storage.Wherein, direct additional storage (DAS) it is little to be applied to storage size, the simple occasion of storage demand.Disk array, CD tower or magnetic tape station etc. pass through SCSI Interface, optical fiber etc. are directly connected to server, and the data sharing of these External memory equipments is only limitted to the inside of single server, It is shared to can not achieve external data.For DAS, NAS is a kind of improvement DAS network file storage, stores network in NAS In, NAS device has the data manipulation and management system of oneself, externally provides IP address, there is Embedded Memory System in NAS It supports, the server and client computer in local area network can directly access nas server, provide a kind of simple to install, height The low cost storage of cost performance and High Availabitity.Unlike the former, SAN will store equipment by optical fiber switch (FC agreement) It connects, forms the storage network based on fiber medium.SAN is stored in network, and store function is stripped, and data are to concentrate Mode stores.

To sum up, big data technology be badly in need of a kind of high-performance, high-throughput, large capacity date storage method.

Summary of the invention

In order to solve the above technical problems, the object of the present invention is to provide a kind of high-performance, high-throughput, large capacities NoSQL big data storage method and system.

The technical solution adopted by the present invention is that:

A kind of NoSQL big data storage system of optimization, including

Data collection server, it is described to pre-process received original for receiving original isomeric data and being pre-processed Beginning isomeric data is classified and is optimized；

Primary server, the data for that will pre-process carry out distributed storage；

Multiple dependent servers, the instruction for receiving primary server store the data, the dependent server storage Data are according to following rule: storing the structural data using relevant database, while will be described by unique identifier Structural data is mapped to non-relational database；The non-relational database stores the unstructured data, described non- Structural data is stored using the form of NoSQL key-value pair.

Further, the NoSQL big data storage system further includes standby server, and the standby server is used for main clothes The function of primary server is taken over when business device damage.

A kind of NoSQL big data storage method of optimization, comprising the following steps:

(1) pretreatment of data: received original isomeric data is classified and optimized；

(2) expression of data: for defining the format standard of data storage, the data storage uses relevant database And non-relational database；And according to following rule: using relevant database structured data, while by uniquely marking Know symbol and the structural data is mapped to non-relational database；The non-relational database stores unstructured data, The unstructured data is stored using the form of NoSQL key-value pair；

(3) it the distributed storage of data: by the data pre-processed according to distributed principle, is cooperateed with using multiple stage computers Storing data.

Further, the pretreatment of step (1) data includes: that received original isomeric data is divided into lightweight by (11) Data and multi-medium data；(12) light weight series of the characteristic information of the multi-medium data as description multi-medium data is extracted According to；(13) data scrubbing is carried out to the characteristic information for extracting the multi-medium data；(14) superfluous after deleting data scrubbing Remainder evidence.

Further, the pretreatment of step (1) data includes: that (15) are directed at the self-defining data of concrete application Reason.

Further, the characteristic information of step (12) multi-medium data includes interest value, digest value and original value, described Interest value is used to indicate the application field of the data, and the digest value is used for briefly describing multi-medium data, the original value In the position that storage original multimedia data is stored in disk array.

Further, the data scrubbing of the step (13) includes filling incomplete data, smooth noisy data, entangling Just inconsistent data.

Wherein, the lightweight data include number, character string, and the multi-medium data includes picture, audio, video.

Further, the distributed storage of step (3) data is the following steps are included: (31) and dependent server cluster are logical Letter, obtains movable CPU quantity, and server node number and deblocking situation are obtained from configuration file；(32) it establishes Database connection determines read-write thread, initialization creation read-write connection according to mobile C PU quantity；(33) in main thread task control Under system, mutual exclusion lock is established, coordinates multiple read-write threads；(34) main thread distributes deblocking, while being responsible for depositing these piecemeals Into corresponding subregion；(35) mutual exclusion lock is discharged, information is merged into specification in main thread after each sub thread completion task；(36) Main thread is uniformly stored in dependent server after arranging mission bit stream.

Further, while being uniformly stored in dependent server after mission bit stream being arranged in the step (36) generate one A copy, is stored in standby server.

Beneficial effects of the present invention: the present invention is protected by the quality to the data prediction raising storing data received Demonstrate,prove accuracy, consistency and the integrality of data.Data are sorted out, are stored using relevant database, while using unique reference number It is mapped to non-relational database, unstructured data is stored, while so that data scalability is improved, and increases The flexibility of storage mode.Using the distributed storage method based on NoSQL, meet high concurrent read-write demand.It provides spare Server, the anti-disaster ability of enhancing data storage.

Detailed description of the invention

A specific embodiment of the invention is described further with reference to the accompanying drawing.

Fig. 1 is the storage model structural schematic diagram of NoSQL big data storage system of the present invention；

Fig. 2 is the structural schematic diagram of data prediction of the present invention；

Fig. 3 is the structural schematic diagram that data indicate；

Fig. 4 is the simplified example that data indicate；

Fig. 5 is unstructured data storage mode in a computer；

Fig. 6 is the flow chart of data distribution formula store tasks.

Specific embodiment

As shown in Figure 1, being a kind of NoSQL big data storage system of optimization of the invention, comprising:

Data collection server, data collection server are mainly responsible for a large amount of isomeric datas received progress data are pre- Received original isomeric data is classified and is optimized by processing, the pretreatment, and primary server progress is transmitted to after handling well The storage of data distribution formula；

Primary server, data for that will pre-process carry out distributed storage, and distributed storage is according to the storage of cluster State stores deblocking into multiple dependent servers；

Multiple dependent servers, the instruction for receiving primary server store the data, by multiple disk arrays (RAID) it constitutes, multiple dependent servers constitute a data service cluster；The dependent server storing data according to Lower rule: the structural data is stored using relevant database, while passing through unique identifier for the structural data It is mapped to non-relational database；The non-relational database stores the unstructured data, the unstructured data It is stored using the form of NoSQL key-value pair.

As the further improvement of the technical program, the NoSQL big data storage system further includes standby server, number It is used as backup according to the copy that standby server will receive that primary server is sent simultaneously when storage, the standby server is used for Primary server takes over the function of primary server when damaging.

The invention also includes the technical solution with the same inventive concept of above-mentioned storage system, a kind of big number of the NoSQL of optimization According to storage method, comprising the following steps:

(1) pretreatment of data: received original isomeric data is classified and optimized；As shown in Fig. 2, the step (1) pretreatment of data includes:

(11) received original isomeric data is divided into lightweight data and multi-medium data, lightweight data such as number, Character string etc. is easy to computer identification；Multi-medium data such as picture, audio, video etc. is not easy to be identified by computer；Because more Media data is existing in binary form in a computer, is not no concrete meaning, it would be desirable to therefrom extract spy Lightweight data of the fixed information as description multi-medium data, identify convenient for computer；

(12) lightweight data of the characteristic information of the multi-medium data as description multi-medium data are extracted；These are special Determine information to be expressed as follows:

1) interest value.The application field that the data indicate is indicated, similar to the keyword of article.Interest value is according to corresponding The value that algorithm generates, can be defined by user oneself.As in a period in video vehicle flowrate, in picture the number of people and The language etc. that speaker uses in audio；

2) digest value.It is mainly used for the clear description multi-medium data of brief introduction, it does not have specific meaning.Digest value is root According to the value that special algorithm generates, such as MD5.When two width pictures are identical, their digest value is identical.As two received When the digest value that data generate is identical, it is believed that the two data are identical data, as long as storing one of them in storage, Data redundancy can be effectively reduced in this way, shorten retrieval time；

3) original value.The position stored in disk array for storing original multimedia data.Convenient for user and management Member directly accesses initial data.

After extracting lightweight data, need to clear up data.

(13) data scrubbing is carried out to the characteristic information for extracting the multi-medium data；The effect of data scrubbing is to fill out It fills incomplete, smooth noisy data and corrects inconsistent data." unknown " can be filled for the data of missing Or the value of " N " etc.Branch mailbox method smooth data in boundary can be used for noise data.Data are reduced after data scrubbing Redundancy.

(14) redundant data after data scrubbing is deleted.It is examined by the digest value of correlation analysis and generation to data Redundant data is surveyed, to delete extra data.

(15) method that can be handled certainly for specific application, user with self-defining data.According to their own needs will The data conversion of these relative high qualities can be realized at user-defined format, user by programming.Substantially increase the one of data Cause property and availability.

(2) expression of data: for defining the format standard of data storage, as shown in figure 3, data storage is using pass It is type database and non-relational database；Structural data is the unique of the unified expression data abstracted from data The attributes such as mark, type, time have strong consistency, according to following rule: using relevant database storage organization number According to, while the structural data is mapped to by non-relational database by unique identifier；The non-relational database Unstructured data is stored, the unstructured data is stored using the form of NoSQL key-value pair；It can flexibly store useful Information has very strong scalability.

It is a simple example as shown in Figure 4, the left side is RDBMS (relevant database), stores data ID and class Not.The right is NoSQL (non-relational database), stores unstructured data.If the ID record for being 1 is the classification of people, It is corresponding with the key-value pair of ID:1 in NoSQL, RDBMS data have been mapped in NoSQL well, V1 indicates first lightweight The key of data, Vm1 indicate the key of first multi-medium data.

Fig. 5 is the data specific storage format in NoSQL database, and this is the data format for belonging to JSON.This The non-structured data format of kind can not have to the storage organization for providing data in advance as RDBMS, suit very much big data Feature more than data class.

(3) distributed storage of data: the distributed storage of data is mainly responsible for by primary server.The number that will have been pre-processed According to according to distributed principle, storing data is cooperateed with using multiple stage computers.

As shown in fig. 6, the distributed storage of step (3) data is the following steps are included: (31) and dependent server collection Group communication obtains movable CPU quantity, and server node number and deblocking situation are obtained from configuration file；(32) Database connection is established, read-write thread, initialization creation read-write connection are determined according to mobile C PU quantity；(33) appoint in main thread Under business control, mutual exclusion lock is established, multiple read-write threads are coordinated；(34) main thread distributes deblocking, while being responsible for these points Block is stored in corresponding subregion；(35) mutual exclusion lock is discharged, information is merged into specification in main thread after each sub thread completion task； (36) dependent server is uniformly stored in after main thread arranges mission bit stream.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to above embodiment, as long as with Essentially identical means realize that the technical solution of the object of the invention belongs within protection scope of the present invention.

Claims

1. a kind of NoSQL big data storage system of optimization, it is characterised in that: including

Data collection server, for receiving original isomeric data and being pre-processed, the pretreatment will be received original different Structure data are classified and are optimized；

Multiple dependent servers, the instruction for receiving primary server store the data, the dependent server storing data According to following rule: storing the structural data using relevant database, while passing through unique identifier for the structure Change data and is mapped to non-relational database；The non-relational database stores the unstructured data, described non-structural Change data to store using the form of NoSQL key-value pair.

2. a kind of NoSQL big data storage system of optimization according to claim 1, it is characterised in that: further include spare Server, the standby server take over the function of primary server when damaging for primary server.

3. a kind of NoSQL big data storage method of optimization, it is characterised in that: the following steps are included:

(2) expression of data: for defining the format standard of data storage, data storage is using relevant database and non- Relevant database；And according to following rule: using relevant database structured data, while passing through unique identifier The structural data is mapped to non-relational database；The non-relational database stores unstructured data, described Unstructured data is stored using the form of NoSQL key-value pair；

(3) it the distributed storage of data: by the data pre-processed according to distributed principle, is cooperateed with and is stored using multiple stage computers Data.

4. a kind of NoSQL big data storage method of optimization according to claim 3, it is characterised in that: the step (1) The pretreatment of data includes: that received original isomeric data is divided into lightweight data and multi-medium data by (11)；(12) it extracts Lightweight data of the characteristic information of the multi-medium data as description multi-medium data；(13) more matchmakers are extracted to described The characteristic information of volume data carries out data scrubbing；(14) redundant data after data scrubbing is deleted.

5. a kind of NoSQL big data storage method of optimization according to claim 4, it is characterised in that: the step (1) The pretreatment of data includes: that (15) are handled for the self-defining data of concrete application.

6. a kind of NoSQL big data storage method of optimization according to claim 4, it is characterised in that: the step (12) characteristic information of multi-medium data includes interest value, digest value and original value, and the interest value is for indicating the data Application field, the digest value exist for briefly describing multi-medium data, the original value for storing original multimedia data The position stored in disk array.

7. a kind of NoSQL big data storage method of optimization according to claim 4, it is characterised in that: the step (13) data scrubbing includes filling incomplete data, smooth noisy data, correcting inconsistent data.

8. a kind of NoSQL big data storage method of optimization according to claim 3, it is characterised in that: the lightweight Data include number, character string, and the multi-medium data includes picture, audio, video.

9. a kind of NoSQL big data storage method of optimization according to claim 3, it is characterised in that: the step (3) The distributed storages of data obtains movable CPU quantity the following steps are included: (31) and dependent server cluster communication, and from Server node number and deblocking situation are obtained in configuration file；(32) database connection is established, according to mobile C PU quantity Determine read-write thread, initialization creation read-write connection；(33) under main thread task control, mutual exclusion lock is established, coordinates multiple readings Write thread；(34) main thread distributes deblocking, while being responsible for for these piecemeals being stored in corresponding subregion；(35) mutual exclusion is discharged It locks, information is merged into specification in main thread after each sub thread completion task；(36) it is unified after main thread arranges mission bit stream It is stored in dependent server.

10. a kind of NoSQL big data storage method of optimization according to claim 9, it is characterised in that: the step (36) copy is generated while being uniformly stored in dependent server after arranging mission bit stream in, is stored in standby server In.