CN106599178A - Big data processing method capable of realizing quick search and supporting distributed storage - Google Patents

Big data processing method capable of realizing quick search and supporting distributed storage Download PDF

Info

Publication number
CN106599178A
CN106599178A CN201611142025.0A CN201611142025A CN106599178A CN 106599178 A CN106599178 A CN 106599178A CN 201611142025 A CN201611142025 A CN 201611142025A CN 106599178 A CN106599178 A CN 106599178A
Authority
CN
China
Prior art keywords
data
storage
cryptographic hash
values
big data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611142025.0A
Other languages
Chinese (zh)
Other versions
CN106599178B (en
Inventor
郑锐韬
李勇波
张恒
孙傲冰
季统凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
G Cloud Technology Co Ltd
Original Assignee
G Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by G Cloud Technology Co Ltd filed Critical G Cloud Technology Co Ltd
Priority to CN201611142025.0A priority Critical patent/CN106599178B/en
Publication of CN106599178A publication Critical patent/CN106599178A/en
Application granted granted Critical
Publication of CN106599178B publication Critical patent/CN106599178B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices

Abstract

The invention relates to the technical field of big data storage, in particular to a big data processing method capable of realizing quick search and supporting distributed storage. According to the method, characteristics in a process are accurately searched for by analyzing big data volume; accurate locating during big data accurate search is supported by performing MD5 and consistency hash calculation on accurate data and adding MD5 fields and hash fields, so that data with small correlation is filtered out; the data is searched for in a relatively small space, so that the efficiency of big data volume accurate search is improved; and multi-file or multi-server distributed storage can be performed according to different hash values through storage definition of the hash fields, so that the storage space utilization of the big data volume is increased, the data storage balanced loading is realized, and the pressure of a storage server is reduced. Through the method, in a specific scene needed to be subjected to data accurate acquisition, the storage efficiency of the big data volume can be improved, and a quick and accurate acquisition method can be provided, so that the big data search efficiency is greatly improved.

Description

A kind of achievable fast searching simultaneously is supported to be distributed the big data processing method of storage
Technical field
The present invention relates to big data technical field of memory, especially a kind of achievable fast searching and support be distributed storage Big data processing method.
Background technology
With the development of computer ecommerce, the data that application program is produced are more and more, and the data volume of application, Concurrency is also increasing, for example, carry out situations such as accurate commodity searching, mobile phone location positioning, inspection of network connection, Unit interval domestic demand rapidly obtains the information of the data specified in substantial amounts of data.For general big data storage method, Quickly found in substantial amounts of data and navigated in specific data, big data need to be traveled through, be exactly phase The index of pass is guided, but safeguards that massive index is also a hard work when data increase, change, deleting, can be very big Affect the storage of data and the efficiency for reading, it is impossible to meet the request of big data quantity, high concurrent well, cause application program to exist Bottleneck in operation.
The content of the invention
Present invention solves the technical problem that being to provide a kind of achievable fast searching and support to be distributed the big data of storage Processing method;Find and support that carrying out distribution deposits for fast and accurately data are carried out on the memory space of big data quantity Storage.
The present invention solves the technical scheme of above-mentioned technical problem:
Described method includes following step:
Step 1:Data one by one to being stored are carried out the extraction of feature by certain algorithm, and acquisition can determine specific The unique features of data are used for the calculating of follow-up data value, and form the method that can quickly carry out data characteristicses extraction, are used for Use when data storage and reading;
Step 2:From the feature that data one by one are extracted, the calculating of MD5 values is carried out, draw MD5 values, then calculated by Hash Method, calculates the cryptographic Hash from 1 to N, and the size of N carries out value by the distributed storage of specific data volume size and division;
Step 3:The storage organization of design data, except the space for having data storage, the also space of MD5 values and cryptographic Hash Space, cryptographic Hash has the data of identical cryptographic Hash for directly hitting, and MD5 values are accurate for determining in identical cryptographic Hash Data;
Step 4:The feature of data when reading data, is extracted, and calculates MD5 values and cryptographic Hash, filtered by cryptographic Hash Fall most data, and accurate data value is determined by MD5 from the data value of small range.
To the eigenvalue for extracting, the calculating of MD5 is carried out, after the MD5 to eigenvalue is calculated, to MD5 Hash meters Calculate, draw cryptographic Hash, so that substantial amounts of data carry out distributed storage by the cryptographic Hash for calculating;
When storage is with reading, MD5 values and cryptographic Hash are calculated according to unified method.
Select can technology carry out the middleware of subregion or distributed structure/architecture as memory space;When memory space is set up, Partitioned file or distributed server architecture are set up by cryptographic Hash, so as to ensure that big data storage and reading process separate Reading, equally loaded;
When data are on the memory space of storage to design, data, MD5 values, cryptographic Hash are preserved together, storage is empty Between by design storage logic store the data to specific storage file or storage server.
Described sets up partitioned file or distributed server architecture by cryptographic Hash, and the process of foundation adopts conforming Kazakhstan Uncommon algorithm.
In digital independent, by the cryptographic Hash calculated, subregion or distributed server storage are being carried out Spatially, it is determined that file or server on identical cryptographic Hash is read out;
The data of identical cryptographic Hash are read out, then is contrasted by MD5 values, obtain out identical MD5 value, so as to quick Search out the data of needs.
The invention has the beneficial effects as follows:
Method by analyze big data quantity accurately found during the characteristics of, by carrying out MD5 to accurate data And conforming Hash calculation, and by increasing MD5 fields and Hash field come accurately fixed when supporting that big data is accurately found Position, the data little so as to filter out dependency, the searching data in the relatively small space are accurately searched so as to improve big data quantity Efficiency;Simultaneously defined by the storage to Hash field, can by different cryptographic Hash carry out multifile or multiserver point Cloth is stored, and so as to the memory space for improving big data quantity is utilized, is accomplished data storage equally loaded, is reduced storage server Pressure.
Description of the drawings
The present invention is further described below in conjunction with the accompanying drawings:
Accompanying drawing 1 is the flow chart of computer software functional unit of the present invention.
Specific embodiment
As shown in figure 1, method of the present invention implementation steps are as follows:
Step 1:On the Storage Middleware Applying of data, the memory space of setting data, MD5 memory spaces, cryptographic Hash storage Space, and the table subregion or distributed server design Storage of memory space are carried out by cryptographic Hash, by the side of concordance Hash Method carries out design Storage;
Step 2:Specific data characteristicses extracting method is defined, data to be increased are carried out carrying for feature by method one by one Take;
Step 3:From the feature that data one by one are extracted, the calculating of MD5 values is carried out, draw MD5 values, then calculated by Hash Method, calculates the cryptographic Hash from 1 to N;
Step 4:Data, MD5 values, cryptographic Hash are saved on memory space, Storage Middleware Applying is automatically by the scope of design The single cent part or sub-server that data are carried out by cryptographic Hash is preserved;
Step 5:When reading data, data to be read are carried out with feature extraction by method first, and is calculated MD5 values and is breathed out Uncommon value, reads the data of identical cryptographic Hash from Storage Middleware Applying by cryptographic Hash, and Storage Middleware Applying navigates to data by cryptographic Hash The file or server of storage, so as to read the data of peek amount very little, and compares identical MD5 data, and returns what is specified Data message.
It is described to design concretely comprising the following steps for Storage Middleware Applying concordance Hash table:
Step one, the available Storage Middleware Applying of selection, using middlewares such as conventional Mysql or MongoDB;
Step 2, in storage between design memory space on part, and be designed with the space of data, MD5, cryptographic Hash, be used for The storage of data;
Step 3, by the scope of cryptographic Hash, partition holding of the design data by cryptographic Hash, such as by the data per 1,000,000 Amount can so design a data space in a balanced way as a memory space.
The feature of the extracted data is concretely comprised the following steps:
The clear and definite feature of step one, data inherently, then can be directly as data characteristicses, such as network address;
Step 2, data generation time can be as data characteristicses, then using the time as data characteristicses;
Step 3, the equipment of data as data characteristicses, then using the unique mark of equipment as data characteristicses, such as mobile phone Number etc.;
Step 4, cannot be used as data characteristicses for unique mark, can be by assemblage characteristic as mark, such as equipment + the time.
For the key point of the fast searching method of the particular data based on big data, can be extracted from data one by one Go out clear and definite feature, a data can propose multiple features, the feature for proposing out need to be unique, by can be from the method Positioning searching is rapidly carried out, the data of needs are quickly found out.
The logic of subregion or distributed server is set up by cryptographic Hash by using specific data storage middleware, is passed through This mode is stored come the classification for carrying out data, reduces the load to big file or server, so as to improve the storage of big data With the efficiency for reading.

Claims (7)

1. a kind of achievable fast searching and support be distributed storage big data processing method, it is characterised in that:Described method Including following step:
Step 1:Data one by one to being stored are carried out the extraction of feature by certain algorithm, and acquisition can determine particular data Unique features be used for follow-up data value calculating, and formed can quickly carry out data characteristicses extraction method, for data Use when storage and reading;
Step 2:From the feature that data one by one are extracted, the calculating of MD5 values is carried out, MD5 values are drawn, then by hash algorithm, The cryptographic Hash from 1 to N is calculated, the size of N carries out value by the distributed storage of specific data volume size and division;
Step 3:The storage organization of design data, except the space for having data storage, the also sky of the space of MD5 values and cryptographic Hash Between, cryptographic Hash has the data of identical cryptographic Hash for directly hitting, and MD5 values accurately count for determining in identical cryptographic Hash According to;
Step 4:The feature of data when reading data, is extracted, and calculates MD5 values and cryptographic Hash, filtered out greatly by cryptographic Hash Partial data, and accurate data value is determined by MD5 from the data value of small range.
2. method according to claim 1, it is characterised in that:
To the eigenvalue for extracting, the calculating of MD5 is carried out, after the MD5 to eigenvalue is calculated, to MD5 Hash calculations, obtained Go out cryptographic Hash, so that substantial amounts of data carry out distributed storage by the cryptographic Hash for calculating;
When storage is with reading, MD5 values and cryptographic Hash are calculated according to unified method.
3. method according to claim 1, it is characterised in that:
Select can technology carry out the middleware of subregion or distributed structure/architecture as memory space;When memory space is set up, by Kazakhstan Uncommon value sets up partitioned file or distributed server architecture, so as to ensure big data storage and the separate reading of reading process Take, equally loaded;
When data are on the memory space of storage to design, data, MD5 values, cryptographic Hash are preserved together, memory space is pressed The storage logic of design stores the data to specific storage file or storage server.
4. method according to claim 2, it is characterised in that:
Select can technology carry out the middleware of subregion or distributed structure/architecture as memory space;When memory space is set up, by Kazakhstan Uncommon value sets up partitioned file or distributed server architecture, so as to ensure big data storage and the separate reading of reading process Take, equally loaded;
When data are on the memory space of storage to design, data, MD5 values, cryptographic Hash are preserved together, memory space is pressed The storage logic of design stores the data to specific storage file or storage server.
5. the method according to claim 3 or 4, it is characterised in that:Described sets up partitioned file or distribution by cryptographic Hash Formula server architecture, the process of foundation adopt conforming hash algorithm.
6. the method according to any one of Claims 1-4, it is characterised in that:
In digital independent, by the cryptographic Hash calculated, in the space for having carried out subregion or distributed server storage On, it is determined that file or server on identical cryptographic Hash is read out;
The data of identical cryptographic Hash are read out, then is contrasted by MD5 values, obtain out identical MD5 value, so as to fast searching To the data for needing.
7. method according to claim 5, it is characterised in that:
In digital independent, by the cryptographic Hash calculated, in the space for having carried out subregion or distributed server storage On, it is determined that file or server on identical cryptographic Hash is read out;
The data of identical cryptographic Hash are read out, then is contrasted by MD5 values, obtain out identical MD5 value, so as to fast searching To the data for needing.
CN201611142025.0A 2016-12-12 2016-12-12 A kind of big data processing method that can be achieved quickly to find and distribution is supported to store Active CN106599178B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611142025.0A CN106599178B (en) 2016-12-12 2016-12-12 A kind of big data processing method that can be achieved quickly to find and distribution is supported to store

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611142025.0A CN106599178B (en) 2016-12-12 2016-12-12 A kind of big data processing method that can be achieved quickly to find and distribution is supported to store

Publications (2)

Publication Number Publication Date
CN106599178A true CN106599178A (en) 2017-04-26
CN106599178B CN106599178B (en) 2019-08-30

Family

ID=58597641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611142025.0A Active CN106599178B (en) 2016-12-12 2016-12-12 A kind of big data processing method that can be achieved quickly to find and distribution is supported to store

Country Status (1)

Country Link
CN (1) CN106599178B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153838A (en) * 2017-12-15 2018-06-12 济南中维世纪科技有限公司 A kind of MySQL database middleware preprocess method
CN111258966A (en) * 2020-01-14 2020-06-09 软通动力信息技术有限公司 Data deduplication method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120166403A1 (en) * 2010-12-24 2012-06-28 Kim Mi-Jeom Distributed storage system having content-based deduplication function and object storing method
CN104239572A (en) * 2014-09-30 2014-12-24 普元信息技术股份有限公司 System and method for achieving metadata analysis based on distributed cache

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120166403A1 (en) * 2010-12-24 2012-06-28 Kim Mi-Jeom Distributed storage system having content-based deduplication function and object storing method
CN104239572A (en) * 2014-09-30 2014-12-24 普元信息技术股份有限公司 System and method for achieving metadata analysis based on distributed cache

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PEILUN LI ET AL: "Optimizing Hash-based Distributed Storage Using Client Choices", 《APSYS "16 PROCEEDINGS OF THE 7TH ACM SIGOPS ASIA-PACIFIC WORKSHOP ON SYSTEMS》 *
黄秋兰 等: "分布式存储系统的哈希算法研究", 《计算机工程与应用》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153838A (en) * 2017-12-15 2018-06-12 济南中维世纪科技有限公司 A kind of MySQL database middleware preprocess method
CN108153838B (en) * 2017-12-15 2022-03-11 山东中维世纪科技股份有限公司 MySQL database middleware preprocessing method
CN111258966A (en) * 2020-01-14 2020-06-09 软通动力信息技术有限公司 Data deduplication method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN106599178B (en) 2019-08-30

Similar Documents

Publication Publication Date Title
US10402427B2 (en) System and method for analyzing result of clustering massive data
CN106528787B (en) query method and device based on multidimensional analysis of mass data
CN106407207B (en) Real-time newly-added data updating method and device
CN103150397B (en) A kind of data directory creation method, data retrieval method and system
CN108205577B (en) Array construction method, array query method, device and electronic equipment
CN106033416A (en) A string processing method and device
US10685042B2 (en) Identifying join relationships based on transactional access patterns
US20150227535A1 (en) Caseless file lookup in a distributed file system
CN107402950A (en) Divide the document handling method and device of table based on point storehouse
CN106126486A (en) Temporal information coded method, encoded radio search method, coding/decoding method and device
US10838875B2 (en) System and method for managing memory for large keys and values
CN107357794B (en) Method and device for optimizing data storage structure of key value database
CN105488176A (en) Data processing method and device
CN106599178A (en) Big data processing method capable of realizing quick search and supporting distributed storage
CN108304404B (en) Data frequency estimation method based on improved Sketch structure
CN110969000B (en) Data merging processing method and device
US10095630B2 (en) Sequential access to page metadata stored in a multi-level page table
CN110222046B (en) List data processing method, device, server and storage medium
CN109189864B (en) Method, device and equipment for determining data synchronization delay
CN103902693A (en) Method of read-optimized memory database T-tree index structure
CN113849524B (en) Data processing method and device
CN116361287A (en) Path analysis method, device and system
US8533167B1 (en) Compressed set representation for sets as measures in OLAP cubes
CN115185998A (en) Target field searching method and device, server and computer readable storage medium
US11709993B2 (en) Efficient concurrent invocation of sheet defined functions including dynamic arrays

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 523808 19th Floor, Cloud Computing Center, Chinese Academy of Sciences, No. 1 Kehui Road, Songshan Lake Hi-tech Industrial Development Zone, Dongguan City, Guangdong Province

Applicant after: G-Cloud Technology Co., Ltd.

Address before: 523808 Guangdong province Dongguan City Songshan Lake Science and Technology Industrial Park Building No. 14 Keyuan pine

Applicant before: G-Cloud Technology Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant