CN107633001A - Hash partition optimization method and device - Google Patents

Hash partition optimization method and device Download PDF

Info

Publication number
CN107633001A
CN107633001A CN201710656815.9A CN201710656815A CN107633001A CN 107633001 A CN107633001 A CN 107633001A CN 201710656815 A CN201710656815 A CN 201710656815A CN 107633001 A CN107633001 A CN 107633001A
Authority
CN
China
Prior art keywords
hash
hash partition
result
partition
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710656815.9A
Other languages
Chinese (zh)
Inventor
刘悦
周鸣
周一鸣
梁巍
张鑫伟
张蕊
王余涛
朱贵伟
张召才
李金洋
张亚超
张攀
严欢
毛彦淇
及莉
吴之尧
徐映霞
卢波
张扬眉
刘春保
徐冰
刘韬
宋博
龚燃
王帅
李博
付郁
王霄
李侃
何慧东
苑艺
赵琪
袁菁
李帅
肖武平
张晓鹤
宋晶晶
赵爽
郭晓曦
李铁骊
王雪瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute Of Space Science And Technology Information
Original Assignee
Beijing Institute Of Space Science And Technology Information
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute Of Space Science And Technology Information filed Critical Beijing Institute Of Space Science And Technology Information
Priority to CN201710656815.9A priority Critical patent/CN107633001A/en
Publication of CN107633001A publication Critical patent/CN107633001A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of hash partition optimization method and device.Wherein, this method includes:Data set is obtained, wherein, data set includes one or more data;First time hash partition is carried out to data set using data skew optimized algorithm, obtains the first hash partition result;Second of hash partition is carried out to the first hash partition result, obtains the second hash partition result.The present invention is solved in the prior art due to the technical problem of hash partition short -board effect caused by data skew.

Description

Hash partition optimization method and device
Technical field
The present invention relates to field of computer data processing, in particular to a kind of hash partition optimization method and device.
Background technology
Hash partition is disposed usually as data, the conventional strategy of dynamic query processing.It can be in processing data unit When obtain high-caliber operation repetitive and shorten the response time.Adjusted including the Directory Enquiries such as Hash join algorithm and converging operation In method, intermediate result can be effectively obtained using hash partition.The main target of hash partition can be summarized as with it is several compared with Small subtask substitutes a larger female task.Advantage of this is that by more efficiently using caching and internal memory, shortening Handle the time of female task.In data base querying processing, hash partition is an operation all the fashion;It is being attached place In reason and clustering processing, hash partition can lift process performance;In sequence processing, hash partition is highly important ring Section.Liu et al. devises a kind of hash partition strategy based on Distributed Query Processing, can effectively shorten query time. Shin et al. proposes the hash partition method of optimization a kind of for solid state hard disc, this method ignore primary storage size or input/ The support of IOB, realize the result better than traditional hash partition method.
Hash partition (Hash partitioning) also known as hashes subregion, is to realize data by specified partition numbering Equally distributed a kind of partition method, by carrying out hash partition on input-output apparatus, when data reach certain scale When so that these partition sizes are approximate consistent, and then improve the efficiency of whole query processing.Need not carry out subregion addition or In the case of deletion, hash partition can effectively improve the efficiency of inquiry.But when needing to carry out subregion addition or deletion Wait, traditional hash partition method will go wrong.Assuming that be originally 7 conventional hash partitions, it is now desired to merge or A conventional hash partition is deleted, modulus algorithm becomes mod6 by mod7, and the data in originally 7 subregions will need to recalculate Again subregion.
And for the data of highly asymmetric property, such as aerospace data, although such as China, Russia, Europe and India Spacefaring nation is belonged to together Deng state and the U.S., but due to national power and the difference of scientific and technological level, all kinds of spacecraft quantity, the manufacturing machine in the U.S. Structure etc. will far more than etc. other countries.The efficiency for the methods of this data skew phenomenon often influences whether hash partition, and not The resource of multi-core parallel concurrent processor can efficiently be utilized.
For above-mentioned in the prior art due to the problem of hash partition short -board effect, not yet being carried at present caused by data skew Go out effective solution.
The content of the invention
The embodiments of the invention provide a kind of hash partition optimization method and device, with least solve in the prior art due to The technical problem of hash partition short -board effect caused by data skew.
One side according to embodiments of the present invention, there is provided a kind of hash partition optimization method, including:Obtain data Collection, wherein, data set includes one or more data;First time Hash point is carried out to data set using data skew optimized algorithm Area, obtain the first hash partition result;Second of hash partition is carried out to the first hash partition result, obtains the second hash partition As a result.
Further, data set is stored in the form of key-value pair, is comprised at least in key-value pair and is compiled corresponding to key-value pair Number.
Further, first time hash partition is carried out to data set using data skew optimized algorithm, obtains the first Hash Division result, including:First time hash partition is carried out to data set using mapping thread, obtains the first middle hash partition knot Fruit, wherein, mapping thread is used to carry out Hash calculation to numbering corresponding to key-value pair, obtains Hash calculation result, and by Hash Result of calculation identical key-value pair is assigned to same subregion, and mapping thread is one or more;Optimized using data skew and calculated Method optimizes to the first middle hash partition result, obtains the first hash partition result.
Further, the first middle hash partition result is optimized using data skew optimized algorithm, obtains first Hash partition result, including:Calculate the average partition size of the first middle hash partition result;By the first middle hash partition knot Partition size is split more than the division result of average partition size according to average partition size in fruit.
Further, map thread has an independent memory space, separate storage sky for multiple and each mapping thread Between be used for write key-value pair, using mapping thread to data set carry out first time hash partition include:Supervise each separate storage The use degree in space, when use degree exceedes predetermined threshold value, to use more than the distribution storage of the independent memory space of predetermined threshold value Space.
Another aspect according to embodiments of the present invention, a kind of hash partition optimization device is additionally provided, including:Obtain mould Block, for obtaining data set, wherein, data set includes one or more data;First division module, for using data skew Optimized algorithm carries out first time hash partition to data set, obtains the first hash partition result;Second division module, for One hash partition result carries out second of hash partition, obtains the second hash partition result.
Another aspect according to embodiments of the present invention, additionally provides a kind of storage medium, and storage medium includes the journey of storage Sequence, wherein, equipment performs above-mentioned hash partition optimization method where controlling storage medium when program is run.
Another aspect according to embodiments of the present invention, a kind of processor being additionally provided, processor is used for operation program, its In, program performs above-mentioned hash partition optimization method when running.
Another aspect according to embodiments of the present invention, a kind of terminal is additionally provided, including:Acquisition module, for obtaining number According to collection, wherein, data set includes one or more data;First division module, for using data skew optimized algorithm logarithm First time hash partition is carried out according to collection, obtains the first hash partition result;Second division module, for the first hash partition knot Fruit carries out second of hash partition, obtains the second hash partition result;Processor, processor operation program, wherein, program operation When perform above-mentioned hash partition optimization side for the data that are exported from acquisition module, the first division module and the second division module Method.
Another aspect according to embodiments of the present invention, a kind of terminal is additionally provided, including:Acquisition module, for obtaining number According to collection, wherein, data set includes one or more data;First division module, for using data skew optimized algorithm logarithm First time hash partition is carried out according to collection, obtains the first hash partition result;Second division module, for the first hash partition knot Fruit carries out second of hash partition, obtains the second hash partition result;Storage medium, for storage program, wherein, program is being transported During row above-mentioned hash partition optimization side is performed for the data exported from acquisition module, the first division module and the second division module Method.
In embodiments of the present invention, by obtaining data set, wherein, data set includes one or more data;Using number First time hash partition is carried out to data set according to optimized algorithm is tilted, obtains the first hash partition result;To the first hash partition As a result second of hash partition is carried out, obtains the second hash partition result, has reached the mesh that data are carried out with efficient hash partition , it is achieved thereby that reducing the influence that tilt data brings hash partition, the task amount handled by each subregion thread is homogenized, is delayed Short -board effect caused by solving data skew, shorten subregion time of return, improve the technique effect of subregion efficiency, and then solve existing Have in technology due to the technical problem of hash partition short -board effect caused by data skew.
Brief description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is a kind of schematic diagram of hash partition optimization method according to embodiments of the present invention;
Fig. 2 is a kind of schematic diagram of optional hash partition optimization method according to embodiments of the present invention;
Fig. 3 is a kind of schematic diagram of optional hash partition optimization method according to embodiments of the present invention;And
Fig. 4 is a kind of schematic diagram of hash partition optimization device according to embodiments of the present invention.
Embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combination.Describe the present invention in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects Enclose.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or Order beyond those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product Or the intrinsic other steps of equipment or unit.
Embodiment 1
According to embodiments of the present invention, there is provided a kind of embodiment of the method for hash partition optimization method, it is necessary to explanation, It can be performed the step of the flow of accompanying drawing illustrates in the computer system of such as one group computer executable instructions, and And although showing logical order in flow charts, in some cases, can be with different from order execution institute herein The step of showing or describing.
Fig. 1 is hash partition optimization method according to embodiments of the present invention, as shown in figure 1, this method comprises the following steps:
Step S102, data set is obtained, wherein, data set includes one or more data.
Specifically, obtain data set before, method also include obtain data the step of, after getting data, can will Data carry out piecemeal, and each piece is exactly a data set, wherein, it is space flight information data that data, which can include but is not limited to,.Obtain The mode for taking data set can read to obtain from the file or memory for be stored with data set, for example, data set can be with It is stored in txt texts.
Optionally, data set is stored in the form of key-value pair, is comprised at least in key-value pair and is numbered corresponding to key-value pair.
Specifically, the representation of key-value pair can be (Key, Value), wherein, Key represents the volume corresponding to key-value pair Number, Value represents the value of the corresponding storage of key-value pair;When data set is stored in txt texts, each key-value pair (Key, Value a line) is accounted for;The key-value pair that txt file reads input per a line can be read, wherein each key-value pair size can be 16 Byte, wherein, numbering Key can account for 8 bytes, and corresponding value Value accounts for 8 bytes.
Step S104, first time hash partition is carried out to data set using data skew optimized algorithm, obtains the first Hash Division result.
Optionally, first time hash partition is carried out to data set using data skew optimized algorithm in step S104, obtained First hash partition result, including:Step S202, first time hash partition is carried out to data set using mapping thread, obtains the One middle hash partition result, wherein, mapping thread is used to carry out Hash calculation to numbering corresponding to key-value pair, obtains Hash meter Result is calculated, and Hash calculation result identical key-value pair is assigned to same subregion, mapping thread is one or more;Step S204, the first middle hash partition result is optimized using data skew optimized algorithm, obtains the first hash partition result.
Specifically, the mapping hash function of mapping thread can be:fm(Key)=Keymod2HashValue, wherein HashValue is pre-defined positive integer Hash parameter, its span be [1 ,+∞), map thread for it is multiple when, often Individual mapping thread is according to mapping hash function fm(Key) Hash meter is carried out to the Key values in data set key-value pair (Key, Value) Calculate, result of calculation identical key-value pair is assigned to same subregion, the first middle hash partition result includes that t can be given birth to common property An individual hash partition, the size of each subregion can be set to R1, R2..., Rj..., Rt, wherein t >=2, and R1≤R2≤… ≤Rj≤…≤Rt
Optionally, the first middle hash partition result is optimized using data skew optimized algorithm in step S204, The first hash partition result is obtained, including:Step S302, calculate the average partition size of the first middle hash partition result;Step Rapid S304, partition size in the first middle hash partition result is more than the division result of average partition size according to average subregion Size is split.
Specifically, in the first middle t hash partitions of hash partition result common property life, the first hash partition result A hash partitions of uniform size can be given birth to common property, can be with when calculating the average partition size of the first middle hash partition result In the following way:
First, calculate the accumulative of the subregion of k before partition size comes in t hash partitions and:
Secondly, according to average partition size that is above-mentioned accumulative and calculating a t hash partitions:
If the subregion R in a t hash partitionsj≤Rm, then the subregion is not handled, the subregion can be put into wait In the queue of second of hash partition, prepare second of hash partition, if Rj≥Rm, then the subregion is carried out according to average mark Area size RmSplit, and the subregion after fractionation is put into the queue for waiting second of hash partition, prepared second and breathe out Uncommon subregion.
By above-mentioned steps 204 and step S302- step S304, the uniform of first time hash partition result can be improved Change degree, make its suitability stronger.
Optionally, map thread has an independent memory space, independent memory space for multiple and each mapping thread Carrying out first time hash partition to data set using mapping thread for writing key-value pair, in step S202 includes:Step S402, The use degree of each independent memory space is supervised, when use degree exceedes predetermined threshold value, to use more than the independence of predetermined threshold value Memory allocation memory space.
Specifically, after reading data set, the data set of reading can also be stored using Hash storage organization, this Shen Please in Hash storage organization can be made up of a continuous array, array each represent a Hash bucket, each Hash bucket stores the key-value pair in some subregion, wherein, each Hash bucket is by a free pointer (free pointers), one section of company Continuous memory space and heir pointer (next pointers) composition, free pointer is pointed to next in this section of Coutinuous store space Individual clear position, continuous memory space store key-value pair, and heir pointer points to a new Hash bucket.
Specifically, on the premise of each mapping thread has an independent memory space, following supervision plan can be used Slightly ensure that the doubling for mapping thread performs and avoids write conflict:Key-value pair is only write oneself by each mapping thread parallel Independent memory space in, mapping has corresponding partitioned area in the independent memory space of thread, most all mapping threads at last Independent memory space merge, obtain the first middle hash partition result, in this process, each mapping can be supervised The workload of thread, or the use degree of each independent memory space of supervision, when use degree exceedes predetermined threshold value, Ke Yiwei The independent memory space distribution memory space of predetermined threshold value is used more than, until all threads are finished.
Step S106, second of hash partition is carried out to the first hash partition result, obtains the second hash partition result.
Specifically, in the first hash partition result common property a hash partitions of uniform size of life, abbreviation line can be passed through Cheng Jinhang subregions calculate, and abbreviation thread can be one or more, and the abbreviation hash function of wherein abbreviation thread can be:fr (Key)=Keymod2HashValue+1, give a subregion to abbreviation thread and carry out subregion calculating, i.e., by abbreviation thread according to abbreviation Hash function carries out Hash operation to the Key values in the key-value pair (Key, Value) in each division result, by operation result phase Same key-value pair is assigned in same subregion, it is possible thereby to b division result is produced, wherein, b >=2, it is individual secondary that common property gives birth to a*b Hash partition result, therefore, a*b division result of final output.
In embodiments of the present invention, by obtaining data set, wherein, data set includes one or more data;Using number First time hash partition is carried out to data set according to optimized algorithm is tilted, obtains the first hash partition result;To the first hash partition As a result second of hash partition is carried out, obtains the second hash partition result, has reached the mesh that data are carried out with efficient hash partition , it is achieved thereby that reducing the influence that tilt data brings hash partition, the task amount handled by each subregion thread is homogenized, is delayed Short -board effect caused by solving data skew, shorten subregion time of return, improve the technique effect of subregion efficiency, and then solve existing Have in technology due to the technical problem of hash partition short -board effect caused by data skew.
In a kind of specific embodiment, as shown in Fig. 2 after getting data set, mapping line can be carried out to data set The first time hash partition of journey, obtains t intermediate partition, carries out data skew optimization to the t intermediate partition, can obtain a Individual subregion, the hash partition of abbreviation thread is carried out to a subregion, finally gives a*b subregion.
In a kind of specific embodiment, hash partition optimization method of the invention can be applied in space industry, obtained After getting space flight information data, based on multi -CPU multi-core parallel computation, using mapping thread and abbreviation loft journey and data are inclined Oblique optimized algorithm, by space flight information data uniform segmentation and parallel computation is carried out, can realize that multistep hash partition operates, improve Buffer efficiency, lift the overall performance of multi -CPU multi-core processor.
The hash partition optimization method of the present invention can be applied in space industry, can enter one by following emulation experiment Walk explanation:
1. emulation experiment condition:This emulation is emulated using C++ programming languages in linux system.
2. emulation content:In this experiment, the space flight information data of input integrates as 32M, totally 32768 pairs of key-value pairs, due to boat Its information data is the data of data skew, and the neat husband value of its gradient is 1.25, is carried out using traditional Hash storage organization Storage, mapping Thread Count be 16, take multiple hash function parameter HashValue, compare using data skew optimization method with not The efficiency of subregion is carried out using data skew optimization method, its result is as shown in Figure 3.
3. simulation result:From figure 3, it can be seen that in the inclined space flight information data of processing data, carried using the present invention The performance that the data skew optimization method gone out compares unused data skew optimization method is significantly improved.Because this hair The data skew optimization method of bright proposition can be averaged partition size to greatest extent, and then ensure the thread of each parallel computation Workload it is roughly the same, shorten the stand-by period, hash partition efficiency can be improved.
Embodiment 2
According to embodiments of the present invention, there is provided a kind of product embodiments of hash partition optimization device, Fig. 4 is according to this hair The hash partition optimization device of bright embodiment, as shown in figure 4, the device includes acquisition module, the first division module and second point Area's module, wherein, acquisition module, for obtaining data set, wherein, data set includes one or more data;First subregion mould Block, for carrying out first time hash partition to data set using data skew optimized algorithm, obtain the first hash partition result;The Two division modules, for carrying out second of hash partition to the first hash partition result, obtain the second hash partition result.
In embodiments of the present invention, data set is obtained by acquisition module, wherein, data set includes one or more numbers According to;First division module carries out first time hash partition using data skew optimized algorithm to data set, obtains the first Hash point Area's result;Second division module carries out second of hash partition to the first hash partition result, obtains the second hash partition result, Reach the purpose that data are carried out with efficient hash partition, it is achieved thereby that the influence that tilt data brings hash partition is reduced, The task amount handled by each subregion thread is homogenized, alleviates short -board effect caused by data skew, shortens subregion time of return, carries The technique effect of high subregion efficiency, and then solve in the prior art due to hash partition short -board effect caused by data skew Technical problem.
Herein it should be noted that above-mentioned acquisition module, the first division module and the second division module correspond to embodiment 1 In step S102 to step S106, the example and application scenarios that above-mentioned module is realized with corresponding step be identical but unlimited In the disclosure of that of above-described embodiment 1.It should be noted that above-mentioned module can be at such as one group as a part of of device Performed in the computer system of computer executable instructions.
Optionally, data set is stored in the form of key-value pair, is comprised at least in key-value pair and is numbered corresponding to key-value pair.
Optionally, the first division module includes the 3rd division module and optimization module, wherein, the 3rd division module, it is used for First time hash partition is carried out to data set using mapping thread, obtains the first middle hash partition result, wherein, map thread For carrying out Hash calculation to numbering corresponding to key-value pair, Hash calculation result is obtained, and by Hash calculation result identical key Value is one or more to being assigned to same subregion, mapping thread;Optimization module, for using data skew optimized algorithm pair First middle hash partition result optimizes, and obtains the first hash partition result.
Herein it should be noted that the step S202 that above-mentioned 3rd division module and optimization module correspond in embodiment 1 To step S204, above-mentioned module is identical with example and application scenarios that corresponding step is realized, but is not limited to above-described embodiment 1 Disclosure of that.It should be noted that above-mentioned module can perform as a part of of device in such as one group of computer Performed in the computer system of instruction.
Optionally, optimization module includes computing module and splits module, wherein, computing module, for calculating among first The average partition size of hash partition result;Module is split, for partition size in the first middle hash partition result to be more than The division result of average partition size is split according to average partition size.
Herein it should be noted that the step S302 that above-mentioned computing module and fractionation module correspond in embodiment 1 is extremely walked Rapid S304, above-mentioned module is identical with example and application scenarios that corresponding step is realized, but it is public to be not limited to the institute of above-described embodiment 1 The content opened.It should be noted that above-mentioned module can be in such as one group of computer executable instructions as a part of of device Computer system in perform.
Optionally, map thread has an independent memory space, independent memory space for multiple and each mapping thread For writing key-value pair, the 3rd division module includes administration module, for supervising the use degree of each independent memory space, when making When expenditure exceedes predetermined threshold value, memory space is distributed to use more than the independent memory space of predetermined threshold value.
Herein it should be noted that above-mentioned administration module correspond to embodiment 1 in step S402, above-mentioned module with it is corresponding The step of the example realized it is identical with application scenarios, but be not limited to the disclosure of that of above-described embodiment 1.Need what is illustrated It is that above-mentioned module can perform as a part of of device in the computer system of such as one group computer executable instructions.
Embodiment 3
According to embodiments of the present invention, there is provided a kind of product embodiments of storage medium, the storage medium include storage Program, wherein, equipment performs above-mentioned hash partition optimization method where controlling storage medium when program is run.
Embodiment 4
According to embodiments of the present invention, there is provided a kind of product embodiments of processor, the processor are used for operation program, its In, program performs above-mentioned hash partition optimization method when running.
Embodiment 5
According to embodiments of the present invention, there is provided a kind of product embodiments of terminal, the terminal include acquisition module, first point Area's module, the second division module and processor, wherein, acquisition module, for obtaining data set, wherein, data set includes one Or multiple data;First division module, for carrying out first time hash partition to data set using data skew optimized algorithm, obtain To the first hash partition result;Second division module, for carrying out second of hash partition to the first hash partition result, obtain Second hash partition result;Processor, processor operation program, wherein, for from acquisition module, the first subregion when program is run Module and the data of the second division module output perform above-mentioned hash partition optimization method.
Embodiment 6
According to embodiments of the present invention, there is provided a kind of product embodiments of terminal, the terminal include acquisition module, first point Area's module, the second division module and storage medium, wherein, acquisition module, for obtaining data set, wherein, data set includes one Individual or multiple data;First division module, for carrying out first time hash partition to data set using data skew optimized algorithm, Obtain the first hash partition result;Second division module, for carrying out second of hash partition to the first hash partition result, obtain To the second hash partition result;Storage medium, for storage program, wherein, program is operationally for from acquisition module, first Division module and the data of the second division module output perform above-mentioned hash partition optimization method.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in some embodiment The part of detailed description, it may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, others can be passed through Mode is realized.Wherein, device embodiment described above is only schematical, such as the division of the unit, Ke Yiwei A kind of division of logic function, can there is an other dividing mode when actually realizing, for example, multiple units or component can combine or Person is desirably integrated into another system, or some features can be ignored, or does not perform.Another, shown or discussed is mutual Between coupling or direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module Connect, can be electrical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On unit.Some or all of unit therein can be selected to realize the purpose of this embodiment scheme according to the actual needs.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer Equipment (can be personal computer, server or network equipment etc.) perform each embodiment methods described of the present invention whole or Part steps.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can be with store program codes Medium.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (10)

  1. A kind of 1. hash partition optimization method, it is characterised in that including:
    Data set is obtained, wherein, the data set includes one or more data;
    First time hash partition is carried out to the data set using data skew optimized algorithm, obtains the first hash partition result;
    Second of hash partition is carried out to the first hash partition result, obtains the second hash partition result.
  2. 2. according to the method for claim 1, it is characterised in that the data set is stored in the form of key-value pair, institute State to comprise at least in key-value pair and numbered corresponding to the key-value pair.
  3. 3. according to the method for claim 2, it is characterised in that the data set is carried out using data skew optimized algorithm First time hash partition, the first hash partition result is obtained, including:
    First time hash partition is carried out to the data set using mapping thread, obtains the first middle hash partition result, wherein, The mapping thread is used to carry out Hash calculation to numbering corresponding to the key-value pair, obtains Hash calculation result, and by described in Key-value pair described in Hash calculation result identical is assigned to same subregion, and the mapping thread is one or more;
    The described first middle hash partition result is optimized using data skew optimized algorithm, obtains first Hash point Area's result.
  4. 4. according to the method for claim 3, it is characterised in that the described first centre is breathed out using data skew optimized algorithm Uncommon division result optimizes, and obtains the first hash partition result, including:
    Calculate the average partition size of the described first middle hash partition result;
    Partition size in described first middle hash partition result is more than the division result of the average partition size according to institute Average partition size is stated to be split.
  5. 5. the method according to claim 3 or 4, it is characterised in that the mapping thread is multiple and each mapping Thread has an independent memory space, and the independent memory space is used to write the key-value pair, using mapping thread to institute Stating data set progress first time hash partition includes:
    The use degree of each independent memory space of supervision, when the use degree exceedes predetermined threshold value, to be described using super Cross the independent memory space distribution memory space of predetermined threshold value.
  6. 6. a kind of hash partition optimizes device, it is characterised in that including:
    Acquisition module, for obtaining data set, wherein, the data set includes one or more data;
    First division module, for carrying out first time hash partition to the data set using data skew optimized algorithm, obtain First hash partition result;
    Second division module, for carrying out second of hash partition to the first hash partition result, obtain the second Hash point Area's result.
  7. A kind of 7. storage medium, it is characterised in that the storage medium includes the program of storage, wherein, run in described program When control the storage medium where hash partition optimization method in equipment perform claim requirement 1 to 5 described in any one.
  8. A kind of 8. processor, it is characterised in that the processor is used for operation program, wherein, right of execution when described program is run Profit requires the hash partition optimization method described in any one in 1 to 5.
  9. A kind of 9. terminal, it is characterised in that including:
    Acquisition module, for obtaining data set, wherein, the data set includes one or more data;
    First division module, for carrying out first time hash partition to the data set using data skew optimized algorithm, obtain First hash partition result;
    Second division module, for carrying out second of hash partition to the first hash partition result, obtain the second Hash point Area's result;
    Processor, the processor operation program, wherein, for from the acquisition module, described first when described program is run Hash partition in the data perform claim requirement 1 to 5 of division module and second division module output described in any one Optimization method.
  10. A kind of 10. terminal, it is characterised in that including:
    Acquisition module, for obtaining data set, wherein, the data set includes one or more data;
    First division module, for carrying out first time hash partition to the data set using data skew optimized algorithm, obtain First hash partition result;
    Second division module, for carrying out second of hash partition to the first hash partition result, obtain the second Hash point Area's result;
    Storage medium, for storage program, wherein, described program is operationally for from the acquisition module, described first point Hash partition in the data perform claim requirement 1 to 5 of area's module and second division module output described in any one is excellent Change method.
CN201710656815.9A 2017-08-03 2017-08-03 Hash partition optimization method and device Pending CN107633001A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710656815.9A CN107633001A (en) 2017-08-03 2017-08-03 Hash partition optimization method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710656815.9A CN107633001A (en) 2017-08-03 2017-08-03 Hash partition optimization method and device

Publications (1)

Publication Number Publication Date
CN107633001A true CN107633001A (en) 2018-01-26

Family

ID=61099515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710656815.9A Pending CN107633001A (en) 2017-08-03 2017-08-03 Hash partition optimization method and device

Country Status (1)

Country Link
CN (1) CN107633001A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492657A (en) * 2018-09-18 2019-03-19 平安科技(深圳)有限公司 Handwriting samples digitizing solution, device, computer equipment and storage medium
CN110532425A (en) * 2019-08-19 2019-12-03 深圳市网心科技有限公司 Video data placement formula storage method, device, computer equipment and storage medium
CN111694693A (en) * 2019-03-12 2020-09-22 上海晶赞融宣科技有限公司 Data stream storage method and device and computer storage medium
CN112286917A (en) * 2020-10-22 2021-01-29 北京锐安科技有限公司 Data processing method and device, electronic equipment and storage medium
CN113516506A (en) * 2021-06-10 2021-10-19 深圳市云网万店科技有限公司 Data processing method and device and electronic equipment
CN116467354A (en) * 2023-06-15 2023-07-21 本原数据(北京)信息技术有限公司 Database query method and device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133661A (en) * 2014-07-30 2014-11-05 西安电子科技大学 Multi-core parallel hash partitioning optimizing method based on column storage
US20150234846A1 (en) * 2014-02-17 2015-08-20 Netapp, Inc. Partitioning file system namespace
CN105183880A (en) * 2015-09-22 2015-12-23 浪潮集团有限公司 Hash join method and device
CN106156159A (en) * 2015-04-16 2016-11-23 阿里巴巴集团控股有限公司 A kind of table connection processing method, device and cloud computing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150234846A1 (en) * 2014-02-17 2015-08-20 Netapp, Inc. Partitioning file system namespace
CN104133661A (en) * 2014-07-30 2014-11-05 西安电子科技大学 Multi-core parallel hash partitioning optimizing method based on column storage
CN106156159A (en) * 2015-04-16 2016-11-23 阿里巴巴集团控股有限公司 A kind of table connection processing method, device and cloud computing system
CN105183880A (en) * 2015-09-22 2015-12-23 浪潮集团有限公司 Hash join method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
袁通等: "多核处理器中基于MapReduce的哈希划分优化", 《西安交通大学学报》 *
赵宇兰: "基于MapReduce的两表数据倾斜连接的优化算法", 《吉林大学学报(理学版)》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492657A (en) * 2018-09-18 2019-03-19 平安科技(深圳)有限公司 Handwriting samples digitizing solution, device, computer equipment and storage medium
CN111694693A (en) * 2019-03-12 2020-09-22 上海晶赞融宣科技有限公司 Data stream storage method and device and computer storage medium
CN110532425A (en) * 2019-08-19 2019-12-03 深圳市网心科技有限公司 Video data placement formula storage method, device, computer equipment and storage medium
CN110532425B (en) * 2019-08-19 2022-04-01 深圳市网心科技有限公司 Video data distributed storage method and device, computer equipment and storage medium
CN112286917A (en) * 2020-10-22 2021-01-29 北京锐安科技有限公司 Data processing method and device, electronic equipment and storage medium
CN113516506A (en) * 2021-06-10 2021-10-19 深圳市云网万店科技有限公司 Data processing method and device and electronic equipment
CN113516506B (en) * 2021-06-10 2024-04-26 深圳市云网万店科技有限公司 Data processing method and device and electronic equipment
CN116467354A (en) * 2023-06-15 2023-07-21 本原数据(北京)信息技术有限公司 Database query method and device, computer equipment and storage medium
CN116467354B (en) * 2023-06-15 2023-09-12 本原数据(北京)信息技术有限公司 Database query method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107633001A (en) Hash partition optimization method and device
CN111913955A (en) Data sorting processing device, method and storage medium
US8381230B2 (en) Message passing with queues and channels
CN107391629A (en) Data migration method, system, server and computer-readable storage medium between cluster
US20180314566A1 (en) Systems for parallel processing of datasets with dynamic skew compensation
CN106815254A (en) A kind of data processing method and device
CN104978228A (en) Scheduling method and scheduling device of distributed computing system
Awad et al. Dynamic graphs on the GPU
CN107729423A (en) A kind of big data processing method and processing device
CN104407879A (en) A power grid timing sequence large data parallel loading method
CN108021449A (en) One kind association journey implementation method, terminal device and storage medium
CN110555700A (en) block chain intelligent contract execution method and device and computer readable storage medium
CN105868218B (en) A kind of data processing method and electronic equipment
CN107070645A (en) Compare the method and system of the data of tables of data
CN112214319A (en) Task scheduling method for sensing computing resources
CN108415912A (en) Data processing method based on MapReduce model and equipment
CN104158875B (en) It is a kind of to share the method and system for mitigating data center server task
CN105637482A (en) Method and device for processing data stream based on gpu
CN108256182A (en) A kind of layout method of dynamic reconfigurable FPGA
WO2022179023A1 (en) Sorting device and method
CN107544848B (en) Cluster expansion method, apparatus, electronic equipment and storage medium
CN114461384A (en) Task execution method and device, computer equipment and storage medium
CN108389152A (en) A kind of figure processing method and processing device of graph structure perception
Liu et al. KubFBS: A fine‐grained and balance‐aware scheduling system for deep learning tasks based on kubernetes
CN107621980A (en) A kind of virtual machine migration method, cluster control system and control device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180126

RJ01 Rejection of invention patent application after publication