CN106354831A - Method and device for loading segmented data blocks - Google Patents

Method and device for loading segmented data blocks Download PDF

Info

Publication number
CN106354831A
CN106354831A CN201610777791.8A CN201610777791A CN106354831A CN 106354831 A CN106354831 A CN 106354831A CN 201610777791 A CN201610777791 A CN 201610777791A CN 106354831 A CN106354831 A CN 106354831A
Authority
CN
China
Prior art keywords
data
newline
url
offset address
set space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610777791.8A
Other languages
Chinese (zh)
Inventor
武新
崔维力
李国节
赵董兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd
Original Assignee
TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd filed Critical TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd
Priority to CN201610777791.8A priority Critical patent/CN106354831A/en
Publication of CN106354831A publication Critical patent/CN106354831A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method and a device for loading segmented data blocks. The method includes judging whether offset addresses of received data blocks are equal to 0 or not and reading data specified in URL (uniform resource locators) if the offset addresses are equal to 0; searching line separators in the ranges from first line separators at the fronts of the offset addresses to preset spaces at the rears of the offset addresses if the offset addresses are larger than 0; discarding data at the fronts of the line separators if the line separators are found; discarding all data in the ranges of the preset spaces at the rears of the offset addresses if the line separators are not found. The method and the device have the advantages that data contents can be determined by loading nodes according to data blocks of the loading nodes, and the data can be parallelly loaded; load among the various loading nodes can be balanced, and the overall loading speed can be increased.

Description

A kind of loading method of cutting data block and device
Technical field
The invention belongs to distributed data base technique field, especially relate to a kind of loading method of cutting data block and dress Put.
Background technology
Distributed data base system is usually used less computer system, and every computer can individually be placed on a ground Side, all may have a complete copy copy of dbms in every computer, or copied part copy, and has oneself local Data base, the many computers positioned at different location are interconnected by network, collectively constitute one and complete, the overall situation patrol The large database concentrated, be physically distributed on volume.
In large-scale distributed analytical type data base cluster system, generally require loading large quantities of from external data source Amount data.In the face of the external data of substantial amounts of data base set group node and magnanimity, executed simultaneously using clustered node as much as possible Row data loads, and is the effective ways realized each load balancing loading between node and improve overall loading velocity.How efficient Ground cutting continuous data and the data being directed to after cutting carry out loading the key factor being to improve overall loading velocity.
Content of the invention
In view of this, a kind of loading method of cutting data block and device are embodiments provided, to realize loading The purpose of the quick loading data of node.
On the one hand, embodiments provide a kind of loading method of cutting data block, comprising:
Judge whether the data block offset address receiving is equal to 0, if equal to 0, then read the data specified in url;
If greater than 0, then search from the range of the front first newline pre-set space to offset address of offset address and change Row symbol;
If finding newline, abandon the data before described newline;
Otherwise all data in the range of pre-set space after discarding offset address.
Further, the data specified in described url includes:
Specify Offsets the data in position and space.
Further, methods described also includes:
After the data specified in reading url, continue to read the data in the range of pre-set space;
In the data specified from url, last newline searches newline to pre-set space;
If finding newline, abandon the data after described newline;
Otherwise retain all data.
Further, methods described also includes:
Data cached in scanning preset range, determine the newline in described preset range.
Further, described pre-set space is 4mb.
On the other hand, the embodiment of the present invention additionally provides a kind of charger of cutting data block, comprising:
Judging unit, for judging whether the data block offset address receiving is equal to 0, if equal to 0, then read in url The data specified;
First searching unit, if greater than 0, is then used for default to offset address from front first newline of offset address Newline is searched in spatial dimension;
First discarding unit, if finding newline, for abandoning the data before described newline;
Second discarding unit, for abandoning all data in the range of pre-set space after offset address.
Further, the data specified in described url includes:
Specify Offsets the data in position and space.
Further, described device also includes:
Reading unit, for after reading the data specified in url, continuing to read the data in the range of pre-set space;
Second searching unit, looks into pre-set space for last newline in the data specified from url Look for newline;
3rd discarding unit, if finding newline, for abandoning the data after described newline;
Stick unit, for retaining all data
Further, described device also includes:
Scanning element, data cached in preset range for scanning, determine the newline in described preset range.
Further, described pre-set space is 4mb.
The embodiment of the present invention obtains the specified data in url by the offset address of receiving data block, so that loading Node can determine data content according to the data block of itself, realize parallel data and load.Can realize between each loading node Load balancing and the overall loading velocity of raising.
Brief description
In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, below will be in embodiment or description of the prior art The accompanying drawing of required use be briefly described it should be apparent that, drawings in the following description be only the present invention some are real Apply example, for those of ordinary skill in the art, without having to pay creative labor, can also be attached according to these Figure obtains other accompanying drawings.
Fig. 1 is the schematic flow sheet of the loading method of cutting data block that the embodiment of the present invention one provides;
Fig. 2 is the schematic flow sheet of the loading method of cutting data block that the embodiment of the present invention two provides;
Fig. 3 is the schematic flow sheet of the loading method of cutting data block that the embodiment of the present invention three provides;
Fig. 4 is the structural representation of the charger of cutting data block that the embodiment of the present invention four provides.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation description is it is clear that described embodiment a part of embodiment that is the present invention, rather than whole embodiments.Based on this Embodiment in bright, the every other enforcement that those of ordinary skill in the art are obtained under the premise of not making creative work Example, broadly falls into the scope of protection of the invention.
Embodiment one
Fig. 1 is the schematic flow sheet of the loading method of cutting data block that the embodiment of the present invention one provides, and the present embodiment can Be applied to the situation loading nodal parallel loading data distributed data base system, the method can be by cutting data block Charger executing, can be realized by software/hardware mode, and can be integrated in the loading in distributed data base system by this device In node.
Referring to Fig. 1, the loading method of described cutting data block, comprising:
S110, judges whether the data block offset address receiving is equal to 0, if equal to 0, then read the number specified in url According to.
In embodiments of the present invention, the single computer node executing file loading tasks in data-base cluster is to load section Point, data-base cluster, can be according to certain load balancings, to big data literary composition when loading the data file of gb level or bigger Part carries out coarseness logic cutting, and the file fragment information after cutting is distributed to different clustered nodes.Data-base cluster Pass to the file fragment packet offset address containing file fragment loading node and two parameters of file fragment length.These ginsengs The url suffix mode of number shape such as #offset=value&length=value represents.File fragment offset address: referred to as partially Move address or offset, be the length away from top of file for the file fragment first byte, unit is represented with byte (byte).
File fragment length: abbreviation fragment length or length, are the sizes of file fragment, unit byte (byte) table Show.
It is with data behavior unit that data loads, and each data is about to be added in database table as a record. Because cluster is coarseness to the cutting of file, only cutting is carried out for tolerance foundation with file size, so offset table The document location showing is not necessarily located at the first character section of a data row in data file, and offset+length table After the document location showing also is not necessarily located in last byte of a data row in data file.Save for ensureing that each loads The loaded data sum of point can cover all files data, and each loads and occurs without weight between the data that node is loaded Fold it is desirable to load node cutting relocation process is carried out to file fragment with stable algorithm.File fragment letter after treatment Breath is referred to as effective document piece segment information, and its parameter comprises effective offset address offset ' and effective fragment length length ', right For node, if offset address is 0, segment data block headed by the data block loading, then directly read the number specified in url According to.
S120, if greater than 0, then looks in the range of pre-set space to offset address from front first newline of offset address Look for newline.
During data loads, regulation supports that the greatest length of single line of data is 4mb, exceedes this length it cannot be guaranteed that loading correct. Row data for ensureing to be less than 4mb (comprising newline) can be correctly processed.If data block offset address is more than 0, The non-first data block of data block is described, then needs pre-set space scope after front first newline of offset address is to offset address First newline of interior lookup is to guarantee the complete of loading data.Described pre-set space is 4mb, 4mb space can be considered as head Block.
S130, if finding newline, abandons the data before described newline.
If finding newline it is determined that corresponding single line of data, the data before described newline is other nodes The data loading, then abandon the data before described newline.
S140, if not finding newline, all data in the range of pre-set space after discarding offset address.
If not finding newline, the data in the range of pre-set space non-load data are described, needing will be pre- If all data are abandoned in spatial dimension.
The embodiment of the present invention obtains the specified data in url by the offset address of receiving data block, so that loading Node can determine data content according to the data block of itself, realize parallel data and load the accurately and unique of guarantee loading data Property, it is possible to achieve each load balancing loading between node and the overall loading velocity of raising.
Embodiment two
Fig. 2 is the schematic flow sheet of the loading method of cutting data block that the embodiment of the present invention two provides, and the present invention is implemented Based on above-described embodiment, further, methods described also includes example: after the data specified in reading url, continues to read Data in the range of pre-set space;In the data specified from url, last newline is searched to pre-set space and is changed Row symbol;If finding newline, abandon the data after described newline.
Referring to Fig. 2, the loading method of described cutting data block, comprising:
S210, judges whether the data block offset address receiving is equal to 0, if equal to 0, then read the number specified in url According to.
S220, if greater than 0, then looks in the range of pre-set space to offset address from front first newline of offset address Look for newline.
S230, if finding newline, abandons the data before described newline.
S240, if not finding newline, all data in the range of pre-set space after discarding offset address.
S250, after the data specified, continues to read the data in the range of pre-set space in reading url.
Exemplary, after having read the data of url designated length, then excess reads 4mb data, can regard 4mb data For tail block.
S260, in the data specified from url, last newline searches newline to pre-set space.
From the position of a newline length (sep_len) before tail block, to the interval of tail agllutination beam position, by First newline is searched after forward direction.I.e. scanning space can be considered [offset1+length1-sep_len, offset1+ length1+tail]
S270, if finding newline, abandoning the data after described newline, otherwise retaining all data.
The present embodiment is by increasing following steps: after the data specified in reading url, continues to read pre-set space scope Interior data;In the data specified from url, last newline searches newline to pre-set space;If searched To newline, then abandon the data after described newline.The data of loading can effectively be determined it is ensured that the integrity of loading data
Embodiment three
Fig. 3 is the schematic flow sheet of the loading method of cutting data block that the embodiment of the present invention three provides, and the present invention is implemented Based on above-described embodiment, further, methods described also includes example: data cached in scanning preset range, determine institute State the newline in preset range.
Referring to Fig. 3, the loading method of described cutting data block, comprising:
S310, judges whether the data block offset address receiving is equal to 0, if equal to 0, then read the number specified in url According to.
S320, if greater than 0, then looks in the range of pre-set space to offset address from front first newline of offset address Look for newline.
S330, if finding newline, abandons the data before described newline.
S340, if not finding newline, all data in the range of pre-set space after discarding offset address.
S350, after the data specified, continues to read the data in the range of pre-set space in reading url.
S360, in the data specified from url, last newline searches newline to pre-set space.
S370, if finding newline, abandoning the data after described newline, otherwise retaining all data.
S380, data cached in scanning preset range, determine the line feed in described preset range.
When newline length be more than 1 byte when, and former and later two newlines r and n be all located at 4mb block front and rear side end to end When in boundary, Ru Guo r and n is cut from middle, it will lead to: node 1 is wrong because the incomplete newline of data trailer produces Data by mistake, node 2 produces wrong data because unnecessary imperfect newline in data header.In the present embodiment, pass through Setting setting sizeable lookup box (red dotted line frame represents) a, it is possible to achieve cache sweep, improves looking into of newline Look for efficiency.Searching box minimal size is (sep_len+4mb), can once accommodate full line data, and can avoid newline quilt From middle cutting, the node 2 of node 1 can correctly cutting loading data.
The present embodiment is data cached in following steps scanning preset range by increasing, and determines in described preset range Newline.Can avoid producing because newline is imperfect wrong data between node.Can correctly cutting and loading data.
Example IV
Fig. 4 is the structural representation of the charger of cutting data block that the embodiment of the present invention four provides, as shown in figure 4, Described device includes:
Judging unit 410, for judging whether the data block offset address receiving is equal to 0, if equal to 0, then read url In the data specified;
First searching unit 420, if greater than 0, is then used for from front first newline of offset address to offset address Newline is searched in the range of pre-set space;
First discarding unit 430, if finding newline, for abandoning the data before described newline;
Second discarding unit 440, for abandoning all data in the range of pre-set space after offset address.
Further, the data specified in described url includes:
Specify Offsets the data in position and space.
Further, described device also includes:
Reading unit 450, for after reading the data specified in url, continuing to read the data in the range of pre-set space;
Second searching unit 460, for last newline in the data specified from url to pre-set space Search newline;
3rd discarding unit 470, if finding newline, for abandoning the data after described newline;
Stick unit 480, for retaining all data
Further, described device also includes:
Scanning element 490, data cached in preset range for scanning, determine the newline in described preset range.
Further, described pre-set space is 4mb.
The embodiment of the present invention obtains the specified data in url by the offset address of receiving data block, so that loading Node can determine data content according to the data block of itself, realize parallel data and load.Can realize between each loading node Load balancing and the overall loading velocity of raising.
One of ordinary skill in the art will appreciate that: all or part of step realizing above-mentioned each method embodiment can be led to Cross the related hardware of programmed instruction to complete.Aforesaid program can be stored in a computer read/write memory medium.This journey Sequence upon execution, executes the step including above-mentioned each method embodiment;And aforesaid storage medium includes: rom, ram, magnetic disc or Person's CD etc. is various can be with the medium of store program codes.
Finally it is noted that various embodiments above, only in order to technical scheme to be described, is not intended to limit;To the greatest extent Pipe has been described in detail to the present invention with reference to foregoing embodiments, it will be understood by those within the art that: its according to So the technical scheme described in foregoing embodiments can be modified, or wherein some or all of technical characteristic is entered Row equivalent;And these modifications or replacement, do not make the essence of appropriate technical solution depart from various embodiments of the present invention technology The scope of scheme.

Claims (10)

1. a kind of loading method of cutting data block is it is characterised in that include:
Judge whether the data block offset address receiving is equal to 0, if equal to 0, then read the data specified in url;
If greater than 0, then search newline from the range of the front first newline pre-set space to offset address of offset address;
If finding newline, abandon the data before described newline;
Otherwise all data in the range of pre-set space after discarding offset address.
2. method according to claim 1 is it is characterised in that the data specified in described url includes:
Specify Offsets the data in position and space.
3. method according to claim 2 is it is characterised in that methods described also includes:
After the data specified in reading url, continue to read the data in the range of pre-set space;
In the data specified from url, last newline searches newline to pre-set space;
If finding newline, abandon the data after described newline;
Otherwise retain all data.
4. method according to claim 3 is it is characterised in that methods described also includes:
Data cached in scanning preset range, determine the newline in described preset range.
5. the method according to claim 1 or 3 is it is characterised in that described pre-set space is 4mb.
6. a kind of charger of cutting data block is it is characterised in that include:
Judging unit, for judging whether the data block offset address receiving is equal to 0, if equal to 0, then read in url and specify Data;
First searching unit, if greater than 0, is then used for from offset address front first newline pre-set space to offset address In the range of search newline;
First discarding unit, if finding newline, for abandoning the data before described newline;
Second discarding unit, for abandoning all data in the range of pre-set space after offset address.
7. device according to claim 6 is it is characterised in that the data specified in described url includes:
Specify Offsets the data in position and space.
8. device according to claim 7 is it is characterised in that described device also includes:
Reading unit, for after reading the data specified in url, continuing to read the data in the range of pre-set space;
Second searching unit, searches to pre-set space for last newline in the data specified from url and changes Row symbol;
3rd discarding unit, if finding newline, for abandoning the data after described newline;
Stick unit, for retaining all data
9. device according to claim 8 is it is characterised in that described device also includes:
Scanning element, data cached in preset range for scanning, determine the newline in described preset range.
10. the device according to claim 6 or 8 is it is characterised in that described pre-set space is 4mb.
CN201610777791.8A 2016-08-31 2016-08-31 Method and device for loading segmented data blocks Pending CN106354831A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610777791.8A CN106354831A (en) 2016-08-31 2016-08-31 Method and device for loading segmented data blocks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610777791.8A CN106354831A (en) 2016-08-31 2016-08-31 Method and device for loading segmented data blocks

Publications (1)

Publication Number Publication Date
CN106354831A true CN106354831A (en) 2017-01-25

Family

ID=57857134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610777791.8A Pending CN106354831A (en) 2016-08-31 2016-08-31 Method and device for loading segmented data blocks

Country Status (1)

Country Link
CN (1) CN106354831A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115292420A (en) * 2022-10-10 2022-11-04 天津南大通用数据技术股份有限公司 Method and device for rapidly loading data in distributed database
CN115292373A (en) * 2022-10-09 2022-11-04 天津南大通用数据技术股份有限公司 Method and device for segmenting data block

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663090A (en) * 2012-04-10 2012-09-12 华为技术有限公司 Method and device for inquiry metadata
CN102841860A (en) * 2012-08-17 2012-12-26 珠海世纪鼎利通信科技股份有限公司 Large data volume information storage and access method
CN103164538A (en) * 2013-04-11 2013-06-19 深圳市华力特电气股份有限公司 Method and device for analyzing data
CN103544285A (en) * 2013-10-28 2014-01-29 华为技术有限公司 Data loading method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663090A (en) * 2012-04-10 2012-09-12 华为技术有限公司 Method and device for inquiry metadata
CN102841860A (en) * 2012-08-17 2012-12-26 珠海世纪鼎利通信科技股份有限公司 Large data volume information storage and access method
CN103164538A (en) * 2013-04-11 2013-06-19 深圳市华力特电气股份有限公司 Method and device for analyzing data
CN103544285A (en) * 2013-10-28 2014-01-29 华为技术有限公司 Data loading method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115292373A (en) * 2022-10-09 2022-11-04 天津南大通用数据技术股份有限公司 Method and device for segmenting data block
CN115292373B (en) * 2022-10-09 2023-01-24 天津南大通用数据技术股份有限公司 Method and device for segmenting data block
CN115292420A (en) * 2022-10-10 2022-11-04 天津南大通用数据技术股份有限公司 Method and device for rapidly loading data in distributed database

Similar Documents

Publication Publication Date Title
EP3678346A1 (en) Blockchain smart contract verification method and apparatus, and storage medium
CN103699585B (en) Methods, devices and systems for file metadata storage and file recovery
TWI499909B (en) Hierarchical immutable content-addressable memory processor
CN105630955B (en) A kind of data acquisition system member management method of high-efficiency dynamic
US9871727B2 (en) Routing lookup method and device and method for constructing B-tree structure
CN105095116A (en) Cache replacing method, cache controller and processor
CN107122130B (en) Data deduplication method and device
CN103729303A (en) Data writing and data reading methods of Flash
CN103106158A (en) Memory system including key-value store
CN105117351A (en) Method and apparatus for writing data into cache
CN105677904B (en) Small documents storage method and device based on distributed file system
CN104238962A (en) Method and device for writing data into cache
CN106407224A (en) Method and device for file compaction in KV (Key-Value)-Store system
CN108241632A (en) A kind of data verification method of data base-oriented Data Migration
CN106354831A (en) Method and device for loading segmented data blocks
CN103914483A (en) File storage method and device and file reading method and device
CN111159140B (en) Data processing method, device, electronic equipment and storage medium
CN103699435B (en) Load balancing method and device
CN101741708A (en) Method, device and system for storing data
CN114297368A (en) Efficient keyword filtering method realized in FPGA (field programmable Gate array) way
CN106254270A (en) A kind of queue management method and device
CN110018794A (en) A kind of rubbish recovering method, device, storage system and readable storage medium storing program for executing
CN104750846A (en) Method and device for finding substring
CN105279166A (en) File management method and system
CN104077555A (en) Method and device for identifying badcase in image search

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170125

RJ01 Rejection of invention patent application after publication