CN105487925A - Data scanning method and device - Google Patents

Data scanning method and device Download PDF

Info

Publication number
CN105487925A
CN105487925A CN201510898272.2A CN201510898272A CN105487925A CN 105487925 A CN105487925 A CN 105487925A CN 201510898272 A CN201510898272 A CN 201510898272A CN 105487925 A CN105487925 A CN 105487925A
Authority
CN
China
Prior art keywords
data
subset
scan
scan instruction
scanning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510898272.2A
Other languages
Chinese (zh)
Other versions
CN105487925B (en
Inventor
周后取
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Uniview Technologies Co Ltd
Original Assignee
Zhejiang Uniview Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Uniview Technologies Co Ltd filed Critical Zhejiang Uniview Technologies Co Ltd
Priority to CN201510898272.2A priority Critical patent/CN105487925B/en
Publication of CN105487925A publication Critical patent/CN105487925A/en
Application granted granted Critical
Publication of CN105487925B publication Critical patent/CN105487925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Memory System (AREA)

Abstract

The invention provides a data scanning method and device. The method may comprise: obtaining subset information of a data list corresponding to a current task; judging the subset covered by every scanning instruction contained in the current task in inquiring the data list according to the subset information; starting corresponding scanning processes in one-to-one correspondence with the covered subsets, thus carrying out data scanning. Through the invention, the quantity of analyzing the whole Map of the task can be reduced; therefore cost and delay in scheduling the task are reduced; and the whole analyzing performance of the tasks is promoted.

Description

Data scanning method and device
Technical field
The present invention relates to field of computer technology, particularly relate to a kind of data scanning method and device.
Background technology
In the related, large-scale data set parallel parsing can be realized by Computational frames such as Map-Reduce (mapping-reduction), such as HBase (HadoopDatabase) database also provides the built-in function being analyzed its data by Map-Reduce, make user by the mode of input List<Scan> task, the scanning analysis to database table can be initiated.
But, correlation technique is in the processing procedure to List<Scan> task, need to start corresponding Map scan procedure respectively to carry out data scanning for each Scan scan instruction wherein, and often comprise a lot of Scan scan instruction in usual List<Scan> task, even may reach up to a hundred Scan scan instruction, cause needing to start very many Map scan procedure simultaneously, and the time that consumption is exited in the scheduling of each Map scan procedure and startup is long, List<Scan> task needs are finally caused to expend for a long time.
Summary of the invention
In view of this, the invention provides a kind of data scanning method and device, to solve the above-mentioned technical matters in correlation technique.
The invention provides following technical scheme:
According to a first aspect of the invention, propose a kind of data scanning method, comprising:
Obtain the subset information of tables of data corresponding to current task;
According to described subset information, judge the subset that each scan instruction comprised in described current task covers when inquiring about described tables of data;
Corresponding scan procedure is started correspondingly, to carry out data scanning to capped subset.
According to a second aspect of the invention, propose a kind of data scanner, comprising:
Subset information acquiring unit, for obtaining the subset information of tables of data corresponding to current task;
Subset identifying unit, for according to described subset information, judges the subset that each scan instruction comprised in described current task covers when inquiring about described tables of data;
Performance element, for starting corresponding scan procedure correspondingly to capped subset, sweeps to carry out data.
From above technical scheme, the subset that the present invention covers when data query by analyzing each scan instruction, and the scan procedure of respective numbers is started with the quantity of the subset covered, namely start and each subset scan procedure one to one, thus the overall Map quantity of analysis task can be reduced, thus reduce expense and the delay of task scheduling aspect, promote task holistic approach performance.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of a kind of data scanning method provided in the embodiment of the present invention;
Fig. 2 is the process flow diagram of the another kind of data scanning method provided in the embodiment of the present invention;
Fig. 3 is the structural representation of a kind of electronic equipment provided in the embodiment of the present invention;
Fig. 4 is the structural representation of a kind of data scanner provided in the embodiment of the present invention.
Embodiment
Hereinafter also describe the present invention in detail with reference to accompanying drawing in conjunction with the embodiments.It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.
Please refer to Fig. 1, Fig. 1 is the process flow diagram of a kind of data scanning method provided in the embodiment of the present invention, and the method is applied in database, can comprise the following steps:
Step 102, obtains the subset information of tables of data corresponding to current task.
In the present embodiment, technical scheme of the present invention can be applied to various types of database; For example, this database can be HBase database, but the present invention does not limit this.
For convenience of description, be hereafter described for HBase database.In HBase database, along with each tables of data (or claim, database table) in record data increase, data can divide according to Rowkey (line unit) by HBase, form multiple line unit interval (Region), each line unit is interval all as the subset of corresponding data table, and only records the data of corresponding row key range (being determined with termination Rowkey by the initial Rowkey of correspondence), and the Rowkey value namely between each line unit interval can not be overlapping.Therefore, after determining the tables of data needing to inquire about, just corresponding subset information can be known, the line unit interval division situation namely in this tables of data.
Step 104, according to described subset information, judges the subset that each scan instruction (Scan) comprised in described current task covers when inquiring about described tables of data.
In the present embodiment, the start-stop line unit of the scanning start-stop line unit of each scan instruction and each subset can be compared; If the scanning start-stop line unit scope of arbitrary scan instruction exists overlapping at least partially with the start-stop line unit scope of arbitrary subset, then judge that described arbitrary scan instruction covers described arbitrary subset.In this embodiment, by the subset that relatively accurately can judge the corresponding covering in tables of data of each scan instruction to start-stop line unit.
Step 106, starts corresponding scan procedure, to carry out data scanning correspondingly to capped subset; Wherein, when being applied to HBase database, the map scan procedure that this scan procedure can start for Map-reduce Computational frame.
In the present embodiment, by starting a scan procedure for each subset correspondence, the scan procedure of equal number is started according to the quantity of current covered subset, and be different from corresponding each scan instruction in correlation technique and start a scan procedure, effectively can reduce started scan procedure quantity, thus avoid scan procedure in the time started and consume in invoked procedure, contribute to promoting data scanning efficiency.
From above-described embodiment, the present invention is by being analyzed the data belonging to same subset by same map process, thus when comprising many scan instruction in the current task for same tables of data, especially when multiple scan instruction all covers same subset, effectively can reduce the map process of startup, contribute to the expense and the delay that reduce task scheduling aspect, promote task holistic approach performance.
Further, on the basis of above-described embodiment, can also comprise in another embodiment: when exist in arbitrary capped subset by multiple scan instruction repeat inquire about data slot time, merge the repetition query manipulation to described data slot.
In other words, if the sweep interval of two scan instruction (line unit is interval) has overlap at least partially, so by the merging treatment to lap, identical data can be avoided by the multiple scanning respectively of two scan instruction, replicate analysis and the wasting of resources of data can be avoided, contribute to the speed and the efficiency that promote data scanning.
Fig. 2 is the process flow diagram of the another kind of data scanning method provided in the embodiment of the present invention, and as shown in Figure 2, the method can comprise the following steps:
Step 202, judge whether the number of the scan instruction (the Scan instruction namely comprised in List<Scan>) comprised in current task (as List<Scan>) is less than or equal to preset value, if be less than or equal to this preset value, then enter step 204, otherwise enter step 206.
In the present embodiment, because the present invention is before final execution data scanning, with the addition of the steps such as the subset determining that scan instruction covers, these steps need to consume corresponding extra process duration; Therefore, if during the negligible amounts of the scan instruction comprised in current task, although the final duration performing data scanning may shorten, owing to there is above-mentioned extra process duration, so may cause longer consuming time on the contrary.
Therefore, by judging the quantity of the scan instruction comprised in current task in advance, the direct process (namely proceeding to step 204) can carrying out based on correlation technique to the situation of negligible amounts, and a fairly large number of situation is processed (namely proceeding to step 206) according to technical scheme of the present invention.
Step 204A, is respectively each scan instruction and starts corresponding scan procedure (as the map process that Map-reduce Computational frame starts), to perform query manipulation corresponding to each scan instruction respectively.
Step 204B, carries out by table grouping current task List<Scan>.
In the present embodiment, may cover the data in multiple tables of data due to every bar Scan instruction simultaneously, thus different pieces of information table can be divided into groups, such as each tables of data is one group, thus carries out scan process to often organizing tables of data respectively.
Step 206, determines the subset (Region) that each Scan instruction covers in each tables of data respectively.
In the present embodiment, the Region information of each tables of data can be obtained respectively, comprise the initial Rowkey of each Region and termination Rowkey, and stop Rowkey in conjunction with the initial Rowkey of scanning of each Scan instruction corresponding to each tables of data with scanning, thus two start-stop scopes are contrasted obtain the data sectional of each Scan instruction on each Region, namely each Scan instruction is to the coverage condition of each Region.
Step 208A, for capped each subset, generates corresponding scan procedure one by one.
In the present embodiment, for all Region covered by Scan instruction determined, start the scan procedure (as the map process that Map-reduce Computational frame starts) of equal number correspondingly, even if thus same Region is covered by multiple Scan instruction, still startup map process is only needed, effectively can reduce the map number of processes of startup, thus contribute to elevator system response speed.
Step 208B, merges the data slot that each scan instruction repeats to inquire about.
In the present embodiment, when same Region is covered in multiple Scan instruction simultaneously, repetition may be there is between the start-stop scope (stopping Rowkey by the initial Rowkey of scanning and scanning to delimit) that multiple Scan instruction covers on this Region, the data slot scope that then repeating part is corresponding can be merged, thus the data slot making repeating part corresponding only needs to perform single pass process, obviously can greatly improving sweep efficiency.
Certainly, the non-essential execution of step 208B; Merely through after the process of step 208A, directly can proceed to step 210.
Step 210, performs scan process.
By above-mentioned disposal route, when same tables of data there being a lot of Scan, based on the process of step 208A, started map process can be made significantly to reduce, by allowing some data of the many analyses of each map process, and reduce expense that task scheduling brings to reach final performance optimization; Meanwhile, based in step 208B, the data slot scope that multiple Scan instruction repeats to inquire about is merged, the data query amount of each map process in corresponding Region can be reduced, to add fast scan speed, elevator system treatment effeciency further.
Fig. 3 shows the schematic configuration diagram of the electronic equipment of the exemplary embodiment according to the application.Please refer to Fig. 3, at hardware view, this electronic equipment comprises processor, internal bus, network interface, internal memory and nonvolatile memory, certainly also may comprise the hardware required for other business.Processor reads corresponding computer program and then runs in internal memory from nonvolatile memory, and logic level forms data scanner.Certainly, except software realization mode, the application does not get rid of other implementations, mode of such as logical device or software and hardware combining etc., that is the executive agent of following treatment scheme is not limited to each logical block, also can be hardware or logical device.
Please refer to Fig. 4, in Software Implementation, this data scanner can comprise subset information acquiring unit, subset identifying unit and performance element.Wherein:
Subset information acquiring unit, for obtaining the subset information of tables of data corresponding to current task;
Subset identifying unit, for according to described subset information, judges the subset that each scan instruction comprised in described current task covers when inquiring about described tables of data;
Performance element, for starting corresponding scan procedure correspondingly, to carry out data scanning to capped subset.
Optionally, described subset identifying unit specifically for:
The scanning start-stop line unit of each scan instruction and the start-stop line unit of each subset are compared;
When scanning start-stop line unit scope and the start-stop line unit scope of arbitrary subset of one scan instruction in office exist overlapping at least partially, judge that described arbitrary scan instruction covers described arbitrary subset.
Optionally, described performance element specifically for:
When exist in arbitrary capped subset by multiple scan instruction repeat inquire about data slot time, merge the repetition query manipulation to described data slot.
Optionally, described performance element specifically for:
When the number of described scan instruction is less than or equal to preset value, is respectively each scan instruction and starts corresponding scan procedure, to perform query manipulation corresponding to each scan instruction respectively.
Optionally, described tables of data is HBase database table, and described scan procedure is the map process that Map-reduce Computational frame starts.
By the present invention, by many Scan scan the data belonging to same subset in the data of covering and be placed in a Map process and analyze, reduce the overall Map quantity of analysis task, thus reduce expense and the delay of task scheduling aspect, promote task holistic approach performance.In addition, by analyzing the repeating data of the many Scan data in a subset, and filtering out this repeating data, analytical performance can be promoted further and reduce disk input and output.Through checking, study and judge business for one for 700 bayonet socket Scan, can make to study and judge analysis efficiency by this kind of analysis and be optimized to 1/6 before optimization.
For device embodiment, because it corresponds essentially to embodiment of the method, so relevant part illustrates see the part of embodiment of the method.Device embodiment described above is only schematic, the wherein said unit illustrated as separating component or can may not be and physically separates, parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of module wherein can be selected according to the actual needs to realize the object of the present invention program.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (10)

1. a data scanning method, is characterized in that, comprising:
Obtain the subset information of tables of data corresponding to current task;
According to described subset information, judge the subset that each scan instruction comprised in described current task covers when inquiring about described tables of data;
Corresponding scan procedure is started correspondingly, to carry out data scanning to capped subset.
2. data scanning method according to claim 1, is characterized in that, according to described subset information, judging the process of the subset that each scan instruction comprised in described current task covers when inquiring about described tables of data, comprising:
The scanning start-stop line unit of each scan instruction and the start-stop line unit of each subset are compared;
If the scanning start-stop line unit scope of arbitrary scan instruction exists overlapping at least partially with the start-stop line unit scope of arbitrary subset, then judge that described arbitrary scan instruction covers described arbitrary subset.
3. data scanning method according to claim 1, is characterized in that, when carrying out data scanning, comprising:
When exist in arbitrary capped subset by multiple scan instruction repeat inquire about data slot time, merge the repetition query manipulation to described data slot.
4. data scanning method according to claim 1, is characterized in that, also comprises:
When the number of described scan instruction is less than or equal to preset value, is respectively each scan instruction and starts corresponding scan procedure, to perform query manipulation corresponding to each scan instruction respectively.
5. data scanning method according to any one of claim 1 to 4, is characterized in that, described tables of data is HBase database table, and described scan procedure is the scan procedure that Map-reduce Computational frame starts.
6. a data scanner, is characterized in that, comprising:
Subset information acquiring unit, for obtaining the subset information of tables of data corresponding to current task;
Subset identifying unit, for according to described subset information, judges the subset that each scan instruction comprised in described current task covers when inquiring about described tables of data;
Performance element, for starting corresponding scan procedure correspondingly, to carry out data scanning to capped subset.
7. data scanner according to claim 6, is characterized in that, described subset identifying unit specifically for:
The scanning start-stop line unit of each scan instruction and the start-stop line unit of each subset are compared;
When scanning start-stop line unit scope and the start-stop line unit scope of arbitrary subset of one scan instruction in office exist overlapping at least partially, judge that described arbitrary scan instruction covers described arbitrary subset.
8. data scanner according to claim 6, is characterized in that, described performance element specifically for:
When exist in arbitrary capped subset by multiple scan instruction repeat inquire about data slot time, merge the repetition query manipulation to described data slot.
9. data scanner according to claim 6, is characterized in that, described performance element specifically for:
When the number of described scan instruction is less than or equal to preset value, is respectively each scan instruction and starts corresponding scan procedure, to perform query manipulation corresponding to each scan instruction respectively.
10. the data scanner according to any one of claim 6 to 9, is characterized in that, described tables of data is HBase database table, and described scan procedure is the map process that Map-reduce Computational frame starts.
CN201510898272.2A 2015-12-08 2015-12-08 data scanning method and device Active CN105487925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510898272.2A CN105487925B (en) 2015-12-08 2015-12-08 data scanning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510898272.2A CN105487925B (en) 2015-12-08 2015-12-08 data scanning method and device

Publications (2)

Publication Number Publication Date
CN105487925A true CN105487925A (en) 2016-04-13
CN105487925B CN105487925B (en) 2019-01-15

Family

ID=55674919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510898272.2A Active CN105487925B (en) 2015-12-08 2015-12-08 data scanning method and device

Country Status (1)

Country Link
CN (1) CN105487925B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956043A (en) * 2016-04-26 2016-09-21 海尔优家智能科技(北京)有限公司 Method and device for allocating Map task for MapReduce running on Hbase database
CN110489478A (en) * 2019-08-27 2019-11-22 恩亿科(北京)数据科技有限公司 A kind of method and device of data scanning
CN111427887A (en) * 2020-03-17 2020-07-17 中国邮政储蓄银行股份有限公司 Method, device and system for rapidly scanning HBase partition table

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102576369A (en) * 2009-08-24 2012-07-11 阿玛得斯两合公司 Continuous full scan data store table and distributed data store featuring predictable answer time for unpredictable workload
CN103902544A (en) * 2012-12-25 2014-07-02 中国移动通信集团公司 Data processing method and system
US20150106397A1 (en) * 2009-08-31 2015-04-16 Hewlett-Packard Development Company, L.P. System and Method for Optimizing Queries

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102576369A (en) * 2009-08-24 2012-07-11 阿玛得斯两合公司 Continuous full scan data store table and distributed data store featuring predictable answer time for unpredictable workload
US20150106397A1 (en) * 2009-08-31 2015-04-16 Hewlett-Packard Development Company, L.P. System and Method for Optimizing Queries
CN103902544A (en) * 2012-12-25 2014-07-02 中国移动通信集团公司 Data processing method and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956043A (en) * 2016-04-26 2016-09-21 海尔优家智能科技(北京)有限公司 Method and device for allocating Map task for MapReduce running on Hbase database
CN110489478A (en) * 2019-08-27 2019-11-22 恩亿科(北京)数据科技有限公司 A kind of method and device of data scanning
CN111427887A (en) * 2020-03-17 2020-07-17 中国邮政储蓄银行股份有限公司 Method, device and system for rapidly scanning HBase partition table

Also Published As

Publication number Publication date
CN105487925B (en) 2019-01-15

Similar Documents

Publication Publication Date Title
US10372600B2 (en) Systems and methods for automated web performance testing for cloud apps in use-case scenarios
CN110119306B (en) Method, device and equipment for balancing automatic scheduling of jobs and storage medium
CN105607986A (en) Acquisition method and device of user behavior log data
CN110515795B (en) Big data component monitoring method and device and electronic equipment
CN109344066B (en) Method, system and terminal for testing browser page
CN104123397A (en) Automatic test device and method for Web page
CN104750690A (en) Query processing method, device and system
CN107357885B (en) Data writing method and device, electronic equipment and computer storage medium
CN105487925A (en) Data scanning method and device
CN110716866B (en) Code quality scanning method, device, computer equipment and storage medium
CN108023905B (en) Internet of things application system and method
CN111078277A (en) Packaging system, packaging method and storage medium
CN105786917B (en) Method and device for concurrent warehousing of time series data
WO2016201964A1 (en) Method and device for realizing test case construction
EP2052325A1 (en) Reduction of message flow between bus-connected consumers and producers
CN111158995B (en) Method and system for realizing cross-system log tracking query based on skywalk and ELK platform
CN106612213B (en) Equipment testing method and device
CN104407978A (en) Automatic test method of software and device thereof
US8516453B2 (en) Partition-based static analysis of computer software applications
CN113542073B (en) Product testing method, system, program product and storage medium based on P2P
CN116126937A (en) Job scheduling method, job scheduling device, electronic equipment and storage medium
CN115754413A (en) Oscilloscope and data processing method
CN107169133B (en) Snapshot capturing method, device, server and system
CN108848183B (en) Login method and device for simulation user
CN110806895A (en) Project creation method and device and computer readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant