CN104714976B

CN104714976B - Data processing method and equipment

Info

Publication number: CN104714976B
Application number: CN201310692643.2A
Authority: CN
Inventors: 杨旭; 蔡宁; 姜晓燕; 王少萌; 代斌
Original assignee: Alibaba Group Holding Ltd
Current assignee: Zhejiang Tmall Technology Co Ltd
Priority date: 2013-12-17
Filing date: 2013-12-17
Publication date: 2018-08-24
Anticipated expiration: 2033-12-17
Also published as: CN104714976A

Abstract

A kind of data processing method of the application offer and equipment.The method includes：In response to the initial query request for a data acquisition system, basic histogram is obtained by reading the data in a data acquisition system；And based on scheduled target interval or target group away from being obtained with target interval or target group away from corresponding goal histogram by basic histogram, and goal histogram is presented.By using this method, the number of reading data can be made to be reduced to once during repeatedly transformation goal histogram, that is, the transformation that basic histogram achieves that histogram is used only.The calculating speed and data-handling capacity of system are substantially increased as a result, moreover, can rapidly show histogram in the case of big data.

Description

Data processing method and equipment

Technical field

This application involves technical field of data processing more particularly to a kind of data processing method based on histogram and set It is standby.

Background technology

Usually, when the quantity of data to be analyzed is only tens, the analysis knot of data is can be obtained by by range estimation Fruit, but when the quantity of data to be analyzed reaches 1,000,10,000 ... when 100,000,000,1,000,000,000, so that it may analyze number to use histogram According to.Histogram（Histogram）It is a kind of statistical graph of performance data distribution characteristics, i.e., with one group without interval, wide, height The case where equal longitudinal line segment or column vertical bar do not indicate data distribution.

For example, Figure 10 A to Figure 10 E are the schematic diagrames using an example of histogram analysis data.Figure 10 A are waited for point for certain Histogram obtained from data is analysed, by interval width（It is also referred to as below " group away from "）In the case of being set as 80, as seen from the figure, number According to having focused largely on [480,560) and [0,80) the two sections.It, as shown in Figure 10 B, can when by group away from being changed to 20 from 80 To find out that [500,520) and [0,20) advantage is clearly in two more sections of data.In turn, to be concerned only with data most Section [500,520) and by group away from being changed to 2 when, as illustrated in figure 10 c, it is known that section [510,512) on concentrated exhausted big portion The data divided.Same section [500,520), when by group away from being adjusted to 0.1, as shown in Figure 10 D, obtain in this section Data are both present in integer such conclusion nearby.In contrast, when be concerned only with section [0,20) when, as shown in figure 10e, this Data distribution in section and section [500,520) it is entirely different, but show the shape in log series model.It can by the example Know, histogram is conducive to the understanding of the distribution to data to be analyzed, by change group away from（Interval width）It can obtain data point The more information of cloth can more intuitively find out the number of each region by paying close attention to interested several regions in histogram According to characteristic distributions.

However, when the data volume of data to be analyzed is smaller, the time for executing calculating for acquisition histogram every time is very short, User can continuously convert display group away from（I.e. the group of histogram away from）, it is switched to each interested section, without apparent Pause feel.But when the data volume of data to be analyzed is larger, calculating the time will be elongated, causes in impulsive The slack phenomenon of picture is significantly appeared in the process, and the experience of user is made to decline.In addition, for being stored in distributed system Mass data（That is big data）, user change demand and convert display group away from when, be every time the meter that executes of acquisition histogram Calculation needs to spend a few minutes, can just show new histogram.

Apply for content

The main purpose of the application is to provide a kind of data processing method and equipment, with solve it is of the existing technology The calculating process of histogram evaluation time of falling into a trap is long and user experience is led to problems such as to decline, wherein：

According to the one side of the application, a kind of data processing method is provided, which is characterized in that including：In response to being directed to The initial query of one data acquisition system is asked, and basic histogram is obtained by reading the data in a data acquisition system；And base In scheduled target interval or target group away from being obtained with target interval or target group away from corresponding target histogram by basic histogram Figure, and goal histogram is presented.

According to the another aspect of the application, a kind of data processing equipment is provided, which is characterized in that including：Basic histogram Device is obtained, is configured in response to ask for the initial query of a data acquisition system, by reading in a data acquisition system Data obtain basic histogram；And goal histogram obtains device, is configured to be based on scheduled target interval or target Group by basic histogram away from being obtained with target interval or target group away from corresponding goal histogram, and goal histogram is presented.

Compared with prior art, mediant is calculated as by reading a pass evidence according to the technical solution of the application According to group away from very small histogram（Referred to as " basic histogram "）, then according to the demand of user, and use basic histogram To obtain goal histogram corresponding with user demand.As a result, during repeatedly transformation goal histogram, make reading data Number be reduced to once, and the transformation of histogram is achieved that using basic histogram every time.It substantially increases as a result, and is The calculating speed and data-handling capacity of system, moreover, can rapidly be shown to user in the case of big data Histogram.

Description of the drawings

Attached drawing described herein is used for providing further understanding of the present application, constitutes part of this application, this Shen Illustrative embodiments and their description please do not constitute the improper restriction to the application for explaining the application.In the accompanying drawings：

Fig. 1 is the general flowchart of the data processing method of the embodiment of the present application；

Fig. 2 is the flow chart of the basic histogram of acquisition of the embodiment of the present application；

Fig. 3 is the schematic diagram of an example that basic histogram is obtained by Distributed Calculation of the embodiment of the present application；

Fig. 4 is the flow chart of the acquisition basic histogram of node of the embodiment of the present application；

Fig. 5 is the flow chart of the basic histogram of merge node of the embodiment of the present application；

Fig. 6 be the invention relates to data processing equipment structural schematic diagram；

Fig. 7 be the invention relates to the basic histogram of node obtain device an example structural schematic diagram；

Fig. 8 be the invention relates to the basic histogram of node obtain another structural schematic diagram of device；

Fig. 9 be the invention relates to histogram obtain device an example structural schematic diagram；

Figure 10 A to Figure 10 E are the schematic diagrames of an example for utilizing histogram analysis data in the prior art.

Specific implementation mode

The main idea of the present application lies in that for big data, a pass is only read according to can be achieved with being supplied in order to reach Look facility this purpose of the histogram of user's smoothness is calculated as intermediate data by reading pass evidence first Group is away from very small histogram（Referred to as " basic histogram "）, then according to the demand of user, by basic histogram be transformed to The corresponding goal histogram of user demand, the i.e. target interval based on user demand or target group are away from just using basic histogram It can obtain goal histogram.To so that the calculating speed and data-handling capacity of system increase substantially, and even if Also histogram can be rapidly shown in the case of big data to realize the function of quickly checking histogram.

What needs to be explained here is that so-called big data can refer to the number that data magnitude is tens GB or more in the application According to set, and can be arbitrary data types, such as network log, video, picture, geographical location information etc..It can manage It solves, the scheme of the application is particularly suitable for the big data scene with huge data volume.At the same time, the scheme of the application It may be equally applicable for the data processing scene of other arbitrary data magnitudes.To make the purpose, technical scheme and advantage of the application It is clearer, below in conjunction with drawings and the specific embodiments, the application is described in further detail.

According to an embodiment of the present application, a kind of data processing method is provided.By the data processing method, treated Handling result is shown to user in the form of histogram.

In the prior art, two factors for constituting histogram be group away from and frequency, the general algorithmic method of histogram be：（1）A pass evidence is first read, the maximum value and minimum value of data are calculated, to obtain the maximum value and minimum of very poor i.e. data The difference of value.（2）It determines the group number of histogram according to the demand of user, is then removed with this group of number very poor, histogram can be obtained Every group of width of figure, i.e. group away from.（3）According to group away from determining the boundary value of each group.（4）A pass evidence is read again, and statistics is each The frequency of group.Whenever user's change demand histogram to be shown is changed as shown in Figure 10 A to Figure 10 E for this method The section of figure or group away from when, be required for reading twice total data and re-start calculating, can just obtain and user demand pair The histogram answered.Also, in the calculating process of histogram, due to needing to read twice of data, so the time of data processing It is elongated.In this way, it is constantly increased in data volume, cause the time of data processing elongated, the usage experience of user will It is greatly reduced.

In view of the above problems, the data-handling capacity of system in order to greatly increase, the application is from when reducing data processing Between set about, that is, reduce read data number.Therefore, this application involves data processing method mainly include two parts：One It is to obtain a group away from very small basic histogram by adaptive computational methods, so that only reading pass evidence can To obtain basic histogram；Second is that basic histogram is transformed to goal histogram corresponding with user demand, so that being not necessarily to The reading that data are repeated.

Referring to Fig.1, Fig. 1 is the general flowchart of the data processing method of the embodiment of the present application.In the figure, step S101 Be this application involves the data of processing in a self-adaptive manner the step of, step S102 be this application involves basic histogram transformation The step of processing.In the following, being described in detail one by one.

(data processing of adaptive mode)

Specifically, in step S101, in response to the initial query request for a data acquisition system, by reading one time Data in the data acquisition system obtain basic histogram.

When user wants to check certain data distributions, the inquiry request for corresponding data set can be initiated. In this application, user is known as initial query request for the inquiry for the first time of a certain data acquisition system.Initial query request can To be data inquiry request that user is initiated by input inquiry keyword.More specifically, initial query request can also be The histogram display request that user initiates for some data acquisition system.

In the above-mentioned initial query request for receiving user, can be read for example, by data processing equipments such as computers Data obtain basic histogram after handling data in a self-adaptive manner, and only carry out primary reading number in the process According to operation.Above-mentioned data refer to pending data.In addition, histogram is group away from very small histogram, its reality substantially The intermediate data for constituting goal histogram on border, i.e., the handling result obtained after data processing be intermediate data and It is not intended to the real histogram shown to user, but for ease of description, this intermediate data is referred to as basic histogram Figure.

For big data, since data volume is huge, so generally use Distributed Calculation handles data.That is, will Data are assigned to multiple calculate nodes after being divided into more parts, each calculate node only handles a part of data, finally merge each The handling result of calculate node obtains final calculation result.So-called Distributed Calculation is exactly total mutually by multiple calculate nodes Information is enjoyed, these calculate nodes can both be run on same computer, can also be in more to get up by network connection It is run on computer.In the following, illustrating the preparation method of basic histogram for calculating in a distributed manner.

Fig. 2 is the flow chart of the basic histogram of acquisition of the embodiment of the present application.Fig. 3 is that the embodiment of the present application passes through distribution The schematic diagram of an example of basic histogram is calculated in formula.In conjunction with Fig. 2 and Fig. 3, the preparation method of basic histogram is illustrated.

In step s 201, data are assigned to multiple calculate nodes, so that each calculate node obtains node data. That is, as shown in figure 3, for example by mass data be divided into node data 1, node data 2 and this three parts of node data 3 with Afterwards, the data of this three parts are separately sent to calculate node 1, calculate node 2 and calculate node 3.In addition, Fig. 3 is shown Three calculate nodes and a merge node, but the quantity of calculate node is without being limited thereto, can also be three or more.In addition, closing And node can also be one of multiple calculate nodes.

In step S202, each calculate node obtains the basic histogram of node by reading a node data.That is, As shown in figure 3, calculate node 1 reads received node data 1, node is obtained after aftermentioned processing shown in Fig. 4 Basic histogram 1.Fig. 4 is the flow chart of the acquisition basic histogram of node of the embodiment of the present application.

Specifically, in step S401, a part of node data is once read as primary data, determines the initial number According to data area, and according to the data area of predetermined group of number and the primary data come determine initial group away from（Determine step）. In other words, calculate node 1 first reads the part in node data 1, this part being read is determined by the read operation Then the data area M of node data obtains group away from very small initial group away from i.e. M/N according to predetermined group of number N, wherein N is Positive integer.Here, predetermined group number N is the preset suggestion group number for the basic histogram of node, which may not It is essentially equal with the group number of the basic histogram of node obtained by calculation, and it is desirable to the suggestion group number and obtained node The group number of basic histogram is in an order of magnitude.In actually calculating, it is proposed that group number N is usually taken to be 10000, then obtained by The group number of the basic histogram of node will be between 10000 and 100000.On the other hand, by the way that predetermined group of number N to be arranged It is sufficiently large, sufficiently small group can be obtained away from the advantage of doing so is that the number of data calculating can be reduced.

In step S402, according to initial group away from primary data is divided into multiple initial sections, and calculate it is each just The frequency in beginning section（That is partiting step）.In other words, calculate node 1 according to initial group away from M/N to the part of nodes that reads Data are grouped, and to which this partial data is divided into multiple initial sections, then calculate the frequency in each initial section.Often The frequency in a initial section is the number for the data for being included in each section.In addition, since initial group is away from very small, so By only reading a pass evidence, it will be able to rapidly calculate the frequency in each initial section, such as frequency is made to be less than or equal to 2. In this way, substantially reducing the reading times of data, the data processing speed of system is improved.

In step S403, remaining node data as new data is read, handles new data in a self-adaptive manner, and really Determine the group of current data away from and frequency, wherein current data includes primary data and new data（That is self-adaptive processing step）.It changes Sentence talk about, on the basis of primary data increase new data when, be adaptively adjusted data area and group away from.Here, remaining is saved Point data refers to the part of nodes data in addition to being read in step S401（Hreinafter referred to as " it is read data "）In addition One or more data.That is, new data packets contain one or more data, the new data is either other than being read data All data, can also be the partial data other than being read data.

In the following, explaining the flow of self-adaptive processing step in detail.

First, after reading new data, judge whether each data in new data are all located at the number of primary data Within the scope of.That is, if need to adjust data area, the numerical value of each data depended primarily in new data is The no minimum value for more than or equal to primary data and the maximum value less than or equal to primary data.Therefore, it is determined according to judging result It is fixed whether to adjust data area.

If it is judged that being yes, i.e., each data in new data are all located within the data area of primary data, then It needs to further determine which initial section is each data in new data belong to, and correspondingly increases the frequency in affiliated initial section Number so that obtain current data group away from and frequency.That is, the group of current data is away from being initial group away from the frequency of current data It is the result obtained after increasing the frequency in the initial section corresponding to new data.In other words, when each of new data When data are all located within the data area of primary data, without adjusting data area, only make each data bit in new data In the section corresponding to it.

, whereas if judging result is no, i.e., some or all of data in new data are located at the number of primary data Except range, then between multiple initial sections being adjusted to multiple new districts, and every number in the new data is further determined Between belonging to which new district, and the frequency belonging to correspondingly increasing between new district so that obtain the group of current data away from and frequency. That is, the group of current data is away from being new group after being adjusted away from the frequency of current data is by increasing corresponding to new data The result obtained after frequency between new district.In other words, when some or all of data in new data are located at primary data Data area except when, need to adjust data area and group away from.

Between multiple initial sections are adjusted to multiple new districts, it is broadly divided into following two situations：

The first situation is, only increases the number in section, and group is away from remaining unchanged.That is, such situation is only to adjust data model The case where enclosing.In this case, for the data except the data area of primary data in new data, according to initial Number of the group away from increase section, and using each section after increase as between a new district so that new bit is in all new districts Between within corresponding data area.That is, making the data except the data area of primary data in new data Within the increased section of institute.

The second situation be increase the number in section, and change group away from.That is, such situation is to adjust data area simultaneously With group away from the case where.Due to being continuously increased with new data, the number in section can exceed the accessible number of calculate node.This Sample, it is necessary to the number in section be reduced away from by way of increase group, thus also need to adjust while adjusting data area Whole group away from.In the latter case, first, for the data except the data area of primary data in new data, According to initial group away from come when increasing section, when the sum of number and the number in initial section in increased section be more than a predetermined group number Certain multiple when, by initial group away from prearranged multiple be adjusted to new group away from.Then, a left side for the data area of current data is adjusted Boundary value so that the left boundary value after adjustment be new group away from multiple, and using the data area after adjustment as new data Range.Then, according to new group away from new data range to be divided between multiple new districts so that the number between new district is more than or equal to predetermined Group number and the certain multiple for being less than predetermined group number.For example, if the number of current interval is more than 10N, need to group away from rising Grade that is, by initial group away from increasing to 10 times, 100 times or 1000 times etc., while will also adjust a left side for the data area of current data Boundary value, make new group away from multiple.If for example, adjustment before left boundary value be 92, new group away from being 10, then adjust after Left boundary value is 90.To meet condition between finally obtained new district is that the number between new district is more than or equal to N and is less than 10N.Then, Determine the frequency between each new district.Due to new group away from be initial group away from prearranged multiple, the left boundary value of new data range is new Group away from multiple, so being the equal of several adjacent regions in the histogram that will be made of current data between actually each new district Between merge made of.In this way, the frequency between each new district is exactly the sum of each frequency of these adjacent intervals.

In addition, adaptively handle new data during, group away from selection it is particularly important.New data is read when continuous When, data area can constantly become larger, and the number in section can be also continuously increased.But data area increase to a certain extent when, area Between number will exceed the accessible number of calculate node, at this time, it may be necessary to reduce section away from by way of increase group Number.In order to reach only by merging multiple groups away from forming new group away from without obtaining this purpose by recalculating, New group after increase away from be set as yes former group away from integral multiple, the data amount check between some new district is exactly several former sections as a result, The sum of data amount check, i.e. frequency between some new district is the sum of the frequency in several former sections.In addition, adjustment group away from when also want In view of the boundary value in section, the preferably more neat number of the boundary value, such as：0.1,0.002,1,100 etc..Logarithm type Data, choose initial group away from being 10^k, k is integer, thus new group away from one be set to initial group away from 10^mTimes, m is positive integer.

Furthermore it is possible to keep the data volume of the new data read every time more as possible, to reduce the number of section adjustment, The operational efficiency of raising system.

In step s 404, if there are still the node data not being read, step S403 is returned to, until all sections Until point data has been read, with obtain the groups of whole node datas away from and frequency（Recycle read step）.That is, with New data is constantly read, constantly adjusts data area and group away from frequency also changes therewith, until whole node datas Be read, so that it may with obtain the groups of whole node datas away from and frequency.

In step S405, the group based on whole node datas away from and frequency, obtain the basic histogram of node（It is walked Suddenly）.That is, as shown in figure 3, when node data 1 is all read completion by calculate node 1, so that it may to obtain for section The group of point data 1 away from and frequency, thus just obtained the basic histogram of node 1.Similarly, calculate node 2 uses node data 2 Obtain the basic histogram 2 of node, calculate node 3 obtains the basic histogram of node 3 using node data 3.

In the following, illustrating the process of self-adaptive processing new data.

For example, it is assumed that for a certain partial data obtain one grouping, that is, formed multiple initial sections [10,10.1), [10.1,10.2), [10.2,10.3) ..., [99.9,100), each corresponding frequency in section be { 2,2,2 ..., 2 }, predetermined group Number N=10.

Then, read a new data 10.15 because it belong to existing section [10.1,10.2), then have no need to change Data area and group are away from only increasing corresponding frequency, i.e., corresponding frequency is { 2,3,2 ..., 2 }.

Then, then read a new data 9.85, due to the data between original area except, so according to initial group away from 0.1 and increase [9.8,9.9) and [9.9,10) the two sections.Wherein, [9.9,10) it is since the section of histogram is continuous And increased empty interval, then between finally obtaining multiple new districts, i.e., [9.8,9.9), [9.9,10), [10,10.1), [10.1, 10.2), [10.2,10.3) ..., [and 99.9,100), corresponding frequency becomes { 1,0,2,3,2 ..., 2 }.

Then, then a new data -1.0 is read, then the number for increasing section away from 0.1 according to initial group, that is, after increasing Section be [- 0.1,0), [0,0.1) ..., [9.8,9.9), [9.9,10) ..., [99.9,100), the sum in section is super at this time 10N=100 are crossed, therefore, initial group are upgraded to new group away from 10 away from 0.1 by 100.In order to make between the new district after adjusting again Number needs the left boundary value for adjusting the data area of current data, makes new group away from 10 more than or equal to 10 and less than 100 Multiple, and the data area after adjustment is as new data range.That is, for section [- 0.1,0) for, due to the left side before adjustment Boundary value is -0.1, new group away from being 10, so the left boundary value after adjustment is -10.Therefore, by being directed to new data range weight New demarcation interval and obtain between multiple new districts [- 10,0) ..., [0,10), [10,20), [20,30) [90,100).Then, needle Between each new district, the frequency in the former section for being included in the new district is added, just obtain the frequency between multiple new districts be 1, 1,201,200,…,200}。

As described above, illustrating the process of self-adaptive processing data, i.e., partial data is first read again with the increase of data Come adjust data area and group away from.But it is also not necessarily limited to this.It is relatively small in the quantity of data, the data processing energy of calculate node In the case that power is sufficiently large, the basic histogram of node can also be obtained by disposably reading whole node datas.That is, In this case, by once reading whole node datas, the data area of whole node datas is determined；According to predetermined group The data area of number and whole node datas come determine the group of whole node datas away from；According to the group of whole node datas Away from whole node datas is divided into multiple sections, and calculate the frequency in each section；Based on whole node datas Group obtains the basic histogram of node away from the frequency with each section.Due to the data processing method and Fig. 2 to Fig. 5 of such situation Described data processing method is roughly the same, only difference is that whether disposably read whole data, so saving herein The slightly description of its detail.

Then, as shown in Fig. 2, in step S203, the basic histogram of multiple nodes is summarized for a basic histogram. That is, as shown in figure 3, merge node receives the basic histogram 1 of the result of calculation from calculate node 1 i.e. node, from calculating The basic histogram 2 of result of calculation, that is, node of node 2 and the basic histogram of result of calculation, that is, node from calculate node 33 After, the basic histogram 1 of node, the basic histogram 2 of node and the basic histogram of node 3 are merged into a basic histogram Figure.As shown in figure 5, Fig. 5 is the flow chart of the basic histogram of merge node of the embodiment of the present application.

Specifically, in step S501, by comparing the data area of the basic histogram of multiple nodes, overall number is obtained According to data area.That is, the left and right side dividing value of more each basic histogram of node, obtains the left margin of goal histogram Minimum value and right margin maximum value, to determine the data area of conceptual data.

In step S502, according to the data area of conceptual data and suggestion group number, determine the group of conceptual data away from.The step Rapid processing method is identical as above-mentioned step S401.That is, in order to merge the basic histogram of each node, need to totality Data are grouped, that is, establish one division data unified standard, it is therefore desirable to first determine conceptual data group away from.

In step S503, group according to conceptual data by the data area of conceptual data away from being divided into multiple whole areas Between.The processing method of the step and above-mentioned step S402 are essentially identical.That is, according to identified group away from being made For multiple whole sections of the unified standard of division data.

In step S504, the basic histogram of each node is divided into multiple portions area respectively according to multiple whole sections Between, and determine the frequency of each partial section.That is, according to unified standard respectively to the basic histogram of each node again It is divided.

It, will be corresponding with whole section in the basic histogram of multiple nodes for each whole section in step S505 The frequency of partial section summarize for the frequency in each whole section.That is, by the basic histogram of each node according to every Summarized in a entirety section.

In step S506, according to the group of conceptual data away from the frequency with each whole section, obtain being directed to conceptual data Basic histogram.

In the following, illustrating the method for merging each basic histogram of node.

Assuming that shared calculate node 1 and calculate node 2 the two calculate nodes, the basic histogram of node of calculate node 1 Interval division be：[9.8,9.9), [9.9,10), [10,10.1), [10.1,10.2), [10.2,10.3), section is corresponding Frequency is { 1,1,1,1,1 }.The interval division of the basic histogram of node of calculate node 2 is：[10,20),[20,30),…, [90,100), the corresponding frequency in section is：{ 200,200 ..., 200 }, it is proposed that group number is 10.

First, compare the left and right side dividing value of calculate node 1 and calculate node 2, i.e., the left boundary value difference of each calculate node Right boundary value for 9.8 and 10, each calculate node is respectively 10.3 and 100, show that the minimum value of left margin is 9.8, right margin Maximum value be 100, therefore, the data area of conceptual data be [9.8,100).

Then, according to the data area of the conceptual data [9.8,100) and suggestion group number 10, determine to be directed to conceptual data Group away from being 10.

In turn, interval divisions are carried out away from 10 pairs of conceptual datas with group, thus the interval division of conceptual data be [0,10), [10,20),[20,30),…,[90,100)。

Then, the basic histogram of each node is repartitioned, i.e.,：Due to the basic histogram of the node of calculate node 1 The interval division of figure be [9.8,9.9), [9.9,10), [10,10.1), [10.1,10.2), [10.2,10.3), section is corresponding Frequency is { 1,1,1,1,1 }, the interval division of conceptual data be [0,10), [10,20), [20,30) ..., [and 90,100), therefore The node histogram of calculate node 1 be reclassified as [0,10), [10,20), [20,30) ..., [and 90,100), the frequency in section Number become 2,3,0,0, ..., 0.Similarly, since the interval division of the basic histogram of the node of calculate node 2 is：[10, 20), [20,30) ..., [and 90,100), the corresponding frequency in section is { 200,200 ..., 200 }, and the interval division of conceptual data is [0,10), [10,20), [20,30) ..., [90,100), thus the node histogram of calculate node 2 be reclassified as [0, 10), [10,20), [20,30) ..., [90,100), the frequency in section becomes { 0,200,200 ..., 200 }, as a result, overall number According to frequency summarized results be { 2,203,200 ..., 200 }.

Therefore, the interval division and frequency of the basic histogram of conceptual data are obtained, i.e.,：The basic histogram of conceptual data Interval division be [0,10), [10,20), [20,30) ..., [and 90,100), the corresponding frequency in section be 2,203,200 ..., 200}。

(basic histogram conversion process)

In order to show histogram according to the demand of user, it is necessary to become the basic histogram after self-adaptive processing It is changed to goal histogram corresponding with user demand.

Therefore, as shown in Figure 1, in step s 102, based on scheduled target interval or target group away from using basic histogram Figure is obtained with target interval or target group away from corresponding goal histogram, and goal histogram is presented.That is, according to mesh Section or target group are marked away from being between a new district by several adjacent interval mappings of basic histogram, and by corresponding frequency Summarized, so that it may to obtain goal histogram corresponding with user demand, to display it to user, for customer analysis Data distribution.

In one embodiment of the application, target interval or target group are away from can be system default.That is, Receive for a certain data acquisition system initial query request when, can according to acquiescence target interval or target group away from（At this Kind in the case of for system default display interval or display group away from）To generate goal histogram and the goal histogram be presented.

In another embodiment of the application, target interval or target group are away from can be that initial query request is specified. That is, the goal histogram shown needed for being specified in user is for the initial query request of the data acquisition system Target interval or target group away from.Need exist for it is clear that, the target interval or target group away from the target histogram that finally shows The actual displayed section of figure or display group are away from may not be completely the same, but infinite approach.It will be described in more detail later.

In actually calculating, it is 10000 usually to choose suggestion group number, and calculated basic histogram has tens of thousands of a areas Between, user directly can not therefrom obtain useful information, it is therefore desirable to show histogram according to the demand of user, the demand of user can To include：User can select an interested data area, that is, select an interested display interval, the data of acquiescence Ranging from all data；It can not select specific display interval and acquiescence is selected to show, in the case where giving tacit consent to display, generally Show that more than ten of section, the section number for the basic histogram that each display interval includes are equal as possible；In present figure It, can be with most group away from being shown under precision, wherein the precision of drawing refers to the more of section quantity in shown histogram It is few；Can with more group away from or bigger group away from being shown, that is, amplified（zoom in）Or it reduces（zoom out）Display Effect；It can be by a certain group input by user away from being shown；For the data of long-tail type, can select not show some districts Between so that convenient for checking the small section of some frequencies；It can show the frequency and each show that each display interval is included Show the percentage etc. for the frequency that section is occupied in current indication range.

As set forth above, it is possible to which the various demands according to user correspondingly convert basic histogram.In short, the demand with user Various operations are correspondingly executed, such a problem finally can be all converted into：On the right of a target left boundary value and target Dividing value and target group are away from seeking histogram.But it is not fully according to input by user in display target histogram Desired value, that is, target interval or target group are away from showing.The reason is that the histogram shown according to desired value input by user It is not necessarily histogram optimal under equal conditions, therefore, in the data processing method of the application, it will usually suitably adjust use The desired value of family input, it is best to obtain to obtain best display group away from, display left boundary value and display right boundary value Show histogram.Steps are as follows for specific calculating：

1, display group is calculated away from meeting two conditions：Display group away from be basic histogram group away from positive integer times；It is aobvious Show group away from equal or close to target group away from.

2, it calculates and shows left margin, to meet two conditions：Display left boundary value be show group away from integral multiple；Display is left Boundary value is equal to or less than target left boundary value.

3, it calculates and shows right margin, to meet two conditions：Display right boundary value be show group away from integral multiple；Display is right Boundary value is equal to or more than target right boundary value.

4, by display group away from, display left boundary value and display right boundary value, obtain the division of display interval.

5, to each display interval, the sum of the frequency of respective bins for the basic histogram for belonging to the display interval is calculated, Frequency as the display interval.

6, entire display interval division and frequency are exported.

Wherein, target left boundary value, target right boundary value and target group are away from can refer to numerical value that user specifies.

It this concludes the description of and show goal histogram in response to being asked for the initial query of a certain data acquisition system Data processing method.And in the case where continuously sending out different subsequent query requests for same data acquisition system, it is only necessary to root According to subsequent query request specified other target intervals or target group away from reconsolidating the several adjacent of basic histogram Section.That is, in response to the subsequent query request for same data acquisition system, based on another specified by subsequent query request Target interval or another target group by using basic histogram away from being obtained with another target interval or another target group away from right Another goal histogram answered.Due in the data processing method asked for subsequent query, the adjacent region of basic histogram Between merging treatment it is identical as the processing method for initial query request, specific described so omitting its herein.

According to an embodiment of the present application, a kind of data processing equipment is provided.Fig. 6 be the invention relates to data The structural schematic diagram of processing equipment.As shown in fig. 6, the data processing equipment 600, which may include basic histogram, obtains device 601, goal histogram obtains device 602 and change device 603.

Specifically, basic histogram obtains device 601 and is configured in response to be directed to the initial query of a data acquisition system Request obtains basic histogram by reading the data in the data acquisition system.

Goal histogram obtains device 602 and is configured to based on scheduled target interval or target group away from by basic histogram It obtains with target interval or target group away from corresponding goal histogram, and goal histogram is presented.

Change device 603 is configured in response to ask for the subsequent query of the data acquisition system, based on described follow-up Another target interval or another target group specified by inquiry request by basic histogram away from being obtained and another target interval or another Another goal histogram is presented away from corresponding another goal histogram in one target group.

In turn, it may include diostribution device 611, the basic histogram acquisition device of node that basic histogram, which obtains device 601, 612 and histogram summarize device 613.

Specifically, diostribution device 611 is configured to data being assigned to multiple calculate nodes so that each calculate section Point obtains node data.

The basic histogram of node obtain device 612 be configured to each calculate node by read a node data come To the basic histogram of node.

Histogram summarizes device 613 and is configured to summarize the basic histogram of multiple nodes for a basic histogram.

In turn, Fig. 7 be the invention relates to the basic histogram of node obtain device an example structural schematic diagram. As shown in fig. 7, it may include node data ranges determination device 701, node group away from true that the basic histogram of node, which obtains device 612, Determine device 702, node interval divides device 703 and the basic histogram constituent apparatus of first node 704.

Specifically, node data ranges determination device 701 is configured to once read whole node datas to determine The data area of whole node datas.

Node group is configured to according to the data area of predetermined group number and whole node datas away from determining device 702 come really The group of fixed whole node data away from.

Node interval divides device 703 and is configured to according to the group of whole node datas away from drawing whole node datas It is divided into multiple sections and calculates the frequency in each section.

The basic histogram constituent apparatus of first node 704 be configured to the group based on whole node datas away from each area Between frequency obtain the basic histogram of node.

In addition, Fig. 8 be the invention relates to the basic histogram of node obtain another structural representation of device Figure.As shown in figure 8, it may include that initial group is drawn away from determining device 801, initial section that the basic histogram of node, which obtains device 612, Separating device 802, self-adaptive processing device 803, cycle reading device 804 and the basic histogram constituent apparatus of second node 805.

Specifically, initial group is configured to once read a part of node data as initial number away from determining device 801 According to and determine primary data data area, and according to the data area of predetermined group of number and primary data come determine initial group away from.

Initial interval division device 802 is configured to according to initial group away from primary data is divided into multiple initial sections simultaneously Calculate the frequency in each initial section.

Self-adaptive processing device 803 is configured to read remaining node data as new data, locates in a self-adaptive manner Manage new data, and determine current data group away from and frequency.Wherein, the current data includes primary data and new data.

Cycle reading device 804 is configured to continue to read the node if there are still the node data not being read Data Concurrent is sent to self-adaptive processing device 803, until whole node data has been read, to obtain whole node datas Group away from and frequency.

The basic histogram constituent apparatus of second node 805 be configured to the group based on whole node datas away from frequency come To the basic histogram of node.

In turn, self-adaptive processing device 803 may include reading device 811, judgment means 812, the first decision maker 813 And second decision maker 814.

Reading device 811 is configured to read new data.Wherein, which contains one or more data.

Judgment means 812 are configured to judge whether each data in new data are all located at the data area of primary data Within.

First decision maker 813 is configured to if it is determined that the judging result of device 812 is to be to further determine new number Which initial section is each data in belong to, and correspondingly increases the frequency in affiliated initial section, to obtain current data Group away from and frequency.Specifically,

Second decision maker 814 is configured to if it is determined that the judging result of device 812 is otherwise by multiple initial sections It is adjusted between multiple new districts, and further determines which new district is each data in new data belong between, and correspondingly increase institute Belong to the frequency between new district, with obtain the group of current data away from and frequency.

The second situation be increase the number in section, and change group away from.That is, such situation is to adjust data area simultaneously With group away from the case where.Due to being continuously increased with new data, the number in section can exceed the accessible number of calculate node.This Sample, it is necessary to the number in section be reduced away from by way of increase group, thus also need to adjust while adjusting data area Whole group away from.In the latter case, first, for the data except the data area of primary data in new data, According to initial group away from come when increasing section, when the sum of number and the number in initial section in increased section be more than a predetermined group number Certain multiple when, by initial group away from prearranged multiple be adjusted to new group away from.Then, the left margin of current data range is adjusted Value so that the left boundary value after adjustment be new group away from multiple, and using the data area after adjustment as new data range. Then, according to new group away from new data range to be divided between multiple new districts so that the number between new district is more than or equal to a predetermined group number And less than the certain multiple of predetermined group number.Then, it is determined that the frequency between each new district.Due to new group away from be initial group away from it is predetermined Multiple, the left boundary value of new data range be new group away from multiple, so being the equal of between actually each new district will be by current Made of several adjacent intervals in the histogram that data are constituted merge.In this way, the frequency between each new district is exactly that these are adjacent The sum of each frequency in section.Fig. 9 be the invention relates to histogram obtain device an example structural schematic diagram.Such as figure Shown in 9, histogram summarizes that device 613 may include comparison means 901, total group away from determining device 902, whole interval division device 903, partial section divides device 904, frequency summarizes device 905 and basic histogram constituent apparatus 906.

Comparison means 901 is configured to obtain conceptual data by comparing the data area of the basic histogram of multiple nodes Data area.

Total group is configured to determine overall number according to the data area of conceptual data and suggestion group number away from determining device 902 According to group away from.

Whole interval division device 903 is configured to the group according to conceptual data away from dividing the data area of conceptual data For multiple whole sections.

Partial section divides device 904 and is configured to respectively draw the basic histogram of each node according to multiple whole sections It is divided into multiple portions section, and determines the frequency of each partial section.

Frequency summarizes device 905 and is configured to for each whole section, by the basic histogram of multiple nodes with this The frequency of the corresponding partial section in whole section summarizes for the frequency in each whole section.

Basic histogram constituent apparatus 906 be configured to according to the group of conceptual data away from the frequency with each whole section come Obtain the basic histogram for conceptual data.

Step in the specific implementation and the present processes of modules included by the data processing equipment 600 of the application Rapid specific implementation is corresponding, in order not to obscure the application, omits the detail no longer to modules herein and carries out Description.

The present processes, equipment and system can it is any can based on the data processing equipment of histogram in answer With.It is described to be can include but is not limited to based on the data processing equipment of histogram：Desktop computer, mobile terminal device, knee Laptop, tablet computer, personal digital assistant etc..

In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology realizes information storage.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, tape magnetic disk storage or other magnetic storage apparatus Or any other non-transmission medium, it can be used for storage and can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability Including so that process, method, commodity or equipment including a series of elements include not only those elements, but also wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that wanted including described There is also other identical elements in the process of element, method, commodity or equipment.

It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code Storage media（Including but not limited to magnetic disk storage, CD-ROM, optical memory etc.）The shape of the computer program product of upper implementation Formula.

Above is only an example of the present application, it is not intended to limit this application.For those skilled in the art For, the application can have various modifications and variations.It is all within spirit herein and principle made by any modification, equivalent Replace, improve etc., it should be included within the scope of claims hereof.

Claims

1. a kind of data processing method, which is characterized in that including：

In response to the initial query request for a data acquisition system, obtained by reading the data in a data acquisition system Basic histogram, wherein the basic histogram is the intermediate data for constituting goal histogram, the basic histogram Group away from the group less than the goal histogram away from；And

Based on scheduled target interval or target group away from, by the basic histogram obtain with the target interval or target group away from Corresponding goal histogram, and the goal histogram is presented；

Wherein, it is described based on scheduled target interval or target group away from being obtained and the target interval by the basic histogram Or target group is away from corresponding goal histogram, including：

According to target interval or target group away between several adjacent intervals of basic histogram are merged into a new district, and by phase The frequency answered is summarized, and corresponding goal histogram is obtained.

2. according to the method described in claim 1, it is characterized in that, the data by reading in a data acquisition system Further comprise come the step of obtaining basic histogram：

The data are assigned to multiple calculate nodes, so that each calculate node obtains node data；

Each calculate node obtains the basic histogram of node by reading a node data；And

Multiple basic histograms of node are summarized for a basic histogram.

3. according to the method described in claim 2, it is characterized in that, described to summarize multiple basic histograms of node be one The step of a basic histogram, further comprises：

By comparing the data area of multiple basic histograms of node, the data area of the data is obtained；

According to the data area of the data and suggestion group number, determine the groups of the data away from；

According to described group away from the data area of the data is divided into multiple whole sections；

Each basic histogram of node is divided into multiple portions section respectively according to the multiple whole section, and is determined The frequency of each partial section；

For each whole section, by part corresponding with the entirety section in multiple basic histograms of node The frequency in section summarizes for the frequency in each whole section；And

According to described group away from the frequency with each whole section, the basic histogram for the data is obtained.

4. according to the method in claim 2 or 3, which is characterized in that described to be saved by reading a node data The step of point basic histogram, further comprises：

It is primary to read whole node datas, determine the data area of whole node data；

Determined according to the data area of predetermined group of number and whole node data the group of whole node data away from；

According to the group of whole node data away from whole node data is divided into multiple sections, and calculate every The frequency in a section；And

Group based on whole node data obtains the basic histogram of the node away from the frequency with each section.

5. according to the method in claim 2 or 3, which is characterized in that described to be saved by reading a node data The step of point basic histogram, further comprises：

It determines step, once reads a part of node data as primary data, determine the data area of the primary data, and According to the data area of predetermined group of number and the primary data come determine initial group away from；

Partiting step, according to the initial group away from the primary data is divided into multiple initial sections, and calculate it is each just The frequency in beginning section；

Self-adaptive processing step reads remaining node data as new data, handles the new data in a self-adaptive manner, and Determine the group of current data away from and frequency, wherein the current data includes the primary data and the new data；

Read step is recycled, if there are still the node data not being read, returns to the self-adaptive processing step, until Until whole node datas have been read, with obtain the groups of whole node datas away from and frequency；And

Obtain step, the group based on whole node datas away from and frequency, obtain the basic histogram of the node.

6. according to the method described in claim 5, it is characterized in that, the self-adaptive processing step further comprises：

Read the new data, wherein the new data packets contain one or more data；

Judge whether each data in the new data are all located within the data area of the primary data；

If it is judged that being yes, then further determine which initial section is each data in the new data belong to, And the frequency in initial section belonging to correspondingly increasing, with obtain the group of current data away from and frequency；And

If it is judged that being no, then between the multiple initial section being adjusted to multiple new districts, and further determine described new Which new district is each data in data belong between, and correspondingly increases the frequency between affiliated new district, to obtain current number According to group away from and frequency.

7. according to the method described in claim 6, it is characterized in that, described be adjusted to multiple new districts by the multiple initial section Between further comprise：

For the data except the data area of the primary data in the new data, according to the initial group away from next Increase one or more sections, so that being formed between multiple new districts so that the new bit between the multiple new district within.

8. according to the method described in claim 6, it is characterized in that, described be adjusted to multiple new districts by the multiple initial section Between the step of further comprise：

For the data except the data area of the primary data in the new data, according to the initial group away from Come when increasing section, when the sum of number and the number in initial section in increased section be more than certain times of the predetermined group of number Number when, by the initial group away from prearranged multiple be adjusted to new group away from；

The left boundary value of the data area of the current data is adjusted to form new data range so that the left boundary value after adjustment Be described new group away from multiple；And

According to described new group away from the new data range to be divided between multiple new districts so that the number between the new district is big In the certain multiple equal to predetermined group of number and less than the predetermined group of number.

9. according to the method described in claim 1, it is characterized in that, it is described based on scheduled target interval or target group away from, by The basic histogram obtains further comprising away from the step of corresponding goal histogram with the target interval or target group：

Based on the target group away from, determine display group away from；

Based on the target interval, determines and show left boundary value and display right boundary value；

Based on the display group away from and the display left boundary value and display right boundary value, determine each display interval；

For each display interval, the sum of frequency of respective bins of basic histogram of the display interval will be belonged to and be used as institute State the frequency of display interval；And

According to the display group away from the frequency with the display interval, the goal histogram is obtained.

10. according to the method described in claim 1, it is characterized in that, further including：

In response to the subsequent query request for the data acquisition system, based on the specified another target of subsequent query request Section or another target group by the basic histogram away from being obtained with another target interval or another target group away from corresponding Another goal histogram, and another goal histogram is presented.

11. a kind of data processing equipment, which is characterized in that including：

Basic histogram obtains device, is configured in response to ask for the initial query of a data acquisition system, by reading one Data in the data acquisition system obtain basic histogram, wherein the basic histogram is for constituting target histogram The intermediate data of figure, the group of the basic histogram away from the group less than the goal histogram away from；And

Goal histogram obtains device, is configured to based on scheduled target interval or target group away from by the basic histogram It obtains with the target interval or target group away from corresponding goal histogram, and the goal histogram is presented；

Wherein, the goal histogram obtains device, is specifically used for according to target interval or target group away from by basic histogram Several adjacent intervals are merged between a new district, and corresponding frequency are summarized, and corresponding goal histogram is obtained.

12. equipment according to claim 11, which is characterized in that the basic histogram obtains device and further comprises：

Diostribution device is configured to the data being assigned to multiple calculate nodes, so that each calculate node obtains node Data；

The basic histogram of node obtains device, is configured to each calculate node by reading a node data to obtain node Basic histogram；And

Histogram summarizes device, is configured to summarize multiple basic histograms of node for a basic histogram.

13. equipment according to claim 12, which is characterized in that the histogram summarizes device and further comprises：

Comparison means is configured to the data area by comparing multiple basic histograms of node, obtains the data Data area；

Total group is configured to the data area and suggestion group number according to the data, determines the group of the data away from determining device Away from；

Whole interval division device is configured to according to described group away from the data area of the data is divided into multiple whole areas Between；

Partial section divides device, and being configured to respectively will each basic histogram of node according to the multiple whole section It is divided into multiple portions section, and determines the frequency of each partial section；

Frequency summarizes device, is configured to for each whole section, by multiple basic histograms of node with The frequency of the corresponding partial section in the entirety section summarizes for the frequency in each whole section；And

Basic histogram constituent apparatus is configured to be directed to away from the frequency with each whole section according to described group The basic histogram of the data.

14. equipment according to claim 12 or 13, which is characterized in that the basic histogram of node obtains device into one Step includes：

Node data ranges determination device is configured to once read whole node datas, determines whole number of nodes According to data area；

First node group away from determining device, be configured to according to the data area of predetermined group number and whole node data come Determine the group of whole node data away from；

First node interval division device is configured to according to the group of whole node data away from by whole node Data are divided into multiple sections, and calculate the frequency in each section；And

The basic histogram constituent apparatus of first node, be configured to the group based on whole node data away from it is described each The frequency in section obtains the basic histogram of the node.

15. equipment according to claim 12 or 13, which is characterized in that the basic histogram of node obtains device into one Step includes：

Initial group is configured to once read a part of node data as primary data, determines described initial away from determining device The data area of data, and according to the data area of predetermined group of number and the primary data come determine initial group away from；

Initial interval division device is configured to according to the initial group away from the primary data is divided into multiple original areas Between, and calculate the frequency in each initial section；

Self-adaptive processing device is configured to read remaining node data as new data, in a self-adaptive manner described in processing New data, and determine current data group away from and frequency, wherein the current data include the primary data and the new number According to；

Reading device is recycled, if being configured to, there are still the node data not being read, continue to read the node data And it is sent to the self-adaptive processing device, until whole node data has been read, to obtain whole node datas Group away from and frequency；And

The basic histogram constituent apparatus of second node, be configured to the group based on whole node datas away from and frequency, obtain The basic histogram of node.

16. equipment according to claim 15, which is characterized in that the self-adaptive processing device further comprises：

Reading device is configured to read the new data, wherein the new data packets contain one or more data；

Judgment means, are configured to judge whether each data in the new data are all located at the data model of the primary data Within enclosing；

First decision maker is configured to if it is judged that be yes, then further determine each data in the new data Belong to which initial section, and the frequency in initial section belonging to correspondingly increasing, with obtain the group of current data away from and frequency Number；And

Second decision maker is configured to if it is judged that be no, then the multiple initial section is adjusted to multiple new districts Between, and further determine which new district is each data in the new data belong between, and correspondingly increase affiliated new district Between frequency, with obtain the group of current data away from and frequency.

17. equipment according to claim 11, which is characterized in that the goal histogram obtains device and further comprises：

First determining device, for based on the target group away from, determine display group away from；

Second determining device determines for being based on the target interval and shows left boundary value and display right boundary value；

Third determining device, for based on the display group away from and the display left boundary value and display right boundary value, it is determining Each display interval；

Computing device calculates the respective bins for the basic histogram for belonging to the display interval for being directed to each display interval Frequency as the display interval of the sum of frequency；And

Device is obtained, for, away from the frequency with the display interval, obtaining the goal histogram according to the display group.

18. equipment according to claim 11, which is characterized in that further include：

Change device is configured in response to ask for the subsequent query of the data acquisition system, be asked based on the subsequent query Ask specified another target interval or another target group away from, by the basic histogram obtain with another target interval or Another goal histogram is presented away from corresponding another goal histogram in another target group.