WO2014176754A1  Histogram construction for string data  Google Patents
Histogram construction for string data Download PDFInfo
 Publication number
 WO2014176754A1 WO2014176754A1 PCT/CN2013/075033 CN2013075033W WO2014176754A1 WO 2014176754 A1 WO2014176754 A1 WO 2014176754A1 CN 2013075033 W CN2013075033 W CN 2013075033W WO 2014176754 A1 WO2014176754 A1 WO 2014176754A1
 Authority
 WO
 WIPO (PCT)
 Prior art keywords
 prefixes
 nodes
 prefix
 buckets
 bucket
 Prior art date
Links
 238000010276 construction Methods 0.000 title claims description 52
 238000004891 communication Methods 0.000 claims description 37
 230000003068 static Effects 0.000 claims description 14
 230000000875 corresponding Effects 0.000 description 12
 230000015654 memory Effects 0.000 description 9
 238000007418 data mining Methods 0.000 description 6
 238000000034 methods Methods 0.000 description 4
 239000011133 lead Substances 0.000 description 3
 230000001808 coupling Effects 0.000 description 2
 238000010168 coupling process Methods 0.000 description 2
 238000005859 coupling reactions Methods 0.000 description 2
 238000004321 preservation Methods 0.000 description 2
 238000005070 sampling Methods 0.000 description 2
 280000749855 Digital Network companies 0.000 description 1
 230000037138 Vds Effects 0.000 description 1
 230000005540 biological transmission Effects 0.000 description 1
 238000007405 data analysis Methods 0.000 description 1
 230000002104 routine Effects 0.000 description 1
 FAPWRFPIFSIZLTUHFFFAOYSAM sodium chloride Chemical compound   [Na+].[Cl] FAPWRFPIFSIZLTUHFFFAOYSAM 0.000 description 1
 235000010384 tocopherol Nutrition 0.000 description 1
 235000019731 tricalcium phosphate Nutrition 0.000 description 1
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
 G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
 G06F16/24—Querying
 G06F16/245—Query processing
 G06F16/2455—Query execution
 G06F16/24568—Data stream processing; Continuous queries

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
 G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
 G06F16/22—Indexing; Data structures therefor; Storage structures
 G06F16/2228—Indexing structures
 G06F16/2246—Trees, e.g. B+trees

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
 G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
 G06F16/23—Updating

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
 G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
 G06F16/24—Querying
 G06F16/245—Query processing
 G06F16/2457—Query processing with adaptation to user needs
 G06F16/24578—Query processing with adaptation to user needs using ranking

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F40/00—Handling natural language data
 G06F40/10—Text processing
Abstract
Description
HISTOGRAM CONSTRUCTION FOR S TRING DATA BACKGROUND
[0001] In modern day environments, large volumes of data are generally captured from a variety of information sources, and managed in databases for various purposes including data analysis and database searching. In view of the large volume of data, database management systems utilize histograms to capture data distribution, to summarize and represent the data in a concise form. To generate a histogram, the data is partitioned based on a degree of similarity in their characteristics. The histogram , in an example, represents a frequency distribution of occurrence of data with similar characteristics over the entire data.
BRIEF DESCRIPTION OF DRAWINGS
[0002] The detailed description is provided with reference to the accompanying figures. In the figures, the leftmost digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to reference like features and components.
[0003] Figure 1(a) illustrates a system environment implementing a histogram construction system , according to an example of the present subject matter.
[0004] Figure 1 (b) illustrates a histogram construction system , according to an example of the present subject matter.
[0005] Figure 2 illustrates the histogram construction system , according to an example of the present subject matter.
[0006] Figure 3 illustrates a prefix tree for string data, according to an example of the present subject matter.
[0007] Figures 4(a), 4(b), 4(c), 4(d), and 4(e) illustrate iterative revisions of a prefix tree for strings in an online environment, according to an example of the present subject matter.
[0008] Figure 5 illustrates a prefix tree for strings in an offline environment, according to an example of the present subject matter. [0009] Figure 6 illustrates a method of generation of a histogram for string data, according to an example of the present subject matter.
[0010] Figure 7 illustrates a method of generation of a histogram for string data in an online environment, according to an example of the present subject matter.
[001 1 ] Figure 8 illustrates a method of generation of a histogram for string data in an offline environment, accord ing to an example of the present subject matter.
[0012] Figure 9 illustrates a system environment for generation of a histogram for string data, according to an example of the present subject matter.
DETAILED DESCRIPTION
[0013] The present subject matter relates to methods and systems for generation of histograms for string data. The string data include multiple sequences of characters in the form of strings. A histogram represents a statistical summary of the string data, which may be generated based on a frequency distribution of strings in the string data.
[0014] A histogram is generated by sampling of data into multiple buckets, where each bucket is filled with the data having similar characteristics. Each bucket generally has a defined bucket boundary or sampling span for filling up the data in that bucket. For example, the data may correspond to age of employees in a company. The age data can be sampled into buckets of different age spans. The buckets may have equal or unequal boundary widths. Each bucket may store frequency of occurrence of data lying within the respective bucket boundary. The frequency distribution stored in the buckets summarizes the data, which is referred to as a histogram synopsis. The histogram synopsis can then be used to generate a histogram for the data over the buckets.
[0015] Histograms of data find their utility in various applications, such as data mining, data analytics, and approximate query answering. Histograms enable in storing the data and its relevant information in compact and concise manner, which in turn facilitate in improving the performance of data mining, data analytics and approximate query answering procedures when performed over the histograms. For data mining and big data analytics, it is possible to fetch required information , draw inferences and identify deviations in the data distribution in a substantially quick time through the histograms. In approximate query answering, user queries can be executed on the histograms, instead of on the entire data, to obtain approximate but quick answers to the user queries.
[0016] Presently available databases and applications deal with different types of data including numerical and string based data. Methods of generation of histograms for numerical data are common; however, such numerical data specific methods cannot be applied for generation of histograms for string data. Also, histogram generation methods are applicable on static data, i.e., on the data that is fixed and known prior to generation of histograms. Such methods cannot be used for generation of histograms for the data being streamed online in realtime.
[0017] Further, generating histograms have computation costs associated with them. The computation costs generally include time cost and space cost. The time cost refers to the amount of time taken for generation of a histogram , and the space cost refers to the am ount of space, i.e. , the memory utilized by a histogram. The methods of generation of histograms for the string data typically take time in a quadratic order of number of data values being considered for the histogram generation , i. e. , 0(n^{2}) where n is the number of data values. With the number of data values being substantially large, the time cost of the histogram is substantially high. The histogram generally takes space in a linear order of number of data values, i.e., 0(n) where n is the number of data values for which the histogram is generated. For the histogram generated over a large number of data values, the space cost is also substantially large.
[0018] Methods and systems for generation of histograms for string data are described herein. With the methods and the systems of the present subject matter, histograms can be generated for string data which is static and predefined, and for string data which is streamed online in realtime. The histograms that are generated based on the methods and the systems of the present subject matter have substantially low time and space costs associated with them.
[0019] In accordance with the present subject matter, for generation of a histogram for string data, the strings in the string data are represented as a prefix tree. A prefix tree is a Trie data structure having nodes that represent prefixes of the strings. A prefix of a string is a sequence of characters which is either the same as that of the string or which is a substring of the string. The nodes in the prefix tree represent longest prefixes and longest common prefixes of the strings. A longest prefix refers to a sequence of characters which is equal to a string. A longest common prefix refers to a sequence of characters which is a common substring of one or more strings. For example, for two strings "host" and "hostname", the prefix tree will have a node representing the longest prefix as "host" for the string "host", a node representing the longest prefix as "hostname" for the string "hostnames", and a node representing the longest common prefix as "host" for the both strings.
[0020] Based on the prefix tree, deploy weights are assigned to the nodes in the prefix tree. A deploy weight of a node is computed based on lengths of the prefixes represented by subtree nodes rooted at that node and based on frequencies of the strings whose prefixes are represented by the subtree nodes. The deploy weight of a node is indicative of a maximum weight preserved upon filling up at least one prefix, represented by the subtree nodes rooted at that node, in a respective bucket. The subtree nodes rooted at one node include that one node and the childnodes of that one node. The values of deploy weights convey the levels of relevan cy of the prefixes at the respective nodes for filling up the buckets. The higher valued deploy weights highlight the prefixes that are more relevant for filling up the buckets.
[0021] Further, based on the deploy weights associated with the prefixes of the strings, a predefined number of prefixes can be determined or found , from amongst the prefixes represented by the nodes of the prefix tree, for filling up the predefined number of buckets. The predefined number of prefixes are determined through maximization of a total weight preserved by the determined prefixes. The total weight preserved is the weight preserved by the determined prefixes, which can be determined based on the deploy weights of the determined prefixes. The predefined number of prefixes that are determined or found are referred to as Topprefixes of the string data. Each bucket fills one distinct prefix. Also, the prefixes are determined to cover the prefixes associated with a maximum number of distinct strings. The deploy weights associated with the predefined number of prefixes can then be used to generate a histogram for the string data.
[0022] The methods and the systems of the present subject matter enable in capturing distribution of string data and generating histograms with a reduced number of Topprefixes of strings. By maximizing the total weight preserved by the Topprefixes, the histogram, in accordance with the present subject matter, captures as much statistical information as possible of the string data. Further, by considering the prefixes of the strings and maximizing the number of prefixes in the Topprefixes, the coverage of the histogram is over a large (maximum) number of distinct strings in the string data.
[0023] In an example, the number of Topprefixes may be less than the total number of distinct strings in the string data considered for generation of a histogram . Such a histogram of the Topprefixes facilitates in representing the string data in a substantially compact form, which can be used for data mining, data analytics, approximate query answering , etc. Further, since each of the distinct Topprefixes is filled in a separate bucket, the number of buckets governs the size of the histogram . The space cost and the time cost of the histogram , in accordance with the subject matter, is based on the number of Topprefixes or the number of buckets in the histogram. This facilitates in reducing the space cost and the time cost associated with the histograms.
[0024] Further, the methods and the systems of the present subject matter enable the generation of histograms both in an offline environment and in an online environment. In an offline environment, the data is static and the complete data set along with the frequency distribution of strings are known in advance. The histograms may be generated for this predetermined static data set in the offline environment. In an online environment, the data is streamed and received, for example, onebyone in realtime. The frequency distribution of the streamed strings is not known in advance. Thus, histograms may be generated and updated for the stream ed data in realtime in the online environment.
[0025] The above methods and systems are further described in conjunction with Figures 1 to 9. It should be noted that the description and figures merely illustrate the principles of the present subject matter. It is thus understood that various arrangements can be devised that, although not explicitly described or shown herein, embody the principles of the present subject matter and are included within its spirit and scope. Moreover, all statements herein reciting principles, aspects, and embodiments of the present subject matter, as well as specific examples thereof, are intended to encompass equivalents thereof.
[0026] Figure 1 (a) schematically illustrates a system environment 100 implementing a histogram construction system 102, according to an example of the present subject matter. The system environment 100 may be a public environment or a private environment. The histogram construction system 102 may be a machine readable instructions based implementation or a hardware based implementation or a combination thereof. The histogram construction system 102 described herein can be implemented in a computing device, such as a server. The histogram construction system 102 in a computing device enables the computing device to generate histograms for string data, in accordance with the present subject matter.
[0027] As shown in Figure 1 (a), the histogram construction system 102 is communicatively coupled with a plurality of data sources 1041 , 1042, ... , 104N. The data sources 1041 , 1042, ... , 104N, hereinafter may be collectively referred to as data sources 104, and individually referred to as a data source 104. The data sources 104 may host data, including string data, in static form . In an example, the histogram construction system 102 can access the data sources 104 to receive the string data in static form, which also refers to a fixed data set, for the generation of histograms. Such an environment for generation of histograms refers to an offline environment.
[0028] Further, as shown in Figure 1 (a), the histogram construction system 102 is communicatively coupled with a plurality of communication devices 1061 , 1062, ... , 106N through a communication network 108. The communication devices 106 1 , 1062, ... , 106N, hereinafter may be collectively referred to as communication devices 106, and individually referred to as a communication device 106. The communication device 106 may include a computer, a laptop, a smart phone, a tablet, and the like. In an example, the histogram construction system 102 can communicate with the communication devices 106 to receive string data streamed online in realtime over the communication network 108, for the generation of histograms. Such an environment for generation of histograms refers to an online environment.
[0029] In an example, the communication device 106 may be communicatively coupled to the histogram construction system 102 over the communication network 108 through one or more communication links. The communication links between the communication devices 106 and the histogram construction system 102 are enabled through a desired form of communication, for example, via dialup modem connections, cable links, and digital subscriber lines (DSL), wireless or satellite links, or any other suitable form of communication.
[0030] The communication network 108 may be a wireless network, a wired network, or a combination thereof. The communication network 108 can also be an individual network or a collection of many such individual networks, interconnected with each other and functioning as a single large network, e.g., the Internet or an intranet. The communication network 108 can be implemented as one of the different types of networks, such as intranet, local area network ( LAN), wide area network (WAN), the internet, and such. The communication network 108 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/I P), etc. , to communicate with each other.
[0031] The communication network 108 may also include individual networks, such as but not limited to, Global System for Communication (GS M) network, Universal Telecommunications System (U MTS) network, Long Term Evolution ( LTE) network, Personal Communications Service (PCS) network, Time Division Multiple Access (TDMA) network, Code Division Multiple Access (CDMA) network, Next Generation Network (NGN), Public Switched Telephone Network (PSTN), and Integrated Services Digital Network (ISDN).
[0032] Figure 1(b) illustrates the histogram construction system 102, according to an implementation of the present subject matter. In an implementation, the histogram construction system 102 includes processors) 110. The processor(s) 1 10 may be implemented as microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) 1 10 fetch and execute computerreadable instructions stored in the memory. The functions of the various elements shown in Figure 1 (b), including any functional blocks labeled as "processor(s)", may be provided through the use of dedicated hardware as well as hardware capable of executing machine readable instructions.
[0033] As shown in Figure 1(b), the histogram construction system 102 includes a data acquiring module 1 12, a data structure module 1 14, a Top prefix finder 1 16, and a histogram generator 1 18. The data acquiring module 112, the data structure module 1 14, the Topprefix finder 1 16, and the histogram generator 1 18 are coupled to the processors) 1 10.
[0034] In an implementation, for the purpose of generation of histograms, the data acquiring module 1 12 obtains string data comprising strings. The data acquiring module 1 12 can obtain static string data offline from the data sources 104, and/or can obtain streamed string data online from the communication devices 106. Based on the obtained strings, the data structure module 1 14 generates a prefix tree for distributing the strings into nodes that represent prefixes of the strings. Based on the nodes in the prefix tree, the Topprefix finder 1 16 assigns deploy weights to the nodes. A deploy weight of a node is indicative of a maximum weight preserved upon filling buckets with one or more prefixes represented by the subtree nodes rooted at that node, each in a separate bucket.
[0035] Based on the deploy weights of the nodes, the Topprefix finder 116 determines or finds a predefined number of Topprefixes of the strings for filling up the predefined number of buckets. In an example, the predefined number may be a system defined or a user defined number. This predefined number may be defined based on the number of buckets to be filled in for a histogram, and based on the size of histogram to be constructed. The Top prefixes are determined from the prefixes in the prefix tree, based on maximization of a total weight preserved by the predefined number of prefixes, where the predefined number of prefixes are associated with a maximum number of distinct strings. Each of the Topprefixes is filled in a separate bucket, and the deploy weight of the node representing the each Topprefix is stored in the corresponding bucket.
[0036] After determining the Topprefixes for the strings and filling up the buckets, the histogram generator 1 18 generates a histogram of the Top prefixes. The histogram is generated based on the Topprefixes and the corresponding deploy weights associated with the Topprefixes in the buckets. The generated histograms can be used for applications, such as data mining, data analytics, and approximate query processing.
[0037] Figure 2 illustrates the histogram construction system 102, according to an implementation of the present subject matter. The histogram construction system 102 includes the processor(s) 1 10 and also interface(s) 202. The interface(s) 202 may include a variety of machine readable instruct! on based and hardware interfaces that allow the histogram construction system 102 to interact with the data sources 104 and the communication devices 106, as the case may be. Further, the interface(s) 202 may enable the histogram construction system 102 to communicate with other devices, such as network entities, web servers and other external repositories.
[0038] Further, the histogram construction system 102 includes memory 204, coupled to the processors) 1 10. The memory 204 may include any computerreadable medium including, for example, volatile mem ory (e. g. , RAM), and/or nonvolatile memory (e.g., EPROM, flash memory, NVRAM , memristor, etc.).
[0039] Further, the histogram construction system 102 includes module(s) 206 coupled to the processor(s) 110. The module(s) 206, amongst other things, include routines, programs, objects, components, data structures, and the like, which perform particular tasks or implement particular abstract data types. The module(s) 206 further include modules that supplement applications on the histogram construction system 102, for example, modules of an operating system.
[0040] The module(s) 206 of the histogram construction system 102 includes the data acquiring module 1 12, the data structure module 1 14, the Topprefix finder 1 16, the histogram generator 118, and other module(s) 210. The other module(s) 210 may include programs or coded instructions that supplement applications and functions, for example, programs in the operating system of the histogram construction system 102.
[0041] Further, the histogram construction system 102 includes data 208. The data 208 serves, amongst other things, as a repository for storing data that may be fetched, processed, received, or generated by the module(s) 206. Although the data 208 is shown internal to the histogram construction system 102, it may be understood that the data 208 can reside in an external repository (not shown in the figure), which may be coupled to the histogram construction system 102. The histogram construction system 102 may communicate with the external repository through the interface(s) 202 to obtain information from the data 208.
[0042] In an implementation, the data 208 of the histogram construction system 102 includes string data 212, prefix data 214, histogram data 216, and other data 218. The string data 212 stores the strings obtained by the histogram construction system 102. The prefix data 214 stores the deploy weights of the nodes, and the data in the buckets. The histogram data 216 stores the histograms generated by the histogram construction system 102. The other data 218 comprise data corresponding to other module(s) 210.
[0043] As mentioned earlier, the histograms can be generated by the histogram construction system 102 in an online environment and in an offline environment. Before describing the procedures for generation of histograms for string data in online and offline environments, a prefix tree that can be used as a data structure for representing strings in the string data is described . The prefix tree is a Trie data structure that distributes the strings into leaf nodes and branch nodes. A leaf node is a terminal node representing the longest prefix of one of the strings. A branch node represents a longest common prefix of one or more prefixes represented by childnodes branching out from that branch node.
[0044] Figure 3 illustrates a prefix tree 300 for representing string data, according to an example of the present subject matter. The prefix tree 300 is for the string data having the following strings: "address", "host", "hostname", "source", "sourcecode", and "sourcename". As shown , Rno is a root node of the prefix tree 300 from which nodes for the distinct strings branch out. Bni to Bn6 are the branch nodes and Lni to Ln6 are the leaf nodes.
[0045] The leaf node Lni is a terminal node for the string "address". The leaf node Ln i represents a prefix "address" which is the longest prefix of the string "address". Similarly, as shown, the leaf nodes Ln2, Ln3, Ln4, Lns_{,} and Ln6 represent the longest prefix as "host", "source", "hostname", "sourcecode", and "sourcename", respectively, for the other strings. The branch node Bn i represents a prefix "address" which is the longest common prefix of the prefix represented by the leaf node Ln i . Since only one leaf node Ln i is branching out from the branch node Bni , the longest common prefix at the branch node Bni is same as the longest prefix at the leaf node Lni . Similarly, the branch nodes Bn4, Bns and Βηε represent the longest common prefix as "hostname", "sourcecode" and "sourcename", respectively, based on the respective leaf nodes. Further, the branch node Bn2 represents a prefix "host" which is the longest common prefix of the prefixes represented by the leaf node Ln2 and the branch node Bn4. The branch node Bn3 represents a prefix "source" which is the longest common prefix of the prefixes represented by the leaf node Ln3 and the branch nodes Bns and Βηε. Further, the nodes Bn2, Ln2, and Bn4 form a group of subtree nodes rooted at the branch node Bn2. Similarly, the nodes Bn3, Ln3, Bn5, and Βηε form a group of subtree nodes rooted at the branch node Bn3. In an example, the prefix tree 300 for the string data may include other internal nodes; however, for the sake of simplicity the root node, the branch nodes and the leaf nodes, as described above, are illustrated.
[0046] The description below describes the generation of histograms by the histogram construction system 102 individually in the online environment and in the offline environment. Histogram Generation in Online Environment
[0047] In an implementation , for the purpose of generation of histograms in an online environment, the data acquiring module 1 12 obtains strings data online, in realtime, as data streams over the communication network 108. The string data includes strings which are received onebyone from one or more communication devices 106. Based on the obtained strings, the data structure module 1 14 generates a prefix tree and iteratively revises the prefix tree to include the strings, as received onebyone, in the prefix tree. Based on the prefix tree, the Topprefix finder 116 assigns deploy weights to the nodes, and fills buckets based on the deploy weights. For the purposes of the present subject matter, since one bucket is filled with one distinct prefix, the number of buckets is equal to a predefined number of Topprefixes to be determined from the prefix tree.
[0048] For determining the predefined number of Topprefixes from the prefix tree, the Topprefix finder 1 16 updates prefixes and corresponding deploy weights in a maximum of predefined number of buckets for each revision of the prefix tree. The description below describes the process of assigning of deploy weights and updating of the buckets for determining the Topprefixes by maximization of total weight preserved by the prefixes in the buckets over a maximum number of distinct strings. Based on the Top prefixes and the corresponding deploy weights in the buckets, a histogram can be generated by the histogram generator 118.
[0049] For the purposes of the description herein, let a string be denoted by s, a bucket be denoted by b, a prefix in a bucket b be denoted by p b, a deploy weight in a bucket b be denoted by W b, and the longest common prefix for two prefixes p b and p b' be denoted by pb Π p b' . The prefix pb also refers to a prefix represented by a node, and the deploy weight W b also refers to a deploy weight of the node representing the prefix p b. Also, the total number of buckets is equal to the predefined number of Topprefixes that are to be determined for filling the buckets and generating a histogram . Let the predefined number be denoted by k. [0050] Upon receiving a string s, the data structure module 114 updates the prefix tree to include the string s. The prefix tree may already have a branch with one or more branch nodes and a leaf node for the string s. If not, a new branch with a branch node and a leaf node is created from the root node for including the received string s.
[0051] Based on the revision of the prefix tree, the Topprefix finder 1 16 compares the string s with the prefixes stored in the buckets to determine if the string s matches with any of the prefixes in the buckets. If the string s matches with a prefix pb in the bucket b, the deploy weight Wb in the bucket b is revised. The deploy weight Wb is revised based on the frequency of the string s in the obtained string data. For this, the frequency of each string in the string data is maintained. If the received string s is a string already represented in the prefix tree, the frequency of the string s is incremented by 1. If the received string s is a new string, the frequency of string s is set as 1. Based on the frequency, the deploy weight Wb at the node representing the prefix pb is revised to make it equal to the frequency of the string s. The revised deploy weight W b is assigned to the node, and the deploy weight Wb in the bucket b is replaced by the revised deploy weight W b.
[0052] Further, if the string s does not match with any of the prefixes in the buckets, the Topprefix finder 1 16 finds an empty bucket from the total of k number of buckets. Upon finding an empty bucket, the longest prefix of the string s, represented by a leaf node, is filled in that empty bucket. The deploy weight equal to the frequency of the string s is assigned to the leaf node representing the longest prefix of the string s. The deploy weight assigned to the leaf node is stored as the deploy weight W b in the bucket b.
[0053] Further, if the string s does not match with any of the prefixes in the buckets, and no bucket is empty or unfilled , the Topprefix finder 1 16 identifies a bucket pair b, b' with prefixes p b, pt>' for which a loss weight is minimum. The loss weight is indicative of a loss in weight preserved upon filling one bucket b with the longest common prefix p b Π pb' and releasing or emptying the bucket b'. For the purposes of the description herein, the loss weight is denoted by Iw. For the bucket pair b, b', the loss weight Iw is computed based on equation ( 1 ) below:
where Wb and Wb' are deploy weights of the prefixes p b and pb' in the buckets b and b', respectively, pb is the length of prefix pb, pb'  is the length of prefix Pb', Pb n pb'  is the length of longest common prefix p b Π pb' .
[0054] For identifying a bucket pair b , b' with a minimum loss weight, the loss weights for different pairs of buckets are computed. One with the minimum loss weight is identified for further updating of the buckets. In an example, the loss weight for a bucket pair b, b' with prefixes pb and pb' is computed , if the prefix tree has a branch node representing the longest common prefix pb Ί pb' .
[0055] Further, based on the value of loss weight for the identified pair of buckets, the Topprefix finder 1 16 revises or updates the buckets to maximize the total weight preserved by the prefixes in the buckets, and to have the prefixes in the buckets, which are associated with a maximum number of distinct strings. For this, if the loss weight Iw for the identified bucket pair b, b' with prefixes p b and pb' has a value less than 1 , then the bucket b is filled with the longest common prefix pb Π pb' to replace the prefix p b in the bucket b. For revision of the deploy weight Wb, the deploy weight of the branch node representing the longest common prefix pb Π pb' is computed as a sum of the deploy weights Wb and Wb' minus the loss weight Iw. This deploy weight is assigned to the branch node representing the longest common prefix p b Π p b' , and replaced as the deploy weight Wb in the bucket b. In addition, the other bucket b' is emptied by removing the prefix p b' and the corresponding deploy weight Wb' , and the longest prefix represented by the leaf node for the string s is filled in the bucket b'. For the deploy weight W b' , the deploy weight of the leaf node representing the longest prefix of the string s is assigned to be equal to the frequency of the string s. Since the frequency of the string s is incremented by 1 , the deploy weight Wb of the leaf node is increased by 1. This deploy weight of the leaf node is stored as the deploy weight W b' in the bucket b'.
[0056] The deploy weights in all the buckets are indicative of the total weight preserved by the prefixes in the buckets. With the loss weight for a bucket pair b, b' being less than 1 and by updating the buckets as described above, the total deploy weight in the buckets is reduced by a value less than 1 after the merging the contribution of the prefixes pb and pt_{>}' in the bucket b. The total deploy weight in the buckets is gained by a value 1 by filling the prefix and the deploy weight associated with the string s in the bucket b'. This facilitates in maximizing the total weight preserved by the prefixes in the buckets and filling up the buckets with prefixes associated with a maximum number of strings.
[0057] Further, if the loss weight Iw for the identified bucket pair b, b' with prefixes p_{b} and p_{b}' has a value equal to 1 or more, then the string s is not considered, and the deploy weights in the buckets are reduced by a value 1.
[0058] With the revision of deploy weight in the buckets as described above, a deploy weight in one or more buckets may become less than 1 . In an implementation, the buckets for which the deploy weights become less than 1 are released or emptied, and made available for filling during the iterative cycle for the next string.
[0059] The description below describes the details of generating and revising a prefix tree for the incoming strings, revising and assigning deploy weights at the nodes, and updating buckets for generation of a histogram in an online environment through an illustrative example. Consider a case where the string data, obtained in an online environment, includes four strings: "host", "hostname", "address" and "server" with respective frequencies as 15, 2, 20 and 2, and three Topprefixes are to be determined to fill in a maximum three buckets for generation of a histogram . The strings are received serially, one byone, in realtime. Figures 4(a), 4(b), 4(c), 4(d), and 4(e) illustrate iterative revisions of a prefix tree for the strings in an online environment, according to an example of the present subject matter.
[0060] Initially, the prefix tree only has a root node Rno, and all the three buckets are empty. In said example, let's say at first the string "host" is received . The prefix tree is revised to include the string "host". Figure 4(a) shows the prefix tree, revised to include the string "host". The prefix tree, as shown in Figure 4(a), has a leaf node Ln i representing the longest prefix as "host", and has a branch node Bni representing the longest common prefix also as "host". As the string "host" is the first string received , the frequency fi of the string "host" is set as 1 and maintained for the leaf node Lni . Based on the frequency fi , a deploy weight is assigned to the leaf node Ln i . The deploy weight for the leaf node Ln i is equal to the frequency fi at the leaf node Lni . Now, since all the buckets are empty, the longest prefix of the string "host", represented by the leaf node Lnι , is filled in a first bucket b i, and the deploy weight at the leaf node Lni is stored as the deploy weight W M in the first bucket bi .
[0061] After this, let's say the string "host" is again received onebyone 14 times. Each time, the prefix tree is revised to include the string "host", the frequency fi at the leaf node Lni is incremented by 1 , and the deploy weight at the leaf node Ln i is also incremented by 1 in accordance with the frequency fi. With the string "host" matching each time with the prefix stored in the bucket bi , the deploy weight Wbi in the bucket bi is revised in accordance with the deploy weight at the leaf node Ln i . After the iterations, the frequency fi becomes 15, the deploy weight at the leaf node Lni becomes 15, and the deploy weightwb l in the bucket bi becomes 15, as shown in Figure 4(b).
[0062] After this, let's say the string "hostname" is received 2 times. Each time, the prefix tree is revised to include the string "hostname". Figure 4(b) shows the prefix tree revised to include the string "hostname". The prefix tree has a branch node Bn2 representing the longest common prefix as "hostname", and has a leaf node Ln2 representing the longest prefix as "hostname". The branch node Bn2 branches out from the branch node Bn i . The branch node Bni now represents the longest common prefix of the strings "host" and "hostname". When the string "hostname" is received for the first time, the frequency f_{2} of the string "hostname" is set as 1 and maintained for the leaf node Ln_{2}. The deploy weight is assigned to the leaf node Lni based on the frequency f_{2}. Further, since string "hostname" is not matching with the prefix stored in the bucket b i, the longest prefix of the string "hostname" is filled in a second bucket b_{2}, and the deploy weight at the leaf node
is stored as the deploy weight W b2 in the second bucket b_{2}. For the next reception of the string "hostname", the prefix tree is revised to include the string "hostname", the frequency f_{2} at the leaf node Ln_{2} is incremented by 1 , and the deploy weight at the leaf node Ln2 is also incremented by 1. With the string "hostname" matching with the prefix in the bucket b_{2}, the deploy weight wb2 in the bucket b2 is revised in accordance with the deploy weight at the l eaf node Ln2. After the iterations, the frequency f_{2} becomes 2, the deploy weight at the leaf node Ln2 becomes 2, and the deploy weight Wb2 in the bucket b2 becomes 2, as shown in Figure 4(b).[0063] After this, let's say the string "address" is received 20 times. After the iterations for the string "address", the revised prefix tree has another branch node Bn3 representing the longest common prefix as "address" and has another leaf node Ln3 representing the longest prefix as "address", as shown in Figure 4(c). Also, the frequency f3 at the leaf node Ln3 becomes 20, and the deploy weight at the leaf node Ln3 also become 20. For the first iteration with the string "address", since the string "address" is not matching with the prefixes stored in the buckets bi and b_{2}, and a third bucket b3 being empty, the longest prefix of the string "address" is filled in the third bucket b 3. After the iterations, the deploy weight Wb3 in the bucket b3 becom es 20, as shown in Figure 4(c).
[0064] After this, let's say the string "server" is received once. The prefix tree is again revised to include the string "server". The revised prefix tree, as shown in Figure 4(d), has another leaf node Ln4 representing the longest prefix as "server", and has another branch node Bn4 representing the longest common prefix as "server". The frequency of the string "server" is set as 1 and maintained for the leaf node Ln4. The deploy weight, equal to the frequency , is assigned to the leaf node Ln4.
[0065] Now, for updating the buckets, since the string "server" is not matching with the prefixes in the buckets b i , b2 and b3, and since no more empty buckets are available, a bucket pair is identified for which a loss weight is minimum. As mentioned earlier, the loss weight for those bucket pairs is computed, for which the prefix tree has respective branch nodes representing the longest common prefixes of the prefixes in the respective bucket pairs. As shown in Figure 4(d), the prefix tree has one branch node Bni representing the longest common prefix of the prefixes in the buckets b i and b_{2}. The loss weight for the bucket pair b i and b2 is computed through equation ( 1 ): IwibM = 15 (l  ) + 2 (l  ) = 1.
Since the value of loss weight for buckets b i and b2 is equal to 1 , the string "server" is ignored and the deploy weights WM , Wb2, Wb3 in the buckets b i , b2, b3 are reduced by 1 , as shown in Figure 4(d). I n an example, the branch representing the string "server", the frequency U, and the deploy weight at the leaf node Ln4, are removed from the prefix tree.
[0066] After this, let's say the string "server" is received once again. The revised prefix tree, as shown in Figure 4(e), again has a leaf node Ln4 representing the longest prefix as "server", and has a branch node Bn4 representing the longest common prefix as "server". The frequency U of the string "server" is set as 1 and maintained for the leaf node Ln4. The deploy weight, equal to the frequency , is assigned to the leaf node Ln4. Now, again for updating the buckets, since the string "server" is not matching with the prefixes in the buckets b i , b2 and b3, and since no more empty buckets are available, a bucket pair is again identified for which a loss weight is minimum. Again, buckets b i and b 2 are identified for loss weight computation , and the loss weight for the bucket pair b i and b2 is computed through equation ( 1 ): lw(b_{v} b_{2}) = 14 (l  g + l (l  3 = 0.5.
[0067] Since the value of loss weight for buckets b i and b2 is 0.5 (less than 1 ), the bucket b i is filled with the longest common prefix represented by the branch node Bni . The deploy weight of ( 14 + 1  0.5 = 14.5) is assigned to the branch node Bni , and this deploy weight is stored as the deploy weight Wbi in the bucket b i . Also, the prefix "hostname" and the corresponding deploy weight Wb2 are removed from the bucket b 2. The longest prefix represented by the leaf node Ln4 is filled in the bucket b2, and the deploy weight at the leaf node Ln4 is stored as the deploy weight Wb2 in the bucket b2. The prefixes and the deploy weights Wbi , Wb2, Wb3 in the buckets b i , b2, b3 are as shown in Figure 4(e). With this, the total deploy weight of the buckets is reduced by 0.5 due to merging of contributions of the strings "host" and "hostname" in the bucket b i , and gained by 1 due to filling up of the bucket b 2 with the contribution of the string "server". Also, with this, the prefixes in the buckets are associated with four distinct strings, instead of three distinct strings as shown in Figures 4(c) and 4(d). Further, the prefixes "host", "address" and "server" are the three Topprefixes determined for filling up the three buckets, and the deploy weights wt>i, Wb2, Wb3 in the buckets b i , b2, b3 can be used for generation of a histogram over the Topprefixes for the strings.
[0068] The space cost associated with the histogram generated in the online environment is 0(k), as a maximum of k number of buckets are used for filling up with the k number of Topprefixes for generation of the histogram . Further, the time cost associated with the histogram generated in the online environment for each iterative revision of the prefix tree based on a new string is 0(k), as each update of a maximum of k number of buckets takes the time of the order of k. The total time cost associated with the histogram depends on the number of strings received in the online environment.
[0069] Although the example of generation of histogram in the online environment is described for a few strings; the histogram construction system 102 can perform the same procedure with a substantially large number of strings to determine a predefined number of Topprefixes and generate histograms based on the topprefixes for the strings.
Histogram Generation in Offline Environment
[0070] In an implementation , for the purpose of generation of histograms in an offline environment, the data acquiring module 1 12 obtains string data in an offline manner from one or more data sources 104. The string data includes static strings with a predefined frequency distribution. The predefined frequency distribution has a frequency of each of the static strings in the string data. In an implementation, the frequencies of the string can be obtained from the respective data sources 104, or can be determined by the data acquiring module 1 12 after obtaining the static strings.
[0071] The description below describes the process of generating a prefix tree, assigning deploy weights to the nodes, and determining a predefined number of Topprefixes by maximization of total weight preserved by the prefixes in the buckets over a maximum number of distinct strings. For the purposes of the description herein, let a string be denoted by s, a frequency of string s be denoted by f(s), a node of prefix tree be denoted by d, a fractional weight of node d be denoted by fWd, and a prefix represented by node d be denoted by pd. Also, the total number of buckets is equal to the predefined number of Topprefixes that are to be determined for filling the buckets and generating a histogram. Let the predefined number be denoted by k.
[0072] Since, in the offline environment, the string data set with all the strings is known for generation of a histogram, the data structure module 1 14 generates a prefix tree for all the distinct strings in the string data set. For determining the predefined number of Topprefixes from the prefix tree, in an implementation, the Topprefix finder 1 16 performs a breadth first search to traverse the prefix tree and determine a reverse traverse order for the nodes. The reverse traverse order captures a sequential order of nodes from the bottom of the prefix tree, i. e. , from the leaf nodes, towards the top of the prefix tree, i .e. , towards the root node.
[0073] After determining the reverse traverse order, the Topprefix finder 116 computes a fractional weight for each of the nodes in the prefix tree in accordance with the reverse traverse order. The fractional weight of a j^{th} leaf node is computed based on equation (2) below:
Λ¾ = / ( ·,). (2)
where f(Sj) is the frequency of the j^{th} string whose longest prefix p dj is represented by the j^{th} leaf node. The fractional weight of a j^{th} branch node is computed based on equation (3) below: y =∑E^{i} /w¾^{.} x ¾ ^{(3)}
where m is equal to the number of childnodes of the j^{th} branch node, fWdi is the fractional weight of the i^{th} childnode of the j^{th} branch node, p dj is a length of prefix p^ represented by the j^{th} branch node, and  pd  is a length of prefix pdi represented by the i^{th} childnode of the j^{th} branch node.
[0074] Since the fractional weights are computed in accordance with the reverse traverse order, the fractional weights of childnodes are known for computing the fractional weight of a branch node. The fractional weight of a leaf node is a measure of a weight preserved by the leaf node with respect to the frequency of the string associated with the leaf node. And, the fractional weight of a branch node is a measure of a fractional weight preserved by the branch node depending on contributions of its childnodes for weight preservation. The fractional contributions for a branch node are governed by the ratios of the length of the prefix at the branch node and the length of the prefix at the respective childnodes.
[0075] After computing the fractional weights for all the nodes, the Top prefix finder 1 16 assigns deploy weights to the nodes. For a node d, a number of deploy weights are computed and assigned to the node d depending on the number of buckets, from 1 to at most k buckets, which can be possibly filled by the prefixes at the subtree nodes rooted at the node d and by the prefixes at further subtree nodes rooted at childnodes of the node d . For the purposes of the description herein, let the deploy weight assigned to the node d be denoted by dW_{d}. Let dW_{d} ^{1} , dW_{d} ^{2}, ... , dW_{d} ^{k} denote the deploy weight of the node d when 1 , 2, ... , k buckets are filled with 1 , 2, ... , k prefixes represented by the subtree nodes rooted at the node d and by the further subtree nodes rooted at the childnodes of the node d. The deploy weight dWd^{1} is indicative of a maximum weight preserved upon filling t number of buckets with t number of prefixes represented by the subtree nodes rooted at the node d and by the further subtree nodes rooted at the childnodes of the node d.
[0076] In addition, for each node d and against each deploy weight dW_{d} ^{1}, the combination of subtree nodes representing the prefixes, for which the weight preserved is maximum , is determined as an arrangement set. Let the arrangement set for the deploy weight dW_{d} ^{1} be denoted by {arr_{d} ^{1}}. The arrangement set {arr_{d} ^{1}} is indicative of the subtree nodes at node d whose prefixes if filled in the t number of buckets will result in the maximum weight preservation.
[0077] In addition, for each node d and against each deploy weight dW_{d} ^{1}, depending on the {an }, a leak weight is computed. Let the leak weight for the deploy weight dWd^{1} and the arrangement set {arrd^{1}} be denoted by Iwd^{1}. The leak weight Iwd^{1} is indicative of leaking information across the node d when t number of buckets are filled. The leak weight IW_{d} ^{1} is a measure of total information of the subtree nodes at the node d minus the deploy weight dW_{d} ^{1}. [0078] The description below describes the computation and determination of the deploy weights dWd, the leak weights Iwd and the arrangement sets {arrd} which can be followed for each of the node d . The deploy weights dW_{d}, the leak weights IW_{d} and the arrangement sets {arr_{d}} are computed and determined for the nodes in accordance with the reverse traverse order. With this, the deploy weights dW_{d} and the leak weights IW_{d} of childnodes are known for computing deploy weights dW_{d} and the leak weights IW_{d} of a branch node.
[0079] For each leaf node, since there is no branch node only one bucket (t = 1 ) can be filled in by the prefix represented by the lead node. The deploy weight dW_{dj} for the j^{th} leaf node is computed based on equation (4) below: dw^j = / IV,, , . (4)
where fW_{dj} is the fractional weight for the j^{th} leaf node. The leak weight IW_{dj} for the j^{th} leaf node is zero, and the corresponding arrangement set {arr_{dj} ^{1}} refers to the leaf node.
[0080] For a node d other than the leaf nodes, one to at most k buckets (t = 1 to k) can possibly be filled by the prefixes at the subtree branch nodes rooted at that node d . The number of buckets that can be filled depends on the number of subtree child nodes rooted at that node d. Let's say the j^{th} node dj in the prefix tree has q number of child branch nodes in the subtree rooted at the node dj. Then the number of subtree branch nodes rooted at the node d_{j} is equal to q + 1.
[0081] For the j^{th} node d_{j}, with one bucket being possibly filled , i.e. , t = 1 , the deploy weight dW_{dj} ^{1} is computed based on equation (5) below:
dWdj = max {fw_{d}j , dw_{di} ^{■}■ i = 1 to q], (5) where fw^ is the fractional weight for the node d_{j}, and dW_{d} ^{1} is the deploy weight of the i^{th} child branch node of the node d_{j} for one filled bucket. The function max {} means that the deploy weight dW_{dj} ^{1} takes a value which maximum from fw^ and dW_{d} ^{1}s.
[0082] Further, for t = 1 , the arrangement set {arr_{dj} ^{1}} refers to a node, from the subtree branch nodes rooted at the node d_{j}, whose value is taken as the deploy weight dW_{d}j^{1}. Further, for t = 1 , the leak weight IW_{d}j^{1} is computed based on equation (6) below: includes the node dj
does not include the node dj
fW_{d}i, p_{d}j is the length of the prefix at the node dj, and p_{d} 1 .. (6) where IW_{d} =  is the length of the prefix at the i^{th} child branch node of the node d_{j}.
[0083] Further, for the j^{th} node d_{j}, with possible number of buckets filled being equal to the number of subtree branch nodes of the node dj, i .e., t = q, the deploy weight dW_{dj} ^{q} is computed based on equation (7) below:
d^{ w}dj = (¾) + ∑;=it_{O}q / 0;) > (7)
where f(Sj) is frequency of the string Sj whose prefix is represented by the node dj, and f(si) is frequency of the string s, whose prefix is represented by the i^{th} child branch node of the node d_{j}.
[0084] Further, for t = k, the arrangement set {arr_{dj} ^{k}} refers to the subtree branch nodes rooted at the node d_{j}. Further, for t = k, the leak weight IW_{dj} ^{k} is zero.
[0085] Further, for the j^{th} node d_{j}, with possible number of buckets filled being more than one and less than the number of subtree branch nodes at the node d_{j}, i .e. , 1 < t < k < q+1 , and for computing the deploy weight dW_{dj} ^{1}, a term "deployment factor" denoted by x is defined for the node d_{j}. The deployment factor ¾ denotes a number of buckets that can be filled by or deployed on the subtree branch nodes rooted on the i^{th} child branch node of the node dj. With q child branch nodes of the node dj, xi refers to the number of buckets that can be filled by the subtree branch nodes rooted on the first child branch node, X2 refers to the number of buckets that can be filled by the subtree branch nodes rooted on the second child branch node, and so on. Here xo refers to the number of buckets that can be filled by the node d_{j}. Thus, xo can be either 0 or 1 for a bucket filled by the node d_{j} and not filled by the node dj, respectively. For various possible values of xo, xi, X2, ■■■ , x_{q} for the node dj, each deployment factor set {X} is defined as {xo, xi , X2,■■■■ , Xq} [0086] Now, for computing the deploy weight dW_{dj} ^{1}, all possible combination of deployment factors x are enumerated in the deploym ent factor sets {X_{t}}, such that∑ xi = t, where i = 0 to q . With this, the deploy weight dW_{dj} ^{1} is computed based on equation (8) below:
where dwd is the deploy weight of the i child branch node at the node dj, IW_{d}i^{xl} is the leak weight of the i^{th} child branch node at the node d_{j}, p_{dj} is length of the prefix at the node d_{j}, and p_{d}i is length of the prefix at the i^{th} child branch node of the node dj. Here IW_{d}i^{0} = fW_{d}i, and max{ t}{} means a value which is maximum over all the enumerated deploym ent factor sets {X_{t}} for the node d_{j}.
[0087] Further, the arrangement set {arr_{dj} ^{1}} is determined based on the deployment factors in the deployment factor set {X_{t}} which decide the deploy weight dW_{d}j^{1}. Based on the determined arrangement set {arr_{d}j^{1}}, the leak weight IW_{dj} ^{1} is computed through equation (9) below:
[0088] Based on equations (8) and (9), the deploy weight dW_{dj} ^{1} and the leak weight IW_{dj} ^{1} are computed , and the arrangement set {arr_{dj} ^{1}} is determined with t = 2, 3, and so on, up to t < k < q+1 for each node d_{j}. These computations enable in identifying and arriving at the combinations of nodes in each branch rooted at the root node of the prefix tree, for which the weight preserved is maximum when 1 to at most k number of buckets are filled by the prefixes at those combinations of nodes.
[0089] After, determining the deploy weights, the leak weights, and the arrangement sets for the leaf nodes and the branch nodes of the prefix tree, the deploy weights and the arrangement sets are computed and determined for the root node of the prefix tree in the manner as described above using equations (5), (7) and (8). For this, the node d is considered as the root node in equations (5), (7) and (8).
[0090] Based on the computations for the root node, the arrangement set {arrRno^{k}} captures and refers to those k nodes whose prefixes when filled in the k buckets preserve the maximum weight. The prefixes represented by such k nodes are the Topprefixes that can be filled in the k buckets. Subsequent to this, the histogram generator 1 18 generates a histogram for the strings received in the offline environment based on the deploy weights of those k nodes identified from the arrangement set {arrR_{n}o^{k}}.
[0091] In an implementation, for each node d , the deploy weights dwd, the leak weights IW_{d} and the arrangement sets {arr_{d} ^{1}} are stored as elements of an array. Let the array for the node d be denoted by V_{d}.
[0092] The description below describes the details of generating a prefix tree for the static strings, assigning deploy weights to nodes, and determining a predefined number of Topprefixes to fill in the predefined number of buckets for generation of a histogram in an offline environment through an illustrative example. Consider a case where the string data, obtained in an offline environment, includes strings s as listed in Table 1 below. Table 1 also lists frequencies f(s) for the received strings. Let's say three Topprefixes are to be determined to fill a maximum of three buckets, i.e. , maximum value of k is 3, for generation of a histogram .
Table 1
[0093] Figure 5 illustrates a prefix tree for the strings in an offline environment, according to an example of the present subject matter. The prefix tree, as shown, has a root node, multiple branch nodes and multiple leaf nodes based on the strings. Initially, the prefix tree is traversed by performing a breadth first search, and a reverse traverse order for the nodes is determined. The nodes in the prefix tree are sequentially numbered in accordance with the reverse traverse order, as shown in Figure 5. For the purpose of the description herein , a node is denoted as d_{j} where j is the node number of that node. Table 2 enlists the node number according to the reverse traverse order, and indicates the prefix pd represented by the corresponding node d . The node di is the root node, the nodes d_{2}, d3, d_{4}, ds, d6, d_{7}, ds, dg, dio_{,} and dn are the branch nodes, and the nodes d i_{2}, di3, di_{4}, di5, di6, di_{7}, dis, dig, d2o_{,} and d_{2}i are the leaf nodes.
Table 2
[0094] After this, in accordance with the reverse traverse order, a fraction weight fWd of each of the nodes is computed. The fractional weight fWd of the leaf nodes is computed using equation (2) and the fractional weight fW d of the branch nodes is computed using equation (3). The values of fractional weights of the nodes are listed in Tabl e 2. Some example computations of the fractional weights are illustrated below: , „ \ hostnameABCD\ 12
For node dn: fw_{dll} = ^{1} fw_{d} ^{a} _{2}*i^{L} x I hostnameA BCD\ = io x — 12 = 10 ,
For node d4: fw_{d4} = fw_{di4} x  + fw_{dW} x  = 5 + 3 = 8 , and
For node d_{3}: fw_{d3} = fw_{di3} x→ fw_{d6} x→ fw_{d7} x→ fw_{d8} x→ fw_{d9} x
10
92
3
[0095] After computing the fractional weights fW_{d} for all the nodes, the deploy weights dW_{d} , the leak weights IW_{d} ^{1}, and the arrangement sets {arr_{d} ^{1}} are computed and/or determined for all the nodes, in accordance with the reverse traverse order. The computations and determinations are carried out in a manner as described earlier. In an example, for each node d, the deploy weights dW_{d} ^{1}, the leak weights IW_{d} ^{1}, and the arrangement sets {arr_{d} ^{1}} are stored in an array Vd with at most k cells, where t^{th} cell of the array Vd is {Vd^{1}} = {dWd^{1}, IW_{d} ^{1}, {arr_{d} ^{1}}}. For a node d, t can take values from 1 < t < k < q+1 , where q is the number of child branch nodes at the node d, and q+1 refers to the number of subtree branch nodes rooted at the node d.
[0096] Table 3 illustrates values of the deploy weights, the leak weights and the arrangement sets for the leaf nodes. Since only one bucket can be filled by the prefix represented by a leaf node, the value of t is equal to 1 and the array V_{d} has one cell for each leaf node. The value of dW_{d} ^{1} for each leaf node is computed through equation (4).
Table 3
{Vdi 2^{1}}= {dWdi2^{1} , IWdi 2^{1} , {arr_{d}i 2^{1} }} I {5, 0, {d i _{2}}}
[0097] Table 4 illustrates values of the deploy weights, the leak weights and the arrangement sets for the branch nodes. For the nodes dn , d io, dg, d_{7}, d6, ds and d_{2}, only one bucket can possibly be filled by the prefix at the respective nodes. Thus, t is equal to 1 , and the corresponding array V_{d} has one cell . For the node d4, one or two buckets can possibly be filled by the prefixes at the subtree branch nodes rooted at the node d4. Thus, t can be equal to 1 or 2, and the array Vd4 has 2 cells, {ν_{<}¾^{1}} and {Vd4^{2}}. Similarly, for node ds, t can be equal to 1 or 2, and the array V_{d}s has 2 cells, {V_{d}s^{1}} and {Vd8^{2}} The values of deploy weights dWd^{1} and leak weights IWd^{1} for the branch nodes are computed through equations (5) to (9).
Table 4
[0098] Some example computations of the deploy weights, the leak weights, and the arrangement sets are illustrated below:
[0099] For the node d_{8}, with t = 1 :
44
dvL^_{g} = max{/w_{d8}, dw^) = max y , 10
3 {arrJe} = {d_{8}}■
[0100] For the node d_{8}, with t = 2:
dw _{8} = 8 + 10 = 18 ,
lwj_{s} = 0 ,
{arr%_{8}} = {d_{8},du} .
[0101] For the node d_{3}, with t = 1:
1 r 1 1 1 1 1 f92 44 Ί 92 d _{3} = max{fw_{d3}, dw _{6},dw _{7},dw _{s},dw _{9}} = max , 9,5,— , 10] = y , iw£_{3} = 0 ,
{αττ^} = {d_{3}} .
[0102] For the node d3, with t = 2, the possible deployment factor sets {X2} are shown in Table 5. The node d3 has four child branch nodes d6, d_{7}, ds and dg. The deploy weight dWds^{2} is computed using equation (8) over all the possible deployment factor sets {X2}. The deploy weight dWds^{2} takes the value corresponding to the deployment factor set {1, 0, 0, 1, 0}. Thus, for the node d_{3}:
d j_{3} = x _{lw}o_{9 t}
4 4 44 4
dwL = X 9 + 8 ,
a^{i} 6  6x5+— + — X l0 = 2
3 10
iwj_{3} = 0 ,
{arrjg} = {d_{3},d_{8}} .
Tables
{0, 0, 0, 1 , 1 }
{0, 0, 0, 2, 0}
[0103] After this, the deploy weights and the arrangement sets are computed and determined for the root node of the prefix tree using equations (5), (7), and (8). For the root node, with t = 1 : dWR_{n}o^{1} = 92/3 and {arrR_{no} ^{1}} = {d3}. With t = 2: dWRno^{2} = 1 16/3 and {arr_{Rn}o^{2}} = {d 3, d_{4}}. And, with t = 3: dw_{Rn}o^{3} = 137/3 and {arrRno^{3}} = {d3, d_{4}, ds}. Based on the computations for the root node, the nodes d3, d_{4} and ds as indicated in the arrangement set {arrRno^{3}} the three nodes whose prefixes when filled in three buckets preserve the maximum weight. Thus, the prefixes "host", "server" and "code" represented by the nodes d3, d_{4,} and ds are the three Topprefixes determined for filling up the three buckets, and the deploy weights associated with these nodes are stored in the buckets, which can be used for generation of a histogram for the strings.
[0104] The space cost associated with the histogram generated in the offline environment is 0(D k f), as D number of distinct strings are represented by the D number of leaf nodes, and a maximum of k number of buckets are used for filling up with the k number of prefixes. Here, f denotes the maximum fanout of the prefix tree, which is indicative of the maximum number of distinct characters that can be a part of a string. Further, the time cost associated with the histogram generated in the offline environment is 0(D k k^{9}), as a D number of leaf nodes is parsed to fill a k number of buckets, and, for one node, a maximum of k number buckets are distributed to a g number of childnodes of that node.
[0105] Although the example of generation of histogram in the offline environment is described for a few strings; the histogram construction system 102 can perform the same procedure with a substantially large number of strings to determine a predefined number of Topprefixes and generate histograms based on the topprefixes for the strings.
[0106] Figure 6 illustrates a method 600 of generation of a histogram for string data, according to an example of the present subject matter. Figure 7 illustrates a method 700 of generation of a histogram for string data in an online environment, according to an example of the present subject matter. Figure 8 illustrates a method 800 of generation of a histogram for string data in an offline environment, according to an example of the present subject matter. The order in which the methods 600, 700, and 800 are described is not intended to be construed as a limitation , and any number of the described method blocks can be combined in any order to implement the method s 600, 700, and 800, or an alternative method . Additionally, individual blocks may be deleted from the methods 600, 700, and 800 without departing from the spirit and scope of the subject matter described herein.
[0107] Furthermore, the methods 600, 700, and 800 can be implemented by processor(s) or computing devices in any suitable hardware, nontransitory machine readable instructions, or combination thereof. It may be understood that steps of the methods 600, 700, and 800 may be executed based on instructions stored in a nontransitory computer readable medium as will be readily understood. The nontransitory computer readable medium may include, for example, digital data storage media, digital memories, magnetic storage media, such as a magnetic disks and magnetic tapes, hard drives, or optically readable digital data storage media.
[0108] Further, although the methods 600, 700, and 800 may be implemented in computing devices in different network environments for generation of histograms for string data, in examples described in Figure 6, Figure 7, and Figure 8, the methods 600, 700, and 800 are explained in context of the aforementioned histogram construction system 102, for ease of explanation.
[0109] Referring to Figure 6, at block 602, a prefix tree is generated for strings in string data. The strings are received and the prefix tree is generated by the histogram construction system 102. The strings may be received in an online environment or an offline environment. The prefix tree includes nodes that represent prefixes of the received strings.
[0110] Based on the nodes in the prefix tree, deploy weights are assigned to the nodes at block 604. The deploy weights are assigned to the nodes based on lengths of the prefixes represented by subtree nodes rooted at the nodes and based on frequencies of the strings whose prefixes are represented by the subtree nodes. Each of the deploy weights of one node is indicative of a maximum weight preserved upon filling the buckets with at least one prefix represented by the subtree nodes rooted at that one node. The deploy weights are assigned by the histogram construction system 102.
[0111] At block 606, a predefined number of Topprefixes of the strings are determined for filling the predefined number of buckets. The predefined number of strings is determined from the prefixes represented by the nodes based on maximizing a total weight preserved by the prefixes in the buckets and over a maximum number of strings. The Topprefixes are determined by the histogram construction system 102.
[0112] At block 608, a histogram is generated based on the deploy weights associated with the Topprefixes in the buckets. The histogram is generated by the histogram construction system 102. The histogram may be generated for the purposes of data mining , data analytics, and approximate query answering.
[0113] Referring to Figure 7, the string data is received online, in realtime. The strings in the string data are serially received onebyone. The prefix tree initially has a root node and the predefined number of buckets, that are to be filled by the Topprefixes, are empty. At block 702, a string is received and the prefix tree is updated to include the string. At block 704, it is checked whether the string is matching with a prefix in one bucket. For this the string is compared with the prefixes in the buckets. If the string matched with a prefix in a bucket ('Yes' branch from block 704), the deploy weight in the bucket having the prefix that matches with the string is incremented by 1 , at block 706. The revised deploy weight is assigned to the bucket, and the method 700 proceeds to receive the next string for processing and/or proceeds to generate a histogram at block 720.
[0114] If the string is not matched (' No' branch from block 704), it is checked at block 708 whether an empty or unfilled bucket, from the maximum of predefined number of buckets, exists. If an unfilled bucket is found ('Yes' branch from block 708), a longest prefix of the string is filled in the unfilled bucket and the deploy weight of the node representing the longest prefix is stored in the unfilled bucket, at block 710. For this, the deploy weight is assigned to the node representing the longest prefix, based on the frequency of the string, before storing the same in the unfilled bucket. The method 700 then proceeds to receive the next string for processing and/or proceeds to generate a histogram at block 720.
[0115] Further, if no unfilled bucket is found ('No' branch from block 708), a bucket pair with prefixes is identified, at block 712, for which a loss weight is minimum. For this, a loss weight for each bucket pair is computed as described earlier and the pair with the minimum loss weight is taken as the bucket pair for further processing.
[0116] At block 714, it is checked whether the value of loss weight for the identified bucket pair is less than 1 . If the value of loss weight is > 1 ('No' branch from block 714), the deploy weights in the buckets are reduced by 1 , at block 716, and the method 700 proceeds to receive the next string for processing and/or proceeds to generate a histogram at block 720. And, if the value of loss weights is < 1 ('Yes' branch from block 716), then, at block 718, one bucket of the identified bucket pair is filled by the longest common prefix of the prefixes in the bucket pair, the deploy weight in that one bucket is revised as a sum of the deploy weights associated with the prefixes in the bucket pair minus the loss weight, the other bucket of the bucket pair is filled with a longest prefix of the string, and the deploy weight of the node representing the longest prefix of the string is stored in that other bucket. For this, the deploy weight is assigned to the node representing the longest prefix of the string, based on the frequency of the string, before storing the same in the bucket. The method 700 then proceeds to receive the next string for processing and/or proceeds to generate a histogram at block 720.
[0117] At block 720, a histogram is generated based on the deploy weights associated with the prefixes in the buckets.
[0118] Referring to Figure 8, at block 802, string data having strings with a predefined frequency distribution is received. The string data is received offline, and the strings are static strings with fixed frequencies. At block 804, a prefix tree is generated for the received strings. The prefix tree is generated for distinct strings. Based on the prefix tree, a breadth first is performed to traverse the prefix tree and a reverse traverse order for the nodes is determined, at block 806. [0119] Based on the reverse traverse order, fractional weights for the leaf nodes and the branch nodes in the prefix tree are computed, at block 808. After this, at block 810, a number of deploy weights are computed and assigned to each node. The deploy weights are computed for each node depending on the number of buckets, from 1 to at most the predefined number, which can be filled by the prefixes at subtree nodes rooted at that each node and by the prefixes at further subtree nodes rooted at childnodes of that each node. The deploy weights for the nodes are computed based on the reverse traverse order and based on the fractional weights of the subtree nodes, frequencies of the strings whose prefixes are represented by the subtree nodes, lengths of the prefixes represented by the subtree nodes, and the deploy weights of subtree nodes.
[0120] At block 812, deploy weights are computed for the root node of the prefix tree. The deploy weights of the root node are computed for the number of buckets, from 1 to at most the predefined number, which can be filled by the prefixes at subtree nodes rooted at the root node and at the further subtree nodes rooted at the childnodes of those subtree nodes. The deploy weights for the root node are computed based on the deploy weights of the subtree nodes rooted at the root node.
[0121] Based on the deploy weights of the root node, at block 814, the predefined number of Topprefixes is determined from the prefixes based on which deploy weights of the root node are computed. The predefined number of Topprefixes is a number indicating those prefixes represented by the subtree nodes at the root nodes and the prefixes represented by further subtree nodes at the childnodes rooted at the subtree nodes for which the deploy weight of the root nodes indicates a maximum weight preserved upon filling the predefined number of buckets.
[0122] At block 816, a histogram is generated based on the deploy weights associated with the predefined number of Topprefixes determined based on the deploy weights of the root node.
[0123] Figure 9 illustrates a system environment 900 for generation of a histogram for string data, according to an example of the present subject matter. The system environment 900 may be a public networking environment or a private networking environment. In one implementation, the system environment 900 includes a processing resource 902 communicatively coupled to a computer readable medium 904 through a communication link 906.
[0124] For example, the processing resource 902 can be a computing device for generating histograms. The computer readable medium 904 can be, for example, an internal memory device or an external memory device. In one implementation, the communication link 906 may be a direct communication link, such as any memory read/write interface. In another implementation, the communication link 906 may be an indirect communication link, such as a network interface. In such a case, the processing device 902 can access the computer readable medium 904 through a network 908. The network 908 may be a single network or a combination of multiple networks and may use a variety of different communication protocols.
[0125] The processing resource 902 and the computer readable medium 904 may also be communicatively coupled to data sources 910 through the communication link 906, and/or to communication devices 912 over the network 908. The coupling with the data sources 910 enables in receiving the string data in an offline environment, and the coupling with the communication devices 912 enables in receiving the string data in an online environment.
[0126] In one implementation, the computer readable medium 904 includes a set of computer readable instructions, such as the data acquiring module 112, the data structure module 1 14, the Topprefix finder 1 16, and the histogram generator 1 18. The set of computer readable instructions can be accessed by the processing resource 902 through the communication link 906 and subsequently executed to perform acts for generating histograms for string data.
[0127] For example, the data acquiring module 1 12 can obtain string data comprising strings. Based on the obtained strings, the data structure module 114 can generate a prefix tree for distributing the strings into nodes that represent prefixes of the strings. Based on the nodes in the prefix tree, the Topprefix finder 1 16 can assign deploy weights to the nodes. [0128] Further, based on the deploy weights of the nodes, the Topprefix finder 1 16 can determine or find a predefined number of Topprefixes of the strings for filling up the predefined number of buckets. The Topprefixes are determined from the prefixes in the prefix tree, based on maximization of a total weight preserved by the predefined number of prefixes, where the predefined number of prefixes is associated with a maximum number of distinct strings. Each of the Topprefixes is filled in a separate bucket, and the deploy weight of the node representing the each Topprefix is stored in the corresponding bucket.
[0129] Further, after determining or finding the Topprefixes for the strings and filling up the buckets, the histogram generator 1 18 can generate a histogram of the Topprefixes. The histogram is generated based on the Top prefixes and the deploy weights associated with the Topprefixes.
[0130] Although implementations for generation of histograms for string data have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed and explained as example implementations for generation of histograms for string data.
Claims
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

PCT/CN2013/075033 WO2014176754A1 (en)  20130430  20130430  Histogram construction for string data 
Applications Claiming Priority (2)
Application Number  Priority Date  Filing Date  Title 

US14/787,548 US20160154854A1 (en)  20130430  20130430  TOPK Prefix Histogram Construction for String Data 
PCT/CN2013/075033 WO2014176754A1 (en)  20130430  20130430  Histogram construction for string data 
Publications (1)
Publication Number  Publication Date 

WO2014176754A1 true WO2014176754A1 (en)  20141106 
Family
ID=51843051
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

PCT/CN2013/075033 WO2014176754A1 (en)  20130430  20130430  Histogram construction for string data 
Country Status (2)
Country  Link 

US (1)  US20160154854A1 (en) 
WO (1)  WO2014176754A1 (en) 
Citations (3)
Publication number  Priority date  Publication date  Assignee  Title 

US6735578B2 (en) *  20010510  20040511  Honeywell International Inc.  Indexing of knowledge base in multilayer selforganizing maps with hessian and perturbation induced fast learning 
US6978274B1 (en) *  20010831  20051220  Attenex Corporation  System and method for dynamically evaluating latent concepts in unstructured documents 
US7809744B2 (en) *  20040619  20101005  International Business Machines Corporation  Method and system for approximate string matching 
Family Cites Families (1)
Publication number  Priority date  Publication date  Assignee  Title 

US9323695B2 (en) *  20121112  20160426  Facebook, Inc.  Predictive cache replacement 

2013
 20130430 US US14/787,548 patent/US20160154854A1/en not_active Abandoned
 20130430 WO PCT/CN2013/075033 patent/WO2014176754A1/en active Application Filing
Patent Citations (3)
Publication number  Priority date  Publication date  Assignee  Title 

US6735578B2 (en) *  20010510  20040511  Honeywell International Inc.  Indexing of knowledge base in multilayer selforganizing maps with hessian and perturbation induced fast learning 
US6978274B1 (en) *  20010831  20051220  Attenex Corporation  System and method for dynamically evaluating latent concepts in unstructured documents 
US7809744B2 (en) *  20040619  20101005  International Business Machines Corporation  Method and system for approximate string matching 
Also Published As
Publication number  Publication date 

US20160154854A1 (en)  20160602 
Similar Documents
Publication  Publication Date  Title 

US10114682B2 (en)  Method and system for operating a data center by reducing an amount of data to be processed  
US20140222776A1 (en)  Document Reuse in a Search Engine Crawler  
CN103548003B (en)  Method and system for improving the clientside fingerprint cache of deduplication system backup performance  
JP3947202B2 (en)  Method for collision detection or collision management of several user requests accessing a database containing multiple string entries, in particular a method for lock management  
US20140222793A1 (en)  System and Method for Automatically Importing, Refreshing, Maintaining, and Merging Contact Sets  
US10467245B2 (en)  System and methods for mapping and searching objects in multidimensional space  
US9582565B2 (en)  Classifying uniform resource locators  
EP2040184B1 (en)  Database and database processing methods  
US10078781B2 (en)  Automatically organizing images  
DE202012013427U1 (en)  Linking tables in a MapReduce method  
DE602004011890T2 (en)  Method for redistributing objects to arithmetic units  
US9400800B2 (en)  Data transport by named content synchronization  
US7930547B2 (en)  High accuracy bloom filter using partitioned hashing  
Guo et al.  Mining the web and the internet for accurate ip address geolocations  
US20050203897A1 (en)  Method for using query templates in directory caches  
US9367640B2 (en)  Method and system for creating linked list, method and system for searching data  
Munagala et al.  The pipelined set cover problem  
US9129010B2 (en)  System and method of partitioned lexicographic search  
EP2936344B1 (en)  Searchable data archive  
US20150261886A1 (en)  Adaptive sampling schemes for clustering streaming graphs  
CN106484877B (en)  A kind of document retrieval system based on HDFS  
US10216848B2 (en)  Method and system for recommending cloud websites based on terminal access statistics  
WO2014071033A1 (en)  Sorting social profile search results based on computing personal similarity scores  
CN104008028B (en)  Intelligent mobile terminal data backup memory method and system based on many cloud storages  
KR20110091763A (en)  Method and apparatus for representing and identifying feature descriptors utilizing a compressed histogram of gradients 
Legal Events
Date  Code  Title  Description 

121  Ep: the epo has been informed by wipo that ep was designated in this application 
Ref document number: 13883396 Country of ref document: EP Kind code of ref document: A1 

WWE  Wipo information: entry into national phase 
Ref document number: 14787548 Country of ref document: US 

NENP  Nonentry into the national phase in: 
Ref country code: DE 

122  Ep: pct application nonentry in european phase 
Ref document number: 13883396 Country of ref document: EP Kind code of ref document: A1 