WO2014176754A1 - Histogram construction for string data - Google Patents

Histogram construction for string data Download PDF

Info

Publication number
WO2014176754A1
WO2014176754A1 PCT/CN2013/075033 CN2013075033W WO2014176754A1 WO 2014176754 A1 WO2014176754 A1 WO 2014176754A1 CN 2013075033 W CN2013075033 W CN 2013075033W WO 2014176754 A1 WO2014176754 A1 WO 2014176754A1
Authority
WO
WIPO (PCT)
Prior art keywords
prefixes
nodes
prefix
buckets
bucket
Prior art date
Application number
PCT/CN2013/075033
Other languages
French (fr)
Inventor
Ge LUO
Li-Mei Jiao
Zhao CAO
Shimin CHEN
Meng Guo
Original Assignee
Hewlett-Packard Development Company, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P. filed Critical Hewlett-Packard Development Company, L.P.
Priority to PCT/CN2013/075033 priority Critical patent/WO2014176754A1/en
Publication of WO2014176754A1 publication Critical patent/WO2014176754A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing

Abstract

Methods and systems of generation of histograms for strings are described. In one implementation, a prefix tree having nodes representing prefixes of the strings is generated. For the prefix tree, deploy weights are assigned to the nodes based on lengths of the prefixes represented by sub-tree nodes rooted at the nodes and frequencies of the strings whose prefixes are represented by the sub-tree nodes. Each of the deploy weights of one node is indicative of a maximum weight preserved upon filling the buckets with at least one prefix represented by the sub-tree nodes rooted at that one node. A predefined number of Top-prefixes are determined for filling up the predefined number of buckets. The Top-prefixes are determined based on maximizing a total weight preserved by the prefixes in the buckets and over a maximum number of strings. A histogram is generated based on the deploy weights associated with the Top-prefixes.

Description

HISTOGRAM CONSTRUCTION FOR S TRING DATA BACKGROUND

[0001] In modern day environments, large volumes of data are generally captured from a variety of information sources, and managed in databases for various purposes including data analysis and database searching. In view of the large volume of data, database management systems utilize histograms to capture data distribution, to summarize and represent the data in a concise form. To generate a histogram, the data is partitioned based on a degree of similarity in their characteristics. The histogram , in an example, represents a frequency distribution of occurrence of data with similar characteristics over the entire data.

BRIEF DESCRIPTION OF DRAWINGS

[0002] The detailed description is provided with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to reference like features and components.

[0003] Figure 1(a) illustrates a system environment implementing a histogram construction system , according to an example of the present subject matter.

[0004] Figure 1 (b) illustrates a histogram construction system , according to an example of the present subject matter.

[0005] Figure 2 illustrates the histogram construction system , according to an example of the present subject matter.

[0006] Figure 3 illustrates a prefix tree for string data, according to an example of the present subject matter.

[0007] Figures 4(a), 4(b), 4(c), 4(d), and 4(e) illustrate iterative revisions of a prefix tree for strings in an online environment, according to an example of the present subject matter.

[0008] Figure 5 illustrates a prefix tree for strings in an offline environment, according to an example of the present subject matter. [0009] Figure 6 illustrates a method of generation of a histogram for string data, according to an example of the present subject matter.

[0010] Figure 7 illustrates a method of generation of a histogram for string data in an online environment, according to an example of the present subject matter.

[001 1 ] Figure 8 illustrates a method of generation of a histogram for string data in an offline environment, accord ing to an example of the present subject matter.

[0012] Figure 9 illustrates a system environment for generation of a histogram for string data, according to an example of the present subject matter.

DETAILED DESCRIPTION

[0013] The present subject matter relates to methods and systems for generation of histograms for string data. The string data include multiple sequences of characters in the form of strings. A histogram represents a statistical summary of the string data, which may be generated based on a frequency distribution of strings in the string data.

[0014] A histogram is generated by sampling of data into multiple buckets, where each bucket is filled with the data having similar characteristics. Each bucket generally has a defined bucket boundary or sampling span for filling up the data in that bucket. For example, the data may correspond to age of employees in a company. The age data can be sampled into buckets of different age spans. The buckets may have equal or unequal boundary widths. Each bucket may store frequency of occurrence of data lying within the respective bucket boundary. The frequency distribution stored in the buckets summarizes the data, which is referred to as a histogram synopsis. The histogram synopsis can then be used to generate a histogram for the data over the buckets.

[0015] Histograms of data find their utility in various applications, such as data mining, data analytics, and approximate query answering. Histograms enable in storing the data and its relevant information in compact and concise manner, which in turn facilitate in improving the performance of data mining, data analytics and approximate query answering procedures when performed over the histograms. For data mining and big data analytics, it is possible to fetch required information , draw inferences and identify deviations in the data distribution in a substantially quick time through the histograms. In approximate query answering, user queries can be executed on the histograms, instead of on the entire data, to obtain approximate but quick answers to the user queries.

[0016] Presently available databases and applications deal with different types of data including numerical and string based data. Methods of generation of histograms for numerical data are common; however, such numerical data specific methods cannot be applied for generation of histograms for string data. Also, histogram generation methods are applicable on static data, i.e., on the data that is fixed and known prior to generation of histograms. Such methods cannot be used for generation of histograms for the data being streamed online in real-time.

[0017] Further, generating histograms have computation costs associated with them. The computation costs generally include time cost and space cost. The time cost refers to the amount of time taken for generation of a histogram , and the space cost refers to the am ount of space, i.e. , the memory utilized by a histogram. The methods of generation of histograms for the string data typically take time in a quadratic order of number of data values being considered for the histogram generation , i. e. , 0(|n2|) where n is the number of data values. With the number of data values being substantially large, the time cost of the histogram is substantially high. The histogram generally takes space in a linear order of number of data values, i.e., 0(|n|) where n is the number of data values for which the histogram is generated. For the histogram generated over a large number of data values, the space cost is also substantially large.

[0018] Methods and systems for generation of histograms for string data are described herein. With the methods and the systems of the present subject matter, histograms can be generated for string data which is static and predefined, and for string data which is streamed online in real-time. The histograms that are generated based on the methods and the systems of the present subject matter have substantially low time and space costs associated with them.

[0019] In accordance with the present subject matter, for generation of a histogram for string data, the strings in the string data are represented as a prefix tree. A prefix tree is a Trie data structure having nodes that represent prefixes of the strings. A prefix of a string is a sequence of characters which is either the same as that of the string or which is a substring of the string. The nodes in the prefix tree represent longest prefixes and longest common prefixes of the strings. A longest prefix refers to a sequence of characters which is equal to a string. A longest common prefix refers to a sequence of characters which is a common substring of one or more strings. For example, for two strings "host" and "hostname", the prefix tree will have a node representing the longest prefix as "host" for the string "host", a node representing the longest prefix as "hostname" for the string "hostnames", and a node representing the longest common prefix as "host" for the both strings.

[0020] Based on the prefix tree, deploy weights are assigned to the nodes in the prefix tree. A deploy weight of a node is computed based on lengths of the prefixes represented by sub-tree nodes rooted at that node and based on frequencies of the strings whose prefixes are represented by the sub-tree nodes. The deploy weight of a node is indicative of a maximum weight preserved upon filling up at least one prefix, represented by the subtree nodes rooted at that node, in a respective bucket. The sub-tree nodes rooted at one node include that one node and the child-nodes of that one node. The values of deploy weights convey the levels of relevan cy of the prefixes at the respective nodes for filling up the buckets. The higher valued deploy weights highlight the prefixes that are more relevant for filling up the buckets.

[0021] Further, based on the deploy weights associated with the prefixes of the strings, a predefined number of prefixes can be determined or found , from amongst the prefixes represented by the nodes of the prefix tree, for filling up the predefined number of buckets. The predefined number of prefixes are determined through maximization of a total weight preserved by the determined prefixes. The total weight preserved is the weight preserved by the determined prefixes, which can be determined based on the deploy weights of the determined prefixes. The predefined number of prefixes that are determined or found are referred to as Top-prefixes of the string data. Each bucket fills one distinct prefix. Also, the prefixes are determined to cover the prefixes associated with a maximum number of distinct strings. The deploy weights associated with the predefined number of prefixes can then be used to generate a histogram for the string data.

[0022] The methods and the systems of the present subject matter enable in capturing distribution of string data and generating histograms with a reduced number of Top-prefixes of strings. By maximizing the total weight preserved by the Top-prefixes, the histogram, in accordance with the present subject matter, captures as much statistical information as possible of the string data. Further, by considering the prefixes of the strings and maximizing the number of prefixes in the Top-prefixes, the coverage of the histogram is over a large (maximum) number of distinct strings in the string data.

[0023] In an example, the number of Top-prefixes may be less than the total number of distinct strings in the string data considered for generation of a histogram . Such a histogram of the Top-prefixes facilitates in representing the string data in a substantially compact form, which can be used for data mining, data analytics, approximate query answering , etc. Further, since each of the distinct Top-prefixes is filled in a separate bucket, the number of buckets governs the size of the histogram . The space cost and the time cost of the histogram , in accordance with the subject matter, is based on the number of Top-prefixes or the number of buckets in the histogram. This facilitates in reducing the space cost and the time cost associated with the histograms.

[0024] Further, the methods and the systems of the present subject matter enable the generation of histograms both in an offline environment and in an online environment. In an offline environment, the data is static and the complete data set along with the frequency distribution of strings are known in advance. The histograms may be generated for this predetermined static data set in the offline environment. In an online environment, the data is streamed and received, for example, one-by-one in real-time. The frequency distribution of the streamed strings is not known in advance. Thus, histograms may be generated and updated for the stream ed data in real-time in the online environment.

[0025] The above methods and systems are further described in conjunction with Figures 1 to 9. It should be noted that the description and figures merely illustrate the principles of the present subject matter. It is thus understood that various arrangements can be devised that, although not explicitly described or shown herein, embody the principles of the present subject matter and are included within its spirit and scope. Moreover, all statements herein reciting principles, aspects, and embodiments of the present subject matter, as well as specific examples thereof, are intended to encompass equivalents thereof.

[0026] Figure 1 (a) schematically illustrates a system environment 100 implementing a histogram construction system 102, according to an example of the present subject matter. The system environment 100 may be a public environment or a private environment. The histogram construction system 102 may be a machine readable instructions- based implementation or a hardware- based implementation or a combination thereof. The histogram construction system 102 described herein can be implemented in a computing device, such as a server. The histogram construction system 102 in a computing device enables the computing device to generate histograms for string data, in accordance with the present subject matter.

[0027] As shown in Figure 1 (a), the histogram construction system 102 is communicatively coupled with a plurality of data sources 104-1 , 104-2, ... , 104-N. The data sources 104-1 , 104-2, ... , 104-N, hereinafter may be collectively referred to as data sources 104, and individually referred to as a data source 104. The data sources 104 may host data, including string data, in static form . In an example, the histogram construction system 102 can access the data sources 104 to receive the string data in static form, which also refers to a fixed data set, for the generation of histograms. Such an environment for generation of histograms refers to an offline environment.

[0028] Further, as shown in Figure 1 (a), the histogram construction system 102 is communicatively coupled with a plurality of communication devices 106-1 , 106-2, ... , 106-N through a communication network 108. The communication devices 106- 1 , 106-2, ... , 106-N, hereinafter may be collectively referred to as communication devices 106, and individually referred to as a communication device 106. The communication device 106 may include a computer, a laptop, a smart phone, a tablet, and the like. In an example, the histogram construction system 102 can communicate with the communication devices 106 to receive string data streamed online in real-time over the communication network 108, for the generation of histograms. Such an environment for generation of histograms refers to an online environment.

[0029] In an example, the communication device 106 may be communicatively coupled to the histogram construction system 102 over the communication network 108 through one or more communication links. The communication links between the communication devices 106 and the histogram construction system 102 are enabled through a desired form of communication, for example, via dial-up modem connections, cable links, and digital subscriber lines (DSL), wireless or satellite links, or any other suitable form of communication.

[0030] The communication network 108 may be a wireless network, a wired network, or a combination thereof. The communication network 108 can also be an individual network or a collection of many such individual networks, interconnected with each other and functioning as a single large network, e.g., the Internet or an intranet. The communication network 108 can be implemented as one of the different types of networks, such as intranet, local area network ( LAN), wide area network (WAN), the internet, and such. The communication network 108 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/I P), etc. , to communicate with each other.

[0031] The communication network 108 may also include individual networks, such as but not limited to, Global System for Communication (GS M) network, Universal Telecommunications System (U MTS) network, Long Term Evolution ( LTE) network, Personal Communications Service (PCS) network, Time Division Multiple Access (TDMA) network, Code Division Multiple Access (CDMA) network, Next Generation Network (NGN), Public Switched Telephone Network (PSTN), and Integrated Services Digital Network (ISDN).

[0032] Figure 1(b) illustrates the histogram construction system 102, according to an implementation of the present subject matter. In an implementation, the histogram construction system 102 includes processors) 110. The processor(s) 1 10 may be implemented as microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) 1 10 fetch and execute computer-readable instructions stored in the memory. The functions of the various elements shown in Figure 1 (b), including any functional blocks labeled as "processor(s)", may be provided through the use of dedicated hardware as well as hardware capable of executing machine readable instructions.

[0033] As shown in Figure 1(b), the histogram construction system 102 includes a data acquiring module 1 12, a data structure module 1 14, a Top- prefix finder 1 16, and a histogram generator 1 18. The data acquiring module 112, the data structure module 1 14, the Top-prefix finder 1 16, and the histogram generator 1 18 are coupled to the processors) 1 10.

[0034] In an implementation, for the purpose of generation of histograms, the data acquiring module 1 12 obtains string data comprising strings. The data acquiring module 1 12 can obtain static string data offline from the data sources 104, and/or can obtain streamed string data online from the communication devices 106. Based on the obtained strings, the data structure module 1 14 generates a prefix tree for distributing the strings into nodes that represent prefixes of the strings. Based on the nodes in the prefix tree, the Top-prefix finder 1 16 assigns deploy weights to the nodes. A deploy weight of a node is indicative of a maximum weight preserved upon filling buckets with one or more prefixes represented by the sub-tree nodes rooted at that node, each in a separate bucket.

[0035] Based on the deploy weights of the nodes, the Top-prefix finder 116 determines or finds a predefined number of Top-prefixes of the strings for filling up the predefined number of buckets. In an example, the predefined number may be a system defined or a user defined number. This predefined number may be defined based on the number of buckets to be filled in for a histogram, and based on the size of histogram to be constructed. The Top- prefixes are determined from the prefixes in the prefix tree, based on maximization of a total weight preserved by the predefined number of prefixes, where the predefined number of prefixes are associated with a maximum number of distinct strings. Each of the Top-prefixes is filled in a separate bucket, and the deploy weight of the node representing the each Top-prefix is stored in the corresponding bucket.

[0036] After determining the Top-prefixes for the strings and filling up the buckets, the histogram generator 1 18 generates a histogram of the Top- prefixes. The histogram is generated based on the Top-prefixes and the corresponding deploy weights associated with the Top-prefixes in the buckets. The generated histograms can be used for applications, such as data mining, data analytics, and approximate query processing.

[0037] Figure 2 illustrates the histogram construction system 102, according to an implementation of the present subject matter. The histogram construction system 102 includes the processor(s) 1 10 and also interface(s) 202. The interface(s) 202 may include a variety of machine readable instruct! on- based and hardware interfaces that allow the histogram construction system 102 to interact with the data sources 104 and the communication devices 106, as the case may be. Further, the interface(s) 202 may enable the histogram construction system 102 to communicate with other devices, such as network entities, web servers and other external repositories.

[0038] Further, the histogram construction system 102 includes memory 204, coupled to the processors) 1 10. The memory 204 may include any computer-readable medium including, for example, volatile mem ory (e. g. , RAM), and/or non-volatile memory (e.g., EPROM, flash memory, NVRAM , memristor, etc.).

[0039] Further, the histogram construction system 102 includes module(s) 206 coupled to the processor(s) 110. The module(s) 206, amongst other things, include routines, programs, objects, components, data structures, and the like, which perform particular tasks or implement particular abstract data types. The module(s) 206 further include modules that supplement applications on the histogram construction system 102, for example, modules of an operating system.

[0040] The module(s) 206 of the histogram construction system 102 includes the data acquiring module 1 12, the data structure module 1 14, the Top-prefix finder 1 16, the histogram generator 118, and other module(s) 210. The other module(s) 210 may include programs or coded instructions that supplement applications and functions, for example, programs in the operating system of the histogram construction system 102.

[0041] Further, the histogram construction system 102 includes data 208. The data 208 serves, amongst other things, as a repository for storing data that may be fetched, processed, received, or generated by the module(s) 206. Although the data 208 is shown internal to the histogram construction system 102, it may be understood that the data 208 can reside in an external repository (not shown in the figure), which may be coupled to the histogram construction system 102. The histogram construction system 102 may communicate with the external repository through the interface(s) 202 to obtain information from the data 208.

[0042] In an implementation, the data 208 of the histogram construction system 102 includes string data 212, prefix data 214, histogram data 216, and other data 218. The string data 212 stores the strings obtained by the histogram construction system 102. The prefix data 214 stores the deploy weights of the nodes, and the data in the buckets. The histogram data 216 stores the histograms generated by the histogram construction system 102. The other data 218 comprise data corresponding to other module(s) 210.

[0043] As mentioned earlier, the histograms can be generated by the histogram construction system 102 in an online environment and in an offline environment. Before describing the procedures for generation of histograms for string data in online and offline environments, a prefix tree that can be used as a data structure for representing strings in the string data is described . The prefix tree is a Trie data structure that distributes the strings into leaf nodes and branch nodes. A leaf node is a terminal node representing the longest prefix of one of the strings. A branch node represents a longest common prefix of one or more prefixes represented by child-nodes branching out from that branch node.

[0044] Figure 3 illustrates a prefix tree 300 for representing string data, according to an example of the present subject matter. The prefix tree 300 is for the string data having the following strings: "address", "host", "hostname", "source", "sourcecode", and "sourcename". As shown , Rno is a root node of the prefix tree 300 from which nodes for the distinct strings branch out. Bni to Bn6 are the branch nodes and Lni to Ln6 are the leaf nodes.

[0045] The leaf node Lni is a terminal node for the string "address". The leaf node Ln i represents a prefix "address" which is the longest prefix of the string "address". Similarly, as shown, the leaf nodes Ln2, Ln3, Ln4, Lns, and Ln6 represent the longest prefix as "host", "source", "hostname", "sourcecode", and "sourcename", respectively, for the other strings. The branch node Bn i represents a prefix "address" which is the longest common prefix of the prefix represented by the leaf node Ln -i . Since only one leaf node Ln i is branching out from the branch node Bni , the longest common prefix at the branch node Bni is same as the longest prefix at the leaf node Lni . Similarly, the branch nodes Bn4, Bns and Βηε represent the longest common prefix as "hostname", "sourcecode" and "sourcename", respectively, based on the respective leaf nodes. Further, the branch node Bn2 represents a prefix "host" which is the longest common prefix of the prefixes represented by the leaf node Ln2 and the branch node Bn4. The branch node Bn3 represents a prefix "source" which is the longest common prefix of the prefixes represented by the leaf node Ln3 and the branch nodes Bns and Βηε. Further, the nodes Bn2, Ln2, and Bn4 form a group of sub-tree nodes rooted at the branch node Bn2. Similarly, the nodes Bn3, Ln3, Bn5, and Βηε form a group of sub-tree nodes rooted at the branch node Bn3. In an example, the prefix tree 300 for the string data may include other internal nodes; however, for the sake of simplicity the root node, the branch nodes and the leaf nodes, as described above, are illustrated.

[0046] The description below describes the generation of histograms by the histogram construction system 102 individually in the online environment and in the offline environment. Histogram Generation in Online Environment

[0047] In an implementation , for the purpose of generation of histograms in an online environment, the data acquiring module 1 12 obtains strings data online, in real-time, as data streams over the communication network 108. The string data includes strings which are received one-by-one from one or more communication devices 106. Based on the obtained strings, the data structure module 1 14 generates a prefix tree and iteratively revises the prefix tree to include the strings, as received one-by-one, in the prefix tree. Based on the prefix tree, the Top-prefix finder 116 assigns deploy weights to the nodes, and fills buckets based on the deploy weights. For the purposes of the present subject matter, since one bucket is filled with one distinct prefix, the number of buckets is equal to a predefined number of Top-prefixes to be determined from the prefix tree.

[0048] For determining the predefined number of Top-prefixes from the prefix tree, the Top-prefix finder 1 16 updates prefixes and corresponding deploy weights in a maximum of predefined number of buckets for each revision of the prefix tree. The description below describes the process of assigning of deploy weights and updating of the buckets for determining the Top-prefixes by maximization of total weight preserved by the prefixes in the buckets over a maximum number of distinct strings. Based on the Top- prefixes and the corresponding deploy weights in the buckets, a histogram can be generated by the histogram generator 118.

[0049] For the purposes of the description herein, let a string be denoted by s, a bucket be denoted by b, a prefix in a bucket b be denoted by p b, a deploy weight in a bucket b be denoted by W b, and the longest common prefix for two prefixes p b and p b' be denoted by pb Π p b' . The prefix pb also refers to a prefix represented by a node, and the deploy weight W b also refers to a deploy weight of the node representing the prefix p b. Also, the total number of buckets is equal to the predefined number of Top-prefixes that are to be determined for filling the buckets and generating a histogram . Let the predefined number be denoted by k. [0050] Upon receiving a string s, the data structure module 114 updates the prefix tree to include the string s. The prefix tree may already have a branch with one or more branch nodes and a leaf node for the string s. If not, a new branch with a branch node and a leaf node is created from the root node for including the received string s.

[0051] Based on the revision of the prefix tree, the Top-prefix finder 1 16 compares the string s with the prefixes stored in the buckets to determine if the string s matches with any of the prefixes in the buckets. If the string s matches with a prefix pb in the bucket b, the deploy weight Wb in the bucket b is revised. The deploy weight Wb is revised based on the frequency of the string s in the obtained string data. For this, the frequency of each string in the string data is maintained. If the received string s is a string already represented in the prefix tree, the frequency of the string s is incremented by 1. If the received string s is a new string, the frequency of string s is set as 1. Based on the frequency, the deploy weight Wb at the node representing the prefix pb is revised to make it equal to the frequency of the string s. The revised deploy weight W b is assigned to the node, and the deploy weight Wb in the bucket b is replaced by the revised deploy weight W b.

[0052] Further, if the string s does not match with any of the prefixes in the buckets, the Top-prefix finder 1 16 finds an empty bucket from the total of k number of buckets. Upon finding an empty bucket, the longest prefix of the string s, represented by a leaf node, is filled in that empty bucket. The deploy weight equal to the frequency of the string s is assigned to the leaf node representing the longest prefix of the string s. The deploy weight assigned to the leaf node is stored as the deploy weight W b in the bucket b.

[0053] Further, if the string s does not match with any of the prefixes in the buckets, and no bucket is empty or unfilled , the Top-prefix finder 1 16 identifies a bucket pair b, b' with prefixes p b, pt>' for which a loss weight is minimum. The loss weight is indicative of a loss in weight preserved upon filling one bucket b with the longest common prefix p b Π pb' and releasing or emptying the bucket b'. For the purposes of the description herein, the loss weight is denoted by Iw. For the bucket pair b, b', the loss weight Iw is computed based on equation ( 1 ) below:

Figure imgf000015_0001

where Wb and Wb' are deploy weights of the prefixes p b and pb' in the buckets b and b', respectively, |pb| is the length of prefix pb, |pb' | is the length of prefix Pb', |Pb n pb' | is the length of longest common prefix p b Π pb' .

[0054] For identifying a bucket pair b , b' with a minimum loss weight, the loss weights for different pairs of buckets are computed. One with the minimum loss weight is identified for further updating of the buckets. In an example, the loss weight for a bucket pair b, b' with prefixes pb and pb' is computed , if the prefix tree has a branch node representing the longest common prefix pb Ί pb' .

[0055] Further, based on the value of loss weight for the identified pair of buckets, the Top-prefix finder 1 16 revises or updates the buckets to maximize the total weight preserved by the prefixes in the buckets, and to have the prefixes in the buckets, which are associated with a maximum number of distinct strings. For this, if the loss weight Iw for the identified bucket pair b, b' with prefixes p b and pb' has a value less than 1 , then the bucket b is filled with the longest common prefix pb Π pb' to replace the prefix p b in the bucket b. For revision of the deploy weight Wb, the deploy weight of the branch node representing the longest common prefix pb Π pb' is computed as a sum of the deploy weights Wb and Wb' minus the loss weight Iw. This deploy weight is assigned to the branch node representing the longest common prefix p b Π p b' , and replaced as the deploy weight Wb in the bucket b. In addition, the other bucket b' is emptied by removing the prefix p b' and the corresponding deploy weight Wb' , and the longest prefix represented by the leaf node for the string s is filled in the bucket b'. For the deploy weight W b' , the deploy weight of the leaf node representing the longest prefix of the string s is assigned to be equal to the frequency of the string s. Since the frequency of the string s is incremented by 1 , the deploy weight Wb of the leaf node is increased by 1. This deploy weight of the leaf node is stored as the deploy weight W b' in the bucket b'.

[0056] The deploy weights in all the buckets are indicative of the total weight preserved by the prefixes in the buckets. With the loss weight for a bucket pair b, b' being less than 1 and by updating the buckets as described above, the total deploy weight in the buckets is reduced by a value less than 1 after the merging the contribution of the prefixes pb and pt>' in the bucket b. The total deploy weight in the buckets is gained by a value 1 by filling the prefix and the deploy weight associated with the string s in the bucket b'. This facilitates in maximizing the total weight preserved by the prefixes in the buckets and filling up the buckets with prefixes associated with a maximum number of strings.

[0057] Further, if the loss weight Iw for the identified bucket pair b, b' with prefixes pb and pb' has a value equal to 1 or more, then the string s is not considered, and the deploy weights in the buckets are reduced by a value 1.

[0058] With the revision of deploy weight in the buckets as described above, a deploy weight in one or more buckets may become less than 1 . In an implementation, the buckets for which the deploy weights become less than 1 are released or emptied, and made available for filling during the iterative cycle for the next string.

[0059] The description below describes the details of generating and revising a prefix tree for the incoming strings, revising and assigning deploy weights at the nodes, and updating buckets for generation of a histogram in an online environment through an illustrative example. Consider a case where the string data, obtained in an online environment, includes four strings: "host", "hostname", "address" and "server" with respective frequencies as 15, 2, 20 and 2, and three Top-prefixes are to be determined to fill in a maximum three buckets for generation of a histogram . The strings are received serially, one- by-one, in real-time. Figures 4(a), 4(b), 4(c), 4(d), and 4(e) illustrate iterative revisions of a prefix tree for the strings in an online environment, according to an example of the present subject matter.

[0060] Initially, the prefix tree only has a root node Rno, and all the three buckets are empty. In said example, let's say at first the string "host" is received . The prefix tree is revised to include the string "host". Figure 4(a) shows the prefix tree, revised to include the string "host". The prefix tree, as shown in Figure 4(a), has a leaf node Ln i representing the longest prefix as "host", and has a branch node Bni representing the longest common prefix also as "host". As the string "host" is the first string received , the frequency fi of the string "host" is set as 1 and maintained for the leaf node Lni . Based on the frequency fi , a deploy weight is assigned to the leaf node Ln i . The deploy weight for the leaf node Ln i is equal to the frequency fi at the leaf node Ln-i . Now, since all the buckets are empty, the longest prefix of the string "host", represented by the leaf node Ln-ι , is filled in a first bucket b i, and the deploy weight at the leaf node Lni is stored as the deploy weight W M in the first bucket bi .

[0061] After this, let's say the string "host" is again received one-by-one 14 times. Each time, the prefix tree is revised to include the string "host", the frequency fi at the leaf node Lni is incremented by 1 , and the deploy weight at the leaf node Ln i is also incremented by 1 in accordance with the frequency fi. With the string "host" matching each time with the prefix stored in the bucket bi , the deploy weight Wbi in the bucket bi is revised in accordance with the deploy weight at the leaf node Ln -i . After the iterations, the frequency fi becomes 15, the deploy weight at the leaf node Lni becomes 15, and the deploy weightwb l in the bucket bi becomes 15, as shown in Figure 4(b).

[0062] After this, let's say the string "hostname" is received 2 times. Each time, the prefix tree is revised to include the string "hostname". Figure 4(b) shows the prefix tree revised to include the string "hostname". The prefix tree has a branch node Bn2 representing the longest common prefix as "hostname", and has a leaf node Ln2 representing the longest prefix as "hostname". The branch node Bn2 branches out from the branch node Bn -i . The branch node Bni now represents the longest common prefix of the strings "host" and "hostname". When the string "hostname" is received for the first time, the frequency f2 of the string "hostname" is set as 1 and maintained for the leaf node Ln2. The deploy weight is assigned to the leaf node Lni based on the frequency f2. Further, since string "hostname" is not matching with the prefix stored in the bucket b i, the longest prefix of the string "hostname" is filled in a second bucket b2, and the deploy weight at the leaf node

Figure imgf000017_0001
is stored as the deploy weight W b2 in the second bucket b2. For the next reception of the string "hostname", the prefix tree is revised to include the string "hostname", the frequency f2 at the leaf node Ln2 is incremented by 1 , and the deploy weight at the leaf node Ln2 is also incremented by 1. With the string "hostname" matching with the prefix in the bucket b2, the deploy weight wb2 in the bucket b2 is revised in accordance with the deploy weight at the l eaf node Ln2. After the iterations, the frequency f2 becomes 2, the deploy weight at the leaf node Ln2 becomes 2, and the deploy weight Wb2 in the bucket b2 becomes 2, as shown in Figure 4(b).

[0063] After this, let's say the string "address" is received 20 times. After the iterations for the string "address", the revised prefix tree has another branch node Bn3 representing the longest common prefix as "address" and has another leaf node Ln3 representing the longest prefix as "address", as shown in Figure 4(c). Also, the frequency f3 at the leaf node Ln3 becomes 20, and the deploy weight at the leaf node Ln3 also become 20. For the first iteration with the string "address", since the string "address" is not matching with the prefixes stored in the buckets bi and b2, and a third bucket b3 being empty, the longest prefix of the string "address" is filled in the third bucket b 3. After the iterations, the deploy weight Wb3 in the bucket b3 becom es 20, as shown in Figure 4(c).

[0064] After this, let's say the string "server" is received once. The prefix tree is again revised to include the string "server". The revised prefix tree, as shown in Figure 4(d), has another leaf node Ln4 representing the longest prefix as "server", and has another branch node Bn4 representing the longest common prefix as "server". The frequency of the string "server" is set as 1 and maintained for the leaf node Ln4. The deploy weight, equal to the frequency , is assigned to the leaf node Ln4.

[0065] Now, for updating the buckets, since the string "server" is not matching with the prefixes in the buckets b i , b2 and b3, and since no more empty buckets are available, a bucket pair is identified for which a loss weight is minimum. As mentioned earlier, the loss weight for those bucket pairs is computed, for which the prefix tree has respective branch nodes representing the longest common prefixes of the prefixes in the respective bucket pairs. As shown in Figure 4(d), the prefix tree has one branch node Bni representing the longest common prefix of the prefixes in the buckets b i and b2. The loss weight for the bucket pair b i and b2 is computed through equation ( 1 ): IwibM = 15 (l - ) + 2 (l - ) = 1.

Since the value of loss weight for buckets b i and b2 is equal to 1 , the string "server" is ignored and the deploy weights WM , Wb2, Wb3 in the buckets b i , b2, b3 are reduced by 1 , as shown in Figure 4(d). I n an example, the branch representing the string "server", the frequency U, and the deploy weight at the leaf node Ln4, are removed from the prefix tree.

[0066] After this, let's say the string "server" is received once again. The revised prefix tree, as shown in Figure 4(e), again has a leaf node Ln4 representing the longest prefix as "server", and has a branch node Bn4 representing the longest common prefix as "server". The frequency U of the string "server" is set as 1 and maintained for the leaf node Ln4. The deploy weight, equal to the frequency , is assigned to the leaf node Ln4. Now, again for updating the buckets, since the string "server" is not matching with the prefixes in the buckets b i , b2 and b3, and since no more empty buckets are available, a bucket pair is again identified for which a loss weight is minimum. Again, buckets b i and b 2 are identified for loss weight computation , and the loss weight for the bucket pair b i and b2 is computed through equation ( 1 ): lw(bv b2) = 14 (l - g + l (l - 3 = 0.5.

[0067] Since the value of loss weight for buckets b i and b2 is 0.5 (less than 1 ), the bucket b i is filled with the longest common prefix represented by the branch node Bn-i . The deploy weight of ( 14 + 1 - 0.5 = 14.5) is assigned to the branch node Bn-i , and this deploy weight is stored as the deploy weight Wbi in the bucket b i . Also, the prefix "hostname" and the corresponding deploy weight Wb2 are removed from the bucket b 2. The longest prefix represented by the leaf node Ln4 is filled in the bucket b2, and the deploy weight at the leaf node Ln4 is stored as the deploy weight Wb2 in the bucket b2. The prefixes and the deploy weights Wbi , Wb2, Wb3 in the buckets b i , b2, b3 are as shown in Figure 4(e). With this, the total deploy weight of the buckets is reduced by 0.5 due to merging of contributions of the strings "host" and "hostname" in the bucket b i , and gained by 1 due to filling up of the bucket b 2 with the contribution of the string "server". Also, with this, the prefixes in the buckets are associated with four distinct strings, instead of three distinct strings as shown in Figures 4(c) and 4(d). Further, the prefixes "host", "address" and "server" are the three Top-prefixes determined for filling up the three buckets, and the deploy weights wt>i, Wb2, Wb3 in the buckets b i , b2, b3 can be used for generation of a histogram over the Top-prefixes for the strings.

[0068] The space cost associated with the histogram generated in the online environment is 0(|k|), as a maximum of k number of buckets are used for filling up with the k number of Top-prefixes for generation of the histogram . Further, the time cost associated with the histogram generated in the online environment for each iterative revision of the prefix tree based on a new string is 0(|k|), as each update of a maximum of k number of buckets takes the time of the order of |k|. The total time cost associated with the histogram depends on the number of strings received in the online environment.

[0069] Although the example of generation of histogram in the online environment is described for a few strings; the histogram construction system 102 can perform the same procedure with a substantially large number of strings to determine a predefined number of Top-prefixes and generate histograms based on the top-prefixes for the strings.

Histogram Generation in Offline Environment

[0070] In an implementation , for the purpose of generation of histograms in an offline environment, the data acquiring module 1 12 obtains string data in an offline manner from one or more data sources 104. The string data includes static strings with a predefined frequency distribution. The predefined frequency distribution has a frequency of each of the static strings in the string data. In an implementation, the frequencies of the string can be obtained from the respective data sources 104, or can be determined by the data acquiring module 1 12 after obtaining the static strings.

[0071] The description below describes the process of generating a prefix tree, assigning deploy weights to the nodes, and determining a predefined number of Top-prefixes by maximization of total weight preserved by the prefixes in the buckets over a maximum number of distinct strings. For the purposes of the description herein, let a string be denoted by s, a frequency of string s be denoted by f(s), a node of prefix tree be denoted by d, a fractional weight of node d be denoted by fWd, and a prefix represented by node d be denoted by pd. Also, the total number of buckets is equal to the predefined number of Top-prefixes that are to be determined for filling the buckets and generating a histogram. Let the predefined number be denoted by k.

[0072] Since, in the offline environment, the string data set with all the strings is known for generation of a histogram, the data structure module 1 14 generates a prefix tree for all the distinct strings in the string data set. For determining the predefined number of Top-prefixes from the prefix tree, in an implementation, the Top-prefix finder 1 16 performs a breadth first search to traverse the prefix tree and determine a reverse traverse order for the nodes. The reverse traverse order captures a sequential order of nodes from the bottom of the prefix tree, i. e. , from the leaf nodes, towards the top of the prefix tree, i .e. , towards the root node.

[0073] After determining the reverse traverse order, the Top-prefix finder 116 computes a fractional weight for each of the nodes in the prefix tree in accordance with the reverse traverse order. The fractional weight of a jth leaf node is computed based on equation (2) below:

Λ¾ = / ( ·,). (2)

where f(Sj) is the frequency of the jth string whose longest prefix p dj is represented by the jth leaf node. The fractional weight of a jth branch node is computed based on equation (3) below: y =∑Ei /w¾. x ¾ (3)

where m is equal to the number of child-nodes of the jth branch node, fWdi is the fractional weight of the ith child-node of the jth branch node, |p dj| is a length of prefix p^ represented by the jth branch node, and | pd | is a length of prefix pdi represented by the ith child-node of the jth branch node.

[0074] Since the fractional weights are computed in accordance with the reverse traverse order, the fractional weights of child-nodes are known for computing the fractional weight of a branch node. The fractional weight of a leaf node is a measure of a weight preserved by the leaf node with respect to the frequency of the string associated with the leaf node. And, the fractional weight of a branch node is a measure of a fractional weight preserved by the branch node depending on contributions of its child-nodes for weight preservation. The fractional contributions for a branch node are governed by the ratios of the length of the prefix at the branch node and the length of the prefix at the respective child-nodes.

[0075] After computing the fractional weights for all the nodes, the Top- prefix finder 1 16 assigns deploy weights to the nodes. For a node d, a number of deploy weights are computed and assigned to the node d depending on the number of buckets, from 1 to at most k buckets, which can be possibly filled by the prefixes at the sub-tree nodes rooted at the node d and by the prefixes at further sub-tree nodes rooted at child-nodes of the node d . For the purposes of the description herein, let the deploy weight assigned to the node d be denoted by dWd. Let dWd 1 , dWd 2, ... , dWd k denote the deploy weight of the node d when 1 , 2, ... , k buckets are filled with 1 , 2, ... , k prefixes represented by the sub-tree nodes rooted at the node d and by the further sub-tree nodes rooted at the child-nodes of the node d. The deploy weight dWd1 is indicative of a maximum weight preserved upon filling t number of buckets with t number of prefixes represented by the sub-tree nodes rooted at the node d and by the further sub-tree nodes rooted at the child-nodes of the node d.

[0076] In addition, for each node d and against each deploy weight dWd 1, the combination of sub-tree nodes representing the prefixes, for which the weight preserved is maximum , is determined as an arrangement set. Let the arrangement set for the deploy weight dWd 1 be denoted by {arrd 1}. The arrangement set {arrd 1} is indicative of the sub-tree nodes at node d whose prefixes if filled in the t number of buckets will result in the maximum weight preservation.

[0077] In addition, for each node d and against each deploy weight dWd 1, depending on the {an }, a leak weight is computed. Let the leak weight for the deploy weight dWd1 and the arrangement set {arrd1} be denoted by Iwd1. The leak weight Iwd1 is indicative of leaking information across the node d when t number of buckets are filled. The leak weight IWd 1 is a measure of total information of the sub-tree nodes at the node d minus the deploy weight dWd 1. [0078] The description below describes the computation and determination of the deploy weights dWd, the leak weights Iwd and the arrangement sets {arrd} which can be followed for each of the node d . The deploy weights dWd, the leak weights IWd and the arrangement sets {arrd} are computed and determined for the nodes in accordance with the reverse traverse order. With this, the deploy weights dWd and the leak weights IWd of child-nodes are known for computing deploy weights dWd and the leak weights IWd of a branch node.

[0079] For each leaf node, since there is no branch node only one bucket (t = 1 ) can be filled in by the prefix represented by the lead node. The deploy weight dWdj for the jth leaf node is computed based on equation (4) below: dw^j = / IV,, , . (4)

where fWdj is the fractional weight for the jth leaf node. The leak weight IWdj for the jth leaf node is zero, and the corresponding arrangement set {arrdj 1} refers to the leaf node.

[0080] For a node d other than the leaf nodes, one to at most k buckets (t = 1 to k) can possibly be filled by the prefixes at the sub-tree branch nodes rooted at that node d . The number of buckets that can be filled depends on the number of sub-tree child nodes rooted at that node d. Let's say the jth node dj in the prefix tree has q number of child branch nodes in the sub-tree rooted at the node dj. Then the number of sub-tree branch nodes rooted at the node dj is equal to q + 1.

[0081] For the jth node dj, with one bucket being possibly filled , i.e. , t = 1 , the deploy weight dWdj 1 is computed based on equation (5) below:

dWdj = max {fwdj , dwdi ■ i = 1 to q], (5) where fw^ is the fractional weight for the node dj, and dWd 1 is the deploy weight of the ith child branch node of the node dj for one filled bucket. The function max {} means that the deploy weight dWdj 1 takes a value which maximum from fw^ and dWd 1s.

[0082] Further, for t = 1 , the arrangement set {arrdj 1} refers to a node, from the sub-tree branch nodes rooted at the node dj, whose value is taken as the deploy weight dWdj1. Further, for t = 1 , the leak weight IWdj1 is computed based on equation (6) below: includes the node dj

does not include the node dj

Figure imgf000024_0001

fWdi, |pdj| is the length of the prefix at the node dj, and |pd 1 .. (6) where IWd = | is the length of the prefix at the ith child branch node of the node dj.

[0083] Further, for the jth node dj, with possible number of buckets filled being equal to the number of sub-tree branch nodes of the node dj, i .e., t = q, the deploy weight dWdj q is computed based on equation (7) below:

d wdj = (¾) + ∑;=itOq / 0;) > (7)

where f(Sj) is frequency of the string Sj whose prefix is represented by the node dj, and f(si) is frequency of the string s, whose prefix is represented by the ith child branch node of the node dj.

[0084] Further, for t = k, the arrangement set {arrdj k} refers to the sub-tree branch nodes rooted at the node dj. Further, for t = k, the leak weight IWdj k is zero.

[0085] Further, for the jth node dj, with possible number of buckets filled being more than one and less than the number of sub-tree branch nodes at the node dj, i .e. , 1 < t < k < q+1 , and for computing the deploy weight dWdj 1, a term "deployment factor" denoted by x is defined for the node dj. The deployment factor ¾ denotes a number of buckets that can be filled by or deployed on the sub-tree branch nodes rooted on the ith child branch node of the node dj. With q child branch nodes of the node dj, xi refers to the number of buckets that can be filled by the sub-tree branch nodes rooted on the first child branch node, X2 refers to the number of buckets that can be filled by the sub-tree branch nodes rooted on the second child branch node, and so on. Here xo refers to the number of buckets that can be filled by the node dj. Thus, xo can be either 0 or 1 for a bucket filled by the node dj and not filled by the node dj, respectively. For various possible values of xo, xi, X2, ■■■ , xq for the node dj, each deployment factor set {X} is defined as {xo, xi , X2,■■■■ , Xq}- [0086] Now, for computing the deploy weight dWdj 1, all possible combination of deployment factors x are enumerated in the deploym ent factor sets {Xt}, such that∑ xi = t, where i = 0 to q . With this, the deploy weight dWdj 1 is computed based on equation (8) below:

Figure imgf000025_0001

where dwd is the deploy weight of the i child branch node at the node dj, IWdixl is the leak weight of the ith child branch node at the node dj, |pdj| is length of the prefix at the node dj, and |pdi| is length of the prefix at the ith child branch node of the node dj. Here IWdi0 = fWdi, and max{ t}{} means a value which is maximum over all the enumerated deploym ent factor sets {Xt} for the node dj.

[0087] Further, the arrangement set {arrdj 1} is determined based on the deployment factors in the deployment factor set {Xt} which decide the deploy weight dWdj1. Based on the determined arrangement set {arrdj1}, the leak weight IWdj 1 is computed through equation (9) below:

Figure imgf000025_0002

[0088] Based on equations (8) and (9), the deploy weight dWdj 1 and the leak weight IWdj 1 are computed , and the arrangement set {arrdj 1} is determined with t = 2, 3, and so on, up to t < k < q+1 for each node dj. These computations enable in identifying and arriving at the combinations of nodes in each branch rooted at the root node of the prefix tree, for which the weight preserved is maximum when 1 to at most k number of buckets are filled by the prefixes at those combinations of nodes.

[0089] After, determining the deploy weights, the leak weights, and the arrangement sets for the leaf nodes and the branch nodes of the prefix tree, the deploy weights and the arrangement sets are computed and determined for the root node of the prefix tree in the manner as described above using equations (5), (7) and (8). For this, the node d is considered as the root node in equations (5), (7) and (8).

[0090] Based on the computations for the root node, the arrangement set {arrRnok} captures and refers to those k nodes whose prefixes when filled in the k buckets preserve the maximum weight. The prefixes represented by such k nodes are the Top-prefixes that can be filled in the k buckets. Subsequent to this, the histogram generator 1 18 generates a histogram for the strings received in the offline environment based on the deploy weights of those k nodes identified from the arrangement set {arrRnok}.

[0091] In an implementation, for each node d , the deploy weights dwd, the leak weights IWd and the arrangement sets {arrd 1} are stored as elements of an array. Let the array for the node d be denoted by Vd.

[0092] The description below describes the details of generating a prefix tree for the static strings, assigning deploy weights to nodes, and determining a predefined number of Top-prefixes to fill in the predefined number of buckets for generation of a histogram in an offline environment through an illustrative example. Consider a case where the string data, obtained in an offline environment, includes strings s as listed in Table 1 below. Table 1 also lists frequencies f(s) for the received strings. Let's say three Top-prefixes are to be determined to fill a maximum of three buckets, i.e. , maximum value of k is 3, for generation of a histogram .

Table 1

Figure imgf000026_0001

[0093] Figure 5 illustrates a prefix tree for the strings in an offline environment, according to an example of the present subject matter. The prefix tree, as shown, has a root node, multiple branch nodes and multiple leaf nodes based on the strings. Initially, the prefix tree is traversed by performing a breadth first search, and a reverse traverse order for the nodes is determined. The nodes in the prefix tree are sequentially numbered in accordance with the reverse traverse order, as shown in Figure 5. For the purpose of the description herein , a node is denoted as dj where j is the node number of that node. Table 2 enlists the node number according to the reverse traverse order, and indicates the prefix pd represented by the corresponding node d . The node di is the root node, the nodes d2, d3, d4, ds, d6, d7, ds, dg, dio, and dn are the branch nodes, and the nodes d -i2, di3, di4, d-i5, di6, di7, dis, dig, d2o, and d2i are the leaf nodes.

Table 2

Figure imgf000027_0001

[0094] After this, in accordance with the reverse traverse order, a fraction weight fWd of each of the nodes is computed. The fractional weight fWd of the leaf nodes is computed using equation (2) and the fractional weight fW d of the branch nodes is computed using equation (3). The values of fractional weights of the nodes are listed in Tabl e 2. Some example computations of the fractional weights are illustrated below: , „ \ hostnameABCD\ 12

For node dn: fwdll = 1 fwd a 2*iL x I hostnameA BCD\ = io x — 12 = 10 ,

For node d4: fwd4 = fwdi4 x - + fwdW x - = 5 + 3 = 8 , and

For node d3: fwd3 = fwdi3 x→ fwd6 x→ fwd7 x→ fwd8 x→ fwd9 x

10

92

3

[0095] After computing the fractional weights fWd for all the nodes, the deploy weights dWd , the leak weights IWd 1, and the arrangement sets {arrd 1} are computed and/or determined for all the nodes, in accordance with the reverse traverse order. The computations and determinations are carried out in a manner as described earlier. In an example, for each node d, the deploy weights dWd 1, the leak weights IWd 1, and the arrangement sets {arrd 1} are stored in an array Vd with at most k cells, where tth cell of the array Vd is {Vd1} = {dWd1, IWd 1, {arrd 1}}. For a node d, t can take values from 1 < t < k < q+1 , where q is the number of child branch nodes at the node d, and q+1 refers to the number of sub-tree branch nodes rooted at the node d.

[0096] Table 3 illustrates values of the deploy weights, the leak weights and the arrangement sets for the leaf nodes. Since only one bucket can be filled by the prefix represented by a leaf node, the value of t is equal to 1 and the array Vd has one cell for each leaf node. The value of dWd 1 for each leaf node is computed through equation (4).

Table 3

Figure imgf000028_0001
{Vdi 21}= {dWdi21 , IWdi 21 , {arrdi 21 }} I {5, 0, {d i 2}}

[0097] Table 4 illustrates values of the deploy weights, the leak weights and the arrangement sets for the branch nodes. For the nodes dn , d io, dg, d7, d6, ds and d2, only one bucket can possibly be filled by the prefix at the respective nodes. Thus, t is equal to 1 , and the corresponding array Vd has one cell . For the node d4, one or two buckets can possibly be filled by the prefixes at the sub-tree branch nodes rooted at the node d4. Thus, t can be equal to 1 or 2, and the array Vd4 has 2 cells, {ν<¾1} and {Vd42}. Similarly, for node ds, t can be equal to 1 or 2, and the array Vds has 2 cells, {Vds1} and {Vd82}- The values of deploy weights dWd1 and leak weights IWd1 for the branch nodes are computed through equations (5) to (9).

Table 4

Figure imgf000029_0001

[0098] Some example computations of the deploy weights, the leak weights, and the arrangement sets are illustrated below:

[0099] For the node d8, with t = 1 :

44

dvL^g = max{/wd8, dw^) = max y , 10

3 {arrJe} = {d8}■

[0100] For the node d8, with t = 2:

dw 8 = 8 + 10 = 18 ,

lwjs = 0 ,

{arr%8} = {d8,du} .

[0101] For the node d3, with t = 1:

1 r 1 1 1 1 1 f92 44 Ί 92 d 3 = max{fwd3, dw 6,dw 7,dw s,dw 9} = max , 9,5,— , 10] = y , iw£3 = 0 ,

{αττ^} = {d3} .

[0102] For the node d3, with t = 2, the possible deployment factor sets {X2} are shown in Table 5. The node d3 has four child branch nodes d6, d7, ds and dg. The deploy weight dWds2 is computed using equation (8) over all the possible deployment factor sets {X2}. The deploy weight dWds2 takes the value corresponding to the deployment factor set {1, 0, 0, 1, 0}. Thus, for the node d3:

d j3 = x lwo9 t

Figure imgf000030_0001

4 4 44 4

dwL = -X 9 + 8 ,

ai 6 - 6x5+— + — X l0 = 2

3 10

iwj3 = 0 ,

{arrjg} = {d3,d8} .

Tables

Figure imgf000030_0002
{0, 0, 0, 1 , 1 }

{0, 0, 0, 2, 0}

[0103] After this, the deploy weights and the arrangement sets are computed and determined for the root node of the prefix tree using equations (5), (7), and (8). For the root node, with t = 1 : dWRno1 = 92/3 and {arrRno 1} = {d3}. With t = 2: dWRno2 = 1 16/3 and {arrRno2} = {d 3, d4}. And, with t = 3: dwRno3 = 137/3 and {arrRno3} = {d3, d4, ds}. Based on the computations for the root node, the nodes d3, d4 and ds as indicated in the arrangement set {arrRno3} the three nodes whose prefixes when filled in three buckets preserve the maximum weight. Thus, the prefixes "host", "server" and "code" represented by the nodes d3, d4, and ds are the three Top-prefixes determined for filling up the three buckets, and the deploy weights associated with these nodes are stored in the buckets, which can be used for generation of a histogram for the strings.

[0104] The space cost associated with the histogram generated in the offline environment is 0(|D k f|), as D number of distinct strings are represented by the D number of leaf nodes, and a maximum of k number of buckets are used for filling up with the k number of prefixes. Here, f denotes the maximum fan-out of the prefix tree, which is indicative of the maximum number of distinct characters that can be a part of a string. Further, the time cost associated with the histogram generated in the offline environment is 0(|D k k9|), as a D number of leaf nodes is parsed to fill a k number of buckets, and, for one node, a maximum of k number buckets are distributed to a g number of child-nodes of that node.

[0105] Although the example of generation of histogram in the offline environment is described for a few strings; the histogram construction system 102 can perform the same procedure with a substantially large number of strings to determine a predefined number of Top-prefixes and generate histograms based on the top-prefixes for the strings.

[0106] Figure 6 illustrates a method 600 of generation of a histogram for string data, according to an example of the present subject matter. Figure 7 illustrates a method 700 of generation of a histogram for string data in an online environment, according to an example of the present subject matter. Figure 8 illustrates a method 800 of generation of a histogram for string data in an offline environment, according to an example of the present subject matter. The order in which the methods 600, 700, and 800 are described is not intended to be construed as a limitation , and any number of the described method blocks can be combined in any order to implement the method s 600, 700, and 800, or an alternative method . Additionally, individual blocks may be deleted from the methods 600, 700, and 800 without departing from the spirit and scope of the subject matter described herein.

[0107] Furthermore, the methods 600, 700, and 800 can be implemented by processor(s) or computing devices in any suitable hardware, non-transitory machine readable instructions, or combination thereof. It may be understood that steps of the methods 600, 700, and 800 may be executed based on instructions stored in a non-transitory computer readable medium as will be readily understood. The non-transitory computer readable medium may include, for example, digital data storage media, digital memories, magnetic storage media, such as a magnetic disks and magnetic tapes, hard drives, or optically readable digital data storage media.

[0108] Further, although the methods 600, 700, and 800 may be implemented in computing devices in different network environments for generation of histograms for string data, in examples described in Figure 6, Figure 7, and Figure 8, the methods 600, 700, and 800 are explained in context of the aforementioned histogram construction system 102, for ease of explanation.

[0109] Referring to Figure 6, at block 602, a prefix tree is generated for strings in string data. The strings are received and the prefix tree is generated by the histogram construction system 102. The strings may be received in an online environment or an offline environment. The prefix tree includes nodes that represent prefixes of the received strings.

[0110] Based on the nodes in the prefix tree, deploy weights are assigned to the nodes at block 604. The deploy weights are assigned to the nodes based on lengths of the prefixes represented by sub-tree nodes rooted at the nodes and based on frequencies of the strings whose prefixes are represented by the sub-tree nodes. Each of the deploy weights of one node is indicative of a maximum weight preserved upon filling the buckets with at least one prefix represented by the sub-tree nodes rooted at that one node. The deploy weights are assigned by the histogram construction system 102.

[0111] At block 606, a predefined number of Top-prefixes of the strings are determined for filling the predefined number of buckets. The predefined number of strings is determined from the prefixes represented by the nodes based on maximizing a total weight preserved by the prefixes in the buckets and over a maximum number of strings. The Top-prefixes are determined by the histogram construction system 102.

[0112] At block 608, a histogram is generated based on the deploy weights associated with the Top-prefixes in the buckets. The histogram is generated by the histogram construction system 102. The histogram may be generated for the purposes of data mining , data analytics, and approximate query answering.

[0113] Referring to Figure 7, the string data is received online, in realtime. The strings in the string data are serially received one-by-one. The prefix tree initially has a root node and the predefined number of buckets, that are to be filled by the Top-prefixes, are empty. At block 702, a string is received and the prefix tree is updated to include the string. At block 704, it is checked whether the string is matching with a prefix in one bucket. For this the string is compared with the prefixes in the buckets. If the string matched with a prefix in a bucket ('Yes' branch from block 704), the deploy weight in the bucket having the prefix that matches with the string is incremented by 1 , at block 706. The revised deploy weight is assigned to the bucket, and the method 700 proceeds to receive the next string for processing and/or proceeds to generate a histogram at block 720.

[0114] If the string is not matched (' No' branch from block 704), it is checked at block 708 whether an empty or unfilled bucket, from the maximum of predefined number of buckets, exists. If an unfilled bucket is found ('Yes' branch from block 708), a longest prefix of the string is filled in the unfilled bucket and the deploy weight of the node representing the longest prefix is stored in the unfilled bucket, at block 710. For this, the deploy weight is assigned to the node representing the longest prefix, based on the frequency of the string, before storing the same in the unfilled bucket. The method 700 then proceeds to receive the next string for processing and/or proceeds to generate a histogram at block 720.

[0115] Further, if no unfilled bucket is found ('No' branch from block 708), a bucket pair with prefixes is identified, at block 712, for which a loss weight is minimum. For this, a loss weight for each bucket pair is computed as described earlier and the pair with the minimum loss weight is taken as the bucket pair for further processing.

[0116] At block 714, it is checked whether the value of loss weight for the identified bucket pair is less than 1 . If the value of loss weight is > 1 ('No' branch from block 714), the deploy weights in the buckets are reduced by 1 , at block 716, and the method 700 proceeds to receive the next string for processing and/or proceeds to generate a histogram at block 720. And, if the value of loss weights is < 1 ('Yes' branch from block 716), then, at block 718, one bucket of the identified bucket pair is filled by the longest common prefix of the prefixes in the bucket pair, the deploy weight in that one bucket is revised as a sum of the deploy weights associated with the prefixes in the bucket pair minus the loss weight, the other bucket of the bucket pair is filled with a longest prefix of the string, and the deploy weight of the node representing the longest prefix of the string is stored in that other bucket. For this, the deploy weight is assigned to the node representing the longest prefix of the string, based on the frequency of the string, before storing the same in the bucket. The method 700 then proceeds to receive the next string for processing and/or proceeds to generate a histogram at block 720.

[0117] At block 720, a histogram is generated based on the deploy weights associated with the prefixes in the buckets.

[0118] Referring to Figure 8, at block 802, string data having strings with a predefined frequency distribution is received. The string data is received offline, and the strings are static strings with fixed frequencies. At block 804, a prefix tree is generated for the received strings. The prefix tree is generated for distinct strings. Based on the prefix tree, a breadth first is performed to traverse the prefix tree and a reverse traverse order for the nodes is determined, at block 806. [0119] Based on the reverse traverse order, fractional weights for the leaf nodes and the branch nodes in the prefix tree are computed, at block 808. After this, at block 810, a number of deploy weights are computed and assigned to each node. The deploy weights are computed for each node depending on the number of buckets, from 1 to at most the predefined number, which can be filled by the prefixes at sub-tree nodes rooted at that each node and by the prefixes at further sub-tree nodes rooted at child-nodes of that each node. The deploy weights for the nodes are computed based on the reverse traverse order and based on the fractional weights of the sub-tree nodes, frequencies of the strings whose prefixes are represented by the subtree nodes, lengths of the prefixes represented by the sub-tree nodes, and the deploy weights of sub-tree nodes.

[0120] At block 812, deploy weights are computed for the root node of the prefix tree. The deploy weights of the root node are computed for the number of buckets, from 1 to at most the predefined number, which can be filled by the prefixes at sub-tree nodes rooted at the root node and at the further sub-tree nodes rooted at the child-nodes of those sub-tree nodes. The deploy weights for the root node are computed based on the deploy weights of the sub-tree nodes rooted at the root node.

[0121] Based on the deploy weights of the root node, at block 814, the predefined number of Top-prefixes is determined from the prefixes based on which deploy weights of the root node are computed. The predefined number of Top-prefixes is a number indicating those prefixes represented by the subtree nodes at the root nodes and the prefixes represented by further sub-tree nodes at the child-nodes rooted at the sub-tree nodes for which the deploy weight of the root nodes indicates a maximum weight preserved upon filling the predefined number of buckets.

[0122] At block 816, a histogram is generated based on the deploy weights associated with the predefined number of Top-prefixes determined based on the deploy weights of the root node.

[0123] Figure 9 illustrates a system environment 900 for generation of a histogram for string data, according to an example of the present subject matter. The system environment 900 may be a public networking environment or a private networking environment. In one implementation, the system environment 900 includes a processing resource 902 communicatively coupled to a computer readable medium 904 through a communication link 906.

[0124] For example, the processing resource 902 can be a computing device for generating histograms. The computer readable medium 904 can be, for example, an internal memory device or an external memory device. In one implementation, the communication link 906 may be a direct communication link, such as any memory read/write interface. In another implementation, the communication link 906 may be an indirect communication link, such as a network interface. In such a case, the processing device 902 can access the computer readable medium 904 through a network 908. The network 908 may be a single network or a combination of multiple networks and may use a variety of different communication protocols.

[0125] The processing resource 902 and the computer readable medium 904 may also be communicatively coupled to data sources 910 through the communication link 906, and/or to communication devices 912 over the network 908. The coupling with the data sources 910 enables in receiving the string data in an offline environment, and the coupling with the communication devices 912 enables in receiving the string data in an online environment.

[0126] In one implementation, the computer readable medium 904 includes a set of computer readable instructions, such as the data acquiring module 112, the data structure module 1 14, the Top-prefix finder 1 16, and the histogram generator 1 18. The set of computer readable instructions can be accessed by the processing resource 902 through the communication link 906 and subsequently executed to perform acts for generating histograms for string data.

[0127] For example, the data acquiring module 1 12 can obtain string data comprising strings. Based on the obtained strings, the data structure module 114 can generate a prefix tree for distributing the strings into nodes that represent prefixes of the strings. Based on the nodes in the prefix tree, the Top-prefix finder 1 16 can assign deploy weights to the nodes. [0128] Further, based on the deploy weights of the nodes, the Top-prefix finder 1 16 can determine or find a predefined number of Top-prefixes of the strings for filling up the predefined number of buckets. The Top-prefixes are determined from the prefixes in the prefix tree, based on maximization of a total weight preserved by the predefined number of prefixes, where the predefined number of prefixes is associated with a maximum number of distinct strings. Each of the Top-prefixes is filled in a separate bucket, and the deploy weight of the node representing the each Top-prefix is stored in the corresponding bucket.

[0129] Further, after determining or finding the Top-prefixes for the strings and filling up the buckets, the histogram generator 1 18 can generate a histogram of the Top-prefixes. The histogram is generated based on the Top- prefixes and the deploy weights associated with the Top-prefixes.

[0130] Although implementations for generation of histograms for string data have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed and explained as example implementations for generation of histograms for string data.

Claims

We claim:
1. A method of generation of a histogram for string data having strings, the method comprising:
generating, by a computing device, a prefix tree having nodes representing prefixes of the strings, the nodes comprising leaf nodes representing longest prefixes of the strings and branch nodes representing longest common prefixes of prefixes represented by child-nodes branching out from the respective branch nodes;
assigning, by the computing device, deploy weights to the nodes based on lengths of prefixes represented by sub-tree nodes rooted at the nodes and frequencies of the strings whose prefixes are represented by the subtree nodes, wherein each of the deploy weights of one node is indicative of a maximum weight preserved upon filling the buckets with at least one prefix represented by the sub-tree nodes rooted at that one node;
determining, by the computing device, a predefined number of Top- prefixes of the strings for filling up the predefined number of buckets, wherein the Top-prefixes are determined from the prefixes represented by the nodes based on maximizing a total weight preserved by the prefixes in the buckets and over a maximum number of strings; and
generating a histogram based on the deploy weights associated with the Top-prefixes in the buckets.
2. The method as claimed in claim 1 , wherein the strings are data streams received online in real-time by the computing device from at least one communication device, and wherein the generating the prefix tree comprises iteratively revising the prefix tree to include the strings, one by one, in the prefix tree, and wherein the determining the predefined number of Top- prefixes comprises updating the buckets for each revision of the prefix tree to maximize the total weight preserved by the Top-prefixes in the buckets..
3. The method as claimed in claim 2, wherein the updating of the buckets comprises:
for each of the strings, comparing the each string with the prefixes in the buckets, and revising, based on a frequency of the each string, the deploy weight in the bucket having the prefix that matches with the each string; and
when the each string is not matched, finding an unfilled bucket, filling a longest prefix of the each string in the unfilled bucket, and storing the deploy weight of the node representing the longest prefix in the unfilled bucket.
4. The method as claimed in claim 3, wherein, when each of the strings is not matched with the prefixes in the buckets and no bucket is unfilled , the updating of the buckets comprises:
identifying a bucket pair with prefixes for which a loss weight is minimum, wherein the loss weight is indicative of a loss in weight preserved upon filling one bucket of the bucket pair with a longest common prefix associated with the prefixes in the bucket pair and releasing another bucket of the bucket pair; and
revising the buckets based on the loss weight.
5. The method as claimed in claim 4, wherein the revising of the buckets comprises:
reducing the deploy weights in the buckets by a value of one when the loss weight has a value of at least one; and
when the loss weight has a value of less than one,
filling one bucket of the bucket pair with the longest common prefix associated with the prefixes in the bucket pair;
revising the deploy weight in the one bucket as a sum of the deploy weights associated with the prefixes in the bucket pair minus the loss weight;
filling another bucket of the bucket pair with a longest prefix of the each string; and
storing the deploy weight of the node representing the longest prefix in that other bucket.
6. The method as claimed in claim 1 , wherein the strings are static strings with a predetermined frequency distribution obtained by the computing device from at least one data source, and wherein the assigning the deploy weights to the nodes is based on a reverse traverse order for the nodes and based on frequencies of the strings as in the predetermined frequency distribution.
7. The method as claimed in claim 6, further comprising determining the reverse traverse order by traversing the prefix tree based on a breadth first search.
8. The method as claimed in claim 6, wherein the assigning of the deploy weights to the nodes is based on the reverse traverse order, wherein the assigning comprises:
computing a number of deploy weights for each of the nodes depending on a number of buckets, from one to at most the predefined number, which are fillable by prefixes represented by sub-tree nodes rooted at the each node and by further sub-tree nodes rooted at child- nodes of the each node.
9. The method as claimed in claim 8, wherein the assigning of the deploy weights to the nodes comprises computing deploy weights for a root node of the prefix tree depending on a number of buckets, from one to at most the predefined number, which are fillable by prefixes represented by sub-tree nodes rooted at the root node and by further sub-tree nodes rooted at child- nodes of the root node, wherein the deploy weights of the root node are computed based on the deploy weights of nodes rooted at the root node, and wherein the Top-prefixes are determined from the prefixes based on which deploy weights of the root node are computed.
10. A histogram construction system ( 102) for generation of a histogram for string data, the histogram construction system ( 102) comprising :
a processor ( 1 10);
a data acquiring module ( 1 12) coupled to the processor ( 1 10) to obtain the string data comprising strings;
a data structure module ( 1 14) coupled to the processor ( 1 10) to generate a prefix tree comprising nodes that represent prefixes of the strings;
a Top-prefix finder ( 1 16) coupled to the processor ( 1 10) to: assign deploy weights to the nodes based on lengths of prefixes represented by sub-tree nodes rooted at the each node and frequencies of the strings whose prefixes are represented by the sub-tree nodes, wherein each of the deploy weights of one node is indicative of a maximum weight preserved upon filling buckets with at least one prefix represented by the sub-tree nodes rooted at that one node; and
determine a predefined number of Top-prefixes of the strings for filling up the predefined number of buckets, wherein the Top- prefixes are determined from the prefixes represented by the nodes based on maximizing a total weight preserved by the prefixes in the buckets over a maximum number of strings; and
a histogram generator ( 1 18) coupled to the processor ( 1 10) to generate a histogram based on the deploy weights of the nodes representing the Top-prefixes.
11. The histogram construction system (102) as claimed in claim 10, wherein the strings are streamed and received online in real-time from at least one communication device ( 106), wherein the data structure module ( 1 14) iteratively revises the prefix tree to include the strings, one by one, in the prefix tree, and wherein the Top-prefix finder ( 1 16), for each revision of the prefix tree, updates the buckets to maximize the total weight preserved by the Top-prefixes in the buckets.
12. The histogram construction system (102) as claimed in claim 1 1 , wherein the Top-prefix finder ( 1 16):
compares each of the strings with the prefixes in the buckets, and revises the deploy weight in the bucket having the prefix that matches with the each string;
finds an unfilled bucket when the each string is not matched, fills a longest prefix of the each string in the unfilled bucket, and stores the deploy weight of the node representing the longest prefix in the unfilled bucket; identifies a bucket pair with prefixes for which a loss weight is minimum when the each string is not matched with the prefixes in the buckets and no bucket is unfilled, wherein the loss weight is indicative of a loss in weight preserved upon filling one bucket of the bucket pair with a longest common prefix associated with the prefixes in the bucket pair and releasing another bucket of the bucket pair; and
revises the buckets based on the loss weight.
13. The histogram construction system ( 102) as claimed in claim 12, wherein, for revising the buckets, the Top-prefix finder ( 1 16):
reduces the deploy weights in the b uckets by a value of one when the loss weight has a value of at least one; and
when the loss weight has a value of less than one;
fills one bucket of the bucket pair with the longest common prefix associated with the prefixes in the bucket pair;
revises the deploy weight in the one bucket as a sum of the deploy weights associated with the prefixes in the bucket pair minus the loss weight;
fills another bucket of the bucket pair with a longest prefix of the each string; and
stores the deploy weight of the node representing the longest prefix in that other bucket.
14. The histogram construction system (102) as claimed in claim 10, wherein the strings are static strings with a predetermined frequency distribution received from at least one data source ( 104), and wherein the Top-prefix finder ( 1 16) assigns the deploy weights to the nodes based on a reverse traverse order of the nodes and based on frequencies of the strings as in the predetermined frequency distribution.
15. The histogram construction system (102) as claimed in claim 14, wherein the Top-prefix finder ( 116) computes a number of deploy weights for each of the nodes based on the reverse traverse order and depending on a number of buckets, from one to at most the predefined number, which are fillable by prefixes represented by sub-tree nodes rooted at the each node and by further sub-tree nodes rooted at child-nodes of the each node.
16. The histogram construction system (102) as claimed in claim 15, wherein the Top-prefix finder ( 1 16) computes deploy weights for a root node of the prefix tree depending on a number of buckets, from one to at most the predefined number, which are fillable by prefixes represented by sub-tree nodes rooted at the root node and by further sub-tree nodes rooted at child- nodes of the root node, wherein the deploy weights of the root are computed based on the deploy weights of nodes rooted at the root node, and wherein the Top-prefixes are determined from the prefixes based on which the deploy weights of the root node are computed.
17. A non-transitory computer-readable medium comprising computer readable instructions that, when executed, cause a histogram construction system to:
obtain string data comprising strings;
determine a predefined number of Top-prefixes of the strings for filling up the predefined number of buckets, by:
generating a prefix tree having nodes representing prefixes of the strings, the nodes comprising leaf nodes representing longest prefixes of the strings and branch nodes representing longest common prefixes of prefixes represented by child-nodes branching out from the respective branch node; and
assigning deploy weights to the nodes based on lengths of the prefixes represented by sub-tree nodes rooted at the each node and frequencies of the strings whose prefixes are represented by the sub-tree nodes, wherein each of the deploy weights of one node is indicative of a maximum weight preserved upon filling the buckets with at least one prefix represented by the sub-tree nodes rooted at that one node;
wherein the Top-prefixes are determined from the prefixes represented by the nodes based on maximizing a total weight preserved by the prefixes in the buckets over a maximum number of strings; and
generate a histogram based on the deploy weights associated with the Top-prefixes in the buckets.
PCT/CN2013/075033 2013-04-30 2013-04-30 Histogram construction for string data WO2014176754A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2013/075033 WO2014176754A1 (en) 2013-04-30 2013-04-30 Histogram construction for string data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US14/787,548 US20160154854A1 (en) 2013-04-30 2013-04-30 TOP-K Prefix Histogram Construction for String Data
PCT/CN2013/075033 WO2014176754A1 (en) 2013-04-30 2013-04-30 Histogram construction for string data

Publications (1)

Publication Number Publication Date
WO2014176754A1 true WO2014176754A1 (en) 2014-11-06

Family

ID=51843051

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/075033 WO2014176754A1 (en) 2013-04-30 2013-04-30 Histogram construction for string data

Country Status (2)

Country Link
US (1) US20160154854A1 (en)
WO (1) WO2014176754A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6735578B2 (en) * 2001-05-10 2004-05-11 Honeywell International Inc. Indexing of knowledge base in multilayer self-organizing maps with hessian and perturbation induced fast learning
US6978274B1 (en) * 2001-08-31 2005-12-20 Attenex Corporation System and method for dynamically evaluating latent concepts in unstructured documents
US7809744B2 (en) * 2004-06-19 2010-10-05 International Business Machines Corporation Method and system for approximate string matching

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9323695B2 (en) * 2012-11-12 2016-04-26 Facebook, Inc. Predictive cache replacement

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6735578B2 (en) * 2001-05-10 2004-05-11 Honeywell International Inc. Indexing of knowledge base in multilayer self-organizing maps with hessian and perturbation induced fast learning
US6978274B1 (en) * 2001-08-31 2005-12-20 Attenex Corporation System and method for dynamically evaluating latent concepts in unstructured documents
US7809744B2 (en) * 2004-06-19 2010-10-05 International Business Machines Corporation Method and system for approximate string matching

Also Published As

Publication number Publication date
US20160154854A1 (en) 2016-06-02

Similar Documents

Publication Publication Date Title
US10114682B2 (en) Method and system for operating a data center by reducing an amount of data to be processed
US20140222776A1 (en) Document Reuse in a Search Engine Crawler
CN103548003B (en) Method and system for improving the client-side fingerprint cache of deduplication system backup performance
JP3947202B2 (en) Method for collision detection or collision management of several user requests accessing a database containing multiple string entries, in particular a method for lock management
US20140222793A1 (en) System and Method for Automatically Importing, Refreshing, Maintaining, and Merging Contact Sets
US10467245B2 (en) System and methods for mapping and searching objects in multidimensional space
US9582565B2 (en) Classifying uniform resource locators
EP2040184B1 (en) Database and database processing methods
US10078781B2 (en) Automatically organizing images
DE202012013427U1 (en) Linking tables in a MapReduce method
DE602004011890T2 (en) Method for redistributing objects to arithmetic units
US9400800B2 (en) Data transport by named content synchronization
US7930547B2 (en) High accuracy bloom filter using partitioned hashing
Guo et al. Mining the web and the internet for accurate ip address geolocations
US20050203897A1 (en) Method for using query templates in directory caches
US9367640B2 (en) Method and system for creating linked list, method and system for searching data
Munagala et al. The pipelined set cover problem
US9129010B2 (en) System and method of partitioned lexicographic search
EP2936344B1 (en) Searchable data archive
US20150261886A1 (en) Adaptive sampling schemes for clustering streaming graphs
CN106484877B (en) A kind of document retrieval system based on HDFS
US10216848B2 (en) Method and system for recommending cloud websites based on terminal access statistics
WO2014071033A1 (en) Sorting social profile search results based on computing personal similarity scores
CN104008028B (en) Intelligent mobile terminal data backup memory method and system based on many cloud storages
KR20110091763A (en) Method and apparatus for representing and identifying feature descriptors utilizing a compressed histogram of gradients

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13883396

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 14787548

Country of ref document: US

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13883396

Country of ref document: EP

Kind code of ref document: A1