WO2011067769A1 - Compression de dictionnaire partagé sur un mandataire http - Google Patents

Compression de dictionnaire partagé sur un mandataire http Download PDF

Info

Publication number
WO2011067769A1
WO2011067769A1 PCT/IL2010/001023 IL2010001023W WO2011067769A1 WO 2011067769 A1 WO2011067769 A1 WO 2011067769A1 IL 2010001023 W IL2010001023 W IL 2010001023W WO 2011067769 A1 WO2011067769 A1 WO 2011067769A1
Authority
WO
WIPO (PCT)
Prior art keywords
sdch
implementing
over http
protocol according
compression over
Prior art date
Application number
PCT/IL2010/001023
Other languages
English (en)
Inventor
Ariel Yaloz
Roman Shterenzon
Original Assignee
Infogin Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Infogin Ltd. filed Critical Infogin Ltd.
Publication of WO2011067769A1 publication Critical patent/WO2011067769A1/fr

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3091Data deduplication
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/04Protocols for data compression, e.g. ROHC

Definitions

  • the present invention relates to web page compression techniques and in particular to Shared Dictionary Compression over HTTP (SDCH).
  • SDCH Shared Dictionary Compression over HTTP
  • the present invention provides a proxy server which implements the Shared Dictionary Compression over HTTP protocol.
  • a system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol including page metrics calculating and aggregating functionality operative to calculate and aggregate page metrics of a web page, compression decision functionality operative to utilize the page metrics to decide whether to compress the web page, feature vector generating functionality operative to generate a feature vector for the web page, web page comparison functionality operative to compare the feature vector of the web page to feature vectors of other web pages and to thereby identify similarities between the web page and the other web pages, web page cluster generation functionality operative to utilize the web page comparison functionality to generate clusters of similar web pages, web page cluster assignment functionality operative to utilize the web page comparison functionality to assign the web page to a cluster of similar web pages, dictionary generating functionality operative to generate and assign a dictionary for each of the clusters of similar web pages, and web page compression functionality operative to compress the web page assigned to the cluster using a dictionary assigned to the cluster.
  • SDCH Shared Dictionary Compression over HTTP
  • the system resides on a proxy server.
  • the proxy is operative to intercept requests from at least one web client to at least one web server to download at least one web page.
  • the page metrics include the number of times the specific web page has been requested during a predefined time duration. Additionally, the page metrics also include the size of the average payload downloaded in response to the request. Additionally, the page metrics also include a product of the number of times the specific web page has been requested during a predefined time duration and the size of the average payload downloaded in response to the request.
  • the feature vector includes an array of page characteristics, the characteristics differentiating between groups of pages.
  • the characteristics include at least one of statistical analysis of HTML tags, DOM tree, CSS Class names, HTML tag IDs, page size, elements size, links to external resources and page keywords.
  • the feature vector generating functionality includes parsing functionality and tokenizing functionality.
  • the web page cluster generation functionality includes utilization of at least one similarity distance metric.
  • the at least one similarity distance metric is a Euclidian metric.
  • the at least one similarity distance metric is a Pearson metric.
  • calculation of at least one similarity distance metric is achieved by utilizing at least one of Hierarchical Clustering, K-Means and Fuzzy C-Means.
  • the similarities include similarities of web content creation suites used to create the web pages.
  • the web page comparison functionality is operative to disregard private web pages.
  • the private pages include pages to which a client was redirected from an HTTPS connection as specified by a "Referer" request header.
  • the private pages include pages which are accessed after a POST request which resulted in a cookie or in appended request parameters.
  • the private pages include pages which include a "Cache-Control: private" header.
  • the web page comparison functionality is operative to identify the private web pages by accessing URLs associated with potentially private web pages without the use of a session identifier, and comparing received responses to previously recorded responses received when a session identifier was used.
  • the web page comparison functionality is operative to compare content of HTML tags which are shared amongst at least two pages. Additionally, the web page comparison functionality is operative to compare content of HTML tags which are shared amongst at least two pages by utilizing a generic differencing algorithm.
  • the generic differencing algorithm is VCDiff.
  • the dictionary generating functionality is operative to generate and assign a dictionary for each of the clusters which include at least a predetermined number of web pages.
  • the system is operative to store compressed web pages.
  • the system is operative to compress web pages targeted at SDCH enabled clients.
  • the system is operative to identify SDCH enabled clients by analyzing the Accept-Encoding header for presence of the "sdch" string.
  • Fig. 1 is a simplified flowchart indicating steps in the operation of a system implementing the Shared Dictionary Compression over HTTP protocol, constructed and operative in accordance with a preferred embodiment of the present invention.
  • caching techniques such as HTTP caching
  • HTTP caching stores complete web pages or objects, and is therefore ineffective in optimizing bandwidth usage for web sites for which only a small portion of a page is modified between subsequent downloads.
  • CNN.com http://www.cnn.com
  • color scheme are modified very rarely, while what is actually modified are the stories embedded in the page layout. This presents an opportunity to implement a mechanism for caching partial pages.
  • the HTTP/ 1.1 protocol supports response compression via the Accept-Encoding and Content-Encoding headers.
  • the most commonly used HTTP response compression encoding is gzip, which compresses data that is repeated multiple times within a given response.
  • the HTTP/ 1.1 protocol does not provide a mechanism for compressing data that is repeated over multiple responses.
  • delta encoding Another class of encoding techniques, known as delta encoding, has proven to be effective in compressing inter-response data.
  • Previous efforts to extend the HTTP/1.1 protocol to support delta encoding have focused on encoding an HTTP response as a delta, or a difference, between the current response and a previous version of the current response.
  • One such approach is discussed in RFC3229 "Delta encoding in HTTP" (http://www.rfc-editor.org/rfc/rfc3229.txt). While RFC3229 is effective in reducing the size of the downloaded payload for many types of resources, it may not be suitable for certain classes of responses. Specifically, under RFC3229, deltas can only be applied to responses originating from the same URL.
  • the previously stored instance to which to apply the delta to recreate an entire response is identified using its Last-Modified timestamp or entity-tag.
  • Content hashes can be used to identify previously stored instances, however this may result in retrieving false positives. Additionally, storing all previous responses on the server may not be practical.
  • SDCH Shared Dictionary Compression over HTTP
  • a dictionary used by the SDCH protocol is a file downloaded by the client from the server which contains strings or elements that are likely to appear in subsequent HTTP responses. These elements can be stored in the dictionary which is available to both the client and to the server, and the server can substitute these elements with references to the dictionary, allowing the client to reconstruct the original page from these references. By substituting dictionary references for repeated elements in HTTP responses, the payload size can be reduced.
  • SDCH can only be beneficial when both the client (e.g. web browser) and the web server support the SDCH protocol.
  • adding support for the SDCH protocol requires substantial investment in the content generating software as well as in content generating work flows, and therefore content providers are reluctant to do so. Therefore, for example, users browsing CNN.com with an SDCH enabled web browser cannot benefit from SDCH because currently CNN.com, similarly to the vast majority of internet web servers, does not support SDCH.
  • the SDCH protocol has been implemented in the Google ChromeTM web browser, as well as in Microsoft Internet ExplorerTM with the GoogleTM Toolbar installed.
  • the present invention seeks to provide a SDCH protocol implementation system wherein the SDCH dictionary and the dictionary references are added to HTTP responses by a server other than the web server hosting the web site.
  • the system acts as a proxy which accesses web servers using the standard HTTP 1.0/1.1 protocol, generates the dictionary, and provides the dictionary as well as the related compressed data to an SDCH supported browser.
  • This system removes the burden of dictionary generation and server side SDCH support from the web servers, while providing the benefits of this technology to SDCH enabled browsers.
  • the proxy operates in an intercepting and fully transparent mode leaving source and destination IP addresses intact, and does not interfere with the services and access controls provided to the client by the web server, which depend on requests from the client to the web server originating from the client's IP address.
  • Fig. 1 is a simplified flowchart indicating steps in the operation of a system implementing the Shared Dictionary Compression over HTTP protocol, constructed and operative in accordance with a preferred embodiment of the present invention.
  • the system of Fig. 1 includes page metrics calculating and aggregating functionality operative to calculate and aggregate page metrics of a web page, compression decision functionality operative to utilize said page metrics to decide whether to compress the specific web page, feature vector generating functionality operative to generate a feature vector for the web page, web page comparison functionality operative to compare the feature vectors of the web page to feature vectors of other web pages and to thereby identify similarities between the web page and the other web pages, web page cluster generation functionality operative to utilize the web page comparison functionality to generate clusters of similar web pages, web page cluster assignment functionality operative to utilize said web page comparison functionality to assign said web page to a cluster of similar web pages, dictionary generating functionality operative to generate and assign a dictionary for each of the clusters of similar web pages, and web page compression functionality operative to compress the web page assigned to the cluster in compliance with the SDCH protocol.
  • the system resides on a proxy server which intercepts requests from web clients to web servers to download web pages.
  • the system continuously monitors incoming requests, and for each web page requested the system tracks the number of times the web page is requested during a predefined time duration and the size of the average payload downloaded in response to the requests.
  • a product of these two numbers is used by the system as a page metric to determine whether a requested page surpasses a predetermined threshold, making it a candidate for compression.
  • a request made by a client to a server to download a particular page is intercepted by the proxy server (100).
  • the page metric for the particular page is calculated (102), and the metric is used by the system to decide whether the particular page should be compressed (104). If the metric is below the predetermined threshold, the page is sent to the client uncompressed (106). If the metric is above the predetermined threshold, the page is marked as a candidate for compression.
  • a dictionary To compress the page, a dictionary must be chosen from multiple dictionaries available to the server.
  • One method of choosing a dictionary is to compress the page using all available dictionaries, and to choose the dictionary which provides the highest compression ratio for the page. In cases where a predefined minimum compression ratio is not achieved, a new dictionary is created.
  • compressing each page using all available dictionaries results in an 0(N 2 ) computation complexity, which uses computational resources inefficiently.
  • the SDCH compression mechanism is more efficient when a single dictionary is shared by multiple similar pages of a domain which share common elements. Therefore, when compressing a particular page, the proxy server identifies a cluster of similar pages in the domain and compresses the particular page using a dictionary corresponding to that cluster.
  • the present invention seeks to provide a more efficient compression mechanism by identifying similarities between pages using feature vectors.
  • a feature vector is an array of page characteristics which differentiate between groups of pages, such as statistical analysis of HTML tags, DOM tree, CSS Class names, HTML tag IDs, page size, elements size, links to external resources and page keywords.
  • a feature vector is created for a page to be compressed by parsing and tokenizing the page (108), and is then compared to existing page clusters (110).
  • Clusters of similar pages are built using a similarity distance metric (e.g. Euclidian, Pearson).
  • a similarity distance metric e.g. Euclidian, Pearson.
  • Several algorithms may be used to compute the similarity distance metric, such as Hierarchical Clustering, K-Means and Fuzzy C-Means.
  • the present invention also uses various, heuristics (such as URL structure and page contents) to analyze the pages and to determine whether they have been created using common web content creation suites (for example, phpBB forum generating software). All pages sharing such commonalities are added to one cluster, obviating the need for any additional similarity related computations.
  • a dictionary comprises content which is shared among multiple pages in a cluster.
  • Content of a private nature e.g. content which is accessed only by an authorized client, is not be shared. Therefore, when comparing pages of a cluster to determine which content elements should comprise the dictionary, content of a private nature must be identified and disregarded by the comparison process.
  • the following content elements are marked as potentially including content of a private nature:
  • a session identifier such as a cookie or a session parameter in the URL, such as JSESSIONID.
  • the received responses are compared to previously recorded responses received when a session identifier was used. If the responses do not match, i.e. alternative content was received when using a session identifier, the elements are considered to be private and are disregarded by the comparison process.
  • the remaining pages in the cluster are then first compared by identifying shared HTML tags, and then by identifying shared tag- content using a generic differencing algorithm such as VCDiff (specified in RFC3284; http://www.rfc-editor.org/rfc/rfc3284;txt). Page elements which appear in a percentage of pages that is higher than a predefined percentage threshold are saved as the SDCH dictionary for the cluster.
  • VCDiff specified in RFC3284; http://www.rfc-editor.org/rfc/rfc3284;txt.
  • an SDCH dictionary may be applied to multiple pages, and a particular page may be compressed by multiple different dictionaries.
  • a page which has been marked as a candidate for compression is compared to existing page clusters (110) to determine whether the page shares any similarities with any of the existing clusters (112), and to thereby determine which dictionary is most relevant to be used for compressing the page. If the page is found not to share similarities with any existing clusters, or if the distance between the cluster found to be most similar to the page and the page, as calculated by the generic differencing algorithm, is greater than a predetermined threshold, the system creates a new cluster based on the page (114), and the page is sent uncompressed (116).
  • the system determines whether a dictionary exists for the cluster (118). If a dictionary for the cluster exists, the page is compressed using the dictionary (120), and the compressed page is sent (122). If a dictionary for the cluster does not exist and if the number of pages in the cluster exceeds a predetermined threshold (124), a dictionary is created for the cluster (126). If the number of pages in the cluster does not exceed the predetermined threshold, a dictionary is not created, and the page is sent uncompressed (128).
  • the software maintains a local cache of SDCH compressed pages in full compliance with the cache-related clauses of RFC2616 (http://www.rfc- editor.org/rfc/rfc2616.txt) between the web servers and the proxy, and between the proxy and the clients. It is appreciated that the system only compresses content targeted at SDCH enabled clients. SDCH enabled clients are identified by analyzing the Accept- Encoding header for presence of the "sdch" string.

Abstract

Cette invention se rapporte à un système destiné à mettre en application le protocole SDCH, comprenant : une fonctionnalité de calcul et d'agrégation de métrique de page servant à calculer et à agréger une métrique de page d'une page Web ; une fonctionnalité de décision de compression servant à utiliser la métrique de page afin de décider s'il convient de compresser la page Web ; une fonctionnalité de génération de vecteur de caractéristiques servant à générer un vecteur de caractéristiques de la page Web ; une fonctionnalité de comparaison de pages Web servant à comparer le vecteur de caractéristiques de la page Web à des vecteurs de caractéristiques d'autres pages Web afin d'identifier des similitudes entre la page Web et d'autres pages Web ; une fonctionnalité de génération de grappe de pages Web servant à utiliser la fonctionnalité de comparaison de pages Web afin de générer des grappes de pages Web similaires ; une fonctionnalité d'attribution de grappe de pages Web servant à utiliser la fonctionnalité de comparaison de pages Web afin d'attribuer la page Web à une grappe de pages Web similaires ; une fonctionnalité de génération de dictionnaire servant à générer et à attribuer un dictionnaire pour des grappes de pages Web similaires ; et une fonctionnalité de compression de page Web servant à compresser les pages Web attribuées à la grappe à l'aide d'un dictionnaire attribué à la grappe.
PCT/IL2010/001023 2009-12-03 2010-12-02 Compression de dictionnaire partagé sur un mandataire http WO2011067769A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US26624709P 2009-12-03 2009-12-03
US61/266,247 2009-12-03

Publications (1)

Publication Number Publication Date
WO2011067769A1 true WO2011067769A1 (fr) 2011-06-09

Family

ID=44114661

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2010/001023 WO2011067769A1 (fr) 2009-12-03 2010-12-02 Compression de dictionnaire partagé sur un mandataire http

Country Status (1)

Country Link
WO (1) WO2011067769A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130114626A1 (en) * 2011-11-08 2013-05-09 Canon Kabushiki Kaisha Methods and network devices for communicating data packets
WO2013079999A1 (fr) * 2011-12-02 2013-06-06 Canon Kabushiki Kaisha Procédés et dispositifs permettant de coder et de décoder des messages
US20150089052A1 (en) * 2012-05-04 2015-03-26 Qun Yang Lin Context-Aware HTTP Compression
US9973597B1 (en) 2014-12-10 2018-05-15 Amazon Technologies, Inc. Differential dictionary compression of network-accessible content
CN113194430A (zh) * 2021-04-28 2021-07-30 杭州电力设备制造有限公司 基于周期传输模型的开关柜传感器网络数据压缩方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7058843B2 (en) * 2001-01-16 2006-06-06 Infonet Services Corporation Method and apparatus for computer network analysis
US7246306B2 (en) * 2002-06-21 2007-07-17 Microsoft Corporation Web information presentation structure for web page authoring
US7343626B1 (en) * 2002-11-12 2008-03-11 Microsoft Corporation Automated detection of cross site scripting vulnerabilities
US20090064193A1 (en) * 2007-09-05 2009-03-05 Yahoo! Inc. Distributed Network Processing System including Selective Event Logging

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7058843B2 (en) * 2001-01-16 2006-06-06 Infonet Services Corporation Method and apparatus for computer network analysis
US7246306B2 (en) * 2002-06-21 2007-07-17 Microsoft Corporation Web information presentation structure for web page authoring
US7343626B1 (en) * 2002-11-12 2008-03-11 Microsoft Corporation Automated detection of cross site scripting vulnerabilities
US20090064193A1 (en) * 2007-09-05 2009-03-05 Yahoo! Inc. Distributed Network Processing System including Selective Event Logging

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BUTLER ET AL.: "A Proposal for Shared Dictionary Compression over HTTP.", 8 September 2008 (2008-09-08), pages 1 - 5, 7, 9, 11, 14-16, Retrieved from the Internet <URL:http://old.nabble.com/attachmenU19381291/0/Shared_Dictionary_Compression_over_HTTP.pdf> [retrieved on 20110218] *
STREHL ET AL.: "Impact of Similarity Measures on Web-page Clustering", AAAI-2000: WORKSHOP OF ARTIFICIAL INTELLIGENCE FOR WEB SEARCH, July 2000 (2000-07-01), pages 1 - 2, Retrieved from the Internet <URL:http://www.ideal.ece.utexas.edu/papers/strehl_aaai00.pdf> [retrieved on 20110219] *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130114626A1 (en) * 2011-11-08 2013-05-09 Canon Kabushiki Kaisha Methods and network devices for communicating data packets
US9338258B2 (en) * 2011-11-08 2016-05-10 Canon Kabushiki Kaisha Methods and network devices for communicating data packets
WO2013079999A1 (fr) * 2011-12-02 2013-06-06 Canon Kabushiki Kaisha Procédés et dispositifs permettant de coder et de décoder des messages
WO2013079277A1 (fr) * 2011-12-02 2013-06-06 Canon Kabushiki Kaisha Procédés et dispositifs de codage et de décodage de messages
US10051090B2 (en) 2011-12-02 2018-08-14 Canon Kabushiki Kaisha Methods and devices for encoding and decoding messages
US20150089052A1 (en) * 2012-05-04 2015-03-26 Qun Yang Lin Context-Aware HTTP Compression
US9973597B1 (en) 2014-12-10 2018-05-15 Amazon Technologies, Inc. Differential dictionary compression of network-accessible content
CN113194430A (zh) * 2021-04-28 2021-07-30 杭州电力设备制造有限公司 基于周期传输模型的开关柜传感器网络数据压缩方法

Similar Documents

Publication Publication Date Title
US7603483B2 (en) Method and system for class-based management of dynamic content in a networked environment
EP1886472B1 (fr) Procede de codage multipartite
US10686726B2 (en) Method for optimizing resource loading at mobile browsers based on cloud-client cooperation
Nanopoulos et al. Effective prediction of web-user accesses: A data mining approach
Palpanas et al. Web prefetching using partial match prediction
Bonchi et al. Web log data warehousing and mining for intelligent web caching
US6532492B1 (en) Methods, systems and computer program products for cache management using admittance control
WO2011067769A1 (fr) Compression de dictionnaire partagé sur un mandataire http
Chen et al. Popularity-based PPM: An effective web prefetching technique for high accuracy and low storage
US20180302489A1 (en) Architecture for proactively providing bundled content items to client devices
Savant et al. Server-friendly delta compression for efficient web access
Chen et al. Coordinated data prefetching for web contents
Sow et al. Prefetching based on web usage mining
Neves et al. Leveraging Web prefetching systems with data deduplication
WO2003083612A2 (fr) Systeme et procede d&#39;optimisation d&#39;applications internet
Canali et al. A two-level distributed architecture for efficient Web content adaptation and delivery
Lindemann et al. Evaluating hardware and software web proxy caching solutions
Chang et al. Caching personalised and database-related dynamic web pages
Deng et al. A review of network latency optimization techniques
Holm Proxy-based prefetching and pushing of web resources
Naaman et al. Evaluation of Delivery Techniques for Dynamic Web Content.
BROWSER et al. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)
González-Cañete et al. A content-type based evaluation of web Cache replacement policies
LATENCY INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)
Pons Enhancement of Web object speculative retrieval

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10834310

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10834310

Country of ref document: EP

Kind code of ref document: A1