CN109981659A - Internet resources forecasting method and system based on data deduplication technology - Google Patents

Internet resources forecasting method and system based on data deduplication technology Download PDF

Info

Publication number
CN109981659A
CN109981659A CN201910251873.2A CN201910251873A CN109981659A CN 109981659 A CN109981659 A CN 109981659A CN 201910251873 A CN201910251873 A CN 201910251873A CN 109981659 A CN109981659 A CN 109981659A
Authority
CN
China
Prior art keywords
resource
module
request
data
client
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910251873.2A
Other languages
Chinese (zh)
Other versions
CN109981659B (en
Inventor
姚瑶
王战红
丁颖
王会霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Institute of Technology
Original Assignee
Zhengzhou Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Institute of Technology filed Critical Zhengzhou Institute of Technology
Priority to CN201910251873.2A priority Critical patent/CN109981659B/en
Publication of CN109981659A publication Critical patent/CN109981659A/en
Application granted granted Critical
Publication of CN109981659B publication Critical patent/CN109981659B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • H04L67/5681Pre-fetching or pre-delivering data based on network characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

The invention discloses a kind of Internet resources forecasting methods and system based on data deduplication technology, the following steps are included: when client sends access request, using proxy server, the network access behavioural information of user is recorded, access log is extracted;It is excavated by the Web to network log and analysis extracts network behavior feature and obtains access rule, analyze the Internet resources of future time most probable access in advance using prediction algorithm by prediction engine, and be prefetched in caching;Since cache size is limited, the resource so stored in the buffer stores in the buffer after data deduplication technical treatment, the present invention can more save memory space under the premise of guaranteeing prefetching efficiency, improve data transfer rate, reduce network delay, alleviate the network flow pressure of network access peak period and save bandwidth, improves system availability.

Description

Internet resources forecasting method and system based on data deduplication technology
Technical field
The present invention relates to network technique fields, and in particular to Internet resources forecasting method based on data deduplication technology and System.
Background technique
With a large amount of surges of internet information and user, how to improve network service quality, realizes that WWW acceleration is current Urgent problem.Web caching mechanism, Web, which are prefetched, can effectively reduce network delay with data de-duplication technology.Web Caching technology is based on temporal locality principle, and using efficient replacement algorithm, the resource buffered in advance that user may be accessed is answered For network environments such as proxy server, P2P network and mobile networks, but it is limited to hit rate.Web prefetching technique attempt with Active prefetching resource, improves hit rate to a certain extent before family is filed a request, and reduces access delay, but due to Web The mechanism of prefetching is a kind of congenial mechanism, will lead to bandwidth increase.This method needs Cautious control simultaneously, otherwise will greatly reduce Performance has violated the intention of script.Data deduplication technology is intended to the occupied space by detecting and removing repeated data.Hair at present If two versions of a resource of existing reference same keyword just have 55% repeated and redundant data.If the reference source is Academic, multiplicity is up to 87%.Using the information redundancy between data object, can obtain much higher than conventional compression side The space utilization rate of method and incremental backup method, the byte for reducing transmission data discharge part occupied bandwidth, reduce network delay. If can combine Web prefetching technique with data deduplication technology, will be of great importance to network delay is effectively reduced.
Existing Web prefetching technique reduces expected delay and is first application on Mozilla Firefox browser, then Google search engine adopts the technology.Google Web Accelerator software cooperates red fox or IE browser work together Make, realizes prefetching based on browser.Business software Robtex Viking Server in relation to prefetching technique and AllegroSurf, but do not revealed about the concrete scheme prefetched.But there is speculation, can also additionally it increase sometimes Bandwidth.Therefore, Web forecasting method needs to use with caution.Exactly because also potential Bandwidth-Constrained and Web prefetching technique is existed The application prospect of commercial field is not optimistic enough.
Summary of the invention
In view of the deficiencies in the prior art and problem, the present invention provide a kind of network money based on data deduplication technology Source forecasting method and system can improve prefetching efficiency and reduce network and prolong by reducing the method for transmission data redudancy Late.
The present invention solves scheme used by its technical problem: a kind of Internet resources based on data deduplication technology prefetch Method, comprising the following steps:
Firstly, connecting proxy server end between client and server, client sends access While request, proxy server records the network access behavioural information of user, extracts access log;
Secondly, proxy server carries out Web excavation and analysis to network access log, extracts user behavior characteristics and obtain Network access rule;The access preference of user is excavated from access log to extract the step of subscriber network access behavioural characteristic Suddenly include: that data cleansing pretreatment is carried out to access log, reject and access the record of failure and not cacheable in journal file Object extracts user from pretreated network access sequence and browses feature;
Meanwhile it being provided by prediction engine using the network that prediction algorithm analyzes the access of future time user's most probable in advance Source, and be prefetched in caching;The prediction engine is the page for predicting access when resource each time is requested Face will generate a series of URL in the nearest accessed highest resource of frequency according to prediction algorithm, and result be put into certainly In plan database;User behavior characteristics can accurately describe user by Markov chain model and browse feature, utilize Markov Tree models browsing behavior of the user to webpage, and the prediction algorithm for being taken based on access probability predicts future time user and most may be used The access request that can be issued;
Finally, the resource for prefetching in the buffer is stored in the buffer after data deduplication technical treatment;Wherein, to pre- The resource that takes carries out the step of data deduplication processing and includes:
The data de-duplication module CDM of client operates in client browser, to store nearest newest network Resource, and the SDM module positioned at server end how is corresponded to according to unique identifier instruction respective resources;
The data de-duplication model SDM of server end is to combine the data block finally responded, when SDM receives one The request of given resource, by a customized request header of the CDM reference resource identifier sent, then it is retrieved SDM takes out the resource from server, and after receiving the header sufficiently responded to and data, SDM gives resource allocation one new mark Know symbol, resource data is divided into block, the metamessage of block is stored in data storage file;The SDM in this data repository It ensure that all pieces of all versions by the metamessage resource of the hash index of block;
After CDM receives response, for all original resources of data reconstruction, including the copy block from local cache resource Reference information and duplication receive the Non-redundant data content of response.
Further, journal file subscriber network access behavioural information includes the access time of user access request, user IP address, the filename for accessing resource or script and parameter field.
Further, the algorithm that resource data is divided into block is used LBFS algorithm by SDM, specifically:
When client, which issues, requests, several index chunk are divided resources by server execution;
Take up from the byte Hash of establishing resource content, realized using sliding hash function:
CiFor the byte of i-th of resource flow, k is the length of Karp-Rabin block, and b is the radix of system, Karp-Rabin The Hash of block is as follows:
H(ciKci+k)=ci×bk-1+ci+1×bk-2+K+ck-1×b+ck
B is constant, and the irritability of function allows to calculate the Hash of next byte, as follows:
H(ci+1Kci+1+k)=((H (ciKci+k)-ci×bk)+ck+1)×b
Selection represents the Hash of resource: implementing the chunk size of minimum and maximum respectively, after selecting the boundary chunk, uses 64 be the biggish chunk of MurmurHash method Hash.
A kind of Internet resources pre-fetching system based on data deduplication technology, is added between user path and Web server Simulation system frame, the simulation system frame include client and proxy server end, and it is clear that client can prefetch client Look at the user behavior of device, client is connect with proxy server end, and proxy server end is connect with Web server;
Client includes 6 modules and 2 storage files:
Read path module: reading the request sequence of user, and data structure is the queue of first in, first out;
It prefetches management module: reading the access queue of logging modle, check prefetching object pond, in advance whether confirmation resource It took;If not prefetching, server is sent a request to, management module is prefetched and creates multiple user's request threads and wait One new request;When receiving resource response from server, prefetches management module and check whether its URL is prefetching queue In, if removing URL and if be inserted into prefetching object pond;It prefetches module while checking whether request queue is sky, if it is empty then Allow to prefetch request and issue server, until a new user requests to arrive;When a new client request is inserted into queue When, it prefetches the resource that management module implies that deletion had prefetched and empties hint queuing data storage;
User's request module: it receives from prefetching the request of management module, then passes to request module;It is connect when from server When receiving a resource response, user's request module, which is inserted into the queue of web response header Web, to be prefetched in queuing data storage, money The URL in source is inserted into prefetching object pond;
Request module is to be responsible for Treated Base communication for connecting Web server;
CDM module: client data deduplication module intercepts client user's request module and either prefetches request module hair The HTTP request message sent;Inquiry resource version number obtains the resource identifier of all resources in client-cache, CDM notice Communicate blocking module;
It communicates blocking module: adding customized header " X-vrs " for information, and header information is attached to HTTP request report On head, request module is passed to, server is finally transmitted to;
Prefetch queue: service device informs the object information that client needs to prefetch;
Prefetching object pond: all objects prefetched of storage act on similar user browser and cache;
Proxy server end includes:
It monitors module: waiting to connect to the thread queue of client, give a port number;
Server-connection module: the connection between processing client and server end passes the resource data of server end Pass client;
Communication blocking module: the http response for intercepting the most original from server end is stored in extra buffer and does Pretreatment;
SDM module: servers' data deduplication module executes the number for the message entity delivered to communication blocking module According to split process;
Communication recombination module: prepare and send the duplication version of response message;Array response header information: update/creation The new entity Message data length of Content-Length affix;Add two new header information resource Build Notation Identifiers With metadata length;
Prediction engine module: the page that prediction will likely access when resource each time is requested, according to prediction Algorithm will generate a series of URL in the nearest accessed highest resource of frequency, and result is put into policy database.
Beneficial effects of the present invention: the present invention prefetches bandwidth limited defect for current Web, provides a kind of couple of Web The improved method of pre-fetching system can improve prefetching efficiency and reduce network and prolong by reducing the method for transmission data redudancy Late.
On the one hand, the present invention provides a kind of Internet resources forecasting methods, by obtaining Web on Squid proxy server The log of request obtains access path to log analysis, which is input in client modules, the visit of analog subscriber Ask behavior;Client is periodically issued to server end according to specified time interval and is requested, and the server of server end connects mould Block is sequentially received request message and passes to server end, then intercepts response message simultaneously when http response header receives It is analyzed and passes to prediction engine module;Prediction engine module is calculated a series of nearest using popular prediction algorithm One section of accessed highest page of frequency, result is put into policy database, at the same time, is receiving all web response header Webs Notice updates state repository afterwards, prepares for prediction next time;Last server-connection module is then forwarded to complete response data Client.
On the other hand, the present invention provides a kind of Internet resources pre-fetching systems, can be before prefetching downloading resource to caching It needs to analyze resource and carry out data duplicate removal processing, increases communication blocking module, SDM mould at proxy server end Block, communication recombination module;Resource Version module, CDM module and communication blocking module are increased in client, is enable to respond quickly The network access requirements of user promote network service quality, reduce access delay and effectively save bandwidth, improve system availability And prefetching efficiency.
Detailed description of the invention
Fig. 1 is the forecasting method application scenario diagram of Web page of the embodiment of the present invention.
Fig. 2 is the flow chart of the pre-fetching system of Web page of the embodiment of the present invention.
Fig. 3 is the client modules figure of the embodiment of the present invention.
Fig. 4 is the proxy server end module figure of pre-fetching system of the embodiment of the present invention.
Fig. 5 is data deduplication structure chart.
Fig. 6 is the algorithm principle figure of SDM of embodiment of the present invention module.
Specific embodiment
Present invention will be further explained below with reference to the attached drawings and examples.
Embodiment 1: a kind of Internet resources pre-fetching system based on data deduplication technology, in user path and Web server Between add simulation system frame, wherein simulation system frame includes client and proxy server end, and client can be pre- The user behavior of client browser is taken, client is connect with proxy server end, and proxy server end is connect with Web server.
As shown in Figure 1, being the application scenario diagram of Web page of embodiment of the present invention pre-fetching system.In Fig. 1, client 101 to Proxy server 102 issues the request of access Web server resource.When 102 host process of proxy server listens to customer end A hair Carry out request, will create the request that a subprocess reply customer end A is sent;And host process continues to monitor work.It has created The proxy server subprocess and client 101 built establish connection, read client's request and parse to client's request, then According to access rule list preset on proxy server, the request being currently received is examined;If request meets rule about Beam can then search whether that there are required information in proxy server caches, and ask in the subsequent information of processing locality It asks.Client can quickly obtain oneself expectation resource in this way, save bandwidth.
As shown in figure 3, client includes 6 modules and 2 storage files.Module is followed successively by read path module 31, pre- Take management module 32, user's request module 33, CDM module 34, communication blocking module 35, request module 36;Storage file difference It is to prefetch queue 37 and prefetching object pond 38.
The access behavior of 31 user record user of read path module, the IP address or hostname including user, request hair Time out, the method (GET, POST etc.) of request, the path (URL) of accession page, the status code and work that server returns The byte number issued for response.
It prefetches management module 32 and selects suitable prediction algorithm, predict the user's request for needing to prefetch.It is responsible for checking pre- Object pool is taken, whether request resource had prefetched.
User's request module 33 is received from prefetching the request of management module 32, then passes to request module.
CDM module 34 is used for client data duplicate removal processing, is asked by intercepting the HTTP message that user's request module is sent It asks, inquiry resource version number obtains the resource identifier of all resources of client-cache Chinese.Finally notice communication intercepts mould Block 35.
Communication blocking module 35 is responsible for solicited message and adds customized header " X-vrs ", passes to request module, finally It is transmitted to server.
Request module 36 is responsible for Treated Base communication for connecting Web server.
It prefetches queue 37 and is used for the object information that service device informing client needs to prefetch.
Prefetching object pond 38 acts on similar user browser and caches for storing all objects prefetched.
As shown in figure 4, proxy server end includes including prison for connecting server end and client, proxy server end Listen module 41, server-connection module 42, request recombination module 43, SDM module 44, communication blocking module 45, prediction engine mould Block 46 and state update library 47.
It monitors module 41: being mainly responsible for the thread queue for waiting to connect to client, give a port number.
Server-connection module 42: it is mainly responsible for processing client and server end and directly connects, especially handling When response, which receives source http response header at the first time, and passes to client.It then will be from server end All resource datas are intact to pass to client.
Communication blocking module 45: the main http response for intercepting the most original from server end is stored in extra buffer It is interior and pre-process.
SDM module 44: servers' data deduplication module executes the message entity delivered to communication blocking module Data split process.Split process is mainly completed by two steps: block divides and data deduplication.
Request recombination module 43: prepare and send the duplication version of response message.Array response header information: update/wound Build the new entity Message data length of Content-Length affix;Add two new header information X-vrs (resource versions Symbolic identifier) and X-mtd (metadata length).
Prediction engine module 46: main task is the page for predicting access when resource each time is requested Face.The module will generate a series of URL in the nearest accessed highest resource of frequency according to prediction algorithm, and result is put Enter into policy database.
Embodiment 2:
A kind of Internet resources forecasting method based on data deduplication technology, comprising the following steps:
Firstly, connecting proxy server end between client and server, client sends access While request, proxy server records the network access behavioural information of user, extracts access log, journal file user network Network access behavioural information mainly includes the access time of user access request, IP address, the filename or foot for accessing resource This and parameter field etc..
Secondly, proxy server carries out Web excavation and analysis to network access log, extracts user behavior characteristics and obtain Network access rule;The access preference of user is excavated from access log to extract the step of subscriber network access behavioural characteristic Suddenly include: that data cleansing pretreatment is carried out to access log, reject and access the record of failure and not cacheable in journal file Object extracts user from pretreated network access sequence and browses feature.
Meanwhile it being provided by prediction engine using the network that prediction algorithm analyzes the access of future time user's most probable in advance Source, and be prefetched in caching;The prediction engine is the page for predicting access when resource each time is requested Face will generate a series of URL in the nearest accessed highest resource of frequency according to prediction algorithm, and result be put into certainly In plan database;User behavior characteristics can accurately describe user by Markov chain model and browse feature, utilize Markov Tree models browsing behavior of the user to webpage, and the prediction algorithm for being taken based on access probability predicts future time user and most may be used The access request that can be issued.
Finally, the resource for prefetching in the buffer is stored in the buffer after data deduplication technical treatment;Wherein, to pre- The resource that takes carries out the step of data deduplication processing and includes:
The data de-duplication module CDM of client operates in client browser, to store nearest newest network Resource, and the SDM module positioned at server end how is corresponded to according to unique identifier instruction respective resources.
The data de-duplication model SDM of server end is to combine the data block finally responded, when SDM receives one The request of given resource, by a customized request header of the CDM reference resource identifier sent, then it is retrieved SDM takes out the resource from server, and after receiving the header sufficiently responded to and data, SDM gives resource allocation one new mark Know symbol, resource data is divided into block, the metamessage of block is stored in data storage file;The SDM in this data repository It ensure that all pieces of all versions by the metamessage resource of the hash index of block.
After CDM receives response, for all original resources of data reconstruction, including the copy block from local cache resource Reference information and duplication receive the Non-redundant data content of response.
As shown in Figure 1, client 101 issues the request of access Web server resource to proxy server 102.Work as agency 102 host process of server listens to customer end A and has sent request, will create the request that a subprocess reply customer end A is sent; And host process continues to monitor work.The proxy server subprocess and client 101 created establishes connection, reads client It requests and client's request is parsed, then according to access rule list preset on proxy server, inspection is currently connect The request received;If request meets rule constraint, can search whether to exist in proxy server caches required Information, and the information request subsequent in processing locality.Client can quickly obtain oneself expectation resource in this way, save band It is wide.
The flow chart of the pre-fetching system of Web page as shown in Figure 2, in step 201, proxy server will record lower user Travel log monitors user browsing behavior.
The browsing behavior of user can refer to that user accesses on the static information of history, these records are stored in Web clothes It is engaged in the journal file on device or proxy server.Following information of each of them record comprising a request: the IP of user Address or host name request the time issued, the method (GET, POST etc.) of request, the path (URL) of accession page, server The status code of return and the byte number issued in response.It can easily be obtained from Web server or proxy server To Web journal file.
In step 202, proxy server excavates the access preferred browsing paths of user by prefetching management module.
Illustrated with an example, the network behavior of certain University Users is to concentrate on this working peak morning 8:00-9:30 The portal websites such as period central access campus net home page, Sina, Sohu, the log recorded by proxy server can be Above-mentioned behavior record gets off.
The path analysis device prefetched in management module at proxy server end mainly completes the excavation to user access path Process constructs Markov scheme-tree by analysis access sequence, to generate transition probability matrix.Using based on Markov chain mould The prediction technique of type shifts to an earlier date transition probability matrix and initial state probability vector by Markov chain model.All clients Request is buffered in client buffer area, once the when marquis for being more than smallest sample threshold value or conversation end, from buffering Area's outflow.Each client distributes an individual buffer area, and wherein by the storage of client's request sequence.Road is accessed according to user The continuous variation of diameter is to update Markov chain model.The method of update is mainly according to the path sequence additionally added come smoothly Change the default value or current value of matrix.
In step 203, the website most possibly accessed by prediction engine using prediction algorithm prediction future time, in net It is previously downloaded in caching when the network free time and in advance.
Illustrated with an example, subscriber network access Behavior law is every morning 8:00-9:30 access Netease website, root According to the rule of above-mentioned network access behavior, network proxy cache equipment can be analyzed in next same time point (8:00-9:30) Network operation state predicts the website most possibly accessed according to prediction algorithm.It, then can be from Netease website when network idle Related network resources are prefetched in advance to save in the buffer.When user proposes access request, directly requested from caching Resource avoids peak period from aggravating network burden, while can save network bandwidth, guarantees service quality.
In step 204, the data deduplication systemic origin of use is in dedupHTTP system.The core work of data deduplication system Work is to carry out content analysis to file, and the basic unit of research is the abstract data object of referred to as block (chunk).The master of system It works and is made of block division, calculating characteristic value, same or similar detection, elimination five stages of redundancy and storing data.This reality Machining system that example is built is applied by two module compositions.The weight of data de-duplication module (CDM) and server section of client Complex data removing module (SDM).CDM and SDM are separately operable in client browser and Web server.CDM is stored recently most New resource version number, and indicate how each resource version number distributes to SDM according to unique identifier.
About data deduplication, as shown in the data deduplication structure chart of Fig. 5, when SDM receives asking for a given resource It asks, it retrieves a customized request header by the CDM reference source identifier sent.Then SDM takes out from server The resource.After receiving the header sufficiently responded to and data, SDM gives resource allocation one new identifier.Resource data is drawn It is divided into block, the metamessage of block is stored in data storage file.SDM ensure that the hash index of block in this data repository All pieces of all versions of metamessage resource.SDM traverses all pieces of resource.For each resource block, it will have in reference source There is same Hash block.If one piece can search for without matched reference source, SDM in current resource response block.Therefore, redundancy is examined It surveys and is not only executed in resource in the cache of CDM, but also also executed when resource is delivered CDM.The final response money of SDM combination Issue CDM in source.Response is by metadata section.The size (byte) of metadata is stored in a customized http response header In.The content of metadata is since the resource identifier of response.Four-tuple is contained in each reference source, each tuple includes Information be required information that CDM can find each redundant block in the buffer.The field for specifically including currently responds Offset Offset, resource identifier Rescource identifier, the length of offset InOffset and block inside resource Degree.Array is arranged according to the sequence of the relevant block in original response.In the ending of meta data block, affix Non-redundant data Hold, sequentially stills remain in the sequence in former response.Due to the sequence maintenance to the tuple and Non-redundant data, CDM only needs to count First offset of group just may know how in Non-redundant data affix redundant data.After CDM receives response, it is by being The original resource of all data reconstructions.Response is received including the copy block reference information from local cache resource and duplication Non-redundant data content.CDM does not store any one piece of metamessage, also no need to send or storage Hash.
The algorithm principle of SDM module in this present embodiment is closed as shown in fig. 6, the basic principle of data deduplication system is to be based on Content addressed storage system is compared and is analyzed to deleting duplicated data section with mutually unison by the similarity to data Save space, key are the reasonable selections of data partition method.
In data deduplication system taken herein, each resource version number of client, SDM are responded to server It is divided into index data chunk.Used block partitioning algorithm is LBFS algorithm.
Step 1: dividing resources into several index chunk by server execution when client issues a request. Judge which part client of the resource has existed, which be partially it is new, need send.
Take up from the byte Hash of establishing resource content.These Hash represent the lesser chunk of resource.Use cunning Dynamic hash function is realized.
CiFor the byte of i-th of resource flow, k is the length of Karp-Rabin block.B is the radix of system.Karp-Rabin The Hash of block is as follows:
H(ciKci+k)=ci×bk-1+ci+1×bk-2+K+ck-1×b+ck
B is a constant, and the irritability of function allows to calculate the Hash of next byte, as follows:
H(ci+1Kci+1+k)=((H (ciKci+k)-ci×bk)+ck+1)×b
Step 2: selection represents the Hash of resource.It is to be difficult to reality for memory although can choose all Hash It applies, because each byte of content has a Hash.So selecting suitable Hash as the side of larger data block Boundary.
In this step, windowing mechanism is selected, in practice it has proved that be capable of providing better redundancy detection.Implement respectively it is minimum and Maximum chunk size because comparison expectation chunk size, the content based on this method be likely to result in excessive chunk or Too small chunk.
It the use of 64 is the biggish chunk of MurmurHash method Hash after selecting the boundary chunk.Keeping low collision rate Aspect, this method are better than password Hash such as MD5 or SHA1 method.

Claims (4)

1. a kind of Internet resources forecasting method based on data deduplication technology, it is characterised in that: the described method comprises the following steps:
Firstly, connecting proxy server end between client and server, client sends access request While, proxy server records the network access behavioural information of user, extracts access log;
Secondly, proxy server carries out Web excavation and analysis to network access log, extracts user behavior characteristics and obtain network Access rule;The step of access preference of user is to extract subscriber network access behavioural characteristic packet is excavated from access log It includes: data cleansing pretreatment being carried out to access log, rejects the record and not cacheable object for accessing failure in journal file, User is extracted from pretreated network access sequence browses feature;
Meanwhile analyzing the Internet resources of future time user's most probable access in advance using prediction algorithm by prediction engine, and It is prefetched in caching;The prediction engine is the page for predicting access when resource each time is requested, according to It is predicted that algorithm will generate a series of URL in the nearest accessed highest resource of frequency, and result is put into decision number According in library;User behavior characteristics can accurately describe user by Markov chain model and browse feature, will using Markov tree User models the browsing behavior of webpage, and the prediction algorithm for being taken based on access probability predicts future time user's most probable hair Access request out;
Finally, the resource for prefetching in the buffer is stored in the buffer after data deduplication technical treatment;Wherein, to prefetching Resource carry out data deduplication processing the step of include:
The data de-duplication module CDM of client operates in client browser, to store nearest newest Internet resources, And the SDM module positioned at server end how is corresponded to according to unique identifier instruction respective resources;
The data de-duplication model SDM of server end gives to combine the data block finally responded when SDM receives one Resource request, it retrieve by CDM sends quote resource identifier a customized request header, then SDM from The resource is taken out in server, after receiving the header sufficiently responded to and data, SDM gives resource allocation one new identifier, Resource data is divided into block, the metamessage of block is stored in data storage file;SDM ensure that in this data repository By all pieces of all versions of the metamessage resource of the hash index of block;
After CDM receives response, for all original resources of data reconstruction, including the copy block reference from local cache resource Information and duplication receive the Non-redundant data content of response.
2. the Internet resources forecasting method according to claim 1 based on data deduplication technology, it is characterised in that: log text Part subscriber network access behavioural information includes the access time of user access request, IP address, the filename for accessing resource Or script and parameter field.
3. the Internet resources forecasting method according to claim 1 based on data deduplication technology, it is characterised in that: SDM will The algorithm that resource data is divided into block uses LBFS algorithm, specifically:
When client, which issues, requests, several index chunk are divided resources by server execution;
Take up from the byte Hash of establishing resource content, realized using sliding hash function:
CiFor the byte of i-th of resource flow, k is the length of Karp-Rabin block, and b is the radix of system, the Kazakhstan of Karp-Rabin block It is uncommon as follows:
H(ciK ci+k)=ci×bk-1+ci+1×bk-2+K+ck-1×b+ck
B is constant, and the irritability of function allows to calculate the Hash of next byte, as follows:
H(ci+1K ci+1+k)=((H (ciK ci+k)-ci×bk)+ck+1)×b
Selection represents the Hash of resource: implementing the chunk size of minimum and maximum respectively, after selecting the boundary chunk, is using 64 The biggish chunk of MurmurHash method Hash.
4. a kind of Internet resources pre-fetching system based on data deduplication technology, it is characterised in that: in user path and Web server Between add simulation system frame, the simulation system frame includes client and proxy server end, and client can be pre- The user behavior of client browser is taken, client is connect with proxy server end, and proxy server end is connect with Web server;
The client includes 6 modules and 2 storage files:
6 modules are respectively as follows:
Read path module: reading the request sequence of user, and data structure is the queue of first in, first out;
It prefetches management module: reading the access queue of logging modle, check prefetching object pond, whether confirmation resource has prefetched It crosses;If not prefetching, server is sent a request to, management module is prefetched and creates multiple user's request threads and wait one A new request;When receiving resource response from server, prefetch management module check its URL whether in prefetching queue, If removing URL and if be inserted into prefetching object pond;It prefetches module while checking whether request queue is sky, if it is empty then allow It prefetches request and issues server, until a new user requests to arrive;When a new client request is inserted into queue, in advance It takes management module to imply to delete the resource prefetched and empty hint queuing data storage;
User's request module: it receives from prefetching the request of management module, then passes to request module;It is received when from server When one resource response, user's request module, which is inserted into the queue of web response header Web, to be prefetched in queuing data storage, resource URL is inserted into prefetching object pond;
Request module is to be responsible for Treated Base communication for connecting Web server;
CDM module: client data deduplication module intercepts client user's request module and either prefetches request module transmission HTTP request message;Inquiry resource version number obtains the resource identifier of all resources in client-cache, CDM notice communication Blocking module;
It communicates blocking module: adding customized header " X-vrs " for information, and header information is attached to http request header On, request module is passed to, server is finally transmitted to;
2 storage files are respectively as follows:
Prefetch queue: service device informs the object information that client needs to prefetch;
Prefetching object pond: all objects prefetched of storage act on similar user browser and cache;
The proxy server end includes:
It monitors module: waiting to connect to the thread queue of client, give a port number;
Server-connection module: the connection between processing client and server end passes to the resource data of server end Client;
Communication blocking module: the http response for intercepting the most original from server end is stored in extra buffer and does pre- place Reason;
SDM module: the data that servers' data deduplication module executes the message entity delivered to communication blocking module are torn open Divide process;
Communication recombination module: prepare and send the duplication version of response message;Array response header information: update/creation The new entity Message data length of Content-Length affix;Add two new header information resource Build Notation Identifiers With metadata length;
Prediction engine module: the page that prediction will likely access when resource each time is requested, according to prediction algorithm The a series of URL in the nearest accessed highest resource of frequency will be generated, and result is put into policy database.
CN201910251873.2A 2019-03-29 2019-03-29 Network resource prefetching method and system based on data deduplication technology Active CN109981659B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910251873.2A CN109981659B (en) 2019-03-29 2019-03-29 Network resource prefetching method and system based on data deduplication technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910251873.2A CN109981659B (en) 2019-03-29 2019-03-29 Network resource prefetching method and system based on data deduplication technology

Publications (2)

Publication Number Publication Date
CN109981659A true CN109981659A (en) 2019-07-05
CN109981659B CN109981659B (en) 2021-07-09

Family

ID=67081749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910251873.2A Active CN109981659B (en) 2019-03-29 2019-03-29 Network resource prefetching method and system based on data deduplication technology

Country Status (1)

Country Link
CN (1) CN109981659B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110609714A (en) * 2019-07-31 2019-12-24 百度在线网络技术(北京)有限公司 Data prefetching method, device and equipment and storage medium
CN111586020A (en) * 2020-04-29 2020-08-25 北京天融信网络安全技术有限公司 Probability model construction method and device, electronic equipment and storage medium
CN112953894A (en) * 2021-01-26 2021-06-11 复旦大学 Multi-path request copying and distributing system and method
CN113064886A (en) * 2021-03-04 2021-07-02 广州中国科学院计算机网络信息中心 Method for storing and managing identification resources
CN114221953A (en) * 2021-11-29 2022-03-22 平安证券股份有限公司 Resource acquisition method, device, equipment and storage medium
CN114785858A (en) * 2022-06-20 2022-07-22 武汉格蓝若智能技术有限公司 Resource active caching method and device applied to mutual inductor online monitoring system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110320415A1 (en) * 2010-06-23 2011-12-29 International Business Machines Corporation Piecemeal list prefetch
CN105574004A (en) * 2014-10-10 2016-05-11 阿里巴巴集团控股有限公司 Webpage deduplication method and device
CN106547764A (en) * 2015-09-18 2017-03-29 北京国双科技有限公司 The method and device of web data duplicate removal
US20180081975A1 (en) * 2016-09-21 2018-03-22 Joseph DiTomaso System and method for web content matching
CN108769253A (en) * 2018-06-25 2018-11-06 湖北工业大学 A kind of adaptive prefetching control method of distributed system access performance optimization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110320415A1 (en) * 2010-06-23 2011-12-29 International Business Machines Corporation Piecemeal list prefetch
CN105574004A (en) * 2014-10-10 2016-05-11 阿里巴巴集团控股有限公司 Webpage deduplication method and device
CN106547764A (en) * 2015-09-18 2017-03-29 北京国双科技有限公司 The method and device of web data duplicate removal
US20180081975A1 (en) * 2016-09-21 2018-03-22 Joseph DiTomaso System and method for web content matching
CN108769253A (en) * 2018-06-25 2018-11-06 湖北工业大学 A kind of adaptive prefetching control method of distributed system access performance optimization

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
RICARDO FILIPE,JOAO BARRETO: "End-to-end data deduplication", 《2011 IEEE INTERNATIONAL SYMPOSIUM ON NETWORK COMPUTING AND APPLICATIONS》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110609714A (en) * 2019-07-31 2019-12-24 百度在线网络技术(北京)有限公司 Data prefetching method, device and equipment and storage medium
CN111586020A (en) * 2020-04-29 2020-08-25 北京天融信网络安全技术有限公司 Probability model construction method and device, electronic equipment and storage medium
CN111586020B (en) * 2020-04-29 2021-09-10 北京天融信网络安全技术有限公司 Probability model construction method and device, electronic equipment and storage medium
CN112953894A (en) * 2021-01-26 2021-06-11 复旦大学 Multi-path request copying and distributing system and method
CN113064886A (en) * 2021-03-04 2021-07-02 广州中国科学院计算机网络信息中心 Method for storing and managing identification resources
CN113064886B (en) * 2021-03-04 2023-08-29 广州中国科学院计算机网络信息中心 Method for storing and marking management of identification resource
CN114221953A (en) * 2021-11-29 2022-03-22 平安证券股份有限公司 Resource acquisition method, device, equipment and storage medium
CN114785858A (en) * 2022-06-20 2022-07-22 武汉格蓝若智能技术有限公司 Resource active caching method and device applied to mutual inductor online monitoring system

Also Published As

Publication number Publication date
CN109981659B (en) 2021-07-09

Similar Documents

Publication Publication Date Title
CN109981659A (en) Internet resources forecasting method and system based on data deduplication technology
JP4025379B2 (en) Search system
US7024452B1 (en) Method and system for file-system based caching
CN104468807B (en) Carry out processing method, high in the clouds device, local device and the system of web cache
Cambazoglu et al. Scalability challenges in web search engines
KR20030048045A (en) A method for searching and analysing information in data networks
JP5705114B2 (en) Information processing apparatus, information processing method, program, and web system
Saygin et al. Exploiting data mining techniques for broadcasting data in mobile computing environments
Wu et al. Prediction of web page accesses by proxy server log
Kucukyilmaz et al. A machine learning approach for result caching in web search engines
Bhushan et al. Recommendation of optimized web pages to users using Web Log mining techniques
CN115269631A (en) Data query method, data query system, device and storage medium
CN107391555B (en) Spark-Sql retrieval-oriented metadata real-time updating method
Nimishan et al. An approach to improve the performance of web proxy cache replacement using machine learning techniques
Agrawal et al. A survey on content based crawling for deep and surface web
Feng et al. Markov tree prediction on web cache prefetching
Zhu et al. Prediction algorithm based on web mining for multimedia objects in next–generation Digital Earth
Lee et al. A proactive request distribution (prord) using web log mining in a cluster-based web server
Doraimani Filecules: A new granularity for resource management in grids
Zhang et al. SMURF: Efficient and Scalable Metadata Access for Distributed Applications from Edge to the Cloud
Temgire et al. Review on web prefetching techniques
Chaudhari et al. Proxy side web Prefetching scheme for efficient bandwidth usage: data mining approach
CN116680276A (en) Data tag storage management method, device, equipment and storage medium
Li et al. A hybrid cache and prefetch mechanism for scientific literature search engines
Du et al. A web cache replacement strategy for safety-critical systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant