CN109981659A - Internet resources forecasting method and system based on data deduplication technology - Google Patents
Internet resources forecasting method and system based on data deduplication technology Download PDFInfo
- Publication number
- CN109981659A CN109981659A CN201910251873.2A CN201910251873A CN109981659A CN 109981659 A CN109981659 A CN 109981659A CN 201910251873 A CN201910251873 A CN 201910251873A CN 109981659 A CN109981659 A CN 109981659A
- Authority
- CN
- China
- Prior art keywords
- resource
- module
- request
- data
- client
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/568—Storing data temporarily at an intermediate stage, e.g. caching
- H04L67/5681—Pre-fetching or pre-delivering data based on network characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/22—Parsing or analysis of headers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Abstract
The invention discloses a kind of Internet resources forecasting methods and system based on data deduplication technology, the following steps are included: when client sends access request, using proxy server, the network access behavioural information of user is recorded, access log is extracted;It is excavated by the Web to network log and analysis extracts network behavior feature and obtains access rule, analyze the Internet resources of future time most probable access in advance using prediction algorithm by prediction engine, and be prefetched in caching;Since cache size is limited, the resource so stored in the buffer stores in the buffer after data deduplication technical treatment, the present invention can more save memory space under the premise of guaranteeing prefetching efficiency, improve data transfer rate, reduce network delay, alleviate the network flow pressure of network access peak period and save bandwidth, improves system availability.
Description
Technical field
The present invention relates to network technique fields, and in particular to Internet resources forecasting method based on data deduplication technology and
System.
Background technique
With a large amount of surges of internet information and user, how to improve network service quality, realizes that WWW acceleration is current
Urgent problem.Web caching mechanism, Web, which are prefetched, can effectively reduce network delay with data de-duplication technology.Web
Caching technology is based on temporal locality principle, and using efficient replacement algorithm, the resource buffered in advance that user may be accessed is answered
For network environments such as proxy server, P2P network and mobile networks, but it is limited to hit rate.Web prefetching technique attempt with
Active prefetching resource, improves hit rate to a certain extent before family is filed a request, and reduces access delay, but due to Web
The mechanism of prefetching is a kind of congenial mechanism, will lead to bandwidth increase.This method needs Cautious control simultaneously, otherwise will greatly reduce
Performance has violated the intention of script.Data deduplication technology is intended to the occupied space by detecting and removing repeated data.Hair at present
If two versions of a resource of existing reference same keyword just have 55% repeated and redundant data.If the reference source is
Academic, multiplicity is up to 87%.Using the information redundancy between data object, can obtain much higher than conventional compression side
The space utilization rate of method and incremental backup method, the byte for reducing transmission data discharge part occupied bandwidth, reduce network delay.
If can combine Web prefetching technique with data deduplication technology, will be of great importance to network delay is effectively reduced.
Existing Web prefetching technique reduces expected delay and is first application on Mozilla Firefox browser, then
Google search engine adopts the technology.Google Web Accelerator software cooperates red fox or IE browser work together
Make, realizes prefetching based on browser.Business software Robtex Viking Server in relation to prefetching technique and
AllegroSurf, but do not revealed about the concrete scheme prefetched.But there is speculation, can also additionally it increase sometimes
Bandwidth.Therefore, Web forecasting method needs to use with caution.Exactly because also potential Bandwidth-Constrained and Web prefetching technique is existed
The application prospect of commercial field is not optimistic enough.
Summary of the invention
In view of the deficiencies in the prior art and problem, the present invention provide a kind of network money based on data deduplication technology
Source forecasting method and system can improve prefetching efficiency and reduce network and prolong by reducing the method for transmission data redudancy
Late.
The present invention solves scheme used by its technical problem: a kind of Internet resources based on data deduplication technology prefetch
Method, comprising the following steps:
Firstly, connecting proxy server end between client and server, client sends access
While request, proxy server records the network access behavioural information of user, extracts access log;
Secondly, proxy server carries out Web excavation and analysis to network access log, extracts user behavior characteristics and obtain
Network access rule;The access preference of user is excavated from access log to extract the step of subscriber network access behavioural characteristic
Suddenly include: that data cleansing pretreatment is carried out to access log, reject and access the record of failure and not cacheable in journal file
Object extracts user from pretreated network access sequence and browses feature;
Meanwhile it being provided by prediction engine using the network that prediction algorithm analyzes the access of future time user's most probable in advance
Source, and be prefetched in caching;The prediction engine is the page for predicting access when resource each time is requested
Face will generate a series of URL in the nearest accessed highest resource of frequency according to prediction algorithm, and result be put into certainly
In plan database;User behavior characteristics can accurately describe user by Markov chain model and browse feature, utilize Markov
Tree models browsing behavior of the user to webpage, and the prediction algorithm for being taken based on access probability predicts future time user and most may be used
The access request that can be issued;
Finally, the resource for prefetching in the buffer is stored in the buffer after data deduplication technical treatment;Wherein, to pre-
The resource that takes carries out the step of data deduplication processing and includes:
The data de-duplication module CDM of client operates in client browser, to store nearest newest network
Resource, and the SDM module positioned at server end how is corresponded to according to unique identifier instruction respective resources;
The data de-duplication model SDM of server end is to combine the data block finally responded, when SDM receives one
The request of given resource, by a customized request header of the CDM reference resource identifier sent, then it is retrieved
SDM takes out the resource from server, and after receiving the header sufficiently responded to and data, SDM gives resource allocation one new mark
Know symbol, resource data is divided into block, the metamessage of block is stored in data storage file;The SDM in this data repository
It ensure that all pieces of all versions by the metamessage resource of the hash index of block;
After CDM receives response, for all original resources of data reconstruction, including the copy block from local cache resource
Reference information and duplication receive the Non-redundant data content of response.
Further, journal file subscriber network access behavioural information includes the access time of user access request, user
IP address, the filename for accessing resource or script and parameter field.
Further, the algorithm that resource data is divided into block is used LBFS algorithm by SDM, specifically:
When client, which issues, requests, several index chunk are divided resources by server execution;
Take up from the byte Hash of establishing resource content, realized using sliding hash function:
CiFor the byte of i-th of resource flow, k is the length of Karp-Rabin block, and b is the radix of system, Karp-Rabin
The Hash of block is as follows:
H(ciKci+k)=ci×bk-1+ci+1×bk-2+K+ck-1×b+ck
B is constant, and the irritability of function allows to calculate the Hash of next byte, as follows:
H(ci+1Kci+1+k)=((H (ciKci+k)-ci×bk)+ck+1)×b
Selection represents the Hash of resource: implementing the chunk size of minimum and maximum respectively, after selecting the boundary chunk, uses
64 be the biggish chunk of MurmurHash method Hash.
A kind of Internet resources pre-fetching system based on data deduplication technology, is added between user path and Web server
Simulation system frame, the simulation system frame include client and proxy server end, and it is clear that client can prefetch client
Look at the user behavior of device, client is connect with proxy server end, and proxy server end is connect with Web server;
Client includes 6 modules and 2 storage files:
Read path module: reading the request sequence of user, and data structure is the queue of first in, first out;
It prefetches management module: reading the access queue of logging modle, check prefetching object pond, in advance whether confirmation resource
It took;If not prefetching, server is sent a request to, management module is prefetched and creates multiple user's request threads and wait
One new request;When receiving resource response from server, prefetches management module and check whether its URL is prefetching queue
In, if removing URL and if be inserted into prefetching object pond;It prefetches module while checking whether request queue is sky, if it is empty then
Allow to prefetch request and issue server, until a new user requests to arrive;When a new client request is inserted into queue
When, it prefetches the resource that management module implies that deletion had prefetched and empties hint queuing data storage;
User's request module: it receives from prefetching the request of management module, then passes to request module;It is connect when from server
When receiving a resource response, user's request module, which is inserted into the queue of web response header Web, to be prefetched in queuing data storage, money
The URL in source is inserted into prefetching object pond;
Request module is to be responsible for Treated Base communication for connecting Web server;
CDM module: client data deduplication module intercepts client user's request module and either prefetches request module hair
The HTTP request message sent;Inquiry resource version number obtains the resource identifier of all resources in client-cache, CDM notice
Communicate blocking module;
It communicates blocking module: adding customized header " X-vrs " for information, and header information is attached to HTTP request report
On head, request module is passed to, server is finally transmitted to;
Prefetch queue: service device informs the object information that client needs to prefetch;
Prefetching object pond: all objects prefetched of storage act on similar user browser and cache;
Proxy server end includes:
It monitors module: waiting to connect to the thread queue of client, give a port number;
Server-connection module: the connection between processing client and server end passes the resource data of server end
Pass client;
Communication blocking module: the http response for intercepting the most original from server end is stored in extra buffer and does
Pretreatment;
SDM module: servers' data deduplication module executes the number for the message entity delivered to communication blocking module
According to split process;
Communication recombination module: prepare and send the duplication version of response message;Array response header information: update/creation
The new entity Message data length of Content-Length affix;Add two new header information resource Build Notation Identifiers
With metadata length;
Prediction engine module: the page that prediction will likely access when resource each time is requested, according to prediction
Algorithm will generate a series of URL in the nearest accessed highest resource of frequency, and result is put into policy database.
Beneficial effects of the present invention: the present invention prefetches bandwidth limited defect for current Web, provides a kind of couple of Web
The improved method of pre-fetching system can improve prefetching efficiency and reduce network and prolong by reducing the method for transmission data redudancy
Late.
On the one hand, the present invention provides a kind of Internet resources forecasting methods, by obtaining Web on Squid proxy server
The log of request obtains access path to log analysis, which is input in client modules, the visit of analog subscriber
Ask behavior;Client is periodically issued to server end according to specified time interval and is requested, and the server of server end connects mould
Block is sequentially received request message and passes to server end, then intercepts response message simultaneously when http response header receives
It is analyzed and passes to prediction engine module;Prediction engine module is calculated a series of nearest using popular prediction algorithm
One section of accessed highest page of frequency, result is put into policy database, at the same time, is receiving all web response header Webs
Notice updates state repository afterwards, prepares for prediction next time;Last server-connection module is then forwarded to complete response data
Client.
On the other hand, the present invention provides a kind of Internet resources pre-fetching systems, can be before prefetching downloading resource to caching
It needs to analyze resource and carry out data duplicate removal processing, increases communication blocking module, SDM mould at proxy server end
Block, communication recombination module;Resource Version module, CDM module and communication blocking module are increased in client, is enable to respond quickly
The network access requirements of user promote network service quality, reduce access delay and effectively save bandwidth, improve system availability
And prefetching efficiency.
Detailed description of the invention
Fig. 1 is the forecasting method application scenario diagram of Web page of the embodiment of the present invention.
Fig. 2 is the flow chart of the pre-fetching system of Web page of the embodiment of the present invention.
Fig. 3 is the client modules figure of the embodiment of the present invention.
Fig. 4 is the proxy server end module figure of pre-fetching system of the embodiment of the present invention.
Fig. 5 is data deduplication structure chart.
Fig. 6 is the algorithm principle figure of SDM of embodiment of the present invention module.
Specific embodiment
Present invention will be further explained below with reference to the attached drawings and examples.
Embodiment 1: a kind of Internet resources pre-fetching system based on data deduplication technology, in user path and Web server
Between add simulation system frame, wherein simulation system frame includes client and proxy server end, and client can be pre-
The user behavior of client browser is taken, client is connect with proxy server end, and proxy server end is connect with Web server.
As shown in Figure 1, being the application scenario diagram of Web page of embodiment of the present invention pre-fetching system.In Fig. 1, client 101 to
Proxy server 102 issues the request of access Web server resource.When 102 host process of proxy server listens to customer end A hair
Carry out request, will create the request that a subprocess reply customer end A is sent;And host process continues to monitor work.It has created
The proxy server subprocess and client 101 built establish connection, read client's request and parse to client's request, then
According to access rule list preset on proxy server, the request being currently received is examined;If request meets rule about
Beam can then search whether that there are required information in proxy server caches, and ask in the subsequent information of processing locality
It asks.Client can quickly obtain oneself expectation resource in this way, save bandwidth.
As shown in figure 3, client includes 6 modules and 2 storage files.Module is followed successively by read path module 31, pre-
Take management module 32, user's request module 33, CDM module 34, communication blocking module 35, request module 36;Storage file difference
It is to prefetch queue 37 and prefetching object pond 38.
The access behavior of 31 user record user of read path module, the IP address or hostname including user, request hair
Time out, the method (GET, POST etc.) of request, the path (URL) of accession page, the status code and work that server returns
The byte number issued for response.
It prefetches management module 32 and selects suitable prediction algorithm, predict the user's request for needing to prefetch.It is responsible for checking pre-
Object pool is taken, whether request resource had prefetched.
User's request module 33 is received from prefetching the request of management module 32, then passes to request module.
CDM module 34 is used for client data duplicate removal processing, is asked by intercepting the HTTP message that user's request module is sent
It asks, inquiry resource version number obtains the resource identifier of all resources of client-cache Chinese.Finally notice communication intercepts mould
Block 35.
Communication blocking module 35 is responsible for solicited message and adds customized header " X-vrs ", passes to request module, finally
It is transmitted to server.
Request module 36 is responsible for Treated Base communication for connecting Web server.
It prefetches queue 37 and is used for the object information that service device informing client needs to prefetch.
Prefetching object pond 38 acts on similar user browser and caches for storing all objects prefetched.
As shown in figure 4, proxy server end includes including prison for connecting server end and client, proxy server end
Listen module 41, server-connection module 42, request recombination module 43, SDM module 44, communication blocking module 45, prediction engine mould
Block 46 and state update library 47.
It monitors module 41: being mainly responsible for the thread queue for waiting to connect to client, give a port number.
Server-connection module 42: it is mainly responsible for processing client and server end and directly connects, especially handling
When response, which receives source http response header at the first time, and passes to client.It then will be from server end
All resource datas are intact to pass to client.
Communication blocking module 45: the main http response for intercepting the most original from server end is stored in extra buffer
It is interior and pre-process.
SDM module 44: servers' data deduplication module executes the message entity delivered to communication blocking module
Data split process.Split process is mainly completed by two steps: block divides and data deduplication.
Request recombination module 43: prepare and send the duplication version of response message.Array response header information: update/wound
Build the new entity Message data length of Content-Length affix;Add two new header information X-vrs (resource versions
Symbolic identifier) and X-mtd (metadata length).
Prediction engine module 46: main task is the page for predicting access when resource each time is requested
Face.The module will generate a series of URL in the nearest accessed highest resource of frequency according to prediction algorithm, and result is put
Enter into policy database.
Embodiment 2:
A kind of Internet resources forecasting method based on data deduplication technology, comprising the following steps:
Firstly, connecting proxy server end between client and server, client sends access
While request, proxy server records the network access behavioural information of user, extracts access log, journal file user network
Network access behavioural information mainly includes the access time of user access request, IP address, the filename or foot for accessing resource
This and parameter field etc..
Secondly, proxy server carries out Web excavation and analysis to network access log, extracts user behavior characteristics and obtain
Network access rule;The access preference of user is excavated from access log to extract the step of subscriber network access behavioural characteristic
Suddenly include: that data cleansing pretreatment is carried out to access log, reject and access the record of failure and not cacheable in journal file
Object extracts user from pretreated network access sequence and browses feature.
Meanwhile it being provided by prediction engine using the network that prediction algorithm analyzes the access of future time user's most probable in advance
Source, and be prefetched in caching;The prediction engine is the page for predicting access when resource each time is requested
Face will generate a series of URL in the nearest accessed highest resource of frequency according to prediction algorithm, and result be put into certainly
In plan database;User behavior characteristics can accurately describe user by Markov chain model and browse feature, utilize Markov
Tree models browsing behavior of the user to webpage, and the prediction algorithm for being taken based on access probability predicts future time user and most may be used
The access request that can be issued.
Finally, the resource for prefetching in the buffer is stored in the buffer after data deduplication technical treatment;Wherein, to pre-
The resource that takes carries out the step of data deduplication processing and includes:
The data de-duplication module CDM of client operates in client browser, to store nearest newest network
Resource, and the SDM module positioned at server end how is corresponded to according to unique identifier instruction respective resources.
The data de-duplication model SDM of server end is to combine the data block finally responded, when SDM receives one
The request of given resource, by a customized request header of the CDM reference resource identifier sent, then it is retrieved
SDM takes out the resource from server, and after receiving the header sufficiently responded to and data, SDM gives resource allocation one new mark
Know symbol, resource data is divided into block, the metamessage of block is stored in data storage file;The SDM in this data repository
It ensure that all pieces of all versions by the metamessage resource of the hash index of block.
After CDM receives response, for all original resources of data reconstruction, including the copy block from local cache resource
Reference information and duplication receive the Non-redundant data content of response.
As shown in Figure 1, client 101 issues the request of access Web server resource to proxy server 102.Work as agency
102 host process of server listens to customer end A and has sent request, will create the request that a subprocess reply customer end A is sent;
And host process continues to monitor work.The proxy server subprocess and client 101 created establishes connection, reads client
It requests and client's request is parsed, then according to access rule list preset on proxy server, inspection is currently connect
The request received;If request meets rule constraint, can search whether to exist in proxy server caches required
Information, and the information request subsequent in processing locality.Client can quickly obtain oneself expectation resource in this way, save band
It is wide.
The flow chart of the pre-fetching system of Web page as shown in Figure 2, in step 201, proxy server will record lower user
Travel log monitors user browsing behavior.
The browsing behavior of user can refer to that user accesses on the static information of history, these records are stored in Web clothes
It is engaged in the journal file on device or proxy server.Following information of each of them record comprising a request: the IP of user
Address or host name request the time issued, the method (GET, POST etc.) of request, the path (URL) of accession page, server
The status code of return and the byte number issued in response.It can easily be obtained from Web server or proxy server
To Web journal file.
In step 202, proxy server excavates the access preferred browsing paths of user by prefetching management module.
Illustrated with an example, the network behavior of certain University Users is to concentrate on this working peak morning 8:00-9:30
The portal websites such as period central access campus net home page, Sina, Sohu, the log recorded by proxy server can be
Above-mentioned behavior record gets off.
The path analysis device prefetched in management module at proxy server end mainly completes the excavation to user access path
Process constructs Markov scheme-tree by analysis access sequence, to generate transition probability matrix.Using based on Markov chain mould
The prediction technique of type shifts to an earlier date transition probability matrix and initial state probability vector by Markov chain model.All clients
Request is buffered in client buffer area, once the when marquis for being more than smallest sample threshold value or conversation end, from buffering
Area's outflow.Each client distributes an individual buffer area, and wherein by the storage of client's request sequence.Road is accessed according to user
The continuous variation of diameter is to update Markov chain model.The method of update is mainly according to the path sequence additionally added come smoothly
Change the default value or current value of matrix.
In step 203, the website most possibly accessed by prediction engine using prediction algorithm prediction future time, in net
It is previously downloaded in caching when the network free time and in advance.
Illustrated with an example, subscriber network access Behavior law is every morning 8:00-9:30 access Netease website, root
According to the rule of above-mentioned network access behavior, network proxy cache equipment can be analyzed in next same time point (8:00-9:30)
Network operation state predicts the website most possibly accessed according to prediction algorithm.It, then can be from Netease website when network idle
Related network resources are prefetched in advance to save in the buffer.When user proposes access request, directly requested from caching
Resource avoids peak period from aggravating network burden, while can save network bandwidth, guarantees service quality.
In step 204, the data deduplication systemic origin of use is in dedupHTTP system.The core work of data deduplication system
Work is to carry out content analysis to file, and the basic unit of research is the abstract data object of referred to as block (chunk).The master of system
It works and is made of block division, calculating characteristic value, same or similar detection, elimination five stages of redundancy and storing data.This reality
Machining system that example is built is applied by two module compositions.The weight of data de-duplication module (CDM) and server section of client
Complex data removing module (SDM).CDM and SDM are separately operable in client browser and Web server.CDM is stored recently most
New resource version number, and indicate how each resource version number distributes to SDM according to unique identifier.
About data deduplication, as shown in the data deduplication structure chart of Fig. 5, when SDM receives asking for a given resource
It asks, it retrieves a customized request header by the CDM reference source identifier sent.Then SDM takes out from server
The resource.After receiving the header sufficiently responded to and data, SDM gives resource allocation one new identifier.Resource data is drawn
It is divided into block, the metamessage of block is stored in data storage file.SDM ensure that the hash index of block in this data repository
All pieces of all versions of metamessage resource.SDM traverses all pieces of resource.For each resource block, it will have in reference source
There is same Hash block.If one piece can search for without matched reference source, SDM in current resource response block.Therefore, redundancy is examined
It surveys and is not only executed in resource in the cache of CDM, but also also executed when resource is delivered CDM.The final response money of SDM combination
Issue CDM in source.Response is by metadata section.The size (byte) of metadata is stored in a customized http response header
In.The content of metadata is since the resource identifier of response.Four-tuple is contained in each reference source, each tuple includes
Information be required information that CDM can find each redundant block in the buffer.The field for specifically including currently responds
Offset Offset, resource identifier Rescource identifier, the length of offset InOffset and block inside resource
Degree.Array is arranged according to the sequence of the relevant block in original response.In the ending of meta data block, affix Non-redundant data
Hold, sequentially stills remain in the sequence in former response.Due to the sequence maintenance to the tuple and Non-redundant data, CDM only needs to count
First offset of group just may know how in Non-redundant data affix redundant data.After CDM receives response, it is by being
The original resource of all data reconstructions.Response is received including the copy block reference information from local cache resource and duplication
Non-redundant data content.CDM does not store any one piece of metamessage, also no need to send or storage Hash.
The algorithm principle of SDM module in this present embodiment is closed as shown in fig. 6, the basic principle of data deduplication system is to be based on
Content addressed storage system is compared and is analyzed to deleting duplicated data section with mutually unison by the similarity to data
Save space, key are the reasonable selections of data partition method.
In data deduplication system taken herein, each resource version number of client, SDM are responded to server
It is divided into index data chunk.Used block partitioning algorithm is LBFS algorithm.
Step 1: dividing resources into several index chunk by server execution when client issues a request.
Judge which part client of the resource has existed, which be partially it is new, need send.
Take up from the byte Hash of establishing resource content.These Hash represent the lesser chunk of resource.Use cunning
Dynamic hash function is realized.
CiFor the byte of i-th of resource flow, k is the length of Karp-Rabin block.B is the radix of system.Karp-Rabin
The Hash of block is as follows:
H(ciKci+k)=ci×bk-1+ci+1×bk-2+K+ck-1×b+ck
B is a constant, and the irritability of function allows to calculate the Hash of next byte, as follows:
H(ci+1Kci+1+k)=((H (ciKci+k)-ci×bk)+ck+1)×b
Step 2: selection represents the Hash of resource.It is to be difficult to reality for memory although can choose all Hash
It applies, because each byte of content has a Hash.So selecting suitable Hash as the side of larger data block
Boundary.
In this step, windowing mechanism is selected, in practice it has proved that be capable of providing better redundancy detection.Implement respectively it is minimum and
Maximum chunk size because comparison expectation chunk size, the content based on this method be likely to result in excessive chunk or
Too small chunk.
It the use of 64 is the biggish chunk of MurmurHash method Hash after selecting the boundary chunk.Keeping low collision rate
Aspect, this method are better than password Hash such as MD5 or SHA1 method.
Claims (4)
1. a kind of Internet resources forecasting method based on data deduplication technology, it is characterised in that: the described method comprises the following steps:
Firstly, connecting proxy server end between client and server, client sends access request
While, proxy server records the network access behavioural information of user, extracts access log;
Secondly, proxy server carries out Web excavation and analysis to network access log, extracts user behavior characteristics and obtain network
Access rule;The step of access preference of user is to extract subscriber network access behavioural characteristic packet is excavated from access log
It includes: data cleansing pretreatment being carried out to access log, rejects the record and not cacheable object for accessing failure in journal file,
User is extracted from pretreated network access sequence browses feature;
Meanwhile analyzing the Internet resources of future time user's most probable access in advance using prediction algorithm by prediction engine, and
It is prefetched in caching;The prediction engine is the page for predicting access when resource each time is requested, according to
It is predicted that algorithm will generate a series of URL in the nearest accessed highest resource of frequency, and result is put into decision number
According in library;User behavior characteristics can accurately describe user by Markov chain model and browse feature, will using Markov tree
User models the browsing behavior of webpage, and the prediction algorithm for being taken based on access probability predicts future time user's most probable hair
Access request out;
Finally, the resource for prefetching in the buffer is stored in the buffer after data deduplication technical treatment;Wherein, to prefetching
Resource carry out data deduplication processing the step of include:
The data de-duplication module CDM of client operates in client browser, to store nearest newest Internet resources,
And the SDM module positioned at server end how is corresponded to according to unique identifier instruction respective resources;
The data de-duplication model SDM of server end gives to combine the data block finally responded when SDM receives one
Resource request, it retrieve by CDM sends quote resource identifier a customized request header, then SDM from
The resource is taken out in server, after receiving the header sufficiently responded to and data, SDM gives resource allocation one new identifier,
Resource data is divided into block, the metamessage of block is stored in data storage file;SDM ensure that in this data repository
By all pieces of all versions of the metamessage resource of the hash index of block;
After CDM receives response, for all original resources of data reconstruction, including the copy block reference from local cache resource
Information and duplication receive the Non-redundant data content of response.
2. the Internet resources forecasting method according to claim 1 based on data deduplication technology, it is characterised in that: log text
Part subscriber network access behavioural information includes the access time of user access request, IP address, the filename for accessing resource
Or script and parameter field.
3. the Internet resources forecasting method according to claim 1 based on data deduplication technology, it is characterised in that: SDM will
The algorithm that resource data is divided into block uses LBFS algorithm, specifically:
When client, which issues, requests, several index chunk are divided resources by server execution;
Take up from the byte Hash of establishing resource content, realized using sliding hash function:
CiFor the byte of i-th of resource flow, k is the length of Karp-Rabin block, and b is the radix of system, the Kazakhstan of Karp-Rabin block
It is uncommon as follows:
H(ciK ci+k)=ci×bk-1+ci+1×bk-2+K+ck-1×b+ck
B is constant, and the irritability of function allows to calculate the Hash of next byte, as follows:
H(ci+1K ci+1+k)=((H (ciK ci+k)-ci×bk)+ck+1)×b
Selection represents the Hash of resource: implementing the chunk size of minimum and maximum respectively, after selecting the boundary chunk, is using 64
The biggish chunk of MurmurHash method Hash.
4. a kind of Internet resources pre-fetching system based on data deduplication technology, it is characterised in that: in user path and Web server
Between add simulation system frame, the simulation system frame includes client and proxy server end, and client can be pre-
The user behavior of client browser is taken, client is connect with proxy server end, and proxy server end is connect with Web server;
The client includes 6 modules and 2 storage files:
6 modules are respectively as follows:
Read path module: reading the request sequence of user, and data structure is the queue of first in, first out;
It prefetches management module: reading the access queue of logging modle, check prefetching object pond, whether confirmation resource has prefetched
It crosses;If not prefetching, server is sent a request to, management module is prefetched and creates multiple user's request threads and wait one
A new request;When receiving resource response from server, prefetch management module check its URL whether in prefetching queue,
If removing URL and if be inserted into prefetching object pond;It prefetches module while checking whether request queue is sky, if it is empty then allow
It prefetches request and issues server, until a new user requests to arrive;When a new client request is inserted into queue, in advance
It takes management module to imply to delete the resource prefetched and empty hint queuing data storage;
User's request module: it receives from prefetching the request of management module, then passes to request module;It is received when from server
When one resource response, user's request module, which is inserted into the queue of web response header Web, to be prefetched in queuing data storage, resource
URL is inserted into prefetching object pond;
Request module is to be responsible for Treated Base communication for connecting Web server;
CDM module: client data deduplication module intercepts client user's request module and either prefetches request module transmission
HTTP request message;Inquiry resource version number obtains the resource identifier of all resources in client-cache, CDM notice communication
Blocking module;
It communicates blocking module: adding customized header " X-vrs " for information, and header information is attached to http request header
On, request module is passed to, server is finally transmitted to;
2 storage files are respectively as follows:
Prefetch queue: service device informs the object information that client needs to prefetch;
Prefetching object pond: all objects prefetched of storage act on similar user browser and cache;
The proxy server end includes:
It monitors module: waiting to connect to the thread queue of client, give a port number;
Server-connection module: the connection between processing client and server end passes to the resource data of server end
Client;
Communication blocking module: the http response for intercepting the most original from server end is stored in extra buffer and does pre- place
Reason;
SDM module: the data that servers' data deduplication module executes the message entity delivered to communication blocking module are torn open
Divide process;
Communication recombination module: prepare and send the duplication version of response message;Array response header information: update/creation
The new entity Message data length of Content-Length affix;Add two new header information resource Build Notation Identifiers
With metadata length;
Prediction engine module: the page that prediction will likely access when resource each time is requested, according to prediction algorithm
The a series of URL in the nearest accessed highest resource of frequency will be generated, and result is put into policy database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910251873.2A CN109981659B (en) | 2019-03-29 | 2019-03-29 | Network resource prefetching method and system based on data deduplication technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910251873.2A CN109981659B (en) | 2019-03-29 | 2019-03-29 | Network resource prefetching method and system based on data deduplication technology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109981659A true CN109981659A (en) | 2019-07-05 |
CN109981659B CN109981659B (en) | 2021-07-09 |
Family
ID=67081749
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910251873.2A Active CN109981659B (en) | 2019-03-29 | 2019-03-29 | Network resource prefetching method and system based on data deduplication technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109981659B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110609714A (en) * | 2019-07-31 | 2019-12-24 | 百度在线网络技术(北京)有限公司 | Data prefetching method, device and equipment and storage medium |
CN111586020A (en) * | 2020-04-29 | 2020-08-25 | 北京天融信网络安全技术有限公司 | Probability model construction method and device, electronic equipment and storage medium |
CN112953894A (en) * | 2021-01-26 | 2021-06-11 | 复旦大学 | Multi-path request copying and distributing system and method |
CN113064886A (en) * | 2021-03-04 | 2021-07-02 | 广州中国科学院计算机网络信息中心 | Method for storing and managing identification resources |
CN114221953A (en) * | 2021-11-29 | 2022-03-22 | 平安证券股份有限公司 | Resource acquisition method, device, equipment and storage medium |
CN114785858A (en) * | 2022-06-20 | 2022-07-22 | 武汉格蓝若智能技术有限公司 | Resource active caching method and device applied to mutual inductor online monitoring system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110320415A1 (en) * | 2010-06-23 | 2011-12-29 | International Business Machines Corporation | Piecemeal list prefetch |
CN105574004A (en) * | 2014-10-10 | 2016-05-11 | 阿里巴巴集团控股有限公司 | Webpage deduplication method and device |
CN106547764A (en) * | 2015-09-18 | 2017-03-29 | 北京国双科技有限公司 | The method and device of web data duplicate removal |
US20180081975A1 (en) * | 2016-09-21 | 2018-03-22 | Joseph DiTomaso | System and method for web content matching |
CN108769253A (en) * | 2018-06-25 | 2018-11-06 | 湖北工业大学 | A kind of adaptive prefetching control method of distributed system access performance optimization |
-
2019
- 2019-03-29 CN CN201910251873.2A patent/CN109981659B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110320415A1 (en) * | 2010-06-23 | 2011-12-29 | International Business Machines Corporation | Piecemeal list prefetch |
CN105574004A (en) * | 2014-10-10 | 2016-05-11 | 阿里巴巴集团控股有限公司 | Webpage deduplication method and device |
CN106547764A (en) * | 2015-09-18 | 2017-03-29 | 北京国双科技有限公司 | The method and device of web data duplicate removal |
US20180081975A1 (en) * | 2016-09-21 | 2018-03-22 | Joseph DiTomaso | System and method for web content matching |
CN108769253A (en) * | 2018-06-25 | 2018-11-06 | 湖北工业大学 | A kind of adaptive prefetching control method of distributed system access performance optimization |
Non-Patent Citations (1)
Title |
---|
RICARDO FILIPE,JOAO BARRETO: "End-to-end data deduplication", 《2011 IEEE INTERNATIONAL SYMPOSIUM ON NETWORK COMPUTING AND APPLICATIONS》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110609714A (en) * | 2019-07-31 | 2019-12-24 | 百度在线网络技术(北京)有限公司 | Data prefetching method, device and equipment and storage medium |
CN111586020A (en) * | 2020-04-29 | 2020-08-25 | 北京天融信网络安全技术有限公司 | Probability model construction method and device, electronic equipment and storage medium |
CN111586020B (en) * | 2020-04-29 | 2021-09-10 | 北京天融信网络安全技术有限公司 | Probability model construction method and device, electronic equipment and storage medium |
CN112953894A (en) * | 2021-01-26 | 2021-06-11 | 复旦大学 | Multi-path request copying and distributing system and method |
CN113064886A (en) * | 2021-03-04 | 2021-07-02 | 广州中国科学院计算机网络信息中心 | Method for storing and managing identification resources |
CN113064886B (en) * | 2021-03-04 | 2023-08-29 | 广州中国科学院计算机网络信息中心 | Method for storing and marking management of identification resource |
CN114221953A (en) * | 2021-11-29 | 2022-03-22 | 平安证券股份有限公司 | Resource acquisition method, device, equipment and storage medium |
CN114785858A (en) * | 2022-06-20 | 2022-07-22 | 武汉格蓝若智能技术有限公司 | Resource active caching method and device applied to mutual inductor online monitoring system |
Also Published As
Publication number | Publication date |
---|---|
CN109981659B (en) | 2021-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109981659A (en) | Internet resources forecasting method and system based on data deduplication technology | |
JP4025379B2 (en) | Search system | |
US7024452B1 (en) | Method and system for file-system based caching | |
CN104468807B (en) | Carry out processing method, high in the clouds device, local device and the system of web cache | |
Cambazoglu et al. | Scalability challenges in web search engines | |
KR20030048045A (en) | A method for searching and analysing information in data networks | |
JP5705114B2 (en) | Information processing apparatus, information processing method, program, and web system | |
Saygin et al. | Exploiting data mining techniques for broadcasting data in mobile computing environments | |
Wu et al. | Prediction of web page accesses by proxy server log | |
Kucukyilmaz et al. | A machine learning approach for result caching in web search engines | |
Bhushan et al. | Recommendation of optimized web pages to users using Web Log mining techniques | |
CN115269631A (en) | Data query method, data query system, device and storage medium | |
CN107391555B (en) | Spark-Sql retrieval-oriented metadata real-time updating method | |
Nimishan et al. | An approach to improve the performance of web proxy cache replacement using machine learning techniques | |
Agrawal et al. | A survey on content based crawling for deep and surface web | |
Feng et al. | Markov tree prediction on web cache prefetching | |
Zhu et al. | Prediction algorithm based on web mining for multimedia objects in next–generation Digital Earth | |
Lee et al. | A proactive request distribution (prord) using web log mining in a cluster-based web server | |
Doraimani | Filecules: A new granularity for resource management in grids | |
Zhang et al. | SMURF: Efficient and Scalable Metadata Access for Distributed Applications from Edge to the Cloud | |
Temgire et al. | Review on web prefetching techniques | |
Chaudhari et al. | Proxy side web Prefetching scheme for efficient bandwidth usage: data mining approach | |
CN116680276A (en) | Data tag storage management method, device, equipment and storage medium | |
Li et al. | A hybrid cache and prefetch mechanism for scientific literature search engines | |
Du et al. | A web cache replacement strategy for safety-critical systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |