CN104408100B - The compression method of structured web site daily record - Google Patents

The compression method of structured web site daily record Download PDF

Info

Publication number
CN104408100B
CN104408100B CN201410663256.0A CN201410663256A CN104408100B CN 104408100 B CN104408100 B CN 104408100B CN 201410663256 A CN201410663256 A CN 201410663256A CN 104408100 B CN104408100 B CN 104408100B
Authority
CN
China
Prior art keywords
encoder
log file
web log
file
web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410663256.0A
Other languages
Chinese (zh)
Other versions
CN104408100A (en
Inventor
胡大祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING RONGHAI HENGXIN CONSULTING Co Ltd
Original Assignee
BEIJING RONGHAI HENGXIN CONSULTING Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING RONGHAI HENGXIN CONSULTING Co Ltd filed Critical BEIJING RONGHAI HENGXIN CONSULTING Co Ltd
Priority to CN201410663256.0A priority Critical patent/CN104408100B/en
Publication of CN104408100A publication Critical patent/CN104408100A/en
Application granted granted Critical
Publication of CN104408100B publication Critical patent/CN104408100B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention provides a kind of compression method of structured web site daily record, it is characterised in that comprises the following steps:Web log file decomposition step, it is multiple fields that every in web log file, which is recorded according to STRUCTURE DECOMPOSITION,;Encoder step is established, corresponding encoder is generated to each field of the web log file after decomposition, and establish the corresponding encoder table of comparisons;Coding step, recompiles web log file using encoder, obtains FACT files, FACT files are analyzed to obtain report file;Decoding step, the encoder table of comparisons obtained by establishing encoder step decode the report file that coding step obtains, and obtain final report file.The compression method of the structured web site daily record of the present invention fast and effectively can be compressed web log file, while not destroy the structural of original web daily record, reduce the scale of the web log file of analysis needed for analysis software.

Description

The compression method of structured web site daily record
Technical field
The present invention relates to data processing technique, more particularly to a kind of compression method of structured web site daily record.
Background technology
Website most starts to refer on the internet, according to certain rule, using particular tool making for showing spy Determine the set of the related web page of content.Briefly, website is a kind of tool of communications, and people can issue oneself by website Want disclosed information, or utilize a website to provide relevant network service.People can be accessed by web browser Website, obtains the information of oneself needs or enjoys network service.Weigh a website performance usually from web space size, Web site, website connection speed, web site software configuration, website provide the several respects such as service and consider, most direct measurement standard It is the real traffic of website.
With the fast development of Internet technology, the content of website also becomes increasingly to enrich, and makes it have and more attracts The characteristic of people, while the popularization of computer and the rapid growth of surfing Internet with cell phone user make it that the approach for accessing website is also more and more, Explosive growth is presented in the visit capacity i.e. real traffic of website.But website is also given while real traffic rapid growth Administrative staff bring some problems, such as the insufficient space of website to show more and more contents, browse increasing for user The problems such as causing the surfing of website to decline and how to lift website service quality, but these problems can be by adding clothes Business device, recruit the solution of the means such as new employee, however as increasing for website visiting, the analysis of more and more large-scale web log file It is processed into for a problem.
Web log file is to record the text that Website server receives the various raw informations such as processing request and run time error Part, specifically, it should be server log.The meaning of web log file maximum is recorded in the operation of website such as the operation feelings in space Condition, is accessed the record of request.By web log file can be clear that user what IP, when, grasped with what Which page of your website made to have accessed in the case of system, what browser, what resolution display, if access into Work(.Therefore, it is correct, effectively the analysis to web log file can find website produced problem in time, while can also be to clear Look at user browse custom etc. information analyzed, make website constantly use up row it is perfect, it is more in line with the habit for browsing user It is used.
At present, for example an average daily visit capacity about exists in 10,000,000 website, its web log file scale generated daily 10G or so, and an average daily visit capacity, in 40,000,000 website, the scale of its daily web log file is about 50G or so.It is right In more massive website, its average daily web log file scale can exceed 100G.
The web log file of big data quantity brings problem to the analytic process of analysis software, more and more long analysis time with And the continuous lifting to performance requirement all becomes the problem of portal management personnel have to face.But actually in web log file Substantial amounts of information is to repeat, and the analysis operation that analysis software constantly does identical information repetition is useless, if by day Will file is compressed before analysis, and the workload of analysis software will be made significantly to be reduced, so as to improve work Efficiency.
At present, hadoop technologies are the main methods to the analysis of extensive web log file, it, which has, is based on java/linux Cluster, based on the ultra-large distributed file systems of HDFS, hardware device it is relatively cheap, realize Map/Reduce Distributed Calculation moulds The characteristics of type and suitable batch processing.But hadoop technologies are there is also some shortcomings at the same time, such as:Lower deployment cost is high, it is necessary to a large amount of Computer, deployment it is complicated, lack ripe high-quality technical staff, calculated using mapreduce models, programming model list One, technical staff lacks and the problems such as lacks the technology modules of ready-made analyzing web site daily record.
ZIP compress techniques in existing compress technique due to that after compression log file structure information can be caused to lose completely, And can not analyze, so cannot use.
Therefore, analysis how is effectively compressed to web log file just becomes urgent problem to be solved.
The content of the invention
It is an object of the invention to provide a kind of compression method of structured web site daily record, place is compressed to web log file Reason, mitigates the workload that analysis software analyzes web log file, improves treatment effeciency.
The compression method of the structured web site daily record of the present invention, it is characterised in that comprise the following steps:
Web log file decomposition step, it is multiple fields that every in web log file, which is recorded according to STRUCTURE DECOMPOSITION,;
Encoder step is established, corresponding encoder, and foundation pair are generated to each field of the web log file after decomposition The encoder table of comparisons answered;
Coding step, recompiles web log file using encoder, obtains FACT files, and FACT files are carried out Analyze to obtain report file;
Decoding step, the report file that the encoder table of comparisons obtained by establishing encoder step obtains coding step Decoded, obtain final report file.
The encoder for establishing encoder step is realized using Java hash table modes;
Java hash tables are the aggregate manner of one group of hash table, and allocate capacity in advance to each hash table.
Web log file field in the encoder table of comparisons obtained to establishing encoder step is encoded again, obtains two The secondary encoder table of comparisons.
The coding method of the secondary coding table of comparisons uses BASE64 or MD5.
The compression method of the structured web site daily record of the present invention fast and effectively can be compressed web log file, at the same time Do not destroy the structural of original web daily record, reduce the scale of the web log file of analysis needed for analysis software, make big data, extensive Web log file analytic process it is more quick, effectively.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is attached drawing needed in technology description to be briefly described, it should be apparent that, drawings in the following description are this hairs Some bright embodiments, for those of ordinary skill in the art, without having to pay creative labor, can be with Other attached drawings are obtained according to these attached drawings.
Fig. 1 is the flow chart of the compression method embodiment of structured web site daily record of the present invention;
Fig. 2 is the structure flow chart of the compression method embodiment of structured web site daily record of the present invention;
Fig. 3 is the coding compression process figure of the compression method embodiment of structured web site daily record of the present invention;
Fig. 4 is the decoding decompression flow chart of the compression method embodiment of structured web site daily record of the present invention;
Fig. 5 is the web log file schematic diagram of the compression method embodiment of structured web site daily record of the present invention;
Fig. 6 is the encoder table of comparisons of the compression method embodiment of structured web site daily record of the present invention;
Fig. 7 is the FACT files of the compression method embodiment of structured web site daily record of the present invention;
Fig. 8 is the secondary coding device table of comparisons of the compression method embodiment of structured web site daily record of the present invention.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention Figure, is clearly and completely described the technical solution in the embodiment of the present invention, it is clear that described embodiment is the present invention Part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not having All other embodiments obtained under the premise of creative work are made, belong to the scope of protection of the invention.
Fig. 1 is a kind of flow chart of the compression method embodiment of structured web site daily record of the present invention, and Fig. 2 is structure of the present invention Change the structure flow chart of the compression method embodiment of web log file, Fig. 3 is that the compression method of structured web site daily record of the present invention is real The coding compression process figure of example is applied, Fig. 4 is the decoding uncompressed streams of the compression method embodiment of structured web site daily record of the present invention Cheng Tu, as shown in Figure 1, Figure 2, Figure 3 and Figure 4, the compression method of the structured web site daily record of the present embodiment can include following step Suddenly:
Web log file decomposition step 101, it is multiple fields that every in web log file, which is recorded according to STRUCTURE DECOMPOSITION,.
Specifically, reading a web log file, according to the pattern of its content, content resolution is carried out to it, by same structure Content be divided into a field, by the field of the different structure occurred in all-network daily record all decomposite come, such as according to day The different structures such as phase, time, IP address are decomposed.
Encoder step 102 is established, corresponding encoder is generated to each field of the web log file after decomposition, and establish The corresponding encoder table of comparisons.
Specifically, some fields with different structure that will be obtained in web log file decomposition step 101, are dissipated by Java List mode establishes encoder, and generates the corresponding encoder table of comparisons;
Since Java hash tables have the problem of hashing again, i.e., when the certain threshold value of the entry heavy rain stored in hash table When, Java Virtual Machine can to hash table implement rearrange, simultaneously because this process can consume the regular hour can be complete Into rearranging, the problems such as causing the hysteresis of system, and former hash table two is also consumed when the process rearranged Big memory headroom again, therefore when handling the network log of big data, more times and memory can be consumed to carry out again Arrangement, if remaining space is insufficient, it is likely that the operation of system can be caused to go wrong.In the present invention using group hash table Mode solves the problems, such as this, i.e. Java hash tables are the aggregate manner of one group of hash table, and each hash table are allocated in advance Capacity, it is after the entry of a hash table reaches its threshold value, another hash table is in combination so that form one it is new , since hash table is to have distributed capacity, no longer there is the problem of hashing again in the hash table with bigger threshold value.
Coding step 103, recompiles web log file using encoder, obtains FACT files, to FACT files Analyzed to obtain report file.
Specifically, process as shown in Figure 3, using establish the encoder table of comparisons that encoder step 102 obtains with its The corresponding content of web log file is replaced, and the full content in web log file is replaced with simple in the encoder table of comparisons Coding, obtains new FACT files.So that the data of web log file are significantly reduced, and then improve analyzing web site The analysis efficiency of daily record, can more quickly generate analysis result report file, while will not be made again because of the process of compression Into website log content distortion the problems such as.
Decoding step 104, the encoder table of comparisons obtained by establishing encoder step 102 obtain coding step 103 Report file decoded, obtain final report file.
Specifically, process as shown in Figure 4, when needing to carry out decoding process to encoded report file, reads All report file simultaneously carries out structural analysis to it, determines coding contained therein, then reads all encoders pair Found according to table and encode corresponding initial value with it, and its initial value is substituted into corresponding coding site, to each in a record Field is decoded respectively, is stopped after coding all in a web log file all replaces with its initial value, and most lifelong Into final report file, to record the analysis result of web log file.
Further, due to determine whether that new value is needed to coding the moment in cataloged procedure is carried out to web log file Device is encoded, therefore whole encoder should be remained stored in memory, to be read at any time to it, but if encoder compares The indefinite length of initial value in table, can not estimate during storage for the memory of consumption required for it, this is super in processing It is possible to that unpredictable problem occurs during large-scale data, therefore the encoder table of comparisons is encoded to solve again The certainly problem.
Using being encoded again to the encoder table of comparisons by MD5 algorithms in the present embodiment, by the encoder table of comparisons The initial value of encoder each replace with a MD5 value accordingly, the secondary coding device table of comparisons is obtained, due to secondary coding MD5 values and the corresponding coding of web log file initial value are only stored in the device table of comparisons, therefore, the secondary coding device table of comparisons, which has, to be fixed Length, can estimate its scale.
Further, if always encoded web log file not on the same day using one and same coding device, can cause to compile The volume of code device is more and more huger, causes the efficiency for handling data will be more and more lower.Therefore, encoder is daily split, Build a new encoder every day, although the consumption of memory can be increased, improve system processing speed, while if a certain coding Process goes wrong, it also simply influences the data on the same day, without influencing other days, problem is localized, beneficial to number According to the measure such as remedy.
Further, if the scale of web log file is quite huge, the encoder on the day of it also can be it is very huge, Therefore only daily partition encoding device or inadequate, reply encoder are further split, used in the present embodiment by field into Row segmentation, i.e., each one encoder of field, although cataloged procedure can be caused extremely complex, significantly improves encoder Memory use.
Further, the compression method of the structured web site daily record of the present embodiment can be used for Apache Server, Microsoft The web log file of the SDC generations of IIS and Webtrends.
The compression method of the structured web site daily record of the present embodiment can make the web log file file of big data obtain it is quick, Efficient coding compresses, while does not damage the file structure of original, reduces the workload of analysis software, improves analysis efficiency.
Specific embodiment:
It is illustrated in figure 5 a part for the web log file of a certain website.Coding pressure is carried out to the web log file described in Fig. 5 Contracting;
This web log file can be divided into some fields according to web log file decomposition step 101, such as:2012-07-09、16: 12:36th, 14.113.241.249 etc..
The web log file described in Fig. 5 is analyzed according to encoder step 102 is established, establishes corresponding encoder pair According to table, as shown in Figure 6.
In the present embodiment to storing the Java hash tables of the encoder table of comparisons using hash table set is organized by the way of, and to every It is 2,000,000 characters that a hash table, which presets capacity,.
Coding compression is carried out to the web log file shown in Fig. 1 according to coding step 103, can obtain new FACT files, such as Shown in Fig. 7.
After FACT files are obtained, FACT files are analyzed, generate report file to record and analyze result.
To the encoder table of comparisons progress secondary coding as shown in Figure 6 obtained according to encoder step 102 is established, coding Mode uses MD5 algorithms, obtains the secondary coding device table of comparisons as shown in Figure 8.
When needing to carry out decoding step, by analyzing report file, whole encoder tables of comparisons are read, by text Coding in part is substituted for the initial value of web log file, after the completion of whole fields are all decoded, can obtain original web daily record text Part, i.e., final report file.
The compression method of the structured web site daily record of this preferred embodiment can be such that web log file is significantly encoded Compression, makes the analytic process of analysis software more efficient, while encoded compressed file can be reduced by decoding operate For original web daily record.
The compression method of the structured web site daily record of the present invention fast and effectively can be compressed web log file, at the same time Do not destroy the structural of original web daily record, reduce the scale of the web log file of analysis needed for analysis software, make big data, extensive Web log file analytic process it is more quick, effectively.
Finally it should be noted that:The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although The present invention is described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that:It still may be used To modify to the technical solution described in foregoing embodiments, or equivalent substitution is carried out to which part technical characteristic; And these modification or replace, do not make appropriate technical solution essence depart from various embodiments of the present invention technical solution spirit and Scope.

Claims (1)

1. a kind of compression method of structured web site daily record, it is characterised in that comprise the following steps:
Web log file decomposition step, it is multiple fields that every in web log file, which is recorded according to STRUCTURE DECOMPOSITION,;
Encoder step is established, corresponding encoder, and foundation pair are generated to each field of the web log file after decomposition The encoder table of comparisons answered;
Coding step, recompiles web log file using the encoder, obtains FACT files, and FACT files are carried out Analyze to obtain report file;
Coding step again, using BASE64 or MD5 to the net in the encoder table of comparisons established encoder step and obtained Log field of standing is encoded again, obtains the secondary coding device table of comparisons;
Decoding step, passes through the report established the encoder table of comparisons that encoder step obtains and obtained to the coding step File is decoded, and obtains final report file;
Wherein, the encoder for establishing encoder step is realized using Java hash table modes;
The Java hash tables are the aggregate manner of one group of hash table, and allocate capacity in advance to each hash table, are dissipated when one It is after the entry of list reaches its threshold value, another hash table is in combination, so that forming one new has bigger threshold value Hash table.
CN201410663256.0A 2014-11-19 2014-11-19 The compression method of structured web site daily record Active CN104408100B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410663256.0A CN104408100B (en) 2014-11-19 2014-11-19 The compression method of structured web site daily record

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410663256.0A CN104408100B (en) 2014-11-19 2014-11-19 The compression method of structured web site daily record

Publications (2)

Publication Number Publication Date
CN104408100A CN104408100A (en) 2015-03-11
CN104408100B true CN104408100B (en) 2018-04-27

Family

ID=52645731

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410663256.0A Active CN104408100B (en) 2014-11-19 2014-11-19 The compression method of structured web site daily record

Country Status (1)

Country Link
CN (1) CN104408100B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105117403B (en) * 2015-07-16 2019-10-11 中国人民大学 Daily record data fragment and querying method and device
CN106055452B (en) * 2016-05-25 2019-06-14 北京百度网讯科技有限公司 The method and apparatus for creating interchanger log template
CN106354617B (en) * 2016-08-29 2019-04-12 广州华多网络科技有限公司 Program compaction journal file output method and device
CN107241394A (en) * 2017-05-24 2017-10-10 努比亚技术有限公司 A kind of log transmission method, device and computer-readable recording medium
CN107391583B (en) * 2017-06-23 2020-07-28 微梦创科网络科技(中国)有限公司 Method and system for converting website login log information into vectorized data
CN109901978A (en) * 2017-12-08 2019-06-18 航天信息股份有限公司 A kind of Hadoop log lossless compression method and system
CN108133033B (en) * 2018-01-08 2020-06-12 武汉斗鱼网络科技有限公司 Method and device for data storage and computer equipment
CN109885549A (en) * 2019-03-04 2019-06-14 安克创新科技股份有限公司 A kind of log collecting method, device, system and computer storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1842021A (en) * 2005-03-28 2006-10-04 华为技术有限公司 Log information storage method
CN103379136A (en) * 2012-04-17 2013-10-30 中国移动通信集团公司 Compression method and decompression method of log acquisition data, compression apparatus and decompression apparatus of log acquisition data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101540964B (en) * 2008-03-18 2011-09-28 中国移动通信集团公司 Method and system for sending updated parameter and device to be updated

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1842021A (en) * 2005-03-28 2006-10-04 华为技术有限公司 Log information storage method
CN103379136A (en) * 2012-04-17 2013-10-30 中国移动通信集团公司 Compression method and decompression method of log acquisition data, compression apparatus and decompression apparatus of log acquisition data

Also Published As

Publication number Publication date
CN104408100A (en) 2015-03-11

Similar Documents

Publication Publication Date Title
CN104408100B (en) The compression method of structured web site daily record
Das et al. Big data analytics: A framework for unstructured data analysis
CN102906751B (en) A kind of method of data storage, data query and device
CN100504879C (en) Dynamic web page segmentation method
CN105099729B (en) A kind of method and apparatus of identification User Identity
CN102999480B (en) The method and system of Edit Document
CN115208414B (en) Data compression method, data compression device, computer device and storage medium
CN102609462A (en) Method for compressed storage of massive SQL (structured query language) by means of extracting SQL models
CN110008192A (en) A kind of data file compression method, apparatus, equipment and readable storage medium storing program for executing
CN114666212A (en) Configuration data issuing method
EP3963853B1 (en) Optimizing storage and retrieval of compressed data
CN112182004A (en) Method and device for viewing data in real time, computer equipment and storage medium
CN112199374B (en) Data feature mining method for data missing and related equipment thereof
CN103577604B (en) A kind of image index structure for Hadoop distributed environments
CN106570152B (en) Mass extraction method and system for mobile phone numbers
CN115905168B (en) Self-adaptive compression method and device based on database, equipment and storage medium
CN116842012A (en) Method, device, equipment and storage medium for storing Redis cluster in fragments
CN115203672A (en) Information access control method and device, computer equipment and medium
CN100511212C (en) Processing method and apparatus for electronic table file
CN104484174A (en) Processing method and processing device for compressed file with RAR (Roshal A Rchive) format
CN114925044A (en) Data synchronization method, device and equipment based on cloud storage and storage medium
CN104216914B (en) large-capacity data transmission
CN110311980B (en) Data downloading method and device
US20140108420A1 (en) Index creation method and system
CN115688195B (en) Block access control method, authentication method, device, equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant