CN104408100B - The compression method of structured web site daily record - Google Patents
The compression method of structured web site daily record Download PDFInfo
- Publication number
- CN104408100B CN104408100B CN201410663256.0A CN201410663256A CN104408100B CN 104408100 B CN104408100 B CN 104408100B CN 201410663256 A CN201410663256 A CN 201410663256A CN 104408100 B CN104408100 B CN 104408100B
- Authority
- CN
- China
- Prior art keywords
- encoder
- log file
- web log
- file
- web
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The present invention provides a kind of compression method of structured web site daily record, it is characterised in that comprises the following steps:Web log file decomposition step, it is multiple fields that every in web log file, which is recorded according to STRUCTURE DECOMPOSITION,;Encoder step is established, corresponding encoder is generated to each field of the web log file after decomposition, and establish the corresponding encoder table of comparisons;Coding step, recompiles web log file using encoder, obtains FACT files, FACT files are analyzed to obtain report file;Decoding step, the encoder table of comparisons obtained by establishing encoder step decode the report file that coding step obtains, and obtain final report file.The compression method of the structured web site daily record of the present invention fast and effectively can be compressed web log file, while not destroy the structural of original web daily record, reduce the scale of the web log file of analysis needed for analysis software.
Description
Technical field
The present invention relates to data processing technique, more particularly to a kind of compression method of structured web site daily record.
Background technology
Website most starts to refer on the internet, according to certain rule, using particular tool making for showing spy
Determine the set of the related web page of content.Briefly, website is a kind of tool of communications, and people can issue oneself by website
Want disclosed information, or utilize a website to provide relevant network service.People can be accessed by web browser
Website, obtains the information of oneself needs or enjoys network service.Weigh a website performance usually from web space size,
Web site, website connection speed, web site software configuration, website provide the several respects such as service and consider, most direct measurement standard
It is the real traffic of website.
With the fast development of Internet technology, the content of website also becomes increasingly to enrich, and makes it have and more attracts
The characteristic of people, while the popularization of computer and the rapid growth of surfing Internet with cell phone user make it that the approach for accessing website is also more and more,
Explosive growth is presented in the visit capacity i.e. real traffic of website.But website is also given while real traffic rapid growth
Administrative staff bring some problems, such as the insufficient space of website to show more and more contents, browse increasing for user
The problems such as causing the surfing of website to decline and how to lift website service quality, but these problems can be by adding clothes
Business device, recruit the solution of the means such as new employee, however as increasing for website visiting, the analysis of more and more large-scale web log file
It is processed into for a problem.
Web log file is to record the text that Website server receives the various raw informations such as processing request and run time error
Part, specifically, it should be server log.The meaning of web log file maximum is recorded in the operation of website such as the operation feelings in space
Condition, is accessed the record of request.By web log file can be clear that user what IP, when, grasped with what
Which page of your website made to have accessed in the case of system, what browser, what resolution display, if access into
Work(.Therefore, it is correct, effectively the analysis to web log file can find website produced problem in time, while can also be to clear
Look at user browse custom etc. information analyzed, make website constantly use up row it is perfect, it is more in line with the habit for browsing user
It is used.
At present, for example an average daily visit capacity about exists in 10,000,000 website, its web log file scale generated daily
10G or so, and an average daily visit capacity, in 40,000,000 website, the scale of its daily web log file is about 50G or so.It is right
In more massive website, its average daily web log file scale can exceed 100G.
The web log file of big data quantity brings problem to the analytic process of analysis software, more and more long analysis time with
And the continuous lifting to performance requirement all becomes the problem of portal management personnel have to face.But actually in web log file
Substantial amounts of information is to repeat, and the analysis operation that analysis software constantly does identical information repetition is useless, if by day
Will file is compressed before analysis, and the workload of analysis software will be made significantly to be reduced, so as to improve work
Efficiency.
At present, hadoop technologies are the main methods to the analysis of extensive web log file, it, which has, is based on java/linux
Cluster, based on the ultra-large distributed file systems of HDFS, hardware device it is relatively cheap, realize Map/Reduce Distributed Calculation moulds
The characteristics of type and suitable batch processing.But hadoop technologies are there is also some shortcomings at the same time, such as:Lower deployment cost is high, it is necessary to a large amount of
Computer, deployment it is complicated, lack ripe high-quality technical staff, calculated using mapreduce models, programming model list
One, technical staff lacks and the problems such as lacks the technology modules of ready-made analyzing web site daily record.
ZIP compress techniques in existing compress technique due to that after compression log file structure information can be caused to lose completely,
And can not analyze, so cannot use.
Therefore, analysis how is effectively compressed to web log file just becomes urgent problem to be solved.
The content of the invention
It is an object of the invention to provide a kind of compression method of structured web site daily record, place is compressed to web log file
Reason, mitigates the workload that analysis software analyzes web log file, improves treatment effeciency.
The compression method of the structured web site daily record of the present invention, it is characterised in that comprise the following steps:
Web log file decomposition step, it is multiple fields that every in web log file, which is recorded according to STRUCTURE DECOMPOSITION,;
Encoder step is established, corresponding encoder, and foundation pair are generated to each field of the web log file after decomposition
The encoder table of comparisons answered;
Coding step, recompiles web log file using encoder, obtains FACT files, and FACT files are carried out
Analyze to obtain report file;
Decoding step, the report file that the encoder table of comparisons obtained by establishing encoder step obtains coding step
Decoded, obtain final report file.
The encoder for establishing encoder step is realized using Java hash table modes;
Java hash tables are the aggregate manner of one group of hash table, and allocate capacity in advance to each hash table.
Web log file field in the encoder table of comparisons obtained to establishing encoder step is encoded again, obtains two
The secondary encoder table of comparisons.
The coding method of the secondary coding table of comparisons uses BASE64 or MD5.
The compression method of the structured web site daily record of the present invention fast and effectively can be compressed web log file, at the same time
Do not destroy the structural of original web daily record, reduce the scale of the web log file of analysis needed for analysis software, make big data, extensive
Web log file analytic process it is more quick, effectively.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is attached drawing needed in technology description to be briefly described, it should be apparent that, drawings in the following description are this hairs
Some bright embodiments, for those of ordinary skill in the art, without having to pay creative labor, can be with
Other attached drawings are obtained according to these attached drawings.
Fig. 1 is the flow chart of the compression method embodiment of structured web site daily record of the present invention;
Fig. 2 is the structure flow chart of the compression method embodiment of structured web site daily record of the present invention;
Fig. 3 is the coding compression process figure of the compression method embodiment of structured web site daily record of the present invention;
Fig. 4 is the decoding decompression flow chart of the compression method embodiment of structured web site daily record of the present invention;
Fig. 5 is the web log file schematic diagram of the compression method embodiment of structured web site daily record of the present invention;
Fig. 6 is the encoder table of comparisons of the compression method embodiment of structured web site daily record of the present invention;
Fig. 7 is the FACT files of the compression method embodiment of structured web site daily record of the present invention;
Fig. 8 is the secondary coding device table of comparisons of the compression method embodiment of structured web site daily record of the present invention.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention
Figure, is clearly and completely described the technical solution in the embodiment of the present invention, it is clear that described embodiment is the present invention
Part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not having
All other embodiments obtained under the premise of creative work are made, belong to the scope of protection of the invention.
Fig. 1 is a kind of flow chart of the compression method embodiment of structured web site daily record of the present invention, and Fig. 2 is structure of the present invention
Change the structure flow chart of the compression method embodiment of web log file, Fig. 3 is that the compression method of structured web site daily record of the present invention is real
The coding compression process figure of example is applied, Fig. 4 is the decoding uncompressed streams of the compression method embodiment of structured web site daily record of the present invention
Cheng Tu, as shown in Figure 1, Figure 2, Figure 3 and Figure 4, the compression method of the structured web site daily record of the present embodiment can include following step
Suddenly:
Web log file decomposition step 101, it is multiple fields that every in web log file, which is recorded according to STRUCTURE DECOMPOSITION,.
Specifically, reading a web log file, according to the pattern of its content, content resolution is carried out to it, by same structure
Content be divided into a field, by the field of the different structure occurred in all-network daily record all decomposite come, such as according to day
The different structures such as phase, time, IP address are decomposed.
Encoder step 102 is established, corresponding encoder is generated to each field of the web log file after decomposition, and establish
The corresponding encoder table of comparisons.
Specifically, some fields with different structure that will be obtained in web log file decomposition step 101, are dissipated by Java
List mode establishes encoder, and generates the corresponding encoder table of comparisons;
Since Java hash tables have the problem of hashing again, i.e., when the certain threshold value of the entry heavy rain stored in hash table
When, Java Virtual Machine can to hash table implement rearrange, simultaneously because this process can consume the regular hour can be complete
Into rearranging, the problems such as causing the hysteresis of system, and former hash table two is also consumed when the process rearranged
Big memory headroom again, therefore when handling the network log of big data, more times and memory can be consumed to carry out again
Arrangement, if remaining space is insufficient, it is likely that the operation of system can be caused to go wrong.In the present invention using group hash table
Mode solves the problems, such as this, i.e. Java hash tables are the aggregate manner of one group of hash table, and each hash table are allocated in advance
Capacity, it is after the entry of a hash table reaches its threshold value, another hash table is in combination so that form one it is new
, since hash table is to have distributed capacity, no longer there is the problem of hashing again in the hash table with bigger threshold value.
Coding step 103, recompiles web log file using encoder, obtains FACT files, to FACT files
Analyzed to obtain report file.
Specifically, process as shown in Figure 3, using establish the encoder table of comparisons that encoder step 102 obtains with its
The corresponding content of web log file is replaced, and the full content in web log file is replaced with simple in the encoder table of comparisons
Coding, obtains new FACT files.So that the data of web log file are significantly reduced, and then improve analyzing web site
The analysis efficiency of daily record, can more quickly generate analysis result report file, while will not be made again because of the process of compression
Into website log content distortion the problems such as.
Decoding step 104, the encoder table of comparisons obtained by establishing encoder step 102 obtain coding step 103
Report file decoded, obtain final report file.
Specifically, process as shown in Figure 4, when needing to carry out decoding process to encoded report file, reads
All report file simultaneously carries out structural analysis to it, determines coding contained therein, then reads all encoders pair
Found according to table and encode corresponding initial value with it, and its initial value is substituted into corresponding coding site, to each in a record
Field is decoded respectively, is stopped after coding all in a web log file all replaces with its initial value, and most lifelong
Into final report file, to record the analysis result of web log file.
Further, due to determine whether that new value is needed to coding the moment in cataloged procedure is carried out to web log file
Device is encoded, therefore whole encoder should be remained stored in memory, to be read at any time to it, but if encoder compares
The indefinite length of initial value in table, can not estimate during storage for the memory of consumption required for it, this is super in processing
It is possible to that unpredictable problem occurs during large-scale data, therefore the encoder table of comparisons is encoded to solve again
The certainly problem.
Using being encoded again to the encoder table of comparisons by MD5 algorithms in the present embodiment, by the encoder table of comparisons
The initial value of encoder each replace with a MD5 value accordingly, the secondary coding device table of comparisons is obtained, due to secondary coding
MD5 values and the corresponding coding of web log file initial value are only stored in the device table of comparisons, therefore, the secondary coding device table of comparisons, which has, to be fixed
Length, can estimate its scale.
Further, if always encoded web log file not on the same day using one and same coding device, can cause to compile
The volume of code device is more and more huger, causes the efficiency for handling data will be more and more lower.Therefore, encoder is daily split,
Build a new encoder every day, although the consumption of memory can be increased, improve system processing speed, while if a certain coding
Process goes wrong, it also simply influences the data on the same day, without influencing other days, problem is localized, beneficial to number
According to the measure such as remedy.
Further, if the scale of web log file is quite huge, the encoder on the day of it also can be it is very huge,
Therefore only daily partition encoding device or inadequate, reply encoder are further split, used in the present embodiment by field into
Row segmentation, i.e., each one encoder of field, although cataloged procedure can be caused extremely complex, significantly improves encoder
Memory use.
Further, the compression method of the structured web site daily record of the present embodiment can be used for Apache Server, Microsoft
The web log file of the SDC generations of IIS and Webtrends.
The compression method of the structured web site daily record of the present embodiment can make the web log file file of big data obtain it is quick,
Efficient coding compresses, while does not damage the file structure of original, reduces the workload of analysis software, improves analysis efficiency.
Specific embodiment:
It is illustrated in figure 5 a part for the web log file of a certain website.Coding pressure is carried out to the web log file described in Fig. 5
Contracting;
This web log file can be divided into some fields according to web log file decomposition step 101, such as:2012-07-09、16:
12:36th, 14.113.241.249 etc..
The web log file described in Fig. 5 is analyzed according to encoder step 102 is established, establishes corresponding encoder pair
According to table, as shown in Figure 6.
In the present embodiment to storing the Java hash tables of the encoder table of comparisons using hash table set is organized by the way of, and to every
It is 2,000,000 characters that a hash table, which presets capacity,.
Coding compression is carried out to the web log file shown in Fig. 1 according to coding step 103, can obtain new FACT files, such as
Shown in Fig. 7.
After FACT files are obtained, FACT files are analyzed, generate report file to record and analyze result.
To the encoder table of comparisons progress secondary coding as shown in Figure 6 obtained according to encoder step 102 is established, coding
Mode uses MD5 algorithms, obtains the secondary coding device table of comparisons as shown in Figure 8.
When needing to carry out decoding step, by analyzing report file, whole encoder tables of comparisons are read, by text
Coding in part is substituted for the initial value of web log file, after the completion of whole fields are all decoded, can obtain original web daily record text
Part, i.e., final report file.
The compression method of the structured web site daily record of this preferred embodiment can be such that web log file is significantly encoded
Compression, makes the analytic process of analysis software more efficient, while encoded compressed file can be reduced by decoding operate
For original web daily record.
The compression method of the structured web site daily record of the present invention fast and effectively can be compressed web log file, at the same time
Do not destroy the structural of original web daily record, reduce the scale of the web log file of analysis needed for analysis software, make big data, extensive
Web log file analytic process it is more quick, effectively.
Finally it should be noted that:The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although
The present invention is described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that:It still may be used
To modify to the technical solution described in foregoing embodiments, or equivalent substitution is carried out to which part technical characteristic;
And these modification or replace, do not make appropriate technical solution essence depart from various embodiments of the present invention technical solution spirit and
Scope.
Claims (1)
1. a kind of compression method of structured web site daily record, it is characterised in that comprise the following steps:
Web log file decomposition step, it is multiple fields that every in web log file, which is recorded according to STRUCTURE DECOMPOSITION,;
Encoder step is established, corresponding encoder, and foundation pair are generated to each field of the web log file after decomposition
The encoder table of comparisons answered;
Coding step, recompiles web log file using the encoder, obtains FACT files, and FACT files are carried out
Analyze to obtain report file;
Coding step again, using BASE64 or MD5 to the net in the encoder table of comparisons established encoder step and obtained
Log field of standing is encoded again, obtains the secondary coding device table of comparisons;
Decoding step, passes through the report established the encoder table of comparisons that encoder step obtains and obtained to the coding step
File is decoded, and obtains final report file;
Wherein, the encoder for establishing encoder step is realized using Java hash table modes;
The Java hash tables are the aggregate manner of one group of hash table, and allocate capacity in advance to each hash table, are dissipated when one
It is after the entry of list reaches its threshold value, another hash table is in combination, so that forming one new has bigger threshold value
Hash table.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410663256.0A CN104408100B (en) | 2014-11-19 | 2014-11-19 | The compression method of structured web site daily record |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410663256.0A CN104408100B (en) | 2014-11-19 | 2014-11-19 | The compression method of structured web site daily record |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104408100A CN104408100A (en) | 2015-03-11 |
CN104408100B true CN104408100B (en) | 2018-04-27 |
Family
ID=52645731
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410663256.0A Active CN104408100B (en) | 2014-11-19 | 2014-11-19 | The compression method of structured web site daily record |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104408100B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105117403B (en) * | 2015-07-16 | 2019-10-11 | 中国人民大学 | Daily record data fragment and querying method and device |
CN106055452B (en) * | 2016-05-25 | 2019-06-14 | 北京百度网讯科技有限公司 | The method and apparatus for creating interchanger log template |
CN106354617B (en) * | 2016-08-29 | 2019-04-12 | 广州华多网络科技有限公司 | Program compaction journal file output method and device |
CN107241394A (en) * | 2017-05-24 | 2017-10-10 | 努比亚技术有限公司 | A kind of log transmission method, device and computer-readable recording medium |
CN107391583B (en) * | 2017-06-23 | 2020-07-28 | 微梦创科网络科技(中国)有限公司 | Method and system for converting website login log information into vectorized data |
CN109901978A (en) * | 2017-12-08 | 2019-06-18 | 航天信息股份有限公司 | A kind of Hadoop log lossless compression method and system |
CN108133033B (en) * | 2018-01-08 | 2020-06-12 | 武汉斗鱼网络科技有限公司 | Method and device for data storage and computer equipment |
CN109885549A (en) * | 2019-03-04 | 2019-06-14 | 安克创新科技股份有限公司 | A kind of log collecting method, device, system and computer storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1842021A (en) * | 2005-03-28 | 2006-10-04 | 华为技术有限公司 | Log information storage method |
CN103379136A (en) * | 2012-04-17 | 2013-10-30 | 中国移动通信集团公司 | Compression method and decompression method of log acquisition data, compression apparatus and decompression apparatus of log acquisition data |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101540964B (en) * | 2008-03-18 | 2011-09-28 | 中国移动通信集团公司 | Method and system for sending updated parameter and device to be updated |
-
2014
- 2014-11-19 CN CN201410663256.0A patent/CN104408100B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1842021A (en) * | 2005-03-28 | 2006-10-04 | 华为技术有限公司 | Log information storage method |
CN103379136A (en) * | 2012-04-17 | 2013-10-30 | 中国移动通信集团公司 | Compression method and decompression method of log acquisition data, compression apparatus and decompression apparatus of log acquisition data |
Also Published As
Publication number | Publication date |
---|---|
CN104408100A (en) | 2015-03-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104408100B (en) | The compression method of structured web site daily record | |
Das et al. | Big data analytics: A framework for unstructured data analysis | |
CN102906751B (en) | A kind of method of data storage, data query and device | |
CN100504879C (en) | Dynamic web page segmentation method | |
CN105099729B (en) | A kind of method and apparatus of identification User Identity | |
CN102999480B (en) | The method and system of Edit Document | |
CN115208414B (en) | Data compression method, data compression device, computer device and storage medium | |
CN102609462A (en) | Method for compressed storage of massive SQL (structured query language) by means of extracting SQL models | |
CN110008192A (en) | A kind of data file compression method, apparatus, equipment and readable storage medium storing program for executing | |
CN114666212A (en) | Configuration data issuing method | |
EP3963853B1 (en) | Optimizing storage and retrieval of compressed data | |
CN112182004A (en) | Method and device for viewing data in real time, computer equipment and storage medium | |
CN112199374B (en) | Data feature mining method for data missing and related equipment thereof | |
CN103577604B (en) | A kind of image index structure for Hadoop distributed environments | |
CN106570152B (en) | Mass extraction method and system for mobile phone numbers | |
CN115905168B (en) | Self-adaptive compression method and device based on database, equipment and storage medium | |
CN116842012A (en) | Method, device, equipment and storage medium for storing Redis cluster in fragments | |
CN115203672A (en) | Information access control method and device, computer equipment and medium | |
CN100511212C (en) | Processing method and apparatus for electronic table file | |
CN104484174A (en) | Processing method and processing device for compressed file with RAR (Roshal A Rchive) format | |
CN114925044A (en) | Data synchronization method, device and equipment based on cloud storage and storage medium | |
CN104216914B (en) | large-capacity data transmission | |
CN110311980B (en) | Data downloading method and device | |
US20140108420A1 (en) | Index creation method and system | |
CN115688195B (en) | Block access control method, authentication method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |