CN102880703B - Chinese web page data encoding, coding/decoding method and system - Google Patents

Chinese web page data encoding, coding/decoding method and system Download PDF

Info

Publication number
CN102880703B
CN102880703B CN201210361682.XA CN201210361682A CN102880703B CN 102880703 B CN102880703 B CN 102880703B CN 201210361682 A CN201210361682 A CN 201210361682A CN 102880703 B CN102880703 B CN 102880703B
Authority
CN
China
Prior art keywords
web page
page data
chinese web
unicode
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210361682.XA
Other languages
Chinese (zh)
Other versions
CN102880703A (en
Inventor
梁捷
俞永福
何小鹏
朱顺炎
田文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou Dongjing Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Dongjing Computer Technology Co Ltd filed Critical Guangzhou Dongjing Computer Technology Co Ltd
Priority to CN201210361682.XA priority Critical patent/CN102880703B/en
Publication of CN102880703A publication Critical patent/CN102880703A/en
Application granted granted Critical
Publication of CN102880703B publication Critical patent/CN102880703B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of Chinese web page data-encoding scheme, comprise: from the first character of the Chinese web page data when pre-treatment, word segmentation processing is carried out, to determine whether there is the participle started with this first character mated with the word in the dictionary pre-set according to the dictionary pre-set; When there is the participle started with this first character of coupling, utilization is encoded with the corresponding Unicode of the word that this participle mates and is replaced this participle, or when there is not the participle started with this first character of coupling, utilize the Unicode coding of this first character to replace this first character; And from the Chinese web page data when pre-treatment, remove the part being replaced by Unicode coding, work as the Chinese web page data of pre-treatment as next, repeat above-mentioned process, until Chinese web page data are completely replaced as Unicode encoding stream.Utilize the method, taking up room of the data stream after coding can be saved, reduce storage space and the data transfer throughput of Chinese web page data thus.

Description

Chinese web page data encoding, coding/decoding method and system
Technical field
The present invention relates to moving communicating field, more specifically, relate to a kind of Chinese web page data-encoding scheme and device, a kind of server with this Chinese web page data coding device, a kind of Chinese web page data decoding method and device, and a kind of mobile terminal with this Chinese web page data decoding method.
Background technology
In order to save user's surfing flow, when web page contents is transferred to the browser client of mobile terminal from server, browser background server can compress webpage before web-page transmission.The compression algorithm normally based on Lz77 that current server adopts, such as Lz77 compression algorithm, Lzma compression algorithm etc., these algorithms adopt the compressed formats such as gzip, 7zip.Webpage http:// en.wikipedia.org/wiki/LZ77 shows Lz77the associated description of compression algorithm.Webpage http:// en.wikipedia.org/wiki/Lempel – Ziv – Markov_chain_algorithmshow the associated description of Lzma compression algorithm.At this, content disclosed in these webpages is incorporated in the application by way of reference.
The ultimate principle of above-mentioned compression algorithm finds the character string of repetition in the text, sets up " dictionary " file that is repeated word string, and replace this character string with the index of dictionary in the output.Dictionary without the need to transmitting together with string encoding, and decompressing device can rebuild original character string according to the inverse process of algorithm.
Fig. 1 shows the process flow diagram of the compression algorithm of LZW.
As shown in Figure 1, first, initialization dictionary comprises the character string (step S110) that all length is 1.Then, find out and the longest character string W(step S120 in the dictionary of current Input matching).Then, in the output W is replaced with dictionary index, in input, delete W(step S130) simultaneously, and by W together with input in be positioned at W after successive character add dictionary (step S140), then step S120 is got back to, repeat above-mentioned process, until the character comprised in input is for empty.
Lzw algorithm is transparent to language, because this algorithm is at byte level definition repeat pattern, therefore it can be effectively applied to the compression of Chinese web page, but also therefore effectively can not utilize the characteristic of language itself simultaneously, such as Chinese is made up of relatively-stationary ' word ' one by one in fact from semantically saying, but this algorithm can not consider this characteristic of Chinese.From compression method, this compression algorithm depends on the repeat pattern in text, if there is not repeat pattern in certain text or character string repeats less, then this algorithm can lose efficacy or compression efficiency not high.Meanwhile, because repeat pattern identifies gradually in the process of scan text, tentatively can only identify shorter pattern, progressively could identify longer repeat pattern, this means that the initial part compressibility of document is very low, this is just unfavorable to the webpage compression of shorter length.According to the rough estimates to news category webpage, the compressibility (compressibility less expression compression is better) between 60 ~ 90% of the body matter in Chinese web page, compression effectiveness is obviously not as good as the js file, css file, html label etc. that are made up of English.
Summary of the invention
In view of the above problems, an object of the present invention is to provide a kind of Chinese web page data-encoding scheme and device, the method and device are utilized as the Unicode code bit in the private room in the Unicode code bit space that each word in the dictionary pre-set distributes or retaining space, Chinese web page content is encoded, thus improves the compression efficiency of Chinese web page data.
Another object of the present invention is providing a kind of intermediate server with above-mentioned Chinese web page data coding device.
Another object of the present invention is to provide a kind of Chinese web page data decoding method and device, the method and device can be decoded to the Unicode stream of as above encoding, to recover original Chinese web page data.
Another object of the present invention is to provide a kind of mobile terminal with above-mentioned Chinese web page data deciphering device.
According to an aspect of the present invention, provide a kind of Chinese web page data-encoding scheme, comprise: from obtained will by the first character of Chinese web page data that compresses, repeat following process, until these Chinese web page data obtained all replace to Unicode coding: from the first character of the Chinese web page data when pre-treatment, according to the dictionary pre-set, word segmentation processing is carried out to these Chinese web page data, to determine whether there is the participle started with this first character mated with the word in the dictionary pre-set; When there is the participle started with this first character mated with the word in the dictionary pre-set, current will by the Chinese web page data compressed, utilization is encoded with the corresponding Unicode of the word that this participle mates and is replaced this participle, or when there is not the participle started with this first character mated with the word in the dictionary pre-set, will, by the Chinese web page data compressed, the Unicode coding of this first character be utilized to replace this first character current; And from the Chinese web page data when pre-treatment, remove the part being replaced by Unicode coding, the Chinese web page data of pre-treatment are worked as next.
In one or more examples in above-mentioned, each word in described dictionary is preassigned the private room in Unicode code bit space or a Unicode coding in retaining space
In one or more examples in above-mentioned, that determine what mate with word in dictionary is the longest participle that can mate with the word in dictionary started with this first character with the participle started when the first character in the Chinese web page data of pre-treatment.
In one or more examples in above-mentioned, word in described dictionary arranges according to word frequency, and according to putting in order as institute's predicate distributes Unicode coding, wherein, Unicode coding in private room described in institute's predicate priority allocation, and in described private room Unicode coding be fully assigned after, distribute in described retaining space Unicode coding.
In one or more examples in above-mentioned, described private room comprises the private room that a private room being positioned at basic plane and two are positioned at supplementary plane, the Unicode coding being positioned at the private room of basic plane takies three bytes, and the Unicode coding of the private room being positioned at supplementary plane takies four bytes, institute's predicate priority allocation is arranged in the Unicode coding of the private room of basic plane, and only after the described Unicode coding being positioned at the private room of basic plane is fully assigned, the Unicode coding of the private room of supplementary plane is arranged in described in just distributing.
In one or more examples in above-mentioned, the Unicode coding in described retaining space is according to from rear to front order-assigned.
In one or more examples in above-mentioned, described Chinese web page data acquisition UTF-8 format transmission.
According to a further aspect in the invention, provide a kind of Chinese web page data coding device, comprise: word segmentation processing unit, for the first character from the Chinese web page data when pre-treatment, according to the dictionary pre-set, word segmentation processing is carried out to these Chinese web page data, to determine whether there is the participle started with this first character mated with the word in the dictionary pre-set; Coding unit, for when there is the participle started with this first character mated with the word in the dictionary pre-set, current will by the Chinese web page data compressed, utilization is encoded with the corresponding Unicode of the word that this participle mates and is replaced this participle, or when there is not the participle started with this first character mated with the word in the dictionary pre-set, will, by the Chinese web page data compressed, the Unicode coding of this first character be utilized to replace this first character current; And when pre-treatment data updating unit, for removing the part being replaced by Unicode coding from the Chinese web page data when pre-treatment, the Chinese web page data of pre-treatment are worked as next, wherein, from obtained will by the first character of Chinese web page data that compresses, repeat described word segmentation processing unit, coding unit and the processing procedure when pre-treatment data updating unit, until these Chinese web page data obtained all replace to Unicode coding.
According to a further aspect in the invention, provide a kind of intermediate server, comprise Chinese web page data coding device as above.
According to a further aspect in the invention, provide a kind of Chinese web page data decoding method, comprising: receive the Unicode encoding stream after the coding of Chinese web page data-encoding scheme as described above from intermediate server; And according to the dictionary pre-set in mobile terminal, received Unicode encoding stream is decoded as corresponding Chinese web page data, the dictionary pre-set in described mobile terminal is identical with the dictionary pre-set in intermediate server.
According to a further aspect in the invention, provide a kind of Chinese web page data deciphering device, comprising: receiving element, for receiving the Unicode encoding stream after the coding of Chinese web page data-encoding scheme as described above from intermediate server; And decoding unit, for according to the dictionary pre-set in Chinese web page data deciphering device, received Unicode encoding stream is decoded as corresponding Chinese web page data, the dictionary pre-set in described Chinese web page data deciphering device is identical with the dictionary pre-set in intermediate server.
According to a further aspect in the invention, a kind of mobile terminal, comprises Chinese web page data deciphering device as above.
According to Chinese web page data-encoding scheme of the present invention, the dictionary pre-set can be utilized, using is Unicode code bit in the private room in the Unicode code bit space that each word in dictionary distributes or retaining space, Chinese web page content is encoded, thus the space shared by data stream saved after coding, reduce storage space and the data transfer throughput of Chinese web page data thus.
In order to realize above-mentioned and relevant object, will describe in detail and the feature particularly pointed out in the claims after one or more aspect of the present invention comprises.Explanation below and accompanying drawing describe some illustrative aspects of the present invention in detail.But what these aspects indicated is only some modes that can use in the various modes of principle of the present invention.In addition, the present invention is intended to comprise all these aspects and their equivalent.
Accompanying drawing explanation
According to following detailed description of carrying out with reference to accompanying drawing, above and other object of the present invention, feature and advantage will become more apparent.In the accompanying drawings:
Fig. 1 shows the process flow diagram of the compression process based on LZW compression algorithm;
Fig. 2 shows the process flow diagram according to Chinese web page data encoding processor of the present invention;
Fig. 3 shows according to process flow diagram Chinese web page data to be processed being carried out to an example of word segmentation processing of the present invention;
Fig. 4 shows the diagram carrying out the Chinese web page data before coded treatment of an example according to Chinese web page data encoding processor of the present invention;
Fig. 5 shows the diagram carrying out word segmentation processing for the Chinese web page data in Fig. 4;
Fig. 6 shows the diagram of the result obtained after above-mentioned word segmentation processing;
Fig. 7 shows the block diagram according to Chinese web page data coding device of the present invention;
Fig. 8 shows the block diagram according to intermediate server of the present invention;
Fig. 9 shows the process flow diagram according to Chinese web page data decoding method of the present invention;
Figure 10 shows the block diagram according to Chinese web page data deciphering device of the present invention; With
Figure 11 shows the block diagram according to mobile terminal of the present invention.
Label identical in all of the figs indicates similar or corresponding feature or function.
Embodiment
Various aspects of the present disclosure are described below.It is to be understood that instruction herein can with varied form imbody, and in this article disclosed any concrete structure, function or both be only representational.Based on instruction herein, those skilled in the art are it is to be understood that an aspect disclosed herein can realize independent of any other side, and the two or more aspects in these aspects can combine according to various mode.Such as, aspect, implement device or the hands-on approach of any number described in this paper can be used.In addition, other structure, function or except one or more aspect described in this paper or be not the 26S Proteasome Structure and Function of one or more aspect described in this paper can be used, realize this device or put into practice this method.In addition, any aspect described herein can comprise at least one element of claim.
Before describing according to an embodiment of the invention, first brief description is carried out to the Unicode used in the present invention.
Term " Unicode ", also referred to as Unicode, ten thousand country codes, single code, standard ten thousand country code, is an industrywide standard in computer science.It has carried out arrangement, coding to most writing system in the world, makes computer to present by the mode more simplified and to process word.
About in the specification of Unicode, Unicode defines 1 between 0 ~ 0x10FFFF, and 114,112 space encoders (namely, 1,114,112 codings), these spaces are divided into 17 planes, be numbered 0 ~ 16 respectively, wherein No. 0 plane is called basic plane, and scope is 0000-FFFF, and 1 ~ No. 16 plane is called auxiliary plane, scope is 10000-10FFFF.
In addition, according to the using method that Unicode standard specifies, Unicode code bit divides into public space, private room and retaining space.Public space is encoded for various countries' word by specification, and private room can utilize voluntarily for private organization, and retaining space refers to temporary transient untapped space.
According to Unicode standard, private room is divided into three sections, respectively: the private room of basic plane: PrivateUseArea:U+E000..U+F8FF (6,400 characters); Supplement the private room of plane: SupplementaryPrivateUseArea-A:U+F0000..U+FFFFD (65,534 characters); Supplement the private room of plane: SupplementaryPrivateUseArea-B:U+100000..U+10FFFD (65,534 characters).In addition, according to Unicode standard, the coding of the basic plane of Unicode (0000-FFFF) takies 3 characters, and the coding of auxiliary plane (10000-10FFFF) occupies 4 bytes.
The size of retaining space is: Unassigned:30000-DFFFF (720,896 characters).
Each embodiment of the present invention is described below with reference to accompanying drawings.
Fig. 2 shows the process flow diagram according to Chinese web page data encoding processor of the present invention, and this cataloged procedure is performed by intermediate server.Described intermediate server can be the server of any type.
As shown in Figure 2, obtaining at intermediate server will by after the Chinese web page data compressed, first, in step S210, will, by the Chinese web page data compressed as current Chinese web page data to be processed, start obtained to carry out Chinese web page data encoding processor.
Then, in step S220, from the first character of the Chinese web page data when pre-treatment, according to the dictionary pre-set, word segmentation processing is carried out to these Chinese web page data, to determine whether there is the participle started with this first character mated with the word in the dictionary pre-set in Chinese web page data.In a preferred exemplary of the present invention, each word in described dictionary is preassigned the private room in Unicode code bit space or a Unicode coding in retaining space.
When allocating the Unicode in Unicode code bit space for the word in described dictionary in advance and encoding, first according to word frequency, the word in described dictionary being arranged, then distributing according to putting in order.For the preceding word that puts in order, the word that also namely frequency of utilization is high, the Unicode coding in private room described in priority allocation.Because total size of private room is only 137,468, may be not to holding large dictionary.In this case, the retaining space of part can also be used.When distributing Unicode coding for entry, generally after the Unicode coding of described private room is fully assigned, just distribute the coding of the Unicode in described retaining space.
And, in order to avoid and following norm conflict as far as possible, when using retaining space (that is, distributing the coding of the Unicode in retaining space), mode from back to front can be adopted to carry out, and the size of shared retaining space depends on that the size of dictionary deducts the size of private room.
In addition, described private room comprises the private room that a private room being positioned at basic plane and two are positioned at supplementary plane, the Unicode coding being positioned at the private room of basic plane takies three bytes, and the Unicode coding of the private room being positioned at supplementary plane takies four bytes.When distributing the Unicode coding in private room for word, priority allocation is arranged in the Unicode coding of the private room of basic plane.Generally after the described Unicode coding being positioned at the private room of basic plane is fully assigned, inborn ability is coordinated in the Unicode coding in the private room of supplementary plane.
From the first character of the Chinese web page data when pre-treatment, according to the dictionary pre-set, word segmentation processing is carried out to these Chinese web page data, various ways can be adopted to carry out.Preferably, in an example of the present invention, the word segmentation processing mode adopted be make determined mate with word in dictionary being the longest participle that can mate with the word in dictionary started with this first character in the Chinese web page data of pre-treatment when the participle that starts of first character in the Chinese web page data of pre-treatment.Fig. 3 shows according to process flow diagram Chinese web page data to be processed being carried out to an example of word segmentation processing of the present invention.
In the example shown in Fig. 3, the entry in dictionary is stored as Chinese dictionary with the form of TRIE index tree.This Chinese dictionary comprises lead-in hash table and TRIE index tree node.
The lead-in hash function of entry provides according to Chinese character Unicode code.By a Hash operation, the sequence number of Chinese character in lead-in hash table directly can be located.First unit of lead-in hash table comprises two contents: entrance item number (2 byte): the number of the word being lead-in with this word; And the first entrance item pointer (4 byte): the root node of corresponding Chinese character TRIE index tree.
TRIE index tree node with following structure be unit, the array that according to keywords sorts: key word (2 byte): single Chinese character, with the Unicode coding and sorting order of this Chinese character; Subtree size (2 byte): with the substring of the key word composition from root node to active cell for prefix and the number of the different word of subsequent words; Word tree pointer (4 byte): during word tree size non-zero, point to word tree; Otherwise sensing leaf.
Fig. 3 shows the process based on any one word of TRIE tree query W [n], and wherein n refers to the character number comprised in this word.
As shown in Figure 3, first, in step S310, i is set to i=1.Then, in step S320, obtain the index tree root node of the TRIE of w [1] according to lead-in hash table, be set to P.Then, in step S330, the value of i is increased by 1, proceeds to step S340 subsequently.
In step S340, in the key word node of P, binary chop is carried out to w [i].Then, in step S350, determine in the key word of node P, whether there is the key word mated with w [i].If the match is successful for certain key word of node P and w [i], then P is set to subtree root node corresponding to this key element, and turns back to step S330.Otherwise, think that P is leaf node, and proceed to step S360.
In step S360, determine whether i is greater than n.If i is greater than n, then think that successful inquiring, w [n] are the entry of in dictionary.If i<n, then think and inquire about unsuccessfully, being defined as by w [n-1] is an entry in dictionary.
Carried out a description above with reference to Fig. 3 to word segmentation processing process, but above-mentioned example is only an illustration of the present invention, word segmentation processing process can also adopt alternate manner as known in the art to carry out.
Get back to Fig. 2, after word segmentation processing being carried out to current Chinese web page data to be processed in step S220, in step S230, judge whether there is the participle started with this first character mated with the word in the dictionary pre-set in current Chinese web page data to be processed.When there is the participle started with this first character mated with the word in the dictionary pre-set, namely, when the judged result of step S230 is for being, in step S240, current will by the Chinese web page data compressed, utilizing encodes with the corresponding Unicode of the word that this participle mates replaces this participle.
When there is not the participle started with this first character mated with the word in the dictionary pre-set, namely, when the judged result of step S230 is no, in step s 250, will, by the Chinese web page data compressed, the Unicode coding of this first character be utilized to replace this first character current.
Then, in step S260, from the Chinese web page data when pre-treatment, remove the part being replaced by Unicode coding, work as the Chinese web page data of pre-treatment as next.Subsequently, in step S270, judge whether next the Chinese web page data when pre-treatment obtained after above-mentioned replacement process are empty.
If next Chinese web page data when pre-treatment is empty, then flow process terminates.If next when Chinese web page data of pre-treatment be empty, then turn back to step S220, for this, next carries out circular treatment when the Chinese web page data of pre-treatment, encodes until the Chinese web page data obtained all replace with Unicode.
In the present invention, Chinese web page data adopt UTF-8 form to transmit usually.In other embodiments of the invention, Chinese web page data also can adopt other format transmission, such as UTF-16.In UTF-8 form, each Chinese character will account for 3 bytes, if using word as Basic Transmission Unit, each word also only accounts for three or four bytes.Carry out File Transfer for UTF-8 form below, the beneficial effect that cataloged procedure according to the present invention obtains is described.
Fig. 4 shows the diagram carrying out the Chinese web page data before coded treatment of an example according to Chinese web page data encoding processor of the present invention.
The one section of Chinese web page data won from Sina News have been shown in Fig. 4, in this paragraph in web page text data, have comprised 78 characters, because each character takies 3 bytes, therefore total size is 78 × 3=234 byte.
Then, according to the mode shown in Fig. 5, the diagram of word segmentation processing is carried out for the Chinese web page data in Fig. 4.As shown in Figure 5, in participle process, first can identify " Philippine " this word, then be replaced with 59500 (0xe68c), so just 9, the space shared by three words byte being saved is 4 bytes.Similar, when analyzing " exclusive economic zone (EEZ) ", 20745 (0x328c5) can be replaced with, so just 15 bytes being replaced with 4 bytes.The rest may be inferred, carries out word segmentation processing to the Chinese web page data in Fig. 4.
Fig. 6 shows the diagram of the result obtained after above-mentioned word segmentation processing.In result shown in Figure 6, with space-separated between word and word.As can be seen from Figure 6, after Chinese web page data encoding according to the present invention process, 78 characters in Fig. 4 are broken down into 41 words.Due in UTF-8 form, each word only accounts for three or four bytes.In this case, the size of the text obtained after as above encoding is 41 × 4=164 to the maximum.Can calculate thus, the ratio of saving is (234-164)/234=30%.In addition, be noted that here in Chinese web page data encoding of the present invention, employing be participle limit, limit coding processing mode, that is, after obtaining a participle, just this participle is replaced with Unicode encode.Therefore, after completing all word segmentation processing, what obtain should be Unicode encoding stream, instead of the result shown in Fig. 6.Diagram in Fig. 6 is only used to understand the present invention better and is replaced by Unicode coding participle and formed.
As can be seen from above, with carry out compared with transmission after directly original Chinese web page being compressed in prior art, transmission after compressing again after utilization coding method according to the present invention carries out recompile to original Chinese web page, the size text that must transmit can be made less, can volume of transmitted data be reduced thus.
Fig. 7 shows the block diagram according to Chinese web page data coding device 700 of the present invention.As shown in Figure 7, Chinese web page data coding device 700 comprises word segmentation processing unit 710, coding unit 720 and works as pre-treatment data updating unit 730.
Word segmentation processing unit 710 is for the first character from the Chinese web page data when pre-treatment, according to the dictionary pre-set, word segmentation processing is carried out to these Chinese web page data, to determine whether there is the participle started with this first character mated with the word in the dictionary pre-set in these Chinese web page data.In a preferred embodiment of the invention, each word in described dictionary is preassigned the private room in Unicode code bit space or a Unicode coding in retaining space.
When for there is the participle started with this first character mated with the word in the dictionary pre-set in Chinese web page data in coding unit 720, current will by the Chinese web page data compressed, utilization is encoded with the corresponding Unicode of the word that this participle mates and is replaced this participle, or when there is not the participle started with this first character mated with the word in the dictionary pre-set in Chinese web page data, will, by the Chinese web page data compressed, the Unicode coding of this first character be utilized to replace this first character current.
When pre-treatment data updating unit 730 for removing the part being replaced by Unicode coding from the Chinese web page data when pre-treatment, work as the Chinese web page data of pre-treatment as next.
When utilizing Chinese web page data coding device 700 according to the present invention to obtained will coding by the Chinese web page data compressed, from obtained will by the first character of Chinese web page data that compresses, repeat described word segmentation processing unit 710, coding unit 720 and the processing procedure when pre-treatment data updating unit 730, until these Chinese web page data obtained all replace to Unicode coding.
Fig. 8 shows the block diagram according to intermediate server 10 of the present invention.As shown in Figure 8, intermediate server 10 comprises the Chinese web page data coding device 700 shown in Fig. 7.
Fig. 9 shows the process flow diagram according to Chinese web page data decoding method of the present invention.
As shown in Figure 9, in step S910, mobile terminal receives the Unicode encoding stream after the coding of Chinese web page data-encoding scheme as described above from intermediate server.After receiving described Unicode encoding stream, according to the dictionary pre-set in mobile terminal, received Unicode encoding stream is decoded as corresponding Chinese web page data, wherein, the dictionary pre-set in described mobile terminal is identical with the dictionary pre-set in intermediate server.
Figure 10 shows the block diagram according to Chinese web page data deciphering device 1000 of the present invention.As shown in Figure 10, Chinese web page data deciphering device 1000 comprises receiving element 1010 and decoding unit 1020.
Described receiving element 1020 receives the Unicode encoding stream after the coding of Chinese web page data-encoding scheme as described above from intermediate server.After receiving described Unicode encoding stream, decoding unit 1020 is according to the dictionary pre-set in mobile terminal, received Unicode encoding stream is decoded as corresponding Chinese web page data, wherein, the dictionary pre-set in described mobile terminal is identical with the dictionary pre-set in intermediate server.Such as, after carrying out point Chinese word coding as shown in Figure 5, when comprising " 0xe68c " in the Unicode encoding stream received on mobile terminal (browser client), be decoded as " Philippine ".
Figure 11 shows the block diagram according to mobile terminal 20 of the present invention.As shown in figure 11, mobile terminal 20 comprises the Chinese web page data deciphering device 1000 shown in Figure 10.
Utilize according to Chinese web page data-encoding scheme of the present invention, the dictionary pre-set can be utilized, using is unicode code bit in the private room in the unicode code bit space that each word in dictionary distributes or retaining space, Chinese web page content is encoded, thus the space shared by data stream saved after coding, reduce storage space and the data transfer throughput of Chinese web page data thus.
In addition, typically, mobile terminal of the present invention can be various hand-held terminal device, such as mobile phone, PDA(Personal Digital Assistant) etc., and therefore protection scope of the present invention should not be defined as the mobile terminal of certain particular type.
In addition, the computer program performed by CPU can also be implemented as according to method of the present invention.When this computer program is performed by CPU, perform the above-mentioned functions limited in method of the present invention.
In addition, said method step and system unit also can utilize controller and realize for storing the computer readable storage devices making controller realize the computer program of above-mentioned steps or Elementary Function.
In addition, it is to be understood that computer readable storage devices as herein described (such as, storer) can be volatile memory or nonvolatile memory, or volatile memory and nonvolatile memory can be comprised.Nonrestrictive as an example, nonvolatile memory can comprise ROM (read-only memory) (ROM), programming ROM (PROM), electrically programmable ROM(EPROM), electrically erasable programmable ROM(EEPROM) or flash memory.Volatile memory can comprise random-access memory (ram), and this RAM can serve as external cache.Nonrestrictive as an example, RAM can obtain in a variety of forms, such as synchronous random access memory (DRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate SDRAM(DDRSDRAM), strengthen SDRAM(ESDRAM), synchronization link DRAM(SLDRAM) and direct RambusRAM(DRRAM).The memory device of disclosed aspect is intended to the storer including but not limited to these and other suitable type.
Those skilled in the art will also understand is that, may be implemented as electronic hardware, computer software or both combinations in conjunction with various illustrative logical blocks, module, circuit and the algorithm steps described by disclosure herein.In order to this interchangeability of hardware and software is clearly described, the function with regard to various exemplary components, square, module, circuit and step has carried out general description to it.This function is implemented as software or is implemented as hardware and depends on embody rule and be applied to the design constraint of whole system.Those skilled in the art can realize described function in every way for often kind of embody rule, but this realization determines should not be interpreted as causing departing from the scope of the present invention.
Although disclosed content shows exemplary embodiment of the present invention above, it should be noted that under the prerequisite not deviating from the scope of the present invention that claim limits, can multiple change and amendment be carried out.Need not perform with any particular order according to the function of the claim to a method of inventive embodiments described herein, step and/or action.In addition, although element of the present invention can describe or requirement with individual form, also it is contemplated that multiple, is odd number unless explicitly limited.
Although describe each embodiment according to the present invention above with reference to figure to be described, it will be appreciated by those skilled in the art that each embodiment that the invention described above is proposed, various improvement can also be made on the basis not departing from content of the present invention.Therefore, protection scope of the present invention should be determined by the content of appending claims.

Claims (12)

1. a Chinese web page data-encoding scheme, comprising:
By the first character of Chinese web page data that compresses, following process to be repeated, until these Chinese web page data obtained all replace to Unicode coding from obtained:
From the first character of the Chinese web page data when pre-treatment, according to the dictionary pre-set, word segmentation processing is carried out to these Chinese web page data, to determine whether there is the participle started with this first character mated with the word in the dictionary pre-set;
When there is the participle started with this first character mated with the word in the dictionary pre-set, current will by the Chinese web page data compressed, utilization is encoded with the corresponding Unicode of the word that this participle mates and is replaced this participle, or when there is not the participle started with this first character mated with the word in the dictionary pre-set, will, by the Chinese web page data compressed, the Unicode coding of this first character be utilized to replace this first character current; And
From the Chinese web page data when pre-treatment, remove the part being replaced by Unicode coding, work as the Chinese web page data of pre-treatment as next.
2. Chinese web page data-encoding scheme as claimed in claim 1, wherein, each word in described dictionary is preassigned the private room in Unicode code bit space or a Unicode coding in retaining space.
3. Chinese web page data-encoding scheme as claimed in claim 1, wherein, that determine mate with word in dictionary being the longest participle that can mate with the word in dictionary started with this first character when the participle that starts of first character in the Chinese web page data of pre-treatment.
4. Chinese web page data-encoding scheme as claimed in claim 2, wherein, the word in described dictionary arranges according to word frequency, and according to putting in order as institute's predicate distributes Unicode coding,
Wherein, Unicode in private room described in institute's predicate priority allocation coding, and after the Unicode coding in described private room is fully assigned, distribute the coding of the Unicode in described retaining space.
5. Chinese web page data-encoding scheme as claimed in claim 4, wherein, described private room comprises the private room that a private room being positioned at basic plane and two are positioned at supplementary plane, the Unicode coding being positioned at the private room of basic plane takies three bytes, and the Unicode coding of the private room being positioned at supplementary plane takies four bytes, institute's predicate priority allocation is arranged in the Unicode coding of the private room of basic plane, and after the described Unicode coding being positioned at the private room of basic plane is fully assigned, the Unicode coding of the private room of supplementary plane is arranged in described in distribution.
6. Chinese web page data-encoding scheme as claimed in claim 5, wherein, the Unicode coding in described retaining space is according to from rear to front order-assigned.
7. Chinese web page data-encoding scheme as claimed in claim 1, wherein, described Chinese web page data acquisition UTF-8 format transmission.
8. a Chinese web page data coding device, comprising:
Word segmentation processing unit, for the first character from the Chinese web page data when pre-treatment, according to the dictionary pre-set, word segmentation processing is carried out to these Chinese web page data, to determine whether there is the participle started with this first character mated with the word in the dictionary pre-set;
Coding unit, for when there is the participle started with this first character mated with the word in the dictionary pre-set, current will by the Chinese web page data compressed, utilization is encoded with the corresponding Unicode of the word that this participle mates and is replaced this participle, or when there is not the participle started with this first character mated with the word in the dictionary pre-set, will, by the Chinese web page data compressed, the Unicode coding of this first character be utilized to replace this first character current; And
When pre-treatment data updating unit, for removing the part being replaced by Unicode coding from the Chinese web page data when pre-treatment, work as the Chinese web page data of pre-treatment as next,
Wherein, from obtained will by the first character of Chinese web page data that compresses, repeat described word segmentation processing unit, coding unit and the processing procedure when pre-treatment data updating unit, until these Chinese web page data obtained all replace to Unicode coding.
9. an intermediate server, comprises Chinese web page data coding device as claimed in claim 8.
10. a Chinese web page data decoding method, comprising:
The Unicode encoding stream after according to Chinese web page data-encoding scheme coding according to claim 1 is received from intermediate server; And
According to the dictionary pre-set in mobile terminal, received Unicode encoding stream is decoded as corresponding Chinese web page data,
Wherein, the dictionary pre-set in described mobile terminal is identical with the dictionary pre-set in intermediate server.
11. 1 kinds of Chinese web page data deciphering devices, comprising:
Receiving element, for receiving the Unicode encoding stream after according to Chinese web page data-encoding scheme coding according to claim 1 from intermediate server; And
Decoding unit, for according to the dictionary pre-set in Chinese web page data deciphering device, received Unicode encoding stream is decoded as corresponding Chinese web page data, the dictionary pre-set in described Chinese web page data deciphering device is identical with the dictionary pre-set in intermediate server.
12. 1 kinds of mobile terminals, comprise Chinese web page data deciphering device as claimed in claim 11.
CN201210361682.XA 2012-09-25 2012-09-25 Chinese web page data encoding, coding/decoding method and system Active CN102880703B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210361682.XA CN102880703B (en) 2012-09-25 2012-09-25 Chinese web page data encoding, coding/decoding method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210361682.XA CN102880703B (en) 2012-09-25 2012-09-25 Chinese web page data encoding, coding/decoding method and system

Publications (2)

Publication Number Publication Date
CN102880703A CN102880703A (en) 2013-01-16
CN102880703B true CN102880703B (en) 2016-03-16

Family

ID=47482029

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210361682.XA Active CN102880703B (en) 2012-09-25 2012-09-25 Chinese web page data encoding, coding/decoding method and system

Country Status (1)

Country Link
CN (1) CN102880703B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102063566B1 (en) * 2014-02-23 2020-01-09 삼성전자주식회사 Operating Method For Text Message and Electronic Device supporting the same
CN105843854B (en) * 2015-03-16 2019-02-05 国家计算机网络与信息安全管理中心 A kind of thematic document system for rapidly identifying of network-oriented data
CN105528420A (en) * 2015-12-07 2016-04-27 北京金山安全软件有限公司 Character encoding and decoding method and device and electronic equipment
CN108108267B (en) * 2016-11-25 2021-06-22 北京国双科技有限公司 Data recovery method and device
CN111178065B (en) * 2019-12-12 2023-06-27 建信金融科技有限责任公司 Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device
CN111178061B (en) * 2019-12-20 2023-03-10 沈阳雅译网络技术有限公司 Multi-lingual word segmentation method based on code conversion
CN112632909B (en) * 2020-10-30 2024-06-11 中核核电运行管理有限公司 English coding method and device for data object

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013420A (en) * 2006-12-31 2007-08-08 中国科学院计算技术研究所 Method for identifying coding form of Chinese text
CN101729075A (en) * 2008-10-10 2010-06-09 英华达(上海)电子有限公司 Data compression method, data compression device, data decompression method and data decompression device
CN101751451A (en) * 2008-12-11 2010-06-23 高德软件有限公司 Chinese data compression method and Chinese data decompression method and related devices
CN102508824A (en) * 2011-09-29 2012-06-20 苏州大学 Compression coding and decoding method and device for microblog information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013420A (en) * 2006-12-31 2007-08-08 中国科学院计算技术研究所 Method for identifying coding form of Chinese text
CN101729075A (en) * 2008-10-10 2010-06-09 英华达(上海)电子有限公司 Data compression method, data compression device, data decompression method and data decompression device
CN101751451A (en) * 2008-12-11 2010-06-23 高德软件有限公司 Chinese data compression method and Chinese data decompression method and related devices
CN102508824A (en) * 2011-09-29 2012-06-20 苏州大学 Compression coding and decoding method and device for microblog information

Also Published As

Publication number Publication date
CN102880703A (en) 2013-01-16

Similar Documents

Publication Publication Date Title
CN102880703B (en) Chinese web page data encoding, coding/decoding method and system
US9223765B1 (en) Encoding and decoding data using context model grouping
CN101783788B (en) File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device
CN101534124B (en) Compression algorithm for short natural language
CN104486434A (en) Method for uploading and downloading files of mobile terminal and mobile terminal
US10897270B2 (en) Dynamic dictionary-based data symbol encoding
Arshad et al. Performance comparison of Huffman coding and double Huffman coding
CN103546161A (en) Lossless compression method based on binary processing
CN113852379A (en) Data encoding method, system, equipment and computer readable storage medium
CN115189696A (en) Hardware compression and decompression method based on Huffman decoding table
CN113824449B (en) Static Huffman parallel coding method, system, storage medium and equipment
Nandi et al. Modified compression techniques based on optimality of LZW code (MOLZW)
US7023365B1 (en) System and method for compression of words and phrases in text based on language features
CN109981108B (en) Data compression method, decompression method, device and equipment
CN103605730A (en) XML (extensible markup language) compressing method and device based on flexible-length identification codes
KR100494876B1 (en) Data compression method for multi-byte character language
Radescu Transform methods used in lossless compression of text files
US9235610B2 (en) Short string compression
Mahmood et al. A feasible 6 bit text database compression scheme with character encoding (6BC)
Jacob et al. Comparative analysis of lossless text compression techniques
CN105007083A (en) Method for storing output result of LZ77 compression algorithm
CN106559085A (en) A kind of normal form Hafman decoding method and its device
CN102891730B (en) Method and device for encoding satellite short message based on binary coded decimal (BCD) code
CN111384963A (en) Data compression/decompression device and data decompression method
Mahmood et al. An Efficient Text Database Compression Technique using 6 Bit Character Encoding by Table Look Up

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200702

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 2, 16, 301 rooms, 510665 Yun Yun Road, Tianhe District, Guangdong, Guangzhou

Patentee before: GUANGZHOU UCWEB COMPUTER TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right