CN102880703A - Methods and systems for encoding and decoding Chinese webpage data - Google Patents

Methods and systems for encoding and decoding Chinese webpage data Download PDF

Info

Publication number
CN102880703A
CN102880703A CN201210361682XA CN201210361682A CN102880703A CN 102880703 A CN102880703 A CN 102880703A CN 201210361682X A CN201210361682X A CN 201210361682XA CN 201210361682 A CN201210361682 A CN 201210361682A CN 102880703 A CN102880703 A CN 102880703A
Authority
CN
China
Prior art keywords
web page
page data
chinese web
unicode
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210361682XA
Other languages
Chinese (zh)
Other versions
CN102880703B (en
Inventor
梁捷
俞永福
何小鹏
朱顺炎
田文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou Dongjing Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Dongjing Computer Technology Co Ltd filed Critical Guangzhou Dongjing Computer Technology Co Ltd
Priority to CN201210361682.XA priority Critical patent/CN102880703B/en
Publication of CN102880703A publication Critical patent/CN102880703A/en
Application granted granted Critical
Publication of CN102880703B publication Critical patent/CN102880703B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method for encoding Chinese webpage data. The method comprises the following steps of: from a first character of the current processed Chinese webpage data, performing participle processing according to a preset word base to judge whether a participle which is matched with a word in the preset word base and starts with the first character exists; when the participle which is matched with the word and starts with the first character exists, replacing the participle by a corresponding Unicode of the word matched with the participle, or if the participle which is matched with the word and starts with the first character does not exist, replacing the first character by a Unicode of the first character; and removing the part which is replaced by the Unicode from the current processed Chinese webpage data to obtain the next current processed Chinese webpage data, and repeating the steps until the Chinese webpage data are completely replaced by a Unicode stream. By the method, the space occupied by an encoded data stream can be saved; and therefore, the storage space and the data transmission flow of the Chinese webpage data can be reduced.

Description

Chinese web page data encoding, coding/decoding method and system
Technical field
The present invention relates to moving communicating field, more specifically, relate to a kind of Chinese web page data-encoding scheme and device, a kind of server with this Chinese web page data coding device, a kind of Chinese web page data decoding method and device, and a kind of portable terminal with this Chinese web page data decoding method.
Background technology
In order to save user's surfing flow, when web page contents was transferred to the browser client of portable terminal from server, the browser background server can compress webpage before the webpage transmission.Current server adopt normally take the compression algorithm of Lz77 as the basis, such as Lz77 compression algorithm, Lzma compression algorithm etc., these algorithms adopt the compression forms such as gzip, 7zip.Webpage Http:// en.wikipedia.org/wiki/LZ77 shows Lz77The associated description of compression algorithm.Webpage Http:// en.wikipedia.org/wiki/Lempel – Ziv – Markov_chain_algorithmShow the associated description of Lzma compression algorithm.At this disclosed content of these webpages mode is by reference incorporated among the application.
The ultimate principle of above-mentioned compression algorithm is to seek the character string that repeats in text, set up " dictionary " file that repeats word string, and the index with dictionary replaces this character string in output.Dictionary need not to transmit with string encoding, and decompressing device can be rebuild the original character string according to the inverse process of algorithm.
Fig. 1 shows the process flow diagram of the compression algorithm of LZW.
As shown in Figure 1, at first, it is 1 character string (step S110) that the initialization dictionary comprises all length.Then, find out the longest character string W(step S120 in the dictionary with current Input matching).Then, in output, W is replaced with dictionary index, in input, delete simultaneously W(step S130), and W is added dictionary (step S140) together with the W successive character afterwards that is positioned in the input, then get back to step S120, repeat above-mentioned processing, until the character that comprises in the input is for empty.
Lzw algorithm is transparent to language, because this algorithm is at byte level definition repeat pattern, therefore it can be effectively applied to the compression of Chinese web page, but also therefore can not effectively utilize simultaneously the characteristic of language itself, form from semantically saying in fact by one by one relatively-stationary ' word ' such as Chinese, but this algorithm can not considered this characteristic of Chinese.From compression method, this compression algorithm depends on the repeat pattern in the text, if it is less not exist repeat pattern or character string to repeat in certain text, then this algorithm can lose efficacy or compression efficiency not high.Simultaneously, because repeat pattern is gradually identification in the process of scan text, tentatively can only identify short pattern, progressively could identify long repeat pattern, this means that the initial part compressibility of document is very low, this is just unfavorable to the webpage compression of shorter length.According to the rough estimates to the news category webpage, the compressibility of the body matter in the Chinese web page is (the less expression compression of compressibility is better) between 60 ~ 90%, and compression effectiveness is obviously not as good as the js file that is comprised of English, css file, html label etc.
Summary of the invention
In view of the above problems, an object of the present invention is to provide a kind of Chinese web page data-encoding scheme and device, the method and device are utilized as the private room in the Unicode code bit space that each word in the dictionary that sets in advance distributes or the Unicode code bit in the retaining space, the Chinese web page content is encoded, thereby improve Chinese web page data compression efficient.
Another object of the present invention is providing a kind of intermediate server with above-mentioned Chinese web page data coding device.
Another object of the present invention is to provide a kind of Chinese web page data decoding method and device, the method and device can be decoded to the Unicode stream of as above encoding, to recover original Chinese web page data.
Another object of the present invention is to provide a kind of portable terminal with above-mentioned Chinese web page data deciphering device.
According to an aspect of the present invention, a kind of Chinese web page data-encoding scheme is provided, comprise: from the first character of wanting compressed Chinese web page data that obtains, repeat following process, until these Chinese web page data of obtaining all replace to Unicode coding: from when the first character of the Chinese web page data of pre-treatment, according to the dictionary that sets in advance, these Chinese web page data are carried out word segmentation processing, with determine whether to exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling; Exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling the time, want in the compressed Chinese web page data current, utilize the corresponding Unicode coding of the word that mates with this participle to replace this participle, perhaps do not exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling the time, want in the compressed Chinese web page data current, utilize the Unicode coding of this first character to replace this first character; And from when the Chinese web page data of pre-treatment, removing the part that has been replaced by the Unicode coding, as next Chinese web page data when pre-treatment.
In one or more examples aspect above-mentioned, each word in the described dictionary is allocated in advance a Unicode coding in private room in the Unicode code bit space or the retaining space
In one or more examples aspect above-mentioned, the participle that the first character with in the Chinese web page data of pre-treatment of word coupling with in the dictionary of determining begins be with this first character begin can with dictionary in the longest participle of word coupling.
In one or more examples aspect above-mentioned, word in the described dictionary is arranged according to word frequency, and encode for institute's predicate distributes Unicode according to putting in order, wherein, Unicode coding in the described private room of institute's predicate priority allocation, and after the Unicode in described private room coding all distributed, distribute the Unicode coding in the described retaining space.
In one or more examples aspect above-mentioned, described private room comprises the private room and two private rooms that are positioned at additional plane that are positioned at basic plane, the Unicode coding that is positioned at the private room on basic plane takies three bytes, and the Unicode coding that is positioned at the private room that replenishes the plane takies four bytes, institute's predicate priority allocation is arranged in the Unicode coding of the private room on basic plane, and only after the Unicode of the described private room that is positioned at basic plane coding is all distributed, just distribute the described Unicode that is arranged in the private room that replenishes the plane to encode.
In one or more examples aspect above-mentioned, Unicode in described retaining space coding is according to from rear to front order-assigned.
In one or more examples aspect above-mentioned, described Chinese web page the data UTF-8 form transmission.
According to a further aspect in the invention, a kind of Chinese web page data coding device is provided, comprise: the word segmentation processing unit, be used for from when the first character of the Chinese web page data of pre-treatment, according to the dictionary that sets in advance, these Chinese web page data are carried out word segmentation processing, with determine whether to exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling; Coding unit, be used for when the participle that begins with this first character that exists with the word coupling of the dictionary that sets in advance, want in the compressed Chinese web page data current, utilize the corresponding Unicode coding of the word that mates with this participle to replace this participle, perhaps do not exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling the time, want in the compressed Chinese web page data current, utilize the Unicode coding of this first character to replace this first character; And current deal with data updating block, be used for removing the part that has been replaced by the Unicode coding from the Chinese web page data when pre-treatment, as next Chinese web page data when pre-treatment, wherein, from the first character of wanting compressed Chinese web page data that obtains, repeat the processing procedure of described word segmentation processing unit, coding unit and current deal with data updating block, until these Chinese web page data of obtaining all replace to the Unicode coding.
According to a further aspect in the invention, provide a kind of intermediate server, comprised aforesaid Chinese web page data coding device.
According to a further aspect in the invention, provide a kind of Chinese web page data decoding method, having comprised: received Unicode encoding stream behind the coding of Chinese web page data-encoding scheme as described above from middle server; And according to the dictionary that sets in advance in the portable terminal, the Unicode encoding stream that receives being decoded as corresponding Chinese web page data, the dictionary that sets in advance in the dictionary that sets in advance in the described portable terminal and the intermediate server is identical.
According to a further aspect in the invention, provide a kind of Chinese web page data deciphering device, having comprised: receiving element is used for receiving Unicode encoding stream behind the Chinese web page data-encoding scheme coding as described above from middle server; And decoding unit, be used for the dictionary that sets in advance according to the Chinese web page data deciphering device, the Unicode encoding stream that receives is decoded as corresponding Chinese web page data, and the dictionary that sets in advance in the dictionary that sets in advance in the described Chinese web page data deciphering device and the intermediate server is identical.
According to a further aspect in the invention, a kind of portable terminal comprises aforesaid Chinese web page data deciphering device.
According to Chinese web page data-encoding scheme of the present invention, can utilize a dictionary that sets in advance, use the private room in the Unicode code bit space of distributing as each word in the dictionary or the Unicode code bit in the retaining space, the Chinese web page content is encoded, thereby save the shared space of data stream behind the coding, reduce thus Chinese web page data storage space and data transfer throughput.
In order to realize above-mentioned and relevant purpose, one or more aspects of the present invention comprise the feature that the back will describe in detail and particularly point out in the claims.Following explanation and accompanying drawing describe some illustrative aspects of the present invention in detail.Yet, the indication of these aspects only be some modes that can use in the variety of way of principle of the present invention.In addition, the present invention is intended to comprise all these aspects and their equivalent.
Description of drawings
According to following detailed description of carrying out with reference to accompanying drawing, above and other purpose of the present invention, feature and advantage will become more apparent.In the accompanying drawings:
Fig. 1 shows the process flow diagram based on the compression process of LZW compression algorithm;
Fig. 2 shows the process flow diagram according to Chinese web page data encoding process of the present invention;
Fig. 3 shows the process flow diagram that Chinese web page data to be processed is carried out an example of word segmentation processing according to of the present invention;
Fig. 4 shows the diagram according to the Chinese web page data before processing of encoding of an example of Chinese web page data encoding process of the present invention;
Fig. 5 shows the diagram that carries out word segmentation processing for the Chinese web page data among Fig. 4;
Fig. 6 shows the diagram through the result who obtains after the above-mentioned word segmentation processing;
Fig. 7 shows the block diagram according to Chinese web page data coding device of the present invention;
Fig. 8 shows the block diagram according to intermediate server of the present invention;
Fig. 9 shows the process flow diagram according to Chinese web page data decoding method of the present invention;
Figure 10 shows the block diagram according to Chinese web page data deciphering device of the present invention; With
Figure 11 shows the block diagram according to portable terminal of the present invention.
Identical label is indicated similar or corresponding feature or function in institute's drawings attached.
Embodiment
Various aspects of the present disclosure are described below.Should be understood that the instruction of this paper can be with varied form imbody, and disclosed any concrete structure, function or both only are representational in this article.Based on the instruction of this paper, those skilled in the art should be understood that an aspect disclosed herein can be independent of any other side and realize, and the two or more aspects in these aspects can make up according to variety of way.For example, can use the aspect of any number described in this paper, implement device or hands-on approach.In addition, can use other structure, function or except one or more aspects described in this paper or be not the 26S Proteasome Structure and Function of one or more aspects described in this paper, realize this device or put into practice this method.In addition, any aspect described herein can comprise at least one element of claim.
Before describing according to an embodiment of the invention, at first the Unicode that uses among the present invention is carried out brief description.
Term " Unicode " is also referred to as Unicode, ten thousand country codes, single code, standard ten thousand country codes, is an industrywide standard in the computer science.It has carried out arrangement, coding to most writing system in the world, so that computer can present and processes literal with the mode of more simplifying.
In the standard about Unicode, Unicode has defined 1,114 between 0~0x10FFFF, 112 space encoders (namely, 1,114,112 codings), these spaces are divided into 17 planes, be numbered respectively 0~16, wherein No. 0 plane is called basic plane, and scope is 0000-FFFF, and 1~No. 16 plane is called auxiliary plane, and scope is 10000-10FFFF.
In addition, according to the using method of Unicode standard code, the Unicode code bit is divided into public space, private room and retaining space.Public space is encoded for various countries' literal by standard, and private room can utilize voluntarily for private organization, and retaining space refers to temporary transient untapped space.
According to the Unicode standard, it is three sections that private room is divided into, and is respectively: the private room on basic plane: Private Use Area:U+E000..U+F8FF (6,400 characters); Replenish the private room on plane: Supplementary Private Use Area-A:U+F0000..U+FFFFD (65,534 characters); Replenish the private room on plane: Supplementary Private Use Area-B:U+100000..U+10FFFD (65,534 characters).In addition, according to the Unicode standard, the coding on the basic plane of Unicode (0000-FFFF) takies 3 characters, and the coding of auxiliary plane (10000-10FFFF) occupies 4 bytes.
The size of retaining space is: Unassigned:30000-DFFFF (720,896 characters).
Each embodiment of the present invention is described below with reference to accompanying drawings.
Fig. 2 shows the process flow diagram according to Chinese web page data encoding process of the present invention, and this cataloged procedure is carried out by intermediate server.Described intermediate server can be the server of any type.
As shown in Figure 2, intermediate server obtain want compressed Chinese web page data after, at first, at step S210, want compressed Chinese web page data as current Chinese web page data to be processed with what obtain, begin to carry out Chinese web page data encoding process.
Then, at step S220, from when the first character of the Chinese web page data of pre-treatment, according to the dictionary that sets in advance, these Chinese web page data are carried out word segmentation processing, with determine whether to exist in the Chinese web page data with the dictionary that sets in advance in the participle that begins with this first character of word coupling.In a preferred exemplary of the present invention, each word in the described dictionary is allocated in advance a Unicode coding in private room in the Unicode code bit space or the retaining space.
When the Unicode coding of allocating in advance for the word in the described dictionary in the Unicode code bit space, at first according to word frequency the word in the described dictionary is arranged, then distribute according to putting in order.For the preceding word that puts in order, also namely use the high word of frequency, the Unicode coding in the described private room of priority allocation.Because total size of private room only is 137,468, may be not to holding large dictionary.In this case, can also use the retaining space of part.When distributing the Unicode coding for entry, generally after the Unicode of described private room coding is all distributed, just distribute the Unicode coding in the described retaining space.
And, in order to avoid and following norm conflict as far as possible, using retaining space when (that is, distributing the Unicode in the retaining space to encode), can adopt mode from back to front to carry out, the size of shared retaining space depends on that the size of dictionary deducts the size of private room.
In addition, described private room comprises the private room and two private rooms that are positioned at additional plane that are positioned at basic plane, the Unicode coding that is positioned at the private room on basic plane takies three bytes, and the Unicode coding that is positioned at the private room that replenishes the plane takies four bytes.When the Unicode coding that distributes for word in the private room, priority allocation is arranged in the Unicode coding of the private room on basic plane.Generally after the described Unicode coding that is positioned at the private room on basic plane was all distributed, inborn ability was coordinated in the Unicode coding in the private room that replenishes the plane.
From when the first character of the Chinese web page data of pre-treatment, according to the dictionary that sets in advance, these Chinese web page data are carried out word segmentation processing, can adopt various ways to carry out.Preferably, in an example of the present invention, the word segmentation processing mode that adopts be so that the participle that the first character with in the Chinese web page data of pre-treatment of word coupling with in the dictionary of determining begins be when beginning with this first character in the Chinese web page data of pre-treatment can with dictionary in the longest participle of word coupling.Fig. 3 shows the process flow diagram that Chinese web page data to be processed is carried out an example of word segmentation processing according to of the present invention.
In the example shown in Fig. 3, the entry in the dictionary is stored as Chinese dictionary with the form of TRIE index tree.This Chinese dictionary comprises lead-in hash table and TRIE index tree node.
The lead-in hash function of entry provides according to Chinese character Unicode code.By a Hash operation, can directly locate the sequence number of Chinese character in the lead-in hash table.First unit of lead-in hash table comprises two contents: entrance item number (2 byte): the number of the word take this word as lead-in; And the first entrance item pointer (4 byte): the root node of corresponding Chinese character TRIE index tree.
The array that TRIE index tree node as the unit, according to keywords sorts take following structure: key word (2 byte): single Chinese character, with the Unicode coding and sorting order of this Chinese character; Subtree size (2 byte): the substring that forms take the key word from the root node to the active cell is the number of the different word of prefix and subsequent words; Word tree pointer (4 byte): when big or small non-zero set in word, point to the word tree; Otherwise sensing leaf.
Fig. 3 shows based on any one word W[n of TRIE tree query] process, wherein n refers to the character number that comprises in this word.
As shown in Figure 3, at first, in step S310, i is set to i=1.Then, at step S320, hash table obtains w[1 according to lead-in] the index tree root node of TRIE, be made as P.Then, in step S330, the value increase by 1 with i proceeds to step S340 subsequently.
In step S340, in the key word node of P to w[i] carry out binary chop.Then, in step S350, determine in the key word of node P, whether to exist and w[i] coupling key word.If certain key word and the w[i of node P] the match is successful, and then P is set to subtree root node corresponding to this key element, and turns back to step S330.Otherwise, think that P is leaf node, and proceed to step S360.
At step S360, determine that whether i is greater than n.If i, then thinks successful inquiring greater than n, w[n] be an entry in the dictionary.If i<n then thinks and inquires about unsuccessfully, with w[n-1] to be defined as be a entry in the dictionary.
As above with reference to Fig. 3 the word segmentation processing process has been carried out a description, but above-mentioned example only is an illustration of the present invention, the word segmentation processing process can also adopt alternate manner as known in the art to carry out.
Get back to Fig. 2, after in step S220, current Chinese web page data to be processed being carried out word segmentation processing, in step S230, judge whether exist in the current Chinese web page data to be processed with the dictionary that sets in advance in the participle that begins with this first character of word coupling.Exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling the time, namely, the judged result of step S230 is when being, at step S240, want in the compressed Chinese web page data current, utilize the corresponding Unicode coding of the word that mates with this participle to replace this participle.
Do not exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling the time, namely, step S230 is when the determination result is NO, in step S250, want in the compressed Chinese web page data current, utilize the Unicode coding of this first character to replace this first character.
Then, in step S260, from when the Chinese web page data of pre-treatment, removing the part that has been replaced by the Unicode coding, as next Chinese web page data when pre-treatment.Subsequently, in step S270, judge whether next the Chinese web page data when pre-treatment through obtaining after the above-mentioned replacement processing are empty.
If next Chinese web page data when pre-treatment is empty, then flow process finishes.If next Chinese web page data when pre-treatment is not empty, then turn back to step S220, next carries out circular treatment when the Chinese web page data of pre-treatment for this, encodes until the Chinese web page data of obtaining all replace with Unicode.
In the present invention, the Chinese web page data adopt the UTF-8 form to transmit usually.In other embodiments of the invention, the Chinese web page data also can adopt other form transmission, such as UTF-16.In the UTF-8 form, each Chinese character will account for 3 bytes, if with word as Basic Transmission Unit, each word also only accounts for three or four bytes.The below carries out text with the UTF-8 form and is transmitted as example, and the beneficial effect that cataloged procedure according to the present invention is obtained describes.
Fig. 4 shows the diagram according to the Chinese web page data before processing of encoding of an example of Chinese web page data encoding process of the present invention.
One section Chinese web page data winning from Sina News have been shown among Fig. 4, in this section Chinese web page data, have comprised 78 characters, because each character takies 3 bytes, therefore total size is 78 * 3=234 byte.
Then, according to the mode shown in Fig. 5, carry out the diagram of word segmentation processing for the Chinese web page data among Fig. 4.As shown in Figure 5, in the participle process, at first can identify " Philippine " this word, then it be replaced with 59500 (0xe68c), so just three shared 9 bytes in space of word being saved is 4 bytes.Similarly, when analyzing " exclusive economic zone (EEZ) ", it can be replaced with 20745 (0x328c5), so just 15 bytes be replaced with 4 bytes.The rest may be inferred, and the Chinese web page data among Fig. 4 are carried out word segmentation processing.
Fig. 6 shows the diagram through the result who obtains after the above-mentioned word segmentation processing.Among the result shown in Figure 6, between word and the word with space-separated.As can be seen from Figure 6, after Chinese web page data encoding processing according to the present invention, 78 characters among Fig. 4 are broken down into 41 words.Because in the UTF-8 form, each word only accounts for three or four bytes.In this case, the size through the text that obtains behind the as above coding is 41 * 4=164 to the maximum.Can calculate thus, the saving ratio is (234-164)/234=30%.In addition, be noted that here in Chinese web page data encoding of the present invention, employing be the processing mode of participle limit, limit coding, that is to say, after obtaining a participle, just this participle is replaced with Unicode and encodes.Therefore, after finishing all word segmentation processing, what obtain should be the Unicode encoding stream, rather than the result shown in Fig. 6.Diagram among Fig. 6 only replaces the Unicode coding in order to understand better the present invention and forms with participle.
As can be seen from above, with directly original Chinese web page is compressed in the prior art after transmit and compare, transmission after after utilizing coding method according to the present invention that original Chinese web page is carried out recompile, compressing again, can so that the size text that will transmit is less, can reduce volume of transmitted data thus.
Fig. 7 shows the block diagram according to Chinese web page data coding device 700 of the present invention.As shown in Figure 7, Chinese web page data coding device 700 comprises word segmentation processing unit 710, coding unit 720 and current deal with data updating block 730.
Word segmentation processing unit 710 is used for from the first character of the Chinese web page data of working as pre-treatment, according to the dictionary that sets in advance, these Chinese web page data are carried out word segmentation processing, with determine whether to exist in these Chinese web page data with the dictionary that sets in advance in the participle that begins with this first character of word coupling.In a preferred embodiment of the invention, each word in the described dictionary is allocated in advance a Unicode coding in private room in the Unicode code bit space or the retaining space.
Coding unit 720 be used for the Chinese web page data exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling the time, want in the compressed Chinese web page data current, utilize the corresponding Unicode coding of the word that mates with this participle to replace this participle, perhaps in the Chinese web page data, do not exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling the time, want in the compressed Chinese web page data current, utilize the Unicode coding of this first character to replace this first character.
Current deal with data updating block 730 is used for removing the part that has been replaced by the Unicode coding from the Chinese web page data when pre-treatment, as next Chinese web page data when pre-treatment.
Utilize according to 700 pairs of Chinese web page data coding devices of the present invention obtain want compressed Chinese web page data to encode the time, from the first character of wanting compressed Chinese web page data that obtains, repeat the processing procedure of described word segmentation processing unit 710, coding unit 720 and current deal with data updating block 730, until these Chinese web page data of obtaining all replace to the Unicode coding.
Fig. 8 shows the block diagram according to intermediate server 10 of the present invention.As shown in Figure 8, intermediate server 10 comprises the Chinese web page data coding device 700 shown in Fig. 7.
Fig. 9 shows the process flow diagram according to Chinese web page data decoding method of the present invention.
As shown in Figure 9, at step S910, portable terminal receives Unicode encoding stream behind the coding of Chinese web page data-encoding scheme as described above from middle server.After receiving described Unicode encoding stream, according to the dictionary that sets in advance in the portable terminal, the Unicode encoding stream that receives is decoded as corresponding Chinese web page data, and wherein, the dictionary that sets in advance in the dictionary that sets in advance in the described portable terminal and the intermediate server is identical.
Figure 10 shows the block diagram according to Chinese web page data deciphering device 1000 of the present invention.As shown in figure 10, Chinese web page data deciphering device 1000 comprises receiving element 1010 and decoding unit 1020.
Described receiving element 1020 receives Unicode encoding stream behind the coding of Chinese web page data-encoding scheme as described above from middle server.After receiving described Unicode encoding stream, decoding unit 1020 is according to the dictionary that sets in advance in the portable terminal, the Unicode encoding stream that receives is decoded as corresponding Chinese web page data, wherein, the dictionary that sets in advance in the dictionary that sets in advance in the described portable terminal and the intermediate server is identical.For example, after dividing Chinese word coding as shown in Figure 5, when comprising " 0xe68c " in the Unicode encoding stream that receives at portable terminal (browser client), it is decoded as " Philippine ".
Figure 11 shows the block diagram according to portable terminal 20 of the present invention.As shown in figure 11, portable terminal 20 comprises the Chinese web page data deciphering device 1000 shown in Figure 10.
Utilization is according to Chinese web page data-encoding scheme of the present invention, can utilize a dictionary that sets in advance, use the private room in the unicode code bit space of distributing as each word in the dictionary or the unicode code bit in the retaining space, the Chinese web page content is encoded, thereby save the shared space of data stream behind the coding, reduce thus Chinese web page data storage space and data transfer throughput.
In addition, typically, portable terminal of the present invention can be various hand-held terminal devices, such as mobile phone, PDA(Personal Digital Assistant) etc., so protection scope of the present invention should not be defined as the portable terminal of certain particular type.
In addition, the method according to this invention can also be implemented as the computer program of being carried out by CPU.When this computer program is carried out by CPU, carry out the above-mentioned functions that limits in the method for the present invention.
In addition, said method step and system unit also can utilize controller and be used for storage so that controller is realized the computer readable storage devices realization of the computer program of above-mentioned steps or Elementary Function.
In addition, should be understood that computer readable storage devices as herein described (for example, storer) can be volatile memory or nonvolatile memory, perhaps can comprise volatile memory and nonvolatile memory.And nonrestrictive, nonvolatile memory can comprise ROM (read-only memory) (ROM), programming ROM (PROM), electrically programmable ROM(EPROM as an example), electrically erasable programmable ROM(EEPROM) or flash memory.Volatile memory can comprise random-access memory (ram), and this RAM can serve as the External Cache storer.As an example and nonrestrictive, RAM can obtain in a variety of forms, such as synchronous random access memory (DRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate SDRAM(DDR SDRAM), strengthen SDRAM(ESDRAM), synchronization link DRAM(SLDRAM) and direct Rambus RAM(DRRAM).The memory device of disclosed aspect is intended to include but not limited to the storer of these and other adequate types.
Those skilled in the art will also understand is that, may be implemented as electronic hardware, computer software or both combinations in conjunction with the described various illustrative logical blocks of disclosure herein, module, circuit and algorithm steps.For this interchangeability of hardware and software clearly is described, with regard to the function of various exemplary components, square, module, circuit and step it has been carried out general description.This function is implemented as software or is implemented as hardware and depends on concrete application and the design constraint that imposes on whole system.Those skilled in the art can realize described function in every way for every kind of concrete application, but this realization determines should not be interpreted as causing departing from the scope of the present invention.
Although the disclosed content in front shows exemplary embodiment of the present invention, should be noted that under the prerequisite of the scope of the present invention that does not deviate from the claim restriction, can carry out multiple change and modification.Function, step and/or action according to the claim to a method of inventive embodiments described herein do not need to carry out with any particular order.In addition, although element of the present invention can be with individual formal description or requirement, also it is contemplated that a plurality of, unless clearly be restricted to odd number.
Be described although as above described each embodiment according to the present invention with reference to figure, it will be appreciated by those skilled in the art that each embodiment that the invention described above is proposed, can also make various improvement on the basis that does not break away from content of the present invention.Therefore, protection scope of the present invention should be determined by the content of appending claims.

Claims (12)

1. Chinese web page data-encoding scheme comprises:
From the first character of wanting compressed Chinese web page data that obtains, repeat following process, until these Chinese web page data of obtaining all replace to the Unicode coding:
From when the first character of the Chinese web page data of pre-treatment, according to the dictionary that sets in advance, these Chinese web page data are carried out word segmentation processing, with determine whether to exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling;
Exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling the time, want in the compressed Chinese web page data current, utilize the corresponding Unicode coding of the word that mates with this participle to replace this participle, perhaps do not exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling the time, want in the compressed Chinese web page data current, utilize the Unicode coding of this first character to replace this first character; And
From when the Chinese web page data of pre-treatment, removing the part that has been replaced by the Unicode coding, as next Chinese web page data when pre-treatment.
2. Chinese web page data-encoding scheme as claimed in claim 1, wherein, each word in the described dictionary is allocated in advance private room in the Unicode code bit space or a Unicode coding in the retaining space.
3. Chinese web page data-encoding scheme as claimed in claim 1, wherein, the participle that the first character with in the Chinese web page data of pre-treatment of word coupling with in the dictionary of determining begins be with this first character begin can with dictionary in the longest participle of word coupling.
4. Chinese web page data-encoding scheme as claimed in claim 1, wherein, the word in the described dictionary is arranged according to word frequency, and is that institute's predicate distributes the Unicode coding according to putting in order,
Wherein, Unicode in the described private room of institute's predicate priority allocation coding, and after the coding of the Unicode in described private room all distributed, distribute the Unicode coding in the described retaining space.
5. Chinese web page data-encoding scheme as claimed in claim 4, wherein, described private room comprises the private room and two private rooms that are positioned at additional plane that are positioned at basic plane, the Unicode coding that is positioned at the private room on basic plane takies three bytes, and the Unicode coding that is positioned at the private room that replenishes the plane takies four bytes, institute's predicate priority allocation is arranged in the Unicode coding of the private room on basic plane, and after the described Unicode coding that is positioned at the private room on basic plane is all distributed, distribute the described Unicode coding that is arranged in the private room that replenishes the plane.
6. Chinese web page data-encoding scheme as claimed in claim 5, wherein, the Unicode coding in the described retaining space is according to from rear to front order-assigned.
7. Chinese web page data-encoding scheme as claimed in claim 1, wherein, described Chinese web page the data UTF-8 form transmission.
8. Chinese web page data coding device comprises:
The word segmentation processing unit, be used for from when the first character of the Chinese web page data of pre-treatment, according to the dictionary that sets in advance, these Chinese web page data are carried out word segmentation processing, with determine whether to exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling;
Coding unit, be used for when the participle that begins with this first character that exists with the word coupling of the dictionary that sets in advance, want in the compressed Chinese web page data current, utilize the corresponding Unicode coding of the word that mates with this participle to replace this participle, perhaps do not exist with the dictionary that sets in advance in the participle that begins with this first character of word coupling the time, want in the compressed Chinese web page data current, utilize the Unicode coding of this first character to replace this first character; And
Current deal with data updating block is used for removing the part that has been replaced by the Unicode coding from the Chinese web page data when pre-treatment, as next Chinese web page data when pre-treatment,
Wherein, from the first character of wanting compressed Chinese web page data that obtains, repeat the processing procedure of described word segmentation processing unit, coding unit and current deal with data updating block, until these Chinese web page data of obtaining all replace to the Unicode coding.
9. an intermediate server comprises Chinese web page data coding device as claimed in claim 8.
10. Chinese web page data decoding method comprises:
Receive according to the Unicode encoding stream behind the Chinese web page data-encoding scheme coding claimed in claim 1 from middle server; And
According to the dictionary that sets in advance in the portable terminal, the Unicode encoding stream that receives is decoded as corresponding Chinese web page data,
Wherein, the dictionary that sets in advance in the dictionary that sets in advance in the described portable terminal and the intermediate server is identical.
11. a Chinese web page data deciphering device comprises:
Receiving element is used for receiving according to the Unicode encoding stream behind the Chinese web page data-encoding scheme coding claimed in claim 1 from middle server; And
Decoding unit, be used for the dictionary that sets in advance according to the Chinese web page data deciphering device, the Unicode encoding stream that receives is decoded as corresponding Chinese web page data, and the dictionary that sets in advance in the dictionary that sets in advance in the described Chinese web page data deciphering device and the intermediate server is identical.
12. a portable terminal comprises Chinese web page data deciphering device as claimed in claim 11.
CN201210361682.XA 2012-09-25 2012-09-25 Chinese web page data encoding, coding/decoding method and system Active CN102880703B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210361682.XA CN102880703B (en) 2012-09-25 2012-09-25 Chinese web page data encoding, coding/decoding method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210361682.XA CN102880703B (en) 2012-09-25 2012-09-25 Chinese web page data encoding, coding/decoding method and system

Publications (2)

Publication Number Publication Date
CN102880703A true CN102880703A (en) 2013-01-16
CN102880703B CN102880703B (en) 2016-03-16

Family

ID=47482029

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210361682.XA Active CN102880703B (en) 2012-09-25 2012-09-25 Chinese web page data encoding, coding/decoding method and system

Country Status (1)

Country Link
CN (1) CN102880703B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105528420A (en) * 2015-12-07 2016-04-27 北京金山安全软件有限公司 Character encoding and decoding method and device and electronic equipment
CN105843854A (en) * 2015-03-16 2016-08-10 国家计算机网络与信息安全管理中心 Network data oriented rapid recognition system for topic document
CN108108267A (en) * 2016-11-25 2018-06-01 北京国双科技有限公司 The restoration methods and device of data
CN110601963A (en) * 2014-02-23 2019-12-20 三星电子株式会社 Message processing method and electronic device supporting same
CN111178061A (en) * 2019-12-20 2020-05-19 沈阳雅译网络技术有限公司 Multi-lingual word segmentation method based on code conversion
CN111178065A (en) * 2019-12-12 2020-05-19 中国建设银行股份有限公司 Word segmentation recognition word stock construction method, Chinese word segmentation method and device
CN112632909A (en) * 2020-10-30 2021-04-09 中核核电运行管理有限公司 Data object English coding method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013420A (en) * 2006-12-31 2007-08-08 中国科学院计算技术研究所 Method for identifying coding form of Chinese text
CN101729075A (en) * 2008-10-10 2010-06-09 英华达(上海)电子有限公司 Data compression method, data compression device, data decompression method and data decompression device
CN101751451A (en) * 2008-12-11 2010-06-23 高德软件有限公司 Chinese data compression method and Chinese data decompression method and related devices
CN102508824A (en) * 2011-09-29 2012-06-20 苏州大学 Compression coding and decoding method and device for microblog information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013420A (en) * 2006-12-31 2007-08-08 中国科学院计算技术研究所 Method for identifying coding form of Chinese text
CN101729075A (en) * 2008-10-10 2010-06-09 英华达(上海)电子有限公司 Data compression method, data compression device, data decompression method and data decompression device
CN101751451A (en) * 2008-12-11 2010-06-23 高德软件有限公司 Chinese data compression method and Chinese data decompression method and related devices
CN102508824A (en) * 2011-09-29 2012-06-20 苏州大学 Compression coding and decoding method and device for microblog information

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110601963A (en) * 2014-02-23 2019-12-20 三星电子株式会社 Message processing method and electronic device supporting same
US11582173B2 (en) 2014-02-23 2023-02-14 Samsung Electronics Co., Ltd. Message processing method and electronic device supporting the same
CN105843854A (en) * 2015-03-16 2016-08-10 国家计算机网络与信息安全管理中心 Network data oriented rapid recognition system for topic document
CN105843854B (en) * 2015-03-16 2019-02-05 国家计算机网络与信息安全管理中心 A kind of thematic document system for rapidly identifying of network-oriented data
CN105528420A (en) * 2015-12-07 2016-04-27 北京金山安全软件有限公司 Character encoding and decoding method and device and electronic equipment
CN108108267A (en) * 2016-11-25 2018-06-01 北京国双科技有限公司 The restoration methods and device of data
CN108108267B (en) * 2016-11-25 2021-06-22 北京国双科技有限公司 Data recovery method and device
CN111178065A (en) * 2019-12-12 2020-05-19 中国建设银行股份有限公司 Word segmentation recognition word stock construction method, Chinese word segmentation method and device
CN111178065B (en) * 2019-12-12 2023-06-27 建信金融科技有限责任公司 Word segmentation recognition word stock construction method, chinese word segmentation method and Chinese word segmentation device
CN111178061A (en) * 2019-12-20 2020-05-19 沈阳雅译网络技术有限公司 Multi-lingual word segmentation method based on code conversion
CN111178061B (en) * 2019-12-20 2023-03-10 沈阳雅译网络技术有限公司 Multi-lingual word segmentation method based on code conversion
CN112632909A (en) * 2020-10-30 2021-04-09 中核核电运行管理有限公司 Data object English coding method and device

Also Published As

Publication number Publication date
CN102880703B (en) 2016-03-16

Similar Documents

Publication Publication Date Title
CN102880703B (en) Chinese web page data encoding, coding/decoding method and system
US9223765B1 (en) Encoding and decoding data using context model grouping
CN101350624B (en) Method for compressing Chinese text supporting ANSI encode
CN107836083B (en) Method, apparatus and system for semantic value data compression and decompression
CN101783788B (en) File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device
US20130141259A1 (en) Method and system for data compression
CN106202172B (en) Text compression methods and device
US10735025B2 (en) Use of data prefixes to increase compression ratios
US11722148B2 (en) Systems and methods of data compression
CN103546161A (en) Lossless compression method based on binary processing
CN101534124B (en) Compression algorithm for short natural language
US10897270B2 (en) Dynamic dictionary-based data symbol encoding
CN115189696A (en) Hardware compression and decompression method based on Huffman decoding table
CN100578943C (en) Optimized Huffman decoding method and device
CN109981108B (en) Data compression method, decompression method, device and equipment
CN103605730A (en) XML (extensible markup language) compressing method and device based on flexible-length identification codes
US7023365B1 (en) System and method for compression of words and phrases in text based on language features
US8872679B1 (en) System and method for data compression using multiple small encoding tables
US9235610B2 (en) Short string compression
RU2437148C1 (en) Method to compress and to restore messages in systems of text information processing, transfer and storage
KR20040087503A (en) Data compression method for multi-byte character language
Shanmugasundaram et al. Text preprocessing using enhanced intelligent dictionary based encoding (EIDBE)
Arif et al. An enhanced static data compression scheme of Bengali short message
Ahn et al. Effective algorithms for cache-level compression
CN105630870B (en) Searching request processing method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200702

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 2, 16, 301 rooms, 510665 Yun Yun Road, Tianhe District, Guangdong, Guangzhou

Patentee before: GUANGZHOU UCWEB COMPUTER TECHNOLOGY Co.,Ltd.