CN1267963A

CN1267963A - Data compression equipment and data restorer

Info

Publication number: CN1267963A
Application number: CN 00100994
Authority: CN
Inventors: 矢作裕纪; 吉田茂
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1999-03-12
Filing date: 2000-01-18
Publication date: 2000-09-27
Also published as: JP2000269822A

Abstract

To attain data compression at a high compression rate by using an auxiliary dictionary storing character strings such as words especially specific to document data so as to decrease a required storage capacity with respect to, e.g. a compressor for document data and a data uncompressor. A static word dictionary 4 stores character strings such as conventional words and phrases used usually, a character string detection section 1 detects character strings in original document data together with character strings such as words and phrases included in the static word dictionary 4 and registers them to an extended dictionary 3. An auxiliary dictionary registration section 2 registers character strings except meaningless character strings for registration and the character strings having been registered in advance in the static word dictionary 4 among character strings registered in the extended dictionary 3. After registering character strings specific to document data to the auxiliary dictionary 5, a word division section 6 divides the original document data, retrieves the static word dictionary 4 and the auxiliary dictionary 5 to read data of corresponding to character strings, and a variable length coding section 7 applies compression processing to the data. The compression coded data are restored by this data restoring device.

Description

Data compression device and data recovery apparatus

The present invention relates to be used for compressing and recovering a kind of data compression device and a kind of data recovery apparatus of the data such as document data etc.

Recently, the various types of data such as pictorial data is by Computer Processing.Along with popularizing of the existing computer network such as internet, internal network, external network etc., used the increasing electronic document such as Email etc.Such electronic document expectation is used more continually, and in fact document size is becoming big.Therefore, need the packed data technology consumingly, thereby remove the redundancy section of data, to reduce memory capacity and at short notice data to be sent to long-range destination.

Thereby having had a kind ofly provides static dictionary, the data transaction that converted fixed length code in the past to is become variable-length codes and finishes the method for a data compression process.Fig. 1 represents above-mentioned data compression method.Data compression system comprises a speech division unit 50 and a variable length code unit 51.Speech division unit 50 receives the original document data, and by divide the speech that is included in the initial data with reference to static dictionary.Static dictionary comprises the input data in advance of the character string such as everyday words, phrase etc., and the respective symbols string is outputed to variable length code unit 51 as fixed length code (intermediate code).

Variable length code unit 51 converts the fixed length codes (intermediate code) that provide to compressed code, it is for example write on the document memory etc., and it is sent to another computer through internet, internal network etc.

On the other hand, when the computer receiving compressed data, it is according to the decoding of the decompression systems shown in Fig. 2 packed data.In other words, variable-length decoding unit 52 is decoded into fixed length code to compressed code, and a speech recovery unit 53 is by reverting to the original document data to data with reference to static dictionary.

In above-mentioned data compression method, the data of using a static dictionary and being input to the character string in the static dictionary are in advance finished a compression process.Yet, owing to the dictionary content is fixed, so can not suitably handle neologisms and the distinctive speech of each document.Therefore, in a kind of like this data compression method, in character cell, speech is divided and encoded, reduced compression ratio thus.

The present invention aim to provide a kind of auxiliary dictionary that can use the character string of storage each a document distinctive speech, phrase etc., with high compression rate finish data compression process, reduce data volume, thus with the data compression device and the packed data restorer of high-speed transferring data.

In other words, according to a first aspect of the invention, can reach above-mentioned purpose of the present invention by a kind of data compression device is provided, this equipment comprises: a static dictionary comprises the character string of speech and phrase in advance; A character string detecting unit is used for retrieving the document data that will compress and detects the character string that is not included in the static dictionary; Select and input unit for one, be used for from the character string that detects by the character string detecting unit, selecting the distinctive character string of each document and be used for a character string of selecting is input in the auxiliary dictionary; A speech division unit is used for static dictionary and auxiliary dictionary are searched the document data that will compress and a character data that is input in static dictionary or the auxiliary dictionary is converted to fixed length code; And a variable length code unit, be used for the fixed length code from the output of speech division unit is converted to compressed code.

Should dispose and opinion, everyday words and short speech are input in the hierarchical structure in the static dictionary in advance.Above-mentioned character string detecting unit is searched static dictionary, to detect the character string of original document data.The new character strings that is not input in the static dictionary is drawn in the dictionary of a for example expansion, and only imports by the character string of selecting and input unit is selected from new character strings.

Above-mentioned selection course prevents the input of ignore character string based on a predetermined number of nodes threshold value and a string length, and correctly only the distinctive character string of document data is input in the auxiliary dictionary.

In addition, a speech division unit is used above-mentioned static dictionary and auxiliary dictionary, not only in static dictionary and also the character string of in auxiliary dictionary, importing convert fixed length code to, finish compression process thus.

With regard to above-mentioned configuration, the distinctive character string of original document data also can convert compressed code to, and a kind of data compression device that can finish efficient compression process and reduce amount of compressed data is provided thus.

According to a second aspect of the invention, can reach purpose of the present invention by a kind of data recovery apparatus is provided, this equipment comprises: a static dictionary, the character string of stored word and phrase in advance; An auxiliary dictionary, at document data that retrieval will be compressed, detect the character string of the distinctive speech of document data be not included in the static dictionary and phrase, further from above-mentioned character string, select character string after, the store character string; A decoding unit is used for deciphering the compressed code of document data; And a data recovery unit, use static dictionary and auxiliary dictionary that the fixed length code by decoding unit decoding is reverted to the original document data.

According to the configuration of this aspect, can decipher the data that convert compressed code by data compression device to.After code by decoding unit decoding compression, use static dictionary and a new auxiliary dictionary of preparing of storing everyday words and phrase in advance, the character of fixed length code is reverted to the original document data.

With regard to above-mentioned configuration, can finish an efficient data recovery process with packed data in a small amount.

According to a third aspect of the invention we, can reach purpose of the present invention by a kind of data compression device is provided, this equipment comprises: a static dictionary comprises the character string of speech and phrase in advance; A character string detecting unit is used for retrieving the document data that will compress and detects the character string that is not included in the static dictionary; Select and input unit for one, be used for from the character string that detects by the character string detecting unit, selecting the distinctive character string of each document and be used for a character string of selecting is input in the auxiliary dictionary; A speech division unit is used for static dictionary and auxiliary dictionary are searched the document data that will compress and a character data that is input in static dictionary or the auxiliary dictionary is converted to fixed length code; A variable length code unit is used for the fixed length code from the output of speech division unit is converted to compressed code; And a delivery unit, be used for these data being sent to communication network by the string data of input in auxiliary dictionary being added to the stem of the compressed code that produces by the variable length code unit.

This aspect converts compressed code by means of this configuration data by data compression device according to the present invention based on a kind of like this configuration, is sent to another computer through the communication line such as internet etc., and by the Computer Storage of recipient's side.Therefore, the configuration that is input to the character string in the auxiliary dictionary and content be with identical according to those of first aspect present invention, but difference aspect the configuration of communication line only.

In other words, the data of the auxiliary dictionary of being prepared by data compression device must be sent to the recipient through communication line.Therefore, the data of the auxiliary dictionary of preparation were transmitted by a delivery unit before the output compressed code.

With regard to above-mentioned configuration, packed data can be sent to another computer that connects through communication line.In addition, because the distinctive character string of original document data also converts compressed code to, so for transmitting the data volume that can reduce to transmit at a high speed.

According to a forth aspect of the invention, can reach purpose of the present invention by a kind of data recovery apparatus is provided, this equipment comprises: a static dictionary, the character string of stored word and phrase in advance; An auxiliary dictionary storage unit, the auxiliary dictionary entry data that storage transmits through communication network; A decoding unit is used for deciphering the compressed code of document data; And a data recovery unit, use static dictionary and auxiliary dictionary that the fixed length code by decoding unit decoding is reverted to the original document data.

According to this aspect, recover to convert the data of compressed code to according to the above-mentioned third aspect, and the communication line through the internet etc. receives the data that convert compressed code to, and revert to the original document data then by data compression device.

Therefore, the auxiliary dictionary entry data that provide through communication network are input in the auxiliary dictionary storage unit, and the compressed code of input is afterwards reverted to the original document data.

Should dispose and opinion, can recover packed data, and available secondary auxiliary word allusion quotation is identical with the dictionary that uses, and finishes efficient recovery process thus in data compression process by another computer preparation that connects through communication line.

Fig. 1 represents the system configuration of data compression device;

Fig. 2 represents the system configuration of data recovery apparatus;

Fig. 3 represents the system configuration according to the data compression device of first embodiment of the invention;

Fig. 4 is illustrated in the data of importing in the static dictionary;

Fig. 5 represents the pattern of the hierarchy of an example according to the present invention;

Fig. 6 is the flow chart according to the whole process of first embodiment of the invention;

Fig. 7 represents a concrete instance of the character string testing process finished by the character string detecting unit;

Fig. 8 represents a concrete instance of a document data part;

Fig. 9 is a flow chart, the actual process that is illustrated in input data in the auxiliary dictionary;

Figure 10 is a flow chart, and actual expression is used for the calculating of coded string;

Figure 11 is illustrated in an example of the data configuration in the auxiliary dictionary;

Figure 12 represents to comprise the example of data configuration of the character string of static dictionary and auxiliary dictionary;

Figure 13 is illustrated in the pattern of the hierarchical structure of the data of importing in static dictionary and the auxiliary dictionary;

Figure 14 represents the code space of static dictionary and auxiliary dictionary;

Speech partition process of the actual expression of Figure 15;

Figure 16 is used for detecting the flow chart of the real process of long character string S;

Figure 17 represents an example with a corresponding code value of character string;

Figure 18 represents another example with a corresponding code value of character string;

Figure 19 represents the system configuration according to the restorer of first embodiment of the invention;

Figure 20 is the flow chart of recovery process;

Figure 21 represents the system configuration according to the data compression device of second embodiment of the invention;

Figure 22 is the flow chart according to the whole process of second embodiment of the invention;

Figure 23 is a flow chart, and expression is about the information of the auxiliary dictionary in an output file;

Figure 24 is illustrated in the configuration of the part of the auxiliary dictionary unit in the output file;

Figure 25 represents the whole configuration of output file;

Figure 26 represents the system configuration according to the data recovery apparatus of second embodiment of the invention;

Figure 27 is the flow chart by the process of finishing according to the data recovery apparatus of second embodiment of the invention;

Figure 28 is a flow chart of reading the process of auxiliary dictinary information; And

Figure 29 represents to be used to use the data compression process that a kind of storage medium finishes and the system configuration of data recovery procedure.

By describing all embodiment of the present invention with reference to the accompanying drawings.

＜the first embodiment 〉

Fig. 3 represents the system configuration according to the data compression device of first embodiment of the invention.Comprise that according to the data compression device of present embodiment the auxiliary dictionary input unit of a character

string detecting unit

1,2, one enlarge the auxiliary dictionary of the static dictionary of

dictionary

3,4,5, speech division unit 6, and variable length code unit 7.Character string detecting unit 1 detects the character string of the initial data that will import.This character string is by detecting with reference to static dictionary 4.Static dictionary 4 comprises the character string such as everyday words, phrase etc. in advance in hierarchical structure (so-called three-decker).Above-mentioned character string detecting unit 1 detects from the character string of importing the original document data by the static dictionary 4 of sequence reference.

Enlarge the data that dictionary 3 storage package are contained in all character strings from the original document data that above-mentioned character string detecting unit 1 extracts.In other words, except that the character string of input in static dictionary 4, can comprise the character string such as the distinctive speech of respective document data, Chinese idiom etc., or the unclear character string of its content.

Auxiliary dictionary input unit 2 is in auxiliary dictionary 5, storage be not stored in the static dictionary 4 and in enlarging dictionary 3 character string in the extracted strings, but do not comprise the unclear character string of its content, for example the character string such as the distinctive speech of original document data, Chinese idiom etc.

Auxiliary dictionary 5 is the distinctive character string of storage original document data in selection of being carried out by above-mentioned auxiliary dictionary input unit 2 and input process.For example, the character string of input in auxiliary dictionary 5 may be the peculiar expression of respective document, neologisms, buzzword etc.The character string of input in auxiliary dictionary 5 also can be input in the hierarchical structure (so-called three-decker).

After being input to the distinctive character string of original document data in the above-mentioned auxiliary dictionary 5, when reading the original document data once more, the speech that speech division unit 6 is divided in the original document data.Data clauses and subclauses and the data clauses and subclauses in the new auxiliary dictionary 5 that produces in the static dictionary 4 of the character string such as Chinese idiom commonly used etc. are imported in speech division unit 6 retrieval therein in advance, and divide the original document data then.

7 compressions of variable length code unit are by the data of the character string of last predicate division unit 6 divisions.From the data of speech division unit 6 output is each distinctive fixed length code of a series of division speech, and variable length code unit 7 converts fixed length code to compressed code.The data that converted to compressed code by variable length code unit 7 output to for example document memory, and use in the recovery process of describing afterwards.

Above-mentioned static dictionary 4 in advance the storage of the character string such as everyday words, Chinese idiom etc. in above-mentioned hierarchical structure.Fig. 4 represents to import the data in static dictionary 4.For example, static dictionary 4 storage node index (hereinafter being called node simply) and in the character data of respective nodes place input.

For example, character code conduct in node 1 place input ' ' will be in character (Chinese character) data of respective nodes place input, the character code of importing ' ' (study) at node 2 places is as the character data that will import at the respective nodes place, and reaching will be in the character data of respective nodes place input in the character code conduct of node 3 places input ' merchant ' (commerce).Thereby, a respective nodes place input character data, as shown in Figure 4.

Design static dictionary 4 by the above-mentioned tree structure that has each node of link, and be connected in the character data of each node place input on the node of its higher and its lower link.Yet, be not linked to the higher or low magnitude node on the highest or minimum magnitude node respectively.

Fig. 5 represents the pattern of above-mentioned hierarchical structure.In this case, in node 1 place's input character (Chinese character) data ' '.Character data ' gas ' (atmosphere) and the character data ' son ' (child) at node 5 places at node 4 places are connected as its low data that link.In addition, in node 2 place's input character data ' '.Character data ' meeting ' at node 6 places is connected as its low data that link.Therefore, by the data of link link, static dictionary 4 comprises the conventional characters string of speech such as ' electric ' (electric), ' electronics ' (electronics), ' association ' (association) etc., phrase etc. in advance.

With regard to above-mentioned configuration, below by operating according to the processing of present embodiment with reference to attached flow chart description.

Fig. 6 is the flow chart according to the whole process of present embodiment.Finish process for twice by reading the original document data, in first read procedure (first pass), extract the distinctive character string such as Chinese idiom etc. of original document data, and extracted strings is input in the auxiliary dictionary 5 according to present embodiment.

In other words, character string detecting unit 1 detects character string (step (hereinafter representing with S) 1).Detection is included in all character strings in the original document data, and is assigned to expansion dictionary 3.Auxiliary 2 pairs of dictionary input units are assigned to the character string that enlarges dictionary 3 and carry out selection course extracting the distinctive character string of effective original document data, and the input of the data of extracted strings in auxiliary dictionary 5 (S2).In second reading process (second time), use auxiliary dictionary 5 to carry out a speech and divide division process (S4), and the data transaction of the fixed length code that is obtained by process (S3 and S4) is become compressed code (S5) by variable length code unit 7.

In flow chart shown in Figure 6, the process of being surrounded by thick line is the distinctive process of present embodiment.

What following reality was described is each process.At first, Fig. 7 is a concrete instance of above-mentioned character string testing process (S1).In this process, all characters in static dictionary 4 and all character strings are assigned to expansion dictionary 3 (S11).In this process,, and dictionary n is set to N pointer position initial setting up to 1 to the input data.The N indication character string information bar number of input in advance in static dictionary 4, and the last dictionary of indication character string of input in static dictionary 4 number.

Secondly, according to the position of pointer, detect the longest character string S (S12) of a character string of coupling.When initialization, the position of pointer is 1 as mentioned above.String searching by detecting the original document data is from the maximum character string S (S12) of initial position (1) beginning.

For example, the document that is illustrated among Fig. 8 is the part of original document data.In this example, the position 1 of pointer is the initial position of original document data, and the position of pointing character (Chinese character) data ' ' (), and searches from the longest character string S of initial position (1) beginning.In this process, for example, above-mentioned static words allusion quotation 4 is searched character data ' ', to detect the node 1 of respective symbols data.Then, upgrade the position of pointer, and whether inspection and pointer position 2 corresponding character datas ' gas ' (atmosphere) are imported in static dictionary 4.Whether check then in above-mentioned example at node 4 places input corresponding data, and with the input of pointer position 3 corresponding character datas ' は ' in static words allusion quotation 4.For example, if character data ' は ' is not imported in static words allusion quotation 4, then the first the longest character string S from pointer position 1 beginning is a character ' electric ' (electric).

Secondly, increase dictionary n (N), dictionary number is set to N+1 (S13), C is set to the character late (character late of C=character string S) of character string S, and SC adds on the dictionary character string, and adds dictionary n (N+1) (S14).In addition, the position of pointer moves to the character behind character string S.Therefore, in said process, extract the formation character string ' character data of the part of electric は ' ' は ' corresponding to dictionary N+1.

Secondly, control return course (S12), and search the longest character string S of coupling from the character string that begins with pointer position 3 corresponding characters, i.e. character data ' は '.In this case, if static dictionary 4 does not comprise the character string about character data ' は ', then the longest character string S is ' は ', increase dictionary n, dictionary number is set to N+2 (S13), character string SC adds on the dictionary, and extracts as be used for will be with the candidate target (S14) of the data of dictionary n (N+2) input.In other words, in this case, extract string data ' は the present '.

By carrying out similar process, in the example shown in Fig. 8, character string ' today must ', ' necessary な ' order is drawn into and enlarges in the dictionary 3.

As mentioned above, when all character strings of original document data are finished extraction process, carry out input process (S2 shown in Fig. 6) for above-mentioned auxiliary dictionary.Actual this process of expression of flow chart shown in Fig. 9.This process is undertaken by auxiliary dictionary input unit 2.In being drawn into the string data that enlarges in the dictionary 3, determine to import the character string in auxiliary dictionary 5.Then, the respective symbols string is imported in auxiliary dictionary 5.Therefore, height to the data of node N have been imported in static dictionary 4, and will handle in node N place and data afterwards.

In other words, along the node of the chain sequence tracking data in the static dictionary 4 in the hierarchical structure.If determine to have arrived the node n bigger than N, carry out character selection course (in step S21 not) so, and determine further whether the subdata node number surpasses a threshold value (S22), and whether string length perhaps to be processed is greater than a threshold value (S23).In other words, carry out whether surpassing determine (S22) of a threshold value about the subdata node number.If determine that (S22) is ' denying ', so perhaps the original character of character string to be processed forms the part of Chinese idiom hardly.Carry out whether surpassing determine (S23) of a threshold value about string length to be processed.If determine that (S23) is ' denying ', so perhaps the respective symbols string forms an ignore character string.For example, in this case, enlarge dictionary 3 even character string has been drawn into, then do not import yet it (in S22 determine not and in S23 not after S24).For example when the threshold value of string length in above-mentioned example is 2, from clauses and subclauses, remove character string ' は the present '.For example import a character string ' necessary な '.

On the other hand, determine to satisfy the character string of above-mentioned condition, determine in (S22 and S23) or in they any, be ' being ' at two, and this character string input in auxiliary dictionary 5.Then, repeat said process, up to (all on the n＞N) are defined as ' being ' at node n.

If finished at node n (all deterministic processes on the n＞N) (in S21 for being), so the character string input of satisfying above-mentioned condition in auxiliary dictionary 5 (S25).

Secondly, carry out about (n＞N) code of corresponding character string calculates (S26) with node n.Figure 10 is the flow chart that actual description character string code calculates.At first, import the last character string YYY (S26-1) of above-mentioned static dictionary 4, and the code 1 (S26-2) of input speech of input in auxiliary dictionary 5.Then, calculated minimum m (2 ^*M＞1) (S26-3).For example, if the code 1 of the speech of input in auxiliary dictionary 5 is 1000, minimum value m (2 so ^*M＞1) be 10.If the code 1 of the speech of input in auxiliary dictionary 5 is 2000, minimum value m is 11 so.M value by above-mentioned setting is distributed to following formula, so that the code (S26-4) of each character string in the auxiliary dictionary 5 to be set.The code value input of calculating is in auxiliary dictionary 5 (S27 among Fig. 9).

By carrying out said process, can finish the process of new character strings input in auxiliary dictionary 5.Figure 11 is the example of configuration of the data of the auxiliary dictionary 5 that produces in said process.Figure 12 represents the example of configuration of the string data of static dictionary 4 and auxiliary dictionary 5.Figure 13 represents to import the pattern of the tree structure of the data in above-mentioned static dictionary 4 and auxiliary dictionary 5.In above-mentioned example, in node 1 place's input character (Chinese character) data ' ' (), and a character data ' gas ' (atmosphere) and the character data ' son ' (child) at node 5 places at node 4 places connect into low link data.And, character data ' words ' (talk) is connected on the node ' N+3 '.In addition, with regard to regard to the character data ' son ' (child) at above-mentioned node 5 places, the character data of input at node N+6 place ' goes out ' (distribution) as low link data.Input (at node N+7 place) character data ' version ' (distribution) as character data ' go out ' than hanging down link data.

With regard to other nodes, new input ' is asked ' (problem) low link data as the character data ' ' (study) of assisting node 2 places in the dictionary 5 in the character data at node N+2 place similarly.New input the data at node N+4 place ' on ' as the character data ' shell ' (sales) at node 7 places in the auxiliary dictionary 5 than hanging down link data.And, input the data ' hand ' (hand) at node N+5 place as character data ' on ' (above) hang down link data.Thereby the data of input in auxiliary dictionary 5 are designated as the shade circle.

Figure 14 represents the code space of static dictionary 4 and auxiliary dictionary 5.Import static words allusion quotation 4 and auxiliary dictionary 5 as code data.For example, import 2 byte characters.Be illustrated in the ESC indication escape probability among Figure 14.The probability that needs of dictionary 5 is assisted in its indication, and sets in advance a predetermined value.

Secondly, use above-mentioned static dictionary 4 and the new auxiliary dictionary 5 that produces, initial data is carried out second reading process (second time).

Be illustrated in the actual descriptor partition process of process (S3 shown in Fig. 4 and S4) among Figure 15.Character string is assigned to expansion dictionary 3, static dictionary 4, reaches auxiliary dictionary 5, and the pointer position initial setting up to 1 (S31) of input original document data.

Secondly, detect the longest character string S (S32) that is complementary with character string from above-mentioned pointer position in the dictionary.Flowcharting shown in Figure 16 detects the real process of above-mentioned the longest character string S.For example, in above-mentioned example, ' when electric は ' at first appeared in the document data, the first character W (position of pointer is 1) (S32-1) was ' ' when character string.For example, input character data is as target character (S32-2).

Secondly, determine whether with static dictionary 4 in the character (S32-3) that is complementary of above-mentioned target character.If in static dictionary 4, do not have such character, then further determine in auxiliary dictionary 5, whether to exist a coupling character (S32-4).In these cases, character (Chinese character) data ' ' are included in the static dictionary 4, upgrade the position of pointer, and carry out similar process (S32-5 after S32-4 is "Yes") for next character (Chinese character) data ' gas '.

' gas is searched static dictionary 4 (S32-5 after S32-3 is "Yes"), and determines whether the character that is complementary with next character ' は ' then for next character (Chinese character) data.Character data ' は ' (S32-5 after S32-4) is detected at node N+1 place in auxiliary dictionary 5.Then, search next character data ' the present ' (now).In the above-described embodiments, in static words allusion quotation 4 or auxiliary dictionary 5 at the longest matched character string S (S32 S32-6, Figure 15 shown in) of ' electric は ' does not have character afterwards, and output character ' electric は ' as the character (character data ' ') of initial (leading) character W.

Secondly, the process shown in Figure 15 is returned in control, and uses the code name (S33) of regular length position output about character string S.So, the order output string ' data of the regular length position of electric は '.In this case, the output and the corresponding fixed length code of node N+1 of character data ' は ', as with character string ' the corresponding code of electric は '.

Secondly, above-mentioned pointer position is moved to character string S character position (S34) afterwards, and repeat said process for next character data ' the present ' as target character.

Therefore, by repeating said process, with the data of the corresponding fixed length code of character string ' today ' (today) and with the data of the corresponding fixed length code of character string ' necessity ' (necessity) ... order outputs to variable length code unit 7.

Thereby when providing the fixed length code data for variable length code unit 7, variable length code unit 7 converts fixed length code to compressed code (S5 shown in Fig. 6).By being converted to compressed code, the fixed length code from 6 outputs of speech division unit carries out this process, and the example of the code value of Figure 17 and 18 expression compressed codes.For example, Figure 17 represents only to import the code value of the character string in static dictionary 4, and Figure 18 represents to import the code value of the character string in auxiliary dictionary 5.

For example, when importing the fixed length code of pointing character strings ' electric ' (electric) by speech division unit 6, respective code value 0000001 of variable length code unit 7 outputs.When the fixed length code of input of character string ' electronics ' (electronics), respective code value 0000011 of variable length code unit 7 outputs.On the other hand, when input of character string ' during the character string code of electric は ', the code value YYY000001 shown in variable length code unit 7 output Figure 18.When the fixed length code of input of character string ' electronic publishing ' (electronic publishing), the code value YYY000010 shown in variable length code unit 7 output Figure 18.

When from speech division unit 6 other fixed length codes of output, the respective code value shown in output Figure 17 and 18.In addition, do not have in the document memory of expression in the accompanying drawing for example writing by the compressed code of above-mentioned output.

By carrying out said process, when comparing, can improve compression ratio with the compression process of using conventional static dictionary to carry out, export a spot of packed data thus.As a result, use can be write packed data than the document memory of low capacity.

Secondly, explained later recovers to write the process of the compressed code (packed data) on for example above-mentioned document memory.

Figure 19 represents the system configuration according to restorer of the present invention.This equipment comprises a variable-length decoding unit 10, speech recovery unit 11, above-mentioned static dictionary 4, and auxiliary dictionary 5.Above-mentioned static dictionary 4 and auxiliary dictionary 5 are connected on the speech recovery unit 11.When speech recovery unit 11 reverted to the original document data to packed data, it used static dictionary 4 and auxiliary dictionary 5.

Figure 20 is the flow chart of recovery process.At first, variable-length decoding unit 10 receives the compressed code (packed data) that deposits in the document memory for example, compressed code is carried out the expansion process opposite with compression process, and it is deciphered original fixed length code (S41).In other words, by reference Figure 17 and 18, compressed code value (packed data) is decoded into respective fixation length code (fixed-length data).For example, when receiving compressed code value 0000001, the fixed length code of output string ' electric ' (electric).When receiving compressed code value 0000011, the fixed length code of output string ' electronics ' (electronics).On the other hand, when receiving compressed code value YYY000001, the output string ' fixed length code of electric は '.When receiving compressed code value YYY000010, the fixed length code of output string ' electronic publishing ' (electronic publishing person).

Secondly, for speech recovery unit 11 provides data by the fixed length code of variable-length decoding unit 10 decodings, and carry out word recovery processs (S42) by searching for the first time static dictionary 4.For example, when the fixed length code that provides by variable-length decoding unit 10 during, call over and ' electric ' corresponding character code from static dictionary 4, and export as the original document data corresponding to character string ' electric '.When the fixed length code that provides by variable-length decoding unit 10 during, call over and ' electronics ' corresponding character code from static dictionary 4, and export as the original document data corresponding to character string ' electronics '.On the other hand, when the fixed length code that provides by variable-length decoding unit 10 corresponding to character string ' during electric は ', this character string is not imported in static dictionary 4, and word recovery unit 11 is searched auxiliary dictionary 5, and reads and character string ' the corresponding character code of electric は ' (S43).In addition, when the code value that is provided by variable-length decoding unit 10 during corresponding to character string ' electronic publishing ', it is the character string of not importing in static dictionary 4, and searches auxiliary dictionary 5, to read and ' electronic publishing ' corresponding character code similarly.

Repeat said process, and speech recovery unit 11 becomes the original document data to the coded message order recovery when searching static dictionary 4 and auxiliary dictionary 5.When finishing all processes, produce initial data fully again.

In above-mentioned recovery process, compressed code (packed data) can regenerate the original document data.Owing to use the compressed code of auxiliary dictionary 5 codings to transmit from document memory in this case, so can finish transport process at short notice.

In above-mentioned explanation according to the present invention, data compression device and data recovery apparatus have been described respectively.Yet, can similar operation comprise their both devices.

＜the second embodiment 〉

What describe below is the second embodiment of the present invention.

Different with the situation in above-mentioned first embodiment, according to the data compression device of present embodiment the order wire of the packed data that produces through the internet etc. is sent to another computer, and data are reverted to initial data.Be sent to another computer at the auxiliary dictionary that transmits the side generation through order wire.

Figure 21 represents the system configuration according to the data compression device of present embodiment.Comprise that according to the data compression device of present embodiment the auxiliary dictionary input unit of a character string detecting unit 21,22, one enlarge the auxiliary dictionary of the static dictionary of dictionary 23,24,25, speech division unit 26, variable length code unit 27, and supplementary units 28.In static dictionary 24, import the conventional characters string such as speech, phrase etc. as described above, and character string detecting unit 21 is by detecting the character string that is included in the original document data with reference to static dictionary 24.

In addition, the storage of the character string such as the Chinese idiom etc. that does not have input in static dictionary 24 in auxiliary dictionary 25.This input process is carried out by auxiliary dictionary input unit 22, and input is by the distinctive character string of original document data that enlarges in dictionary 23 extracted strings.

And as mentioned above, after the process of the character string carrying out input such as speech phrase etc., when reading the original document data once more, speech division unit 26 is divided into speech to the original document data.In addition, 27 pairs of variable length code unit are divided into the data selection data compression process of speech.

On the other hand, supplementary units 28 is read the information that produces according to present embodiment in auxiliary dictionary 25, and sense information is outputed to the order wire of internet etc., and in output before the compressed code of variable length code unit 27 outputs, transmit this information.

Wherein import the configuration of the static dictionary 24 of the conventional characters string such as speech, phrase etc. in advance, be similar to above-mentioned configuration.They import in hierarchical structure.

With regard to above-mentioned configuration, by operating according to the processing of present embodiment with reference to following flow chart description.

Figure 22 is the flow chart according to the processing operation of second embodiment of the invention.And in the present embodiment, the original document data are carried out read procedure twice.In first read procedure (first pass), extract the distinctive character string of original document data, and be input in the auxiliary dictionary 25.In other words, at first execution character string testing process (step S51) extracts the distinctive character string of original document data, and the data of extracted strings are imported in auxiliary dictionary 25 (S52).

And the process of describing among character string testing process (S51) and above-mentioned first embodiment (with reference to the flow chart shown in the figure 6) is identical, and extracts the character string of the original document data in enlarging dictionary 23.The process of the distinctive character string of input original document data is also identical with the process of describing in first embodiment (flow chart shown in Fig. 7).Only the data of dictionary 25 distinctive character strings are assisted in input.With regard to said process, the data shown in input Figure 12 in static dictionary 4 and auxiliary dictionary 5.

According to present embodiment, as in the process of carrying out according to the flow chart shown in Figure 22, importing the stem (S53) of assisting the data in the dictionary 25 to add output file to.This process is carried out by supplementary units 28.In fact, this process is carried out according to the flow chart shown in Figure 23.

Figure 24 is illustrated in the file format that produces in this process.

At first, determine whether target string is the character string (S61) that finishes at n＞N place.For example, under the situation of above-mentioned character string ' electric ', the respective symbols string is that (n＜N) locates the character string that finishes, and is defined as not (in S61 for not) at node 4.On the other hand, ' under the situation of electric は ', the respective symbols string is that (n＞N) locates the character string that finishes, and to be defined as be (in S61 for being) at node N+1 in above-mentioned string segments.In this case, retrieval is at the father node at n≤N place.For example, ' under the situation of electric は ', search the father node 4 of ' は ', and import this character string (S62, S63) in above-mentioned character string.In this process, the father node of ' は ' (index) and the character code of ' は ' are write on the file shown in Figure 24.And, write the coded word YYY000001 (S64) of respective symbols string.

In addition, when character string is ' electronic publishing ', search the father node 5 of ' publication ' (publication), and import this character string.The character code of the father node of ' publication ' (index) and ' publication ' is write on the file shown in Figure 24.And, write the coded word YYY000010 of respective symbols string.

In addition, the file format shown in Figure 24 is illustrated in the part of the content of all clauses and subclauses in the auxiliary dictionary.After the information in writing auxiliary dictionary, add compressed code in the end of auxiliary dictionary, and output then.Figure 25 represents to have the output format of the whole output file that adds the compressed code on the auxiliary dictionary unit to, writes auxiliary dictionary for this auxiliary dictionary unit.The compressed code of exporting from the variable length code unit 27 of later description adds on the compressed code unit.

Secondly, speech division unit 26 is carried out the process (S54 shown in Figure 22, S55) of dividing with the corresponding character string of original document data.This process is similar to said process.In other words, when searching the original document data, produce the fixed length code of character string for static dictionary 24 and auxiliary dictionary 25.For example, under the situation of character string ' electric ', by searching respective fixation length code of static dictionary 24 outputs.On the other hand, ' under the situation of electric は ', export a respective fixation length code in character string by searching auxiliary dictionary 25.

Variable length code unit 27 converts the fixed length code from 26 outputs of speech division unit to compressed code.This process is with identical according to the process of above-mentioned first embodiment.For example, when the fixed length code of searching character string ' electric ', the corresponding compressed code 0000001 of variable length code unit 27 outputs.On the other hand, when searching character string ' during the fixed length code of electric は ', the corresponding compressed code YYY000001 of variable length code unit 27 outputs.

By carrying out said process, after the content of the auxiliary dictionary 25 of supplementary units 28 outputs, output actual compression code.In other words, in the output file of form as shown in Figure 25, describe the content of auxiliary dictionary 25 and the data of compressed code, and the order wire through the internet etc. outputs to another computer, and carry out compressed code and regenerate process.

Therefore, according to present embodiment, based on the data of input in auxiliary dictionary 25, the distinctive character string of the original document data of encoding becomes very little amount to data compression thus.As a result, when the order wire through the internet etc. transmits data, can shorten data transfer time.

What describe below is the situation that the packed data that provides through order wire wherein is provided.

Figure 26 represents the system configuration according to the restorer of present embodiment.This equipment comprises a variable-length decoding unit 30, speech recovery unit 31, static dictionary 34, auxiliary dictionary 33, and auxiliary dictionary input unit 32.Speech recovery unit 31 is connected on static dictionary 34 and the auxiliary dictionary 33, and when speech recovery unit 31 reverts to the original document data to data, searches static dictionary 34 and auxiliary dictionary 33.In addition, auxiliary dictionary input unit 32 inputs are provided at the auxiliary dictinary information of assisting in the dictionary 33 through order wire.

Figure 27 is about the flow chart of above-mentioned configuration according to the recovery process of second embodiment.At first, according to present embodiment, carry out auxiliary dictionary input process (S71), in this process, for example use auxiliary dictionary input unit 32 to detect the data of under file status as shown in Figure 24, importing, and import in auxiliary dictionary 33 in the data recovery apparatus side.

Figure 28 is the flow chart of this process.In other words, input is included in the father node (S81) of each character string in the auxiliary dictionary unit (index).In this process, in the example shown in Figure 24, read the father node 4 of ' は ' (index), and read the coded word YYY000001 (S82, S83) of character code He this character string of ' は '.Then, read the father node 5 of ' publication ' (index), and read the coded word YYY000010 of corresponding character code and character string.

By repeating said process, the data at the auxiliary dictionary 25 on the data compression device side are imported in auxiliary dictionary 33.

On the other hand, the restore data process (S/2 shown in Figure 27) that provides of the order wire carried out as mentioned above through the internet etc. of variable-length decoding unit 30.For example, the compressed code (packed data) that input provides through order wire, and convert a fixed length code to.This process is identical with said process.For example, compressed code 0000001 is decoded into the fixed length code of character string ' electric ', and compressed code YYY000001 is decoded into the character string ' fixed length code of electric は '.

Secondly, the regular length of being deciphered by variable-length decoding unit 30 offers speech recovery unit 31, searches static dictionary 24, and carries out speech recovery process (S73).For example, when the code that provides from variable-length decoding unit 30 during corresponding to character string ' electric ', output and ' electric ' corresponding character code.In addition, when the code that provides from variable-length decoding unit 30 during corresponding to character string ' electronics ', output and ' electronics ' corresponding character code.

On the other hand, ' during electric は ', it is not imported in static dictionary 34, and speech recovery unit 31 is searched auxiliary dictionary 33 (S74) corresponding to character string when the code value that provides from variable-length decoding unit 30.At this moment, the order wire of the data in the auxiliary dictionary 25 that is stored in the data compression device side through aforesaid internet etc. write auxiliary dictionary 33, and can detect corresponding character code by the speech recovery unit 31 of searching auxiliary dictionary 33.In addition, when code value was meant character string ' electronic publishing ', it was not imported in static words allusion quotation 34, and speech recovery unit 31 is searched auxiliary dictionary 33, and can detect corresponding character code.

The data of the character code that order output detects as mentioned above are as the original document data.When continuing said process, speech recovery unit 31 is when searching static dictionary 34 and auxiliary dictionary 33, and order reverts to the original document data to code information.When finishing all processes, finish the process of recovering the original document data.

As mentioned above, according to second embodiment of the invention, that order wire input through internet etc. in the computer of receiver side produces in the data compression device side, be stored in the information in the auxiliary dictionary, and use auxiliary dictionary and general static dictionary to recover the packed data of original document data, and after transmitting auxiliary dictionary, carry out data recovery procedure at short notice.

Figure 29 represents a kind of like this system: by in the portability storage medium such as floppy disk, CD-ROM etc. and in the memory of the external memory such as hard disk etc., and the program of the data access process of storage present embodiment; With by storage medium is inserted in the computer driver, realize process according to present embodiment.

In addition, by through the internet, the order wire of Local Area Network, wide area network (WAN) etc. is from the program of program supplier to the downloaded present embodiment, also can realize this process.

As mentioned above, can import the character string of the distinctive word of document data and phrase in auxiliary dictionary, improve compression ratio and output small amount of compression data thus according to the present invention.

And, because output small amount of compression data, so can have the packed data storage in the document memory etc. of low capacity.

In addition, when the order wire through the internet etc. transmits packed data, can transmit it at short notice, because data volume can be less.

Claims

1. data compression device comprises:

A static dictionary comprises the character string of speech and phrase in advance;

A character string detecting unit, the document data that retrieval will be compressed, and detect the character string that is not included in the described static dictionary;

Select and input unit for one, from the character string that detects by described character string detecting unit, select the distinctive character string of each document, and the character string of selecting is input in the auxiliary dictionary;

A speech division unit is searched the document data that will compress to described static dictionary and described auxiliary dictionary, and the character data that is input in described static dictionary or the described auxiliary dictionary is converted to fixed length code; And

A variable length code unit converts the fixed length code from institute's predicate division unit output to compressed code.

2. equipment according to claim 1 further comprises:

A decoding unit is decoded into fixed length code to compressed code; And

A data recovery unit uses described static dictionary and described auxiliary dictionary that the fixed length code by described decoding unit decoding is reverted to the original document data.

3. data recovery apparatus comprises:

A static dictionary, the character string of stored word and phrase in advance;

An auxiliary dictionary, at document data that retrieval will be compressed, detect the character string be not included in the distinctive speech of document data and phrase in the described static words allusion quotation, further from above-mentioned character string, select character string after, the store character string;

A decoding unit is decoded into fixed length code to the compressed code of document data; And

A data recovery unit uses described static dictionary and described auxiliary dictionary, and the fixed length code by described decoding unit decoding is reverted to the original document data.

4. data compression device comprises:

A speech division unit is searched the document data that will compress to described static dictionary and described auxiliary dictionary, and the character data that is input in described static words allusion quotation or the described auxiliary dictionary is converted to fixed length code;

A variable length code unit converts the fixed length code from institute's predicate division unit output to compressed code; And

A delivery unit by the string data of input in described auxiliary dictionary being added to the stem of the compressed code that is produced by described variable length code unit, is sent to communication network to these data.

5. equipment according to claim 4 further comprises:

An auxiliary dictionary Input Control Element receives the data that input transmits through communication network in described auxiliary dictionary, and data are imported in an auxiliary dictionary storage unit;

A decoding unit is decoded into fixed length code to compressed code; And

A data recovery unit uses described static dictionary and described auxiliary dictionary storage unit that the fixed length code by described decoding unit decoding is reverted to the original document data.

6. data recovery apparatus comprises:

A static dictionary, the character string of stored word and phrase in advance;

An auxiliary dictionary storage unit, the auxiliary dictionary input data that storage transmits through communication network;

A data recovery unit by retrieving described static dictionary and the auxiliary dictionary that is stored in the described auxiliary dictionary storage unit, reverts to the original document data to the fixed length code by described decoding unit decoding.

7. data compression method comprises:

A character string testing process is used a static dictionary that comprises the character string of speech and phrase in advance, the document data that retrieval will be compressed, and detect the character string that is not included in the described static dictionary;

Select and input process for one, from the character string that described character string testing process, detects, select the distinctive character string of each document, and the character string of selecting is input in the auxiliary dictionary;

A speech partition process is searched the document data that will compress to described static dictionary and described auxiliary dictionary, and the character data that is input in described static dictionary or the described auxiliary dictionary is converted to fixed length code; And

A variable length code process converts the fixed length code of changing to compressed code in institute's predicate partition process.

8. data reconstruction method comprises:

Select and input process for one, by using a static dictionary of the character string of stored word and phrase in advance, the document data that retrieval will be compressed, detect the character string be not included in the distinctive speech of document data and phrase in the described static dictionary, further from character string, select character string, and the character string input in an auxiliary dictionary;

A decode procedure is decoded into fixed length code to the compressed code of document data; And

A data recovery process is used described static dictionary and described auxiliary dictionary, and the fixed length code by described decode procedure decoding is reverted to the original document data.

9. data compression method comprises:

A speech partition process is searched the document data that will compress to described static dictionary and described auxiliary dictionary, and the character data that is input in described static words allusion quotation or the described auxiliary dictionary is converted to fixed length code;

A variable length code process converts fixed length code to compressed code; And

A transport process by the string data of input in described auxiliary dictionary is added to the stem of the compressed code that produces in described variable length code process, is sent to communication network to these data.

10. data reconstruction method comprises:

An auxiliary dictionaries store process, the auxiliary dictionary input data that storage transmits through communication network;

11. the computer-readable recording medium of a program of a storage is carried out following process to instruct computer:

A character string testing process is used a static dictionary that comprises the character string of speech and phrase in advance, the document data that retrieval will be compressed, and detect the character string that is not included in the described static words allusion quotation;

A variable length code process converts fixed length code to compressed code.