CN106549674B

CN106549674B - A kind of data compression and decompressing method towards electronic health record

Info

Publication number: CN106549674B
Application number: CN201610961205.5A
Authority: CN
Inventors: 于海龙; 李建元; 温晓岳
Original assignee: Enjoyor Co Ltd
Current assignee: Yinjiang Technology Co.,Ltd.
Priority date: 2016-10-28
Filing date: 2016-10-28
Publication date: 2019-07-23
Anticipated expiration: 2036-10-28
Also published as: CN106549674A

Abstract

The present invention relates to a kind of data compression towards electronic health record and decompressing methods, the present invention is encoded the character in GB2312 Hanzi coded character set by area, position coded representation is changed into each character by area code, segment number, position number indicates, then to these area codes, segment number, position number is identified using ascii character, what ascii character to correspond to that number does not do any requirement and hypothesis using, individual Chinese character can be used three ascii characters and be indicated in this way, Chinese character string is also just changed into ascii string, pass through the statistics to ascii character, coding realizes compression and decompression.Feature of the present invention is compressed towards chinese character, and compression efficiency ignores the repetitive rate of Chinese character, combines the operation pressure for mitigating server；It is convenient to use, high reliablity.

Description

A kind of data compression and decompressing method towards electronic health record

Technical field

The present invention relates to medical record data management domain more particularly to a kind of data compressions and decompressor towards electronic health record Method.

Background technique

With continuous mature and scale the expansion of hospital information, the growth of electronic health record information reaches unprecedented Scale, extensive growth, the continuous expansion of individual electronic medical record information of electronic health record information content, not only in data storage, tune The problem of encountering is read, and encounters the problem in various transmission in application aspects such as remote diagnosis, trans-regional information sharings.

Compression is one of the effective means that information improves efficiency in storage, transmission process, is used various compressions Method, as using most representative halfman algorithm in traditional compression method, the thought of this algorithm is based in information The frequency that single character occurs is encoded, the short coded representation of the high character of frequency, and the low character of frequency uses long coding It indicates, this thought counts status until the end of the seventies in last century accounts for always.Later by static state/dynamic dictionary to information into The algorithm of row compression has broken this pattern, and most representative is that the dynamic dictionary based on sliding window compresses information LZ77 and its derivative algorithm, the characteristics of algorithm is that the repetitive rate of character string is higher, the efficiency of compression is better, at present mainstream Algorithm used in tool of compression is all inextricably linked with LZ77 and its derivative algorithm.

Actually rare to electronic health record data compression applications in China's electronic medical record system at present, main cause is for electricity It is unrealistic that the entire data of sub- medical records system carry out unified compression, because compressed data are not suitable for application on site system The operation constantly increase, delete, changed.And information and few, few then tens Chinese characters for single case, more then thousands of words.Also, The character repetition rate of the information such as symptom, inspection, diagnosis, treatment, the treatment results of patient is not high, so either halfman Compression algorithm or LZ77 and its derivative algorithm, can not put expertise to good use.

Also many experts are trying to explore, studying in compression China of chinese character.South China Normal University's journal " the Universal And Simple Compression Method For Chinese Texts research " of (natural science edition) 2 phases in 2001 provides unique think of to Chinese character compression Road, the core concept of the technology are quantized by chinese character, then utilize " highest 3 two of two byte storage numerical value The characteristics of system position (the i.e. the 14th, 15 and 16) is 0 " entirely, gives up this three and is stored, the bright spot of the technical solution is not Any dictionary is needed, by the functional relation established in advance, is compiled by the numerical value that the region-position code of Chinese character obtains firstth area of Chinese character Code, simple and high efficiency.Compression ratio herein refers to the ratio between the size before size and compression after compressing file, example Such as: being 90m after the compressing file of 100m, compression ratio is for 90/100*100%=90% so the compression ratio of this compress mode It should be 13/16 × 100%, rather than 3/16 × 100% claimed in text, in comparison in electronic medical record system, this hair It is bright that better compression ratio can be provided.

" 0 rank model arithmetic of the Chinese language text dynamic alphabet volume of Journal of Chinese Information Processing the 14th phase volume 1 publication in 2000 Code ", this method presses chinese character in conjunction with PPM algorithm based on statistical model is established on the analysis foundation to existing text The method of contracting, the compression relies on suitable statistical model, and suitable statistical model is established on the basis of a large amount of texts, then right Character in text is predicted, is compressed, this is a very effective method.In electronic medical record system, case history is by multiple Writer writes, and the writing style of different writers is different, and the symptom descriptions of different case histories, diagnostic mode, treatment means are all Different, this brings challenge to the applicable of model.

Application No. is " Compression Method For Chinese Texts " of 201410614796.X to use the compression method based on word frequency statistics, Substantially or the simple application of halfman algorithm, the key point of the compression method how are realized to " without the mark such as space Will " is segmented as the Chinese character string of separator, and this method is not illustrated.Meanwhile as previously described, single disease The character repetition rate of the information such as symptom, inspection, diagnosis, treatment, the treatment results of example is not high, even if being calculated using effective participle Method obtains correctly segmenting or cannot effectively compressing.

Application No. is 089126954 " methods that one kind is suitable for compress to the document information of shuffling " (to derive from TaiWan, China), what is used is also halfman encryption algorithm, and bright spot is that Chinese individual character coding is split into high-order and low level two Then a part is counted respectively with English character, is encoded, obtain English encoder dictionary, Chinese high coding dictionary and in Literary low level encoder dictionary.In order to distinguish, prevents from conflicting, be added to mark in output character coding.This method is laid particular emphasis on A kind of method of Chinese and English shuffling is provided, careful consideration is not carried out to compression efficiency, because addition mark current situation necessity is examined The influence that the mark generates the decoding that halfman is encoded is considered, in order to avoid the decoding to the halfman coding after addition mark Influence, mark and the coding of halfman must be constrained, this has seriously affected the efficiency of compression.

Summary of the invention

The present invention is to overcome above-mentioned shortcoming, and it is an object of the present invention to provide a kind of data compression side towards electronic health record Method, the present invention is based under the scene towards electronic health record data compression, proposing a kind of new compression method, its main feature is that, towards Chinese character compression, compression efficiency ignore the repetitive rate of Chinese character, can be effective in the unduplicated situation of chinese character Compression combines the operation pressure for mitigating server.

Another object of the present invention is to provide a kind of data decompression method towards electronic health record, this decryption method is for matching Compression method as described above is covered, is easy to use, high reliablity.

The present invention is to reach above-mentioned purpose by the following technical programs: a kind of data compression method towards electronic health record, Include the following steps:

(1) N1 × N2 × N3 three-dimensional array arry_3rd is created, three dimensions are respectively area, section, position, wherein N1 is Area's maximum number, N2 are section maximum number；N3 is position maximum number；And character sequence is filled to three-dimensional array arry_3rd；

(2) area of three-dimensional array arry_3rd, section, position are identified according to ascii character；

(3) the mark result based on three-dimensional array arry_3rd and step (2) creates to obtain character mark table dic_ character；

(4) sorting coding table is created, if being encoded to i i-th in table₁i₂…i_n, it is encoded to j j-th₁j₂…j_n…j_r, n, r For length, wherein 1≤n < r, and i₁i₂…i_n≠j₁j₂…j_n；Total coding number is the maximum value in N1, N2, N3；

(5) all characters in electronic health record are converted into referring to character mark table dic_character by ascii character The character string of composition, and all characters are respectively corresponded into area, section, bit ASCII character identifier count and Bit-reversed；

(6) compression method is selected to generate compression dictionary dic_output according to the result of step (5)；

(7) Chinese character ascii character is encoded according to the compression dictionary dic_output of generation, completes compression.

Preferably, N1 × the N2 × N3 is preferably 20 × 20 × 20.

Preferably, the method that the step (1) fills character sequence to three-dimensional array arry_3rd is by GB2312 The chinese character of coding schedule is filled to three-dimensional array arry_3rd, is filled in three-dimensional array arry_3rd remaining space Ascii character and command character.

Preferably, the character mark table dic_character of the step (3) is preserved in three-dimensional array arry_3rd All characters and the corresponding area of character, section, bit identifier ASCII character between mapping relations one by one, character mark table Dic_character is once created and is used for multiple times, and is updated at predetermined intervals.

Preferably, the step (5) count and the number that occurs to ascii character identifier when Bit-reversed into Row statistics, sorts from large to small according to frequency of occurrence.

Preferably, it is word that step (6) the selection compression method, which generates the selection gist of compression dictionary dic_output, The distribution situation occurred is accorded with, specific as follows:

I) if being distributed in character set on a few character, using halfman algorithm to area, section, each ASCII word in position The ranking results of symbol carry out binary coding, generate compression dictionary dic_output；

II) if character distribution is partial to be uniformly distributed, the coding of sorting coding table, Jiang Qu, section, each ASCII in position are used The ranking results of character and the coding and sorting order of sorting coding table correspond, and generate compression dictionary dic_output.

Preferably, described judge that character integrated distribution or equally distributed principle are closed for sequence in preceding 4 number of characters Meter whether given birth to if sequence is more than 50% in preceding 4 number of characters using halfman algorithm by 50% or more of the total number of characters of Zhan At coding；Otherwise sorting coding table is used.

A kind of decompressing method of mating data compression method as described above, includes the following steps:

1) binary string is successively extracted into compressed file referring to compression dictionary dic_output, extract obtain area for the first time The binary coding of symbolic identifier, second extracts and obtains segment number binary coding, extract for the third time in place number binary system compile Code；

2) obtained binary coding corresponding ascii character is converted into according to compression dictionary dic_output to identify Symbol, is output in character string out1；

3) character string out1 is successively scanned from left to right, three characters is scanned every time, according to three obtained character to word Corresponding chinese character is searched in symbol mark table dic_character, completes decompression.

Preferably, the compression dictionary dic_output, character mark table dic_character and data compression use Compression dictionary dic_output, character mark table dic_character it is identical.

The beneficial effects of the present invention are: the present invention is to propose one based under the scene towards electronic health record data compression The new compression thought of kind, its main feature is that, it is compressed towards chinese character, compression efficiency ignores the repetitive rate of Chinese character, and Chinese character is got over The efficiency of multiple pressure contracting is higher, combines the operation pressure for mitigating server；The decompressing method to match therewith is provided simultaneously, it is convenient It uses, high reliablity.

Detailed description of the invention

Fig. 1 is the flow diagram of data compression method of the present invention.

Specific embodiment

The present invention is described further combined with specific embodiments below, but protection scope of the present invention is not limited in This:

Embodiment 1: as shown in Figure 1, a kind of data compression method towards electronic health record, includes the following steps:

Step 1, building 20 × 20 × 20 three-dimensional array arry_3rd, three dimensions of three-dimensional array respectively correspond area, Section, position, wherein N1 is area's maximum number, N2 is a section maximum number, N3 is a maximum number.

Step, 2 are sequentially filled character, and character includes chinese character, ASCII symbol, command character.

Specifically, chinese character can be filled other necessary using character in GB2312 coding schedule in remaining space Character such as ASCII symbol, command character etc..

Step 3, the area of three-dimensional array arry_3rd, section, position are identified using ascii character.

Step 4, a character mark table dic_character is created, all words in three-dimensional array arry_3rd are saved Symbol and the corresponding area of character, section, bit identifier ASCII character between mapping relations one by one.

Step 5, sorting coding table is created, as shown in table 1, if being encoded to i i-th in table₁i₂…i_n, it is encoded to for j-th j₁j₂…j_n…j_r, n, r are length, wherein 1≤n < r, and i₁i₂…i_n≠j₁j₂…j_n；Total coding number is in N1, N2, N3 Maximum value；

Table 1

Step 6, by all chinese characters and other characters reference such as step 4 table dic_ generated in electronic health record Character is converted into the character string of ascii character composition, and Chinese characters all in this way is converted into ascii character.

Step 7, by chinese character and other characters all in electronic health record, respectively to area, section, bit ASCII character Identifier count and Bit-reversed.The number that ascii character occurs is counted, and frequency of occurrence is arranged from big to small Sequence.

Step 8, according to step 7 as a result, Jiang Qu, section, the volume of the ranking results of each ascii character in position and sorting coding table Code sequence corresponds, and generates compression dictionary dic_output.Alternatively, using halfman algorithm to area, section, each ASCII word in position Symbol carries out binary coding, generates compression dictionary dic_output.The foundation of selection is the distribution situation that character occurs, if collection In be distributed on a few character using halfman algorithm generate coding, if distribution be partial to uniformly, it is proposed that using sequence The coding of coding schedule.Simple judgment principle is to sort to add up to whether account for 50% or more in preceding 4 number of characters, if sequence exists Preceding 4 number of characters are more than 50% using halfman algorithm generation coding, otherwise use sorting coding table.

Step 9 generates compression dictionary dic_output according to step 8 and encodes to Chinese character ascii character, completes compression.

It is specific as follows based on the decompressing method of above-mentioned compression:

Step 1, referring to compression dictionary dic_output, binary string is successively extracted into compressed file, is obtained for the first time The binary coding of area code identifier, second of acquirement segment number binary coding, third time obtain position binary coding.

Due to being encoded to i i-th in table in sorting coding table₁i₂…i_n, it is encoded to j j-th₁j₂…j_n…j_r, n, r are Length, wherein 1≤n < r, and i₁i₂…i_n≠j₁j₂…j_n；Therefore, compressing any coding in the binary coding of dictionary is not The prefix of other codings, so can uniquely obtain correct coding every time.

Step 2, the binary coding of acquirement is converted into corresponding ASCII identifier according to compression dictionary, is output to word In symbol string out1；

Step 3, character string out1 is successively scanned from left to right, three characters is scanned every time, according to three obtained character Corresponding chinese character is searched into dic_character dictionary table, completes decompression.

Embodiment 2: in the present embodiment, as space is limited, simultaneously be also limited to the present invention is directed Chinese character compress, I Using the most common case in electronic health record, the process of compression is illustrated by taking 20 characters as an example:

It is uncomfortable that the patient lower-left chest occurred without obvious cause before Yu Si days

Step 1 makes according to the character mark table dic_character having been built up, character mark table dic_character The ASCII identifier used, as shown in table 2, its generation step of character mark table dic_character is as described above, herein not Ao Shu again.

Table 2

Area's identifier, segment identifier, bit identifier successively are taken to the character of above-mentioned case, ascii string is generated, every In ternary character, first character is area's identifier, and second is segment identifier, and third position is bit identifier:

41c 4f7 aj6 a9d 8aj 8hd 7b3 967 6fi 9ab ac2 a63 30d 9ad b9e 99d 9f9 2d4 2d0 86b

Step 2 distinguishes the number that Statistical Area identifier, segment identifier, bit identifier occur and carries out Bit-reversed, such as table 3 It is shown:

Table 3

Step 3 is according to judgement, and total preceding 4 number of characters that sort have been more than 50%, we are raw using halfman algorithm Compression dictionary is generated at corresponding area's identifier binary coding, segment identifier binary coding and bit identifier binary coding Dic_output, as shown in table 4:

Table 4

The compression dictionary that step 4 is generated according to step 3 respectively converts area's identifier, segment identifier, bit identifier Complete compression.

20 chinese characters have been used in case, it is therefore apparent that each Chinese character accounts for 2 bytes, and each byte accounts for 8bit, that 320bit is accounted in total before compression.

The statistical data that the compression dictionary and step 2 obtained according to step 3 obtains, can run away with compressed word Fu Gongzhan 207bit, so the compression ratio of present case is 207 320 × 100%=64.6875% of ÷

Decompression procedure is the inverse process of the above process, and details are not described herein again.

Embodiment 3: being also limited to simultaneously as space is limited, the present invention is directed Chinese character compresses, we use electronics disease The most common case in going through illustrates the process of compression by taking 200 characters as an example:

There is lower-left chest discomfort, no cough, expectoration, fearless cold fever without obvious cause because of before four days in the patient.Not It arouses attention, does not go any treatment.It is uncomfortable because occurring lower-left chest after feed cold dish again before six hours, no diarrhea and nausea, Vomiting.It does not draw attention, takes March Powder symptom certainly without alleviation.Medical locality health-center gives oral drugs and intramuscular injection liquid is (specific Owe detailed) left chest discomfort is without being clearly better and cold, the coolness of extremities of fear occurs.Show that millet straw turns ammonia through our hospital's patient examination myocardial enzymes Enzyme and lactic dehydrogenase are high.My section is taken in further to treat emergency treatment our hospital.Since morbidity, spirit is poor, sleeps not good enough.Stool and urine Normally.

Step 1 makes according to the character mark table dic_character having been built up, character mark table dic_character The ASCII identifier used is as shown in table 2, its generation step of character mark table dic_character is as described above, herein no longer Ao Shu.Area's identifier, segment identifier, bit identifier successively are taken to the character of above-mentioned case, ascii string is generated, every three In a one group of character, first character is area's identifier, and second is segment identifier, and third position is bit identifier:

41c 4f7 aj6 aca a9d 8aj 8hd 7b3 967 6fi 9ab ac2 a63 30d 9ad b9e 99d 9f9 2d4 2d0 86b 09j 967 5ef 8bf 001 5ef 8ei 09j 967 948 60h 3f5 7h1 002 945 a6e 79d b5d a57 09j 945 9f1 7h7 4bb b33 641 002 66c 9ce 856 7b3 a63 57f 858 638 2dg 4d8 ae6 337 30d 9ad b9e 99d 9f9 2d4 2d0 86b 09j 967 412 9dj 4j9 3ea 9e9 001 726 905 002 945a6e 79d b3d 870 09j b83 3jh 9f1 5cd 7jh b0j b6b 967 4f5 56d 002 5a6b02 36c 38b 94h 845 8dc aa7 5fd 3jh a2a 974 4j9 4ie b5d a37 8h6 09f 5b4 8h6 7ba 9bd 09g b9e 9f9 2d4 2d0 86b 967 6fi 9ab 4b0b61 7ch 30d 9ad 948 60h 09j b18 8h6 2b8 60h 002 58e 95f acf 6d8b02 529 2f6 9e9 4ie 6cf 788 85j 46b 2ee b61 232 6cf 4j9 7ie 8c8 910 7dd 6cf 430 002 93f 57f a38 2d2b33 641 4ja b02 95f acf 872 7ig 95f 5ed 002 3f5 2be a4d 5i9 09j 58c 83g 2fb 09j 89i 6ee 7ba 512 002 354 9ce 2a6b0g 2ga 002

Step 2 distinguishes the number that Statistical Area identifier, segment identifier, bit identifier occur and carries out Bit-reversed, such as table 5 It is shown:

Table 5

Step 3 sorts, and preceding 4 number of characters are total to be less than 50%, according to subordinate list 2 obtain corresponding area's identifier two into System coding, segment identifier binary coding and bit identifier binary coding generate compression dictionary dic_output, such as 6 institute of table Show:

Table 6

The total length of compressed binary string is 2490, and compression ratio is 2490/3200 × 100%=77.8125%

In conclusion it is an object of the present invention to provide a kind of compressing data method, actually should during, we are right Common English medical terms and scalar-unit such as HBsAg, mmHg are placed directly in the section of blank and are identified, for it Influence of his ascii character individually occurred in the case where frequency of occurrence is seldom to reduced overall effect is not very big.For High compression effect, we periodically collect the high frequency Chinese character in primary electron case history in practical applications, high frequency Chinese character is concentrated It is stored in the section of dic_character, the version number of dic_character is added in binary string upon compression to prevent Only character mapping error.It in addition directly will be entire for common Chinese character medical terms such as congenital heart disease, duodenum etc. Character string, which is put into blank section to be identified, (in order to make full use of the limited space dic_character, deletes GB2312 In the character such as character in the 04th area to 09th area and the character shaped like these radicals of Bing, Tou, Yan that take less than substantially).

GB2312 coding has carried out multidomain treat-ment, the area Gong94Ge to the character included, and 94 positions are contained in each area, can be with 8836 code bits are accommodated, wherein first-level Chinese characters there are 3755, is distributed in 16-55 subregion, and the Chinese characters of level 2 there are 3008 to be distributed in It in addition to this further include other 682 characters such as the Latin alphabet, Greek alphabet in 56-87 subregion, it is total to have included 7445 words Symbol, in order to realize lossless compression, we will also add printable character and command character in ASCII coding schedule, so altogether about 7545 characters.If directly compressed to these characters using halfman algorithm in compression process, it is clear that cannot reach Compress purpose.Be converted into chinese character one two or more transcodings, reduces to the character being likely to occur in pressure source Number is one of effective means come the extension for limiting compression code length, meanwhile, it was noticed that while individual Chinese character changes into transcoding Actually and a kind of expansion to chinese character, transcoding digit is longer, and the character number being likely to occur in pressure source is fewer, puts down The length for compressing dictionary encoding is shorter, but character repetition rate is also more uniform, and compression efficiency is also lower.Practice have shown that by Chinese character Being converted into 3 transcodings can achieve optimal effect.In order to limit the length of compression dictionary encoding, 20 are devised in this programme Area, each area design 20 sections, and each section accommodates the method for 20 positions to accommodate 7545 characters.Then respectively to this 20 Area, section, position are encoded, and to realize compression, vision area, section, the selection of the distribution situation of position code character are used in compression process Halfman algorithm generates the coding that compression dictionary code still uses subordinate list (two), it is therefore apparent that halfman algorithm is in the row of generation Its code length can meet or exceed 5 when the coding of the 10th character of sequence, since we use 3 character representations to Chinese character, Binary coding after average each character compression, which is no more than 5, can just play compression effectiveness.It sorts the 10th to statistical result When later character code, actually to a kind of expansion of character, so the only sequence character later more than 10 always occurs When number is no more than the 25% of total number of characters, it could generally offset this 25% expansion and play compression effectiveness, so we Judgment principle be that sort in first 4 character sums be more than 50% to generate compression dictionary code using halfman algorithm, otherwise make Coding with the coding of subordinate list (two), subordinate list two is not universally applicable coding, but can guarantee equal energy in any case The effect of compression is played, this is also us using 20th area, each area designs 20 sections, and each section of 20 positions of receiving accommodate this The reason of 7545 characters.

In addition, about Code table of the present invention rule: 0,1 composition 3, totally 8 kinds may " 000,001,010,011, 100,101,110,111 ", take wherein 2 " 000,011 ", adds one 0,1 by remaining 6 kinds as first 2 codings, totally 12 Kind is possible " 0010,0011,0100,0101,1001,1000,1010,1011,1100,1101,1110,1111 ", takes therein 5 It is a to be left 7 kinds of possibility as the 3rd~7 coding, one 0,1 is added, totally 14 kinds of possibility, removed " 10000 ", remaining 13 It is encoded as 8~20.Correct dividing regions, section, position when decoding may be implemented in this way.About coding characteristic: assuming that i-th of coding For d₁d₂…d_ni, it is encoded to d j-th₁d₂…d_rj…d_nj, 1≤ni≤nj (i < j), and d₁d₂…d_ni≠d₁d₂…d_rjWherein, ni, Nj is code length, 1≤rj≤nj.

It is specific embodiments of the present invention and the technical principle used described in above, if conception under this invention institute The change of work when the spirit that generated function is still covered without departing from specification and attached drawing, should belong to of the invention Protection scope.

Claims

1. a kind of data compression method towards electronic health record, it is characterised in that include the following steps:

(1) create N1 × N2 × N3 three-dimensional array arry_3rd, three dimensions are respectively area, section, position, wherein N1 be area most Big number, N2 are section maximum number；N3 is position maximum number；And character sequence is filled to three-dimensional array arry_3rd；It is described Character include chinese character, ASCII symbol, command character；

(4) sorting coding table is created, if being encoded to i i-th in table₁i₂…i_n, it is encoded to j j-th₁j₂…j_n…j_r, n, r are length Degree, wherein 1≤n < r, and i₁i₂…i_n≠j₁j₂…j_n；Total coding number is the maximum value in N1, N2, N3；

(5) all characters in electronic health record are converted into being made of ascii character referring to character mark table dic_character Character string, and all characters are respectively corresponded into area, section, bit ASCII character identifier count and Bit-reversed；

2. a kind of data compression method towards electronic health record according to claim 1, it is characterised in that: the N1 × N2 × N3 is preferably 20 × 20 × 20.

3. a kind of data compression method towards electronic health record according to claim 1, it is characterised in that: the step (1) method for filling character sequence to three-dimensional array arry_3rd is to fill the chinese character of GB2312 coding schedule to three-dimensional After array arry_3rd, ascii character and command character are filled in three-dimensional array arry_3rd remaining space.

4. a kind of data compression method towards electronic health record according to claim 1, it is characterised in that: the step (3) character mark table dic_character preserve all characters in three-dimensional array arry_3rd and the corresponding area of character, Section, bit identifier ASCII character between mapping relations one by one, character mark table dic_character, which is once created, repeatedly to be made With, and be updated at predetermined intervals.

5. a kind of data compression method towards electronic health record according to claim 1, it is characterised in that: the step (5) carry out statistics and the number occurred when Bit-reversed to ascii character identifier count, according to frequency of occurrence from greatly to Small sequence.

6. a kind of data compression method towards electronic health record according to claim 1, it is characterised in that: the step (6) it is the distribution situation that character occurs that selection compression method, which generates the selection gist of compression dictionary dic_output, specific as follows:

I) if being distributed in character set on a few character, using halfman algorithm to area, section, position each ascii character Ranking results carry out binary coding, generate compression dictionary dic_output；

II) if character distribution is partial to be uniformly distributed, the coding of sorting coding table, Jiang Qu, section, each ascii character in position are used Ranking results and sorting coding table coding and sorting order correspond, generate compression dictionary dic_output.

7. a kind of data compression method towards electronic health record according to claim 6, it is characterised in that: the judgement word Symbol integrated distribution or equally distributed principle be sequence preceding 4 number of characters add up to whether 50% or more of the total number of characters of Zhan, If sequence is more than 50% in preceding 4 number of characters, coding is generated using halfman algorithm；Otherwise sorting coding table is used.

8. a kind of decompressing method of mating data compression method as described in claim 1, it is characterised in that include the following steps:

1) binary string is successively extracted into compressed file referring to compression dictionary dic_output, extract obtain area code mark for the first time Know the binary coding of symbol, second extracts and obtain segment number binary coding, extracts to obtain number binary coding in place for the third time；

2) obtained binary coding is converted into corresponding ascii character identifier according to compression dictionary dic_output, it is defeated Out into character string out1；

3) character string out1 is successively scanned from left to right, three characters is scanned every time, according to three obtained character to character mark Corresponding chinese character is searched in knowledge table dic_character, completes decompression.

9. decompressing method according to claim 8, it is characterised in that: the compression dictionary dic_output, character mark Compression dictionary dic_output that table dic_character is used with data compression, character mark table dic_character phase Together.