CN106549674B - A kind of data compression and decompressing method towards electronic health record - Google Patents

A kind of data compression and decompressing method towards electronic health record Download PDF

Info

Publication number
CN106549674B
CN106549674B CN201610961205.5A CN201610961205A CN106549674B CN 106549674 B CN106549674 B CN 106549674B CN 201610961205 A CN201610961205 A CN 201610961205A CN 106549674 B CN106549674 B CN 106549674B
Authority
CN
China
Prior art keywords
character
coding
dic
ascii
compression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610961205.5A
Other languages
Chinese (zh)
Other versions
CN106549674A (en
Inventor
于海龙
李建元
温晓岳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yinjiang Technology Co.,Ltd.
Original Assignee
Enjoyor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Enjoyor Co Ltd filed Critical Enjoyor Co Ltd
Priority to CN201610961205.5A priority Critical patent/CN106549674B/en
Publication of CN106549674A publication Critical patent/CN106549674A/en
Application granted granted Critical
Publication of CN106549674B publication Critical patent/CN106549674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound
    • G06Q50/24

Abstract

The present invention relates to a kind of data compression towards electronic health record and decompressing methods, the present invention is encoded the character in GB2312 Hanzi coded character set by area, position coded representation is changed into each character by area code, segment number, position number indicates, then to these area codes, segment number, position number is identified using ascii character, what ascii character to correspond to that number does not do any requirement and hypothesis using, individual Chinese character can be used three ascii characters and be indicated in this way, Chinese character string is also just changed into ascii string, pass through the statistics to ascii character, coding realizes compression and decompression.Feature of the present invention is compressed towards chinese character, and compression efficiency ignores the repetitive rate of Chinese character, combines the operation pressure for mitigating server;It is convenient to use, high reliablity.

Description

A kind of data compression and decompressing method towards electronic health record
Technical field
The present invention relates to medical record data management domain more particularly to a kind of data compressions and decompressor towards electronic health record Method.
Background technique
With continuous mature and scale the expansion of hospital information, the growth of electronic health record information reaches unprecedented Scale, extensive growth, the continuous expansion of individual electronic medical record information of electronic health record information content, not only in data storage, tune The problem of encountering is read, and encounters the problem in various transmission in application aspects such as remote diagnosis, trans-regional information sharings.
Compression is one of the effective means that information improves efficiency in storage, transmission process, is used various compressions Method, as using most representative halfman algorithm in traditional compression method, the thought of this algorithm is based in information The frequency that single character occurs is encoded, the short coded representation of the high character of frequency, and the low character of frequency uses long coding It indicates, this thought counts status until the end of the seventies in last century accounts for always.Later by static state/dynamic dictionary to information into The algorithm of row compression has broken this pattern, and most representative is that the dynamic dictionary based on sliding window compresses information LZ77 and its derivative algorithm, the characteristics of algorithm is that the repetitive rate of character string is higher, the efficiency of compression is better, at present mainstream Algorithm used in tool of compression is all inextricably linked with LZ77 and its derivative algorithm.
Actually rare to electronic health record data compression applications in China's electronic medical record system at present, main cause is for electricity It is unrealistic that the entire data of sub- medical records system carry out unified compression, because compressed data are not suitable for application on site system The operation constantly increase, delete, changed.And information and few, few then tens Chinese characters for single case, more then thousands of words.Also, The character repetition rate of the information such as symptom, inspection, diagnosis, treatment, the treatment results of patient is not high, so either halfman Compression algorithm or LZ77 and its derivative algorithm, can not put expertise to good use.
Also many experts are trying to explore, studying in compression China of chinese character.South China Normal University's journal " the Universal And Simple Compression Method For Chinese Texts research " of (natural science edition) 2 phases in 2001 provides unique think of to Chinese character compression Road, the core concept of the technology are quantized by chinese character, then utilize " highest 3 two of two byte storage numerical value The characteristics of system position (the i.e. the 14th, 15 and 16) is 0 " entirely, gives up this three and is stored, the bright spot of the technical solution is not Any dictionary is needed, by the functional relation established in advance, is compiled by the numerical value that the region-position code of Chinese character obtains firstth area of Chinese character Code, simple and high efficiency.Compression ratio herein refers to the ratio between the size before size and compression after compressing file, example Such as: being 90m after the compressing file of 100m, compression ratio is for 90/100*100%=90% so the compression ratio of this compress mode It should be 13/16 × 100%, rather than 3/16 × 100% claimed in text, in comparison in electronic medical record system, this hair It is bright that better compression ratio can be provided.
" 0 rank model arithmetic of the Chinese language text dynamic alphabet volume of Journal of Chinese Information Processing the 14th phase volume 1 publication in 2000 Code ", this method presses chinese character in conjunction with PPM algorithm based on statistical model is established on the analysis foundation to existing text The method of contracting, the compression relies on suitable statistical model, and suitable statistical model is established on the basis of a large amount of texts, then right Character in text is predicted, is compressed, this is a very effective method.In electronic medical record system, case history is by multiple Writer writes, and the writing style of different writers is different, and the symptom descriptions of different case histories, diagnostic mode, treatment means are all Different, this brings challenge to the applicable of model.
Application No. is " Compression Method For Chinese Texts " of 201410614796.X to use the compression method based on word frequency statistics, Substantially or the simple application of halfman algorithm, the key point of the compression method how are realized to " without the mark such as space Will " is segmented as the Chinese character string of separator, and this method is not illustrated.Meanwhile as previously described, single disease The character repetition rate of the information such as symptom, inspection, diagnosis, treatment, the treatment results of example is not high, even if being calculated using effective participle Method obtains correctly segmenting or cannot effectively compressing.
Application No. is 089126954 " methods that one kind is suitable for compress to the document information of shuffling " (to derive from TaiWan, China), what is used is also halfman encryption algorithm, and bright spot is that Chinese individual character coding is split into high-order and low level two Then a part is counted respectively with English character, is encoded, obtain English encoder dictionary, Chinese high coding dictionary and in Literary low level encoder dictionary.In order to distinguish, prevents from conflicting, be added to mark in output character coding.This method is laid particular emphasis on A kind of method of Chinese and English shuffling is provided, careful consideration is not carried out to compression efficiency, because addition mark current situation necessity is examined The influence that the mark generates the decoding that halfman is encoded is considered, in order to avoid the decoding to the halfman coding after addition mark Influence, mark and the coding of halfman must be constrained, this has seriously affected the efficiency of compression.
Summary of the invention
The present invention is to overcome above-mentioned shortcoming, and it is an object of the present invention to provide a kind of data compression side towards electronic health record Method, the present invention is based under the scene towards electronic health record data compression, proposing a kind of new compression method, its main feature is that, towards Chinese character compression, compression efficiency ignore the repetitive rate of Chinese character, can be effective in the unduplicated situation of chinese character Compression combines the operation pressure for mitigating server.
Another object of the present invention is to provide a kind of data decompression method towards electronic health record, this decryption method is for matching Compression method as described above is covered, is easy to use, high reliablity.
The present invention is to reach above-mentioned purpose by the following technical programs: a kind of data compression method towards electronic health record, Include the following steps:
(1) N1 × N2 × N3 three-dimensional array arry_3rd is created, three dimensions are respectively area, section, position, wherein N1 is Area's maximum number, N2 are section maximum number;N3 is position maximum number;And character sequence is filled to three-dimensional array arry_3rd;
(2) area of three-dimensional array arry_3rd, section, position are identified according to ascii character;
(3) the mark result based on three-dimensional array arry_3rd and step (2) creates to obtain character mark table dic_ character;
(4) sorting coding table is created, if being encoded to i i-th in table1i2…in, it is encoded to j j-th1j2…jn…jr, n, r For length, wherein 1≤n < r, and i1i2…in≠j1j2…jn;Total coding number is the maximum value in N1, N2, N3;
(5) all characters in electronic health record are converted into referring to character mark table dic_character by ascii character The character string of composition, and all characters are respectively corresponded into area, section, bit ASCII character identifier count and Bit-reversed;
(6) compression method is selected to generate compression dictionary dic_output according to the result of step (5);
(7) Chinese character ascii character is encoded according to the compression dictionary dic_output of generation, completes compression.
Preferably, N1 × the N2 × N3 is preferably 20 × 20 × 20.
Preferably, the method that the step (1) fills character sequence to three-dimensional array arry_3rd is by GB2312 The chinese character of coding schedule is filled to three-dimensional array arry_3rd, is filled in three-dimensional array arry_3rd remaining space Ascii character and command character.
Preferably, the character mark table dic_character of the step (3) is preserved in three-dimensional array arry_3rd All characters and the corresponding area of character, section, bit identifier ASCII character between mapping relations one by one, character mark table Dic_character is once created and is used for multiple times, and is updated at predetermined intervals.
Preferably, the step (5) count and the number that occurs to ascii character identifier when Bit-reversed into Row statistics, sorts from large to small according to frequency of occurrence.
Preferably, it is word that step (6) the selection compression method, which generates the selection gist of compression dictionary dic_output, The distribution situation occurred is accorded with, specific as follows:
I) if being distributed in character set on a few character, using halfman algorithm to area, section, each ASCII word in position The ranking results of symbol carry out binary coding, generate compression dictionary dic_output;
II) if character distribution is partial to be uniformly distributed, the coding of sorting coding table, Jiang Qu, section, each ASCII in position are used The ranking results of character and the coding and sorting order of sorting coding table correspond, and generate compression dictionary dic_output.
Preferably, described judge that character integrated distribution or equally distributed principle are closed for sequence in preceding 4 number of characters Meter whether given birth to if sequence is more than 50% in preceding 4 number of characters using halfman algorithm by 50% or more of the total number of characters of Zhan At coding;Otherwise sorting coding table is used.
A kind of decompressing method of mating data compression method as described above, includes the following steps:
1) binary string is successively extracted into compressed file referring to compression dictionary dic_output, extract obtain area for the first time The binary coding of symbolic identifier, second extracts and obtains segment number binary coding, extract for the third time in place number binary system compile Code;
2) obtained binary coding corresponding ascii character is converted into according to compression dictionary dic_output to identify Symbol, is output in character string out1;
3) character string out1 is successively scanned from left to right, three characters is scanned every time, according to three obtained character to word Corresponding chinese character is searched in symbol mark table dic_character, completes decompression.
Preferably, the compression dictionary dic_output, character mark table dic_character and data compression use Compression dictionary dic_output, character mark table dic_character it is identical.
The beneficial effects of the present invention are: the present invention is to propose one based under the scene towards electronic health record data compression The new compression thought of kind, its main feature is that, it is compressed towards chinese character, compression efficiency ignores the repetitive rate of Chinese character, and Chinese character is got over The efficiency of multiple pressure contracting is higher, combines the operation pressure for mitigating server;The decompressing method to match therewith is provided simultaneously, it is convenient It uses, high reliablity.
Detailed description of the invention
Fig. 1 is the flow diagram of data compression method of the present invention.
Specific embodiment
The present invention is described further combined with specific embodiments below, but protection scope of the present invention is not limited in This:
Embodiment 1: as shown in Figure 1, a kind of data compression method towards electronic health record, includes the following steps:
Step 1, building 20 × 20 × 20 three-dimensional array arry_3rd, three dimensions of three-dimensional array respectively correspond area, Section, position, wherein N1 is area's maximum number, N2 is a section maximum number, N3 is a maximum number.
Step, 2 are sequentially filled character, and character includes chinese character, ASCII symbol, command character.
Specifically, chinese character can be filled other necessary using character in GB2312 coding schedule in remaining space Character such as ASCII symbol, command character etc..
Step 3, the area of three-dimensional array arry_3rd, section, position are identified using ascii character.
Step 4, a character mark table dic_character is created, all words in three-dimensional array arry_3rd are saved Symbol and the corresponding area of character, section, bit identifier ASCII character between mapping relations one by one.
Step 5, sorting coding table is created, as shown in table 1, if being encoded to i i-th in table1i2…in, it is encoded to for j-th j1j2…jn…jr, n, r are length, wherein 1≤n < r, and i1i2…in≠j1j2…jn;Total coding number is in N1, N2, N3 Maximum value;
Table 1
Step 6, by all chinese characters and other characters reference such as step 4 table dic_ generated in electronic health record Character is converted into the character string of ascii character composition, and Chinese characters all in this way is converted into ascii character.
Step 7, by chinese character and other characters all in electronic health record, respectively to area, section, bit ASCII character Identifier count and Bit-reversed.The number that ascii character occurs is counted, and frequency of occurrence is arranged from big to small Sequence.
Step 8, according to step 7 as a result, Jiang Qu, section, the volume of the ranking results of each ascii character in position and sorting coding table Code sequence corresponds, and generates compression dictionary dic_output.Alternatively, using halfman algorithm to area, section, each ASCII word in position Symbol carries out binary coding, generates compression dictionary dic_output.The foundation of selection is the distribution situation that character occurs, if collection In be distributed on a few character using halfman algorithm generate coding, if distribution be partial to uniformly, it is proposed that using sequence The coding of coding schedule.Simple judgment principle is to sort to add up to whether account for 50% or more in preceding 4 number of characters, if sequence exists Preceding 4 number of characters are more than 50% using halfman algorithm generation coding, otherwise use sorting coding table.
Step 9 generates compression dictionary dic_output according to step 8 and encodes to Chinese character ascii character, completes compression.
It is specific as follows based on the decompressing method of above-mentioned compression:
Step 1, referring to compression dictionary dic_output, binary string is successively extracted into compressed file, is obtained for the first time The binary coding of area code identifier, second of acquirement segment number binary coding, third time obtain position binary coding.
Due to being encoded to i i-th in table in sorting coding table1i2…in, it is encoded to j j-th1j2…jn…jr, n, r are Length, wherein 1≤n < r, and i1i2…in≠j1j2…jn;Therefore, compressing any coding in the binary coding of dictionary is not The prefix of other codings, so can uniquely obtain correct coding every time.
Step 2, the binary coding of acquirement is converted into corresponding ASCII identifier according to compression dictionary, is output to word In symbol string out1;
Step 3, character string out1 is successively scanned from left to right, three characters is scanned every time, according to three obtained character Corresponding chinese character is searched into dic_character dictionary table, completes decompression.
Embodiment 2: in the present embodiment, as space is limited, simultaneously be also limited to the present invention is directed Chinese character compress, I Using the most common case in electronic health record, the process of compression is illustrated by taking 20 characters as an example:
It is uncomfortable that the patient lower-left chest occurred without obvious cause before Yu Si days
Step 1 makes according to the character mark table dic_character having been built up, character mark table dic_character The ASCII identifier used, as shown in table 2, its generation step of character mark table dic_character is as described above, herein not Ao Shu again.
Table 2
Area's identifier, segment identifier, bit identifier successively are taken to the character of above-mentioned case, ascii string is generated, every In ternary character, first character is area's identifier, and second is segment identifier, and third position is bit identifier:
41c 4f7 aj6 a9d 8aj 8hd 7b3 967 6fi 9ab ac2 a63 30d 9ad b9e 99d 9f9 2d4 2d0 86b
Step 2 distinguishes the number that Statistical Area identifier, segment identifier, bit identifier occur and carries out Bit-reversed, such as table 3 It is shown:
Table 3
Step 3 is according to judgement, and total preceding 4 number of characters that sort have been more than 50%, we are raw using halfman algorithm Compression dictionary is generated at corresponding area's identifier binary coding, segment identifier binary coding and bit identifier binary coding Dic_output, as shown in table 4:
Table 4
The compression dictionary that step 4 is generated according to step 3 respectively converts area's identifier, segment identifier, bit identifier Complete compression.
20 chinese characters have been used in case, it is therefore apparent that each Chinese character accounts for 2 bytes, and each byte accounts for 8bit, that 320bit is accounted in total before compression.
The statistical data that the compression dictionary and step 2 obtained according to step 3 obtains, can run away with compressed word Fu Gongzhan 207bit, so the compression ratio of present case is 207 320 × 100%=64.6875% of ÷
Decompression procedure is the inverse process of the above process, and details are not described herein again.
Embodiment 3: being also limited to simultaneously as space is limited, the present invention is directed Chinese character compresses, we use electronics disease The most common case in going through illustrates the process of compression by taking 200 characters as an example:
There is lower-left chest discomfort, no cough, expectoration, fearless cold fever without obvious cause because of before four days in the patient.Not It arouses attention, does not go any treatment.It is uncomfortable because occurring lower-left chest after feed cold dish again before six hours, no diarrhea and nausea, Vomiting.It does not draw attention, takes March Powder symptom certainly without alleviation.Medical locality health-center gives oral drugs and intramuscular injection liquid is (specific Owe detailed) left chest discomfort is without being clearly better and cold, the coolness of extremities of fear occurs.Show that millet straw turns ammonia through our hospital's patient examination myocardial enzymes Enzyme and lactic dehydrogenase are high.My section is taken in further to treat emergency treatment our hospital.Since morbidity, spirit is poor, sleeps not good enough.Stool and urine Normally.
Step 1 makes according to the character mark table dic_character having been built up, character mark table dic_character The ASCII identifier used is as shown in table 2, its generation step of character mark table dic_character is as described above, herein no longer Ao Shu.Area's identifier, segment identifier, bit identifier successively are taken to the character of above-mentioned case, ascii string is generated, every three In a one group of character, first character is area's identifier, and second is segment identifier, and third position is bit identifier:
41c 4f7 aj6 aca a9d 8aj 8hd 7b3 967 6fi 9ab ac2 a63 30d 9ad b9e 99d 9f9 2d4 2d0 86b 09j 967 5ef 8bf 001 5ef 8ei 09j 967 948 60h 3f5 7h1 002 945 a6e 79d b5d a57 09j 945 9f1 7h7 4bb b33 641 002 66c 9ce 856 7b3 a63 57f 858 638 2dg 4d8 ae6 337 30d 9ad b9e 99d 9f9 2d4 2d0 86b 09j 967 412 9dj 4j9 3ea 9e9 001 726 905 002 945a6e 79d b3d 870 09j b83 3jh 9f1 5cd 7jh b0j b6b 967 4f5 56d 002 5a6b02 36c 38b 94h 845 8dc aa7 5fd 3jh a2a 974 4j9 4ie b5d a37 8h6 09f 5b4 8h6 7ba 9bd 09g b9e 9f9 2d4 2d0 86b 967 6fi 9ab 4b0b61 7ch 30d 9ad 948 60h 09j b18 8h6 2b8 60h 002 58e 95f acf 6d8b02 529 2f6 9e9 4ie 6cf 788 85j 46b 2ee b61 232 6cf 4j9 7ie 8c8 910 7dd 6cf 430 002 93f 57f a38 2d2b33 641 4ja b02 95f acf 872 7ig 95f 5ed 002 3f5 2be a4d 5i9 09j 58c 83g 2fb 09j 89i 6ee 7ba 512 002 354 9ce 2a6b0g 2ga 002
Step 2 distinguishes the number that Statistical Area identifier, segment identifier, bit identifier occur and carries out Bit-reversed, such as table 5 It is shown:
Table 5
Step 3 sorts, and preceding 4 number of characters are total to be less than 50%, according to subordinate list 2 obtain corresponding area's identifier two into System coding, segment identifier binary coding and bit identifier binary coding generate compression dictionary dic_output, such as 6 institute of table Show:
Table 6
The compression dictionary that step 4 is generated according to step 3 respectively converts area's identifier, segment identifier, bit identifier Complete compression.
The total length of compressed binary string is 2490, and compression ratio is 2490/3200 × 100%=77.8125%
Decompression procedure is the inverse process of the above process, and details are not described herein again.
In conclusion it is an object of the present invention to provide a kind of compressing data method, actually should during, we are right Common English medical terms and scalar-unit such as HBsAg, mmHg are placed directly in the section of blank and are identified, for it Influence of his ascii character individually occurred in the case where frequency of occurrence is seldom to reduced overall effect is not very big.For High compression effect, we periodically collect the high frequency Chinese character in primary electron case history in practical applications, high frequency Chinese character is concentrated It is stored in the section of dic_character, the version number of dic_character is added in binary string upon compression to prevent Only character mapping error.It in addition directly will be entire for common Chinese character medical terms such as congenital heart disease, duodenum etc. Character string, which is put into blank section to be identified, (in order to make full use of the limited space dic_character, deletes GB2312 In the character such as character in the 04th area to 09th area and the character shaped like these radicals of Bing, Tou, Yan that take less than substantially).
GB2312 coding has carried out multidomain treat-ment, the area Gong94Ge to the character included, and 94 positions are contained in each area, can be with 8836 code bits are accommodated, wherein first-level Chinese characters there are 3755, is distributed in 16-55 subregion, and the Chinese characters of level 2 there are 3008 to be distributed in It in addition to this further include other 682 characters such as the Latin alphabet, Greek alphabet in 56-87 subregion, it is total to have included 7445 words Symbol, in order to realize lossless compression, we will also add printable character and command character in ASCII coding schedule, so altogether about 7545 characters.If directly compressed to these characters using halfman algorithm in compression process, it is clear that cannot reach Compress purpose.Be converted into chinese character one two or more transcodings, reduces to the character being likely to occur in pressure source Number is one of effective means come the extension for limiting compression code length, meanwhile, it was noticed that while individual Chinese character changes into transcoding Actually and a kind of expansion to chinese character, transcoding digit is longer, and the character number being likely to occur in pressure source is fewer, puts down The length for compressing dictionary encoding is shorter, but character repetition rate is also more uniform, and compression efficiency is also lower.Practice have shown that by Chinese character Being converted into 3 transcodings can achieve optimal effect.In order to limit the length of compression dictionary encoding, 20 are devised in this programme Area, each area design 20 sections, and each section accommodates the method for 20 positions to accommodate 7545 characters.Then respectively to this 20 Area, section, position are encoded, and to realize compression, vision area, section, the selection of the distribution situation of position code character are used in compression process Halfman algorithm generates the coding that compression dictionary code still uses subordinate list (two), it is therefore apparent that halfman algorithm is in the row of generation Its code length can meet or exceed 5 when the coding of the 10th character of sequence, since we use 3 character representations to Chinese character, Binary coding after average each character compression, which is no more than 5, can just play compression effectiveness.It sorts the 10th to statistical result When later character code, actually to a kind of expansion of character, so the only sequence character later more than 10 always occurs When number is no more than the 25% of total number of characters, it could generally offset this 25% expansion and play compression effectiveness, so we Judgment principle be that sort in first 4 character sums be more than 50% to generate compression dictionary code using halfman algorithm, otherwise make Coding with the coding of subordinate list (two), subordinate list two is not universally applicable coding, but can guarantee equal energy in any case The effect of compression is played, this is also us using 20th area, each area designs 20 sections, and each section of 20 positions of receiving accommodate this The reason of 7545 characters.
In addition, about Code table of the present invention rule: 0,1 composition 3, totally 8 kinds may " 000,001,010,011, 100,101,110,111 ", take wherein 2 " 000,011 ", adds one 0,1 by remaining 6 kinds as first 2 codings, totally 12 Kind is possible " 0010,0011,0100,0101,1001,1000,1010,1011,1100,1101,1110,1111 ", takes therein 5 It is a to be left 7 kinds of possibility as the 3rd~7 coding, one 0,1 is added, totally 14 kinds of possibility, removed " 10000 ", remaining 13 It is encoded as 8~20.Correct dividing regions, section, position when decoding may be implemented in this way.About coding characteristic: assuming that i-th of coding For d1d2…dni, it is encoded to d j-th1d2…drj…dnj, 1≤ni≤nj (i < j), and d1d2…dni≠d1d2…drjWherein, ni, Nj is code length, 1≤rj≤nj.
It is specific embodiments of the present invention and the technical principle used described in above, if conception under this invention institute The change of work when the spirit that generated function is still covered without departing from specification and attached drawing, should belong to of the invention Protection scope.

Claims (9)

1. a kind of data compression method towards electronic health record, it is characterised in that include the following steps:
(1) create N1 × N2 × N3 three-dimensional array arry_3rd, three dimensions are respectively area, section, position, wherein N1 be area most Big number, N2 are section maximum number;N3 is position maximum number;And character sequence is filled to three-dimensional array arry_3rd;It is described Character include chinese character, ASCII symbol, command character;
(2) area of three-dimensional array arry_3rd, section, position are identified according to ascii character;
(3) the mark result based on three-dimensional array arry_3rd and step (2) creates to obtain character mark table dic_ character;
(4) sorting coding table is created, if being encoded to i i-th in table1i2…in, it is encoded to j j-th1j2…jn…jr, n, r are length Degree, wherein 1≤n < r, and i1i2…in≠j1j2…jn;Total coding number is the maximum value in N1, N2, N3;
(5) all characters in electronic health record are converted into being made of ascii character referring to character mark table dic_character Character string, and all characters are respectively corresponded into area, section, bit ASCII character identifier count and Bit-reversed;
(6) compression method is selected to generate compression dictionary dic_output according to the result of step (5);
(7) Chinese character ascii character is encoded according to the compression dictionary dic_output of generation, completes compression.
2. a kind of data compression method towards electronic health record according to claim 1, it is characterised in that: the N1 × N2 × N3 is preferably 20 × 20 × 20.
3. a kind of data compression method towards electronic health record according to claim 1, it is characterised in that: the step (1) method for filling character sequence to three-dimensional array arry_3rd is to fill the chinese character of GB2312 coding schedule to three-dimensional After array arry_3rd, ascii character and command character are filled in three-dimensional array arry_3rd remaining space.
4. a kind of data compression method towards electronic health record according to claim 1, it is characterised in that: the step (3) character mark table dic_character preserve all characters in three-dimensional array arry_3rd and the corresponding area of character, Section, bit identifier ASCII character between mapping relations one by one, character mark table dic_character, which is once created, repeatedly to be made With, and be updated at predetermined intervals.
5. a kind of data compression method towards electronic health record according to claim 1, it is characterised in that: the step (5) carry out statistics and the number occurred when Bit-reversed to ascii character identifier count, according to frequency of occurrence from greatly to Small sequence.
6. a kind of data compression method towards electronic health record according to claim 1, it is characterised in that: the step (6) it is the distribution situation that character occurs that selection compression method, which generates the selection gist of compression dictionary dic_output, specific as follows:
I) if being distributed in character set on a few character, using halfman algorithm to area, section, position each ascii character Ranking results carry out binary coding, generate compression dictionary dic_output;
II) if character distribution is partial to be uniformly distributed, the coding of sorting coding table, Jiang Qu, section, each ascii character in position are used Ranking results and sorting coding table coding and sorting order correspond, generate compression dictionary dic_output.
7. a kind of data compression method towards electronic health record according to claim 6, it is characterised in that: the judgement word Symbol integrated distribution or equally distributed principle be sequence preceding 4 number of characters add up to whether 50% or more of the total number of characters of Zhan, If sequence is more than 50% in preceding 4 number of characters, coding is generated using halfman algorithm;Otherwise sorting coding table is used.
8. a kind of decompressing method of mating data compression method as described in claim 1, it is characterised in that include the following steps:
1) binary string is successively extracted into compressed file referring to compression dictionary dic_output, extract obtain area code mark for the first time Know the binary coding of symbol, second extracts and obtain segment number binary coding, extracts to obtain number binary coding in place for the third time;
2) obtained binary coding is converted into corresponding ascii character identifier according to compression dictionary dic_output, it is defeated Out into character string out1;
3) character string out1 is successively scanned from left to right, three characters is scanned every time, according to three obtained character to character mark Corresponding chinese character is searched in knowledge table dic_character, completes decompression.
9. decompressing method according to claim 8, it is characterised in that: the compression dictionary dic_output, character mark Compression dictionary dic_output that table dic_character is used with data compression, character mark table dic_character phase Together.
CN201610961205.5A 2016-10-28 2016-10-28 A kind of data compression and decompressing method towards electronic health record Active CN106549674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610961205.5A CN106549674B (en) 2016-10-28 2016-10-28 A kind of data compression and decompressing method towards electronic health record

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610961205.5A CN106549674B (en) 2016-10-28 2016-10-28 A kind of data compression and decompressing method towards electronic health record

Publications (2)

Publication Number Publication Date
CN106549674A CN106549674A (en) 2017-03-29
CN106549674B true CN106549674B (en) 2019-07-23

Family

ID=58393924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610961205.5A Active CN106549674B (en) 2016-10-28 2016-10-28 A kind of data compression and decompressing method towards electronic health record

Country Status (1)

Country Link
CN (1) CN106549674B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368509A (en) * 2020-03-05 2020-07-03 薛昌熵 Method and system for encoding and decoding generic characters
CN112131865B (en) * 2020-09-11 2023-12-08 成都运达科技股份有限公司 Track traffic report digital compression processing method, device and storage medium
CN116153452B (en) * 2023-04-18 2023-06-30 济南科汛智能科技有限公司 Medical electronic medical record storage system based on artificial intelligence

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350624A (en) * 2008-09-11 2009-01-21 中国科学院计算技术研究所 Method for compressing Chinese text supporting ANSI encode
CN102122960A (en) * 2011-01-18 2011-07-13 西安理工大学 Multi-character combination lossless data compression method for binary data
CN102664634A (en) * 2012-04-16 2012-09-12 中国航空无线电电子研究所 Data compression method used during Big Dipper reception and transmission of Chinese character text massages
US8929402B1 (en) * 2005-09-29 2015-01-06 Silver Peak Systems, Inc. Systems and methods for compressing packet data by predicting subsequent data
CN104467868A (en) * 2014-11-04 2015-03-25 深圳市元征科技股份有限公司 Chinese text compression method
CN105933009A (en) * 2016-05-19 2016-09-07 浪潮(北京)电子信息产业有限公司 Data compression method, data compression system, data decompression method and data decompression system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8929402B1 (en) * 2005-09-29 2015-01-06 Silver Peak Systems, Inc. Systems and methods for compressing packet data by predicting subsequent data
CN101350624A (en) * 2008-09-11 2009-01-21 中国科学院计算技术研究所 Method for compressing Chinese text supporting ANSI encode
CN102122960A (en) * 2011-01-18 2011-07-13 西安理工大学 Multi-character combination lossless data compression method for binary data
CN102664634A (en) * 2012-04-16 2012-09-12 中国航空无线电电子研究所 Data compression method used during Big Dipper reception and transmission of Chinese character text massages
CN104467868A (en) * 2014-11-04 2015-03-25 深圳市元征科技股份有限公司 Chinese text compression method
CN105933009A (en) * 2016-05-19 2016-09-07 浪潮(北京)电子信息产业有限公司 Data compression method, data compression system, data decompression method and data decompression system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"COMPRESSING CHINESE TEXT FILES USING AN ADAPTIVE";TGhim Hwee Ong等;《Proceedings of IEEE Singapore International Conference on Networks/International Conference on Information Engineering "93》;20020806;1-5
"通用简易中文文本压缩方法研究";游荣彦等;《华南师范大学学报(自然科学版)》;20010531(第2期);84-88

Also Published As

Publication number Publication date
CN106549674A (en) 2017-03-29

Similar Documents

Publication Publication Date Title
CN106549674B (en) A kind of data compression and decompressing method towards electronic health record
Larsson et al. Off-line dictionary-based compression
CN100367189C (en) Method for coding DNA sequence and device and computer readability medium
CN107066837A (en) One kind has with reference to DNA sequence dna compression method and system
CN102122960A (en) Multi-character combination lossless data compression method for binary data
CN101335616B (en) Symmetric ciphering method having infinite cipher key space
Zhang et al. Order-preserving key compression for in-memory search trees
CN110569974B (en) DNA storage layered representation and interweaving coding method capable of containing artificial base
CN100498794C (en) Method and device for compressing index
CN111027081B (en) Text carrierless information hiding method based on feature mapping
CN1286077C (en) Data encipher and decipher system based on dynamic variable-length code
Mäkinen et al. Advantages of backward searching—efficient secondary memory and distributed implementation of compressed suffix arrays
CN116865950B (en) Detection kit quality inspection data safety storage system
James Rohlf Numbering binary trees with labeled terminal vertices
Williams et al. Compression of nucleotide databases for fast searching
Beck et al. Finding data in DNA: computer forensic investigations of living organisms
CN100343851C (en) Database compression and decompression method
US9143163B2 (en) Method and system for text compression and decompression
Wu et al. HD-code: End-to-end high density code for DNA storage
CN110008236A (en) A kind of data distribution formula is from increasing coding method, system, equipment and medium
Sato et al. Identification of 146 Metagenome-assembled genomes from the rumen microbiome of cattle in Japan
RU2437148C1 (en) Method to compress and to restore messages in systems of text information processing, transfer and storage
CN114678074A (en) Hidden addressing DNA storage coding design method
CN109192245A (en) The GDS-Huffman compression method of genetic mutation data
CN109698703A (en) Gene sequencing data decompression method, system and computer-readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 310012 1st floor, building 1, 223 Yile Road, Hangzhou City, Zhejiang Province

Patentee after: Yinjiang Technology Co.,Ltd.

Address before: 310012 1st floor, building 1, 223 Yile Road, Hangzhou City, Zhejiang Province

Patentee before: ENJOYOR Co.,Ltd.