CN106775909A

CN106775909A - The determination methods and device of the coded format of a kind of JAVA files and byte stream

Info

Publication number: CN106775909A
Application number: CN201611041686.4A
Authority: CN
Inventors: 王同庆
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2016-11-22
Filing date: 2016-11-22
Publication date: 2017-05-31

Abstract

The invention provides a kind of JAVA files and the determination methods and device of the coded format of byte stream.Methods described includes：Read preceding four bytes of file or byte stream；According to Unicode coding rules and preceding four bytes of the file or byte stream, the coded format of the file or byte stream is judged.The determination methods and device of the coded format of the JAVA files and byte stream of the embodiment of the present invention, the coded format of file and byte stream is judged according to Unicode coding rules, has the advantages that workload is small, program is succinct, accuracy of judgement.

Description

The determination methods and device of the coded format of a kind of JAVA files and byte stream

Technical field

Concretely it is a kind of the present invention relates to data processing field, more particularly to a kind of determination methods of coded format The determination methods and device of the coded format of JAVA files and byte stream.

Background technology

This part is it is intended that the embodiments of the present invention stated in claims provide background or context.Herein Description in being included in this part because just do not recognize it is prior art.

Character string in internal memory is not limited solely to the character string being loaded directly into from class codes, more also Character string is read from text, also has plenty of what is read by database, it is also possible to built from byte arrays , but it be not Unicode codings that they is substantially all, reason is very simple, for storage optimization.

Therefore it is accomplished by processing various encoded questions, before treatment, it is necessary to the coding in clear and definite " source ", Ran Houyong The coded system specified correctly is read in internal memory.

The method of the currently a popular coded format for judging file and byte stream is the volume for judging all of byte of file stream The code scope and all of coding range of byte stream, has that workload is big, program is complicated, the defect of easy error.

The content of the invention

The embodiment of the present invention judges the coded format of file and byte stream using the coding rule of Unicode, to solve Existing determination methods workload is big and problem of easily error.

In order to achieve the above object, the embodiment of the present invention provides a kind of judgement of the coded format of JAVA files and byte stream Method, including：Read preceding four bytes of file or byte stream；According to Unicode coding rules and the file or byte Preceding four bytes of stream, judge the coded format of the file or byte stream.

Further, in one embodiment, it is described according to Unicode coding rules and preceding four words of the file Section, judges the coded format of the file, specifically includes：

If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16；

If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode；

If the 1st byte is -17, and the 2nd byte is -69, and the 3rd byte is -65, then coded format is UTF-8。

Further, in one embodiment, if it is decided that for coded format is Unicode, then return to the coding lattice of file Formula is UTF-16.

Further, in one embodiment, it is described according to Unicode coding rules and preceding four words of the byte stream Section, judges the coded format of the byte stream, specifically includes：

1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first Individual new byte iHead>=0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<= 0xfe, according to Unicode coding rules, this byte stream is GB2312；

1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first Individual new byte iHead>=0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<= 0x7e) or (iTail>=0x80 and iTail<=0xfe), according to Unicode coding rules, this byte stream is GBK；

1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first Individual new byte iHead>=0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<= 0x7e) or (iTail>=0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte stream is BIG5；

1st byte head and the 2nd byte tail2 and 0xff are taken and obtains iHead and iTail2 with operation, if the One new byte iHead>=0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2 <=0x7e) or (iTail2>=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3< =0x7e) or (iTail3>=0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<= 0x7e) or (iTail2>=0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte stream is UTF8.

Further, in one embodiment, also include：

By the coded format for judging the coded format of byte stream to judge character string, specifically include：

The coded format for setting the former character string of unknown coded format is a certain coded format；

The former character string is converted into the byte stream of the coded format that coded format is the setting, lattice then will be encoded Formula is that the byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings, if two Individual character string is identical, then the coded format of the former character string is the coded format of setting.

In order to achieve the above object, the embodiment of the present invention also provides a kind of sentencing for the coded format of JAVA files and byte stream Disconnected device, including：Byte read module, preceding four bytes for reading file or byte stream；Judge module, for basis Preceding four bytes of Unicode coding rules and the file or byte stream, judge the coding lattice of the file or byte stream Formula.

Further, in one embodiment, before the judge module is according to Unicode coding rules and the file Four bytes, judge the coded format of the file, specifically include：

If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16；

If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode；

Further, in one embodiment, the judge module is according to Unicode coding rules and the byte stream Preceding four bytes, judge the coded format of the byte stream, specifically include：

Further, in one embodiment, the judge module is additionally operable to sentence by judging the coded format of byte stream The coded format of word break character string, it is specifically included：

The determination methods and device of the coded format of the JAVA files and byte stream of the embodiment of the present invention, according to Unicode Coding rule judges the coded format of file and byte stream, has the advantages that workload is small, program is succinct, accuracy of judgement.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those skilled in the art, without having to pay creative labor, can be with root Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is the process chart of the judgment means of the coded format of the JAVA files and byte stream of the embodiment of the present invention；

Fig. 2 is the structural representation of the judgment means of the coded format of the JAVA files and byte stream of the embodiment of the present invention.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

Art technology technical staff knows, embodiments of the present invention can be implemented as a kind of system, device, equipment, Method or computer program product.Therefore, the disclosure can be implemented as following form, i.e.,：It is complete hardware, complete soft Part (including firmware, resident software, microcode etc.), or the form that hardware and software is combined.

Below with reference to some representative embodiments of the invention, principle of the invention and spirit are explained in detail.

Principle of the invention is：The coded format of file and byte stream is judged according to Unicode coding rules, relatively In the coding range and the determination methods of all of coding range of byte stream of the existing judgement all of byte of file stream, workload It is small, accuracy of judgement.

Fig. 1 is the process chart of the determination methods of the coded format of the JAVA files and byte stream of the embodiment of the present invention. As illustrated, including：Step S101, reads preceding four bytes of file or byte stream；Step S102, encodes according to Unicode Preceding four bytes of the regular and file or byte stream, judge the coded format of the file or byte stream.

First, in the present invention, Unicode coding rules are as follows：

1)ANSI：The coding of file is exactly two bytes " D1CF ", and the GB2312 of this exactly " tight " is encoded, and this is also implied GB2312 is stored using major part mode.

2)Unicode：Coding is four bytes " 4E of FF FE 25 ", wherein " FF FE " is shown to be microcephaly's mode store, Real coding is 4E25.

3)Unicode big endian：Coding is four bytes " FE FF 4E 25 ", wherein " FE FF " is shown to be greatly Head mode is stored.

4)UTF-8：Coding is six bytes " EF BB BF E4B8A5 ", and first three byte " EF BB BF " represents that this is UTF-8 is encoded, and three " E4B8A5 " is exactly the specific coding of " tight " afterwards, and its storage order is consistent with coded sequence.

Embodiment one：

According to Unicode coding rules and preceding four bytes of the file, the coded format of the file is judged, have Body step includes：

1) coded format of current operation system running environment is obtained：

String dc=Charset.defaultCharset () .name ()；

2) inlet flow is converted into Unicode inlet flows：

UnicodeInputStream uin=new UnicodeInputStream (in, dc)；

3) the one before byte of file stream is read：

Byte [] head=new byte [4]；

in.read(head)；

4) it is GBK to define coded format：

String code=" GBK "；

5) judged according to Unicode coding rules, if the 1st byte is -1, and the 2nd byte is -2, then encode Form is UTF-16：

If (head [0]==-1＆＆head [1]==-2)

Code=" UTF-16 "；

6) judged according to Unicode coding rules, if the 1st byte is -2, and the 2nd byte is -1, then encode Form is Unicode：

If (head [0]==-2＆＆head [1]==-1)

Code=" Unicode "；

7) judged according to Unicode coding rules, if the 1st byte is -17, and the 2nd byte is -69 and the 3 bytes are -65, then coded format is UTF-8：

If (head [0]==-17＆＆head [1]==-69＆＆head [2]==-65)

Code=" UTF-8 "；

In step 6) in, if coded format is Unicode, the coded format for returning to file is UTF-16；Therefore most The coded format for returning to file afterwards is the one kind in tri- kinds of GBK/UTF-8/UTF-16.

In the embodiment, judge that the code realization of the coded format of the file is as follows：

Embodiment two：

It is described according to Unicode coding rules and preceding four bytes of the byte stream, judge the volume of the byte stream Code form, its concrete methods of realizing is：

1) judge whether byte stream is GB2312 codings, the first two byte of byte [] byte stream is obtained, by the 1st byte Head and the 2nd byte tail and 0xff takes and obtains iHead and iTail with operation, if first new byte iHead>= 0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<=0xfe, compiles according to Unicode Code rule, this byte arrays are GB2312.

2) judge whether byte stream is GBK codings, the first two byte of byte [] byte stream is obtained, by the 1st byte Head and the 2nd byte tail and 0xff takes and obtains iHead and iTail with operation, if first new byte iHead>= 0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<=0x7e) or (iTail>= 0x80 and iTail<=0xfe), according to Unicode coding rules, this byte arrays is GBK.

3) judge whether byte stream is BIG5 codings, the first two byte of byte [] byte stream is obtained, by the 1st byte Head and the 2nd byte tail and 0xff takes and obtains iHead and iTail with operation, if first new byte iHead>= 0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<=0x7e) or (iTail>= 0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte arrays is BIG5.

4) judge whether byte stream is UTF-8 codings, preceding 4 bytes of byte [] byte stream are obtained, by the 1st byte Head and the 2nd byte tail2 and 0xff takes and obtains iHead and iTail2 with operation, if first new byte iHead>= 0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2<=0x7e) or (iTail2> =0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3<=0x7e) or (iTail3>= 0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<=0x7e) or (iTail2>= 0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte arrays is UTF8.

In the embodiment, judge that the code realization of the coded format of the byte stream is as follows：

Embodiment three：

The determination methods of the coded format of JAVA files of the invention and byte stream also include the volume by judging byte stream Code form judges the coded format of character string, specifically includes：Set unknown coded format former character string coded format as A certain coded format；The former character string is converted into the byte stream of the coded format that coded format is the setting, then will Coded format is that the byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings, If two character strings are identical, the coded format of the former character string is the coded format of setting.

During specific implementation, the coding rule for judging character string is carried out on the coding rule for judge byte stream, function It is as follows：

1) coded format for setting the former character string A of unknown coded format first is GB2312.

2) former character string A is converted into and assumes that coded format is byte [] byte stream of GB2312, then by coded format For byte [] byte stream of GB2312 is converted to character string B, if original character string A with according to coded format for GB2312 word Symbol string B compares, if two character strings are identical, the coded format of former character string A is GB2312.

And then the coded format of the former character string A that sets unknown coded format is ISO-8859-1 3).

4) former character string A is converted into and assumes that coded format is byte [] byte stream of ISO-8859-1, then will coding Form is converted to character string B for byte [] byte stream of ISO-8859-1, if original character string A is with according to coded format The character string B of ISO-8859-1 compares, if two character strings are identical, the coded format of former character string A is ISO-8859-1.

And then the coded format of the former character string A that sets unknown coded format is UTF-8 5).

6) former character string A is converted into and assumes that coded format is byte [] byte stream of UTF-8, be then by coded format Byte [] byte stream of UTF-8 is converted to character string B, if original character string A with according to coded format for UTF-8 character string B Compare, if two character strings are identical, the coded format of former character string A is UTF-8.

And then the coded format of the former character string A that sets unknown coded format is GBK 7).

8) former character string A is converted into and assumes that coded format is byte [] byte stream of GBK, be then by coded format Byte [] byte stream of GBK is converted to character string B, if original character string A with according to coded format for GBK character string B ratios Compared with if two character strings are identical, the coded format of former character string A is GBK.

And then the coded format of the former character string A that sets unknown coded format is BIG5 9).

10) former character string A is converted into and assumes that coded format is byte [] byte stream of BIG5, be then by coded format Byte [] byte stream of BIG5 is converted to character string B, if original character string A with according to coded format for BIG5 character string B ratios Compared with if two character strings are identical, the coded format of former character string A is BIG5.

Code referring to：

In the embodiment, judge that the code realization of the coded format of the character string is as follows：

It should be noted that although the operation of the inventive method is described with particular order in the accompanying drawings, this is not required that Or imply that these must be performed according to the particular order operates, or the operation having to carry out shown in whole could realize the phase The result of prestige.Additionally or alternatively, it is convenient to omit some steps, multiple steps are merged into a step to perform, and/or will One step is decomposed into execution of multiple steps.

After the method for describing exemplary embodiment of the invention, next, with reference to Fig. 2 to exemplary reality of the invention The judgment means for applying the coded format of the JAVA files and byte stream of mode are introduced.The implementation of the device may refer to above-mentioned The implementation of method, repeats part and repeats no more.Term " module " used below and " unit ", can realize predetermined function Software and/or hardware.Although the module described by following examples is preferably realized with software, hardware, or soft The realization of the combination of part and hardware is also that may and be contemplated.

Fig. 2 is the structural representation of the judgment means of the coded format of the JAVA files and byte stream of the embodiment of the present invention. As shown in Fig. 2 including：

Byte read module 101, preceding four bytes for reading file or byte stream；

Judge module 102, for preceding four bytes according to Unicode coding rules and the file or byte stream, Judge the coded format of the file or byte stream.

In the present embodiment, the judge module 102 is according to Unicode coding rules and preceding four words of the file Section, judges the coded format of the file, specifically includes：

If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16；

If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode；If it is determined that compiling Code form is Unicode, then the coded format for returning to file is UTF-16；

In the present embodiment, first four according to Unicode coding rules and the byte stream of the judge module 102 Byte, judges the coded format of the byte stream, specifically includes：

In the present embodiment, the judge module 102 is additionally operable to judge character by judging the coded format of byte stream The coded format of string, it is specifically included：

The coded format for setting the former character string of unknown coded format is a certain coded format；The former character string is converted It is the byte stream of coded format that coded format is the setting, then by the byte stream of coded format that coded format is setting New character strings are converted to, the former character string is compared with new character strings, if two character strings are identical, the former character string Coded format be setting coded format.

Although additionally, being referred to the judgment means of the coded format of JAVA files and byte stream in above-detailed Some units, but this division is only not enforceable.In fact, according to the embodiment of the present invention, it is above-described The feature and function of two or more units can embody in a unit.Equally, the spy of an above-described unit Function of seeking peace can also be further divided into being embodied by multiple units.

It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.And, the present invention can be used and wherein include the computer of computer usable program code at one or more The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) is produced The form of product.

The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram are described.It should be understood that every first-class during flow chart and/or block diagram can be realized by computer program instructions The combination of flow and/or square frame in journey and/or square frame and flow chart and/or block diagram.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced for reality by the instruction of computer or the computing device of other programmable data processing devices The device of the function of being specified in present one flow of flow chart or multiple one square frame of flow and/or block diagram or multiple square frames.

These computer program instructions may be alternatively stored in can guide computer or other programmable data processing devices with spy In determining the computer-readable memory that mode works so that instruction of the storage in the computer-readable memory is produced and include finger Make the manufacture of device, the command device realize in one flow of flow chart or multiple one square frame of flow and/or block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented treatment, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

Apply specific embodiment in the present invention to be set forth principle of the invention and implementation method, above example Explanation be only intended to help and understand the method for the present invention and its core concept；Simultaneously for those of ordinary skill in the art, According to thought of the invention, will change in specific embodiments and applications, in sum, in this specification Appearance should not be construed as limiting the invention.

Claims

1. determination methods of the coded format of a kind of JAVA files and byte stream, it is characterised in that including：

Read preceding four bytes of file or byte stream；

According to Unicode coding rules and preceding four bytes of the file or byte stream, the file or byte stream are judged Coded format.

2. determination methods of the coded format of JAVA files according to claim 1 and byte stream, it is characterised in that described According to Unicode coding rules and preceding four bytes of the file, the coded format of the file is judged, specifically include：

If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16；

If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode；

If the 1st byte is -17, and the 2nd byte is -69, and the 3rd byte is -65, then coded format is UTF- 8。

3. determination methods of the coded format of JAVA files according to claim 2 and byte stream, it is characterised in that if It is judged to coded format for Unicode, then the coded format for returning to file is UTF-16.

4. determination methods of the coded format of JAVA files according to claim 1 and byte stream, it is characterised in that described According to Unicode coding rules and preceding four bytes of the byte stream, the coded format of the byte stream is judged, specific bag Include：

1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new Byte iHead>=0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<=0xfe, According to Unicode coding rules, this byte stream is GB2312；

1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new Byte iHead>=0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<=0x7e) Or (iTail>=0x80 and iTail<=0xfe), according to Unicode coding rules, this byte stream is GBK；

1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new Byte iHead>=0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<=0x7e) Or (iTail>=0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte stream is BIG5；

1st byte head and the 2nd byte tail2 and 0xff are taken and obtains iHead and iTail2 with operation, if first New byte iHead>=0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2<= 0x7e) or (iTail2>=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3<= 0x7e) or (iTail3>=0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<= 0x7e) or (iTail2>=0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte stream is UTF8.

5. determination methods of the coded format of JAVA files according to claim 1 and byte stream, it is characterised in that also wrap Include：

The former character string is converted into the byte stream of the coded format that coded format is the setting, is then by coded format The byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings, if two words Symbol string is identical, then the coded format of the former character string is the coded format of setting.

6. judgment means of the coded format of a kind of JAVA files and byte stream, it is characterised in that including：

Byte read module, preceding four bytes for reading file or byte stream；

Judge module, for preceding four bytes according to Unicode coding rules and the file or byte stream, judges described The coded format of file or byte stream.

7. judgment means of the coded format of JAVA files according to claim 6 and byte stream, it is characterised in that described Judge module judges the coded format of the file according to Unicode coding rules and preceding four bytes of the file, tool Body includes：

If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16；

If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode；

8. judgment means of the coded format of JAVA files according to claim 7 and byte stream, it is characterised in that if It is judged to coded format for Unicode, then the coded format for returning to file is UTF-16.

9. judgment means of the coded format of JAVA files according to claim 6 and byte stream, it is characterised in that described Judge module judges the coding lattice of the byte stream according to Unicode coding rules and preceding four bytes of the byte stream Formula, specifically includes：

10. judgment means of the coded format of JAVA files according to claim 6 and byte stream, it is characterised in that institute The coded format that judge module is additionally operable to by judging the coded format of byte stream to judge character string is stated, it is specifically included：