CN106775909A - The determination methods and device of the coded format of a kind of JAVA files and byte stream - Google Patents

The determination methods and device of the coded format of a kind of JAVA files and byte stream Download PDF

Info

Publication number
CN106775909A
CN106775909A CN201611041686.4A CN201611041686A CN106775909A CN 106775909 A CN106775909 A CN 106775909A CN 201611041686 A CN201611041686 A CN 201611041686A CN 106775909 A CN106775909 A CN 106775909A
Authority
CN
China
Prior art keywords
byte
coded format
itail
ihead
stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611041686.4A
Other languages
Chinese (zh)
Inventor
王同庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of China Ltd
Original Assignee
Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of China Ltd filed Critical Bank of China Ltd
Priority to CN201611041686.4A priority Critical patent/CN106775909A/en
Publication of CN106775909A publication Critical patent/CN106775909A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of JAVA files and the determination methods and device of the coded format of byte stream.Methods described includes:Read preceding four bytes of file or byte stream;According to Unicode coding rules and preceding four bytes of the file or byte stream, the coded format of the file or byte stream is judged.The determination methods and device of the coded format of the JAVA files and byte stream of the embodiment of the present invention, the coded format of file and byte stream is judged according to Unicode coding rules, has the advantages that workload is small, program is succinct, accuracy of judgement.

Description

The determination methods and device of the coded format of a kind of JAVA files and byte stream
Technical field
Concretely it is a kind of the present invention relates to data processing field, more particularly to a kind of determination methods of coded format The determination methods and device of the coded format of JAVA files and byte stream.
Background technology
This part is it is intended that the embodiments of the present invention stated in claims provide background or context.Herein Description in being included in this part because just do not recognize it is prior art.
Character string in internal memory is not limited solely to the character string being loaded directly into from class codes, more also Character string is read from text, also has plenty of what is read by database, it is also possible to built from byte arrays , but it be not Unicode codings that they is substantially all, reason is very simple, for storage optimization.
Therefore it is accomplished by processing various encoded questions, before treatment, it is necessary to the coding in clear and definite " source ", Ran Houyong The coded system specified correctly is read in internal memory.
The method of the currently a popular coded format for judging file and byte stream is the volume for judging all of byte of file stream The code scope and all of coding range of byte stream, has that workload is big, program is complicated, the defect of easy error.
The content of the invention
The embodiment of the present invention judges the coded format of file and byte stream using the coding rule of Unicode, to solve Existing determination methods workload is big and problem of easily error.
In order to achieve the above object, the embodiment of the present invention provides a kind of judgement of the coded format of JAVA files and byte stream Method, including:Read preceding four bytes of file or byte stream;According to Unicode coding rules and the file or byte Preceding four bytes of stream, judge the coded format of the file or byte stream.
Further, in one embodiment, it is described according to Unicode coding rules and preceding four words of the file Section, judges the coded format of the file, specifically includes:
If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16;
If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode;
If the 1st byte is -17, and the 2nd byte is -69, and the 3rd byte is -65, then coded format is UTF-8。
Further, in one embodiment, if it is decided that for coded format is Unicode, then return to the coding lattice of file Formula is UTF-16.
Further, in one embodiment, it is described according to Unicode coding rules and preceding four words of the byte stream Section, judges the coded format of the byte stream, specifically includes:
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first Individual new byte iHead>=0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<= 0xfe, according to Unicode coding rules, this byte stream is GB2312;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first Individual new byte iHead>=0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<= 0x7e) or (iTail>=0x80 and iTail<=0xfe), according to Unicode coding rules, this byte stream is GBK;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first Individual new byte iHead>=0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<= 0x7e) or (iTail>=0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte stream is BIG5;
1st byte head and the 2nd byte tail2 and 0xff are taken and obtains iHead and iTail2 with operation, if the One new byte iHead>=0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2 <=0x7e) or (iTail2>=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3< =0x7e) or (iTail3>=0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<= 0x7e) or (iTail2>=0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte stream is UTF8.
Further, in one embodiment, also include:
By the coded format for judging the coded format of byte stream to judge character string, specifically include:
The coded format for setting the former character string of unknown coded format is a certain coded format;
The former character string is converted into the byte stream of the coded format that coded format is the setting, lattice then will be encoded Formula is that the byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings, if two Individual character string is identical, then the coded format of the former character string is the coded format of setting.
In order to achieve the above object, the embodiment of the present invention also provides a kind of sentencing for the coded format of JAVA files and byte stream Disconnected device, including:Byte read module, preceding four bytes for reading file or byte stream;Judge module, for basis Preceding four bytes of Unicode coding rules and the file or byte stream, judge the coding lattice of the file or byte stream Formula.
Further, in one embodiment, before the judge module is according to Unicode coding rules and the file Four bytes, judge the coded format of the file, specifically include:
If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16;
If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode;
If the 1st byte is -17, and the 2nd byte is -69, and the 3rd byte is -65, then coded format is UTF-8。
Further, in one embodiment, if it is decided that for coded format is Unicode, then return to the coding lattice of file Formula is UTF-16.
Further, in one embodiment, the judge module is according to Unicode coding rules and the byte stream Preceding four bytes, judge the coded format of the byte stream, specifically include:
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first Individual new byte iHead>=0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<= 0xfe, according to Unicode coding rules, this byte stream is GB2312;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first Individual new byte iHead>=0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<= 0x7e) or (iTail>=0x80 and iTail<=0xfe), according to Unicode coding rules, this byte stream is GBK;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first Individual new byte iHead>=0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<= 0x7e) or (iTail>=0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte stream is BIG5;
1st byte head and the 2nd byte tail2 and 0xff are taken and obtains iHead and iTail2 with operation, if the One new byte iHead>=0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2 <=0x7e) or (iTail2>=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3< =0x7e) or (iTail3>=0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<= 0x7e) or (iTail2>=0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte stream is UTF8.
Further, in one embodiment, the judge module is additionally operable to sentence by judging the coded format of byte stream The coded format of word break character string, it is specifically included:
The coded format for setting the former character string of unknown coded format is a certain coded format;
The former character string is converted into the byte stream of the coded format that coded format is the setting, lattice then will be encoded Formula is that the byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings, if two Individual character string is identical, then the coded format of the former character string is the coded format of setting.
The determination methods and device of the coded format of the JAVA files and byte stream of the embodiment of the present invention, according to Unicode Coding rule judges the coded format of file and byte stream, has the advantages that workload is small, program is succinct, accuracy of judgement.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those skilled in the art, without having to pay creative labor, can be with root Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is the process chart of the judgment means of the coded format of the JAVA files and byte stream of the embodiment of the present invention;
Fig. 2 is the structural representation of the judgment means of the coded format of the JAVA files and byte stream of the embodiment of the present invention.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.
Art technology technical staff knows, embodiments of the present invention can be implemented as a kind of system, device, equipment, Method or computer program product.Therefore, the disclosure can be implemented as following form, i.e.,:It is complete hardware, complete soft Part (including firmware, resident software, microcode etc.), or the form that hardware and software is combined.
Below with reference to some representative embodiments of the invention, principle of the invention and spirit are explained in detail.
Principle of the invention is:The coded format of file and byte stream is judged according to Unicode coding rules, relatively In the coding range and the determination methods of all of coding range of byte stream of the existing judgement all of byte of file stream, workload It is small, accuracy of judgement.
Fig. 1 is the process chart of the determination methods of the coded format of the JAVA files and byte stream of the embodiment of the present invention. As illustrated, including:Step S101, reads preceding four bytes of file or byte stream;Step S102, encodes according to Unicode Preceding four bytes of the regular and file or byte stream, judge the coded format of the file or byte stream.
First, in the present invention, Unicode coding rules are as follows:
1)ANSI:The coding of file is exactly two bytes " D1CF ", and the GB2312 of this exactly " tight " is encoded, and this is also implied GB2312 is stored using major part mode.
2)Unicode:Coding is four bytes " 4E of FF FE 25 ", wherein " FF FE " is shown to be microcephaly's mode store, Real coding is 4E25.
3)Unicode big endian:Coding is four bytes " FE FF 4E 25 ", wherein " FE FF " is shown to be greatly Head mode is stored.
4)UTF-8:Coding is six bytes " EF BB BF E4B8A5 ", and first three byte " EF BB BF " represents that this is UTF-8 is encoded, and three " E4B8A5 " is exactly the specific coding of " tight " afterwards, and its storage order is consistent with coded sequence.
Embodiment one:
According to Unicode coding rules and preceding four bytes of the file, the coded format of the file is judged, have Body step includes:
1) coded format of current operation system running environment is obtained:
String dc=Charset.defaultCharset () .name ();
2) inlet flow is converted into Unicode inlet flows:
UnicodeInputStream uin=new UnicodeInputStream (in, dc);
3) the one before byte of file stream is read:
Byte [] head=new byte [4];
in.read(head);
4) it is GBK to define coded format:
String code=" GBK ";
5) judged according to Unicode coding rules, if the 1st byte is -1, and the 2nd byte is -2, then encode Form is UTF-16:
If (head [0]==-1&&head [1]==-2)
Code=" UTF-16 ";
6) judged according to Unicode coding rules, if the 1st byte is -2, and the 2nd byte is -1, then encode Form is Unicode:
If (head [0]==-2&&head [1]==-1)
Code=" Unicode ";
7) judged according to Unicode coding rules, if the 1st byte is -17, and the 2nd byte is -69 and the 3 bytes are -65, then coded format is UTF-8:
If (head [0]==-17&&head [1]==-69&&head [2]==-65)
Code=" UTF-8 ";
In step 6) in, if coded format is Unicode, the coded format for returning to file is UTF-16;Therefore most The coded format for returning to file afterwards is the one kind in tri- kinds of GBK/UTF-8/UTF-16.
In the embodiment, judge that the code realization of the coded format of the file is as follows:
Embodiment two:
It is described according to Unicode coding rules and preceding four bytes of the byte stream, judge the volume of the byte stream Code form, its concrete methods of realizing is:
1) judge whether byte stream is GB2312 codings, the first two byte of byte [] byte stream is obtained, by the 1st byte Head and the 2nd byte tail and 0xff takes and obtains iHead and iTail with operation, if first new byte iHead>= 0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<=0xfe, compiles according to Unicode Code rule, this byte arrays are GB2312.
2) judge whether byte stream is GBK codings, the first two byte of byte [] byte stream is obtained, by the 1st byte Head and the 2nd byte tail and 0xff takes and obtains iHead and iTail with operation, if first new byte iHead>= 0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<=0x7e) or (iTail>= 0x80 and iTail<=0xfe), according to Unicode coding rules, this byte arrays is GBK.
3) judge whether byte stream is BIG5 codings, the first two byte of byte [] byte stream is obtained, by the 1st byte Head and the 2nd byte tail and 0xff takes and obtains iHead and iTail with operation, if first new byte iHead>= 0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<=0x7e) or (iTail>= 0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte arrays is BIG5.
4) judge whether byte stream is UTF-8 codings, preceding 4 bytes of byte [] byte stream are obtained, by the 1st byte Head and the 2nd byte tail2 and 0xff takes and obtains iHead and iTail2 with operation, if first new byte iHead>= 0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2<=0x7e) or (iTail2> =0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3<=0x7e) or (iTail3>= 0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<=0x7e) or (iTail2>= 0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte arrays is UTF8.
In the embodiment, judge that the code realization of the coded format of the byte stream is as follows:
Embodiment three:
The determination methods of the coded format of JAVA files of the invention and byte stream also include the volume by judging byte stream Code form judges the coded format of character string, specifically includes:Set unknown coded format former character string coded format as A certain coded format;The former character string is converted into the byte stream of the coded format that coded format is the setting, then will Coded format is that the byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings, If two character strings are identical, the coded format of the former character string is the coded format of setting.
During specific implementation, the coding rule for judging character string is carried out on the coding rule for judge byte stream, function It is as follows:
1) coded format for setting the former character string A of unknown coded format first is GB2312.
2) former character string A is converted into and assumes that coded format is byte [] byte stream of GB2312, then by coded format For byte [] byte stream of GB2312 is converted to character string B, if original character string A with according to coded format for GB2312 word Symbol string B compares, if two character strings are identical, the coded format of former character string A is GB2312.
And then the coded format of the former character string A that sets unknown coded format is ISO-8859-1 3).
4) former character string A is converted into and assumes that coded format is byte [] byte stream of ISO-8859-1, then will coding Form is converted to character string B for byte [] byte stream of ISO-8859-1, if original character string A is with according to coded format The character string B of ISO-8859-1 compares, if two character strings are identical, the coded format of former character string A is ISO-8859-1.
And then the coded format of the former character string A that sets unknown coded format is UTF-8 5).
6) former character string A is converted into and assumes that coded format is byte [] byte stream of UTF-8, be then by coded format Byte [] byte stream of UTF-8 is converted to character string B, if original character string A with according to coded format for UTF-8 character string B Compare, if two character strings are identical, the coded format of former character string A is UTF-8.
And then the coded format of the former character string A that sets unknown coded format is GBK 7).
8) former character string A is converted into and assumes that coded format is byte [] byte stream of GBK, be then by coded format Byte [] byte stream of GBK is converted to character string B, if original character string A with according to coded format for GBK character string B ratios Compared with if two character strings are identical, the coded format of former character string A is GBK.
And then the coded format of the former character string A that sets unknown coded format is BIG5 9).
10) former character string A is converted into and assumes that coded format is byte [] byte stream of BIG5, be then by coded format Byte [] byte stream of BIG5 is converted to character string B, if original character string A with according to coded format for BIG5 character string B ratios Compared with if two character strings are identical, the coded format of former character string A is BIG5.
Code referring to:
In the embodiment, judge that the code realization of the coded format of the character string is as follows:
It should be noted that although the operation of the inventive method is described with particular order in the accompanying drawings, this is not required that Or imply that these must be performed according to the particular order operates, or the operation having to carry out shown in whole could realize the phase The result of prestige.Additionally or alternatively, it is convenient to omit some steps, multiple steps are merged into a step to perform, and/or will One step is decomposed into execution of multiple steps.
After the method for describing exemplary embodiment of the invention, next, with reference to Fig. 2 to exemplary reality of the invention The judgment means for applying the coded format of the JAVA files and byte stream of mode are introduced.The implementation of the device may refer to above-mentioned The implementation of method, repeats part and repeats no more.Term " module " used below and " unit ", can realize predetermined function Software and/or hardware.Although the module described by following examples is preferably realized with software, hardware, or soft The realization of the combination of part and hardware is also that may and be contemplated.
Fig. 2 is the structural representation of the judgment means of the coded format of the JAVA files and byte stream of the embodiment of the present invention. As shown in Fig. 2 including:
Byte read module 101, preceding four bytes for reading file or byte stream;
Judge module 102, for preceding four bytes according to Unicode coding rules and the file or byte stream, Judge the coded format of the file or byte stream.
In the present embodiment, the judge module 102 is according to Unicode coding rules and preceding four words of the file Section, judges the coded format of the file, specifically includes:
If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16;
If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode;If it is determined that compiling Code form is Unicode, then the coded format for returning to file is UTF-16;
If the 1st byte is -17, and the 2nd byte is -69, and the 3rd byte is -65, then coded format is UTF-8。
In the present embodiment, first four according to Unicode coding rules and the byte stream of the judge module 102 Byte, judges the coded format of the byte stream, specifically includes:
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first Individual new byte iHead>=0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<= 0xfe, according to Unicode coding rules, this byte stream is GB2312;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first Individual new byte iHead>=0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<= 0x7e) or (iTail>=0x80 and iTail<=0xfe), according to Unicode coding rules, this byte stream is GBK;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first Individual new byte iHead>=0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<= 0x7e) or (iTail>=0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte stream is BIG5;
1st byte head and the 2nd byte tail2 and 0xff are taken and obtains iHead and iTail2 with operation, if the One new byte iHead>=0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2 <=0x7e) or (iTail2>=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3< =0x7e) or (iTail3>=0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<= 0x7e) or (iTail2>=0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte stream is UTF8.
In the present embodiment, the judge module 102 is additionally operable to judge character by judging the coded format of byte stream The coded format of string, it is specifically included:
The coded format for setting the former character string of unknown coded format is a certain coded format;The former character string is converted It is the byte stream of coded format that coded format is the setting, then by the byte stream of coded format that coded format is setting New character strings are converted to, the former character string is compared with new character strings, if two character strings are identical, the former character string Coded format be setting coded format.
Although additionally, being referred to the judgment means of the coded format of JAVA files and byte stream in above-detailed Some units, but this division is only not enforceable.In fact, according to the embodiment of the present invention, it is above-described The feature and function of two or more units can embody in a unit.Equally, the spy of an above-described unit Function of seeking peace can also be further divided into being embodied by multiple units.
The determination methods and device of the coded format of the JAVA files and byte stream of the embodiment of the present invention, according to Unicode Coding rule judges the coded format of file and byte stream, has the advantages that workload is small, program is succinct, accuracy of judgement.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.And, the present invention can be used and wherein include the computer of computer usable program code at one or more The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) is produced The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram are described.It should be understood that every first-class during flow chart and/or block diagram can be realized by computer program instructions The combination of flow and/or square frame in journey and/or square frame and flow chart and/or block diagram.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced for reality by the instruction of computer or the computing device of other programmable data processing devices The device of the function of being specified in present one flow of flow chart or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or other programmable data processing devices with spy In determining the computer-readable memory that mode works so that instruction of the storage in the computer-readable memory is produced and include finger Make the manufacture of device, the command device realize in one flow of flow chart or multiple one square frame of flow and/or block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented treatment, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
Apply specific embodiment in the present invention to be set forth principle of the invention and implementation method, above example Explanation be only intended to help and understand the method for the present invention and its core concept;Simultaneously for those of ordinary skill in the art, According to thought of the invention, will change in specific embodiments and applications, in sum, in this specification Appearance should not be construed as limiting the invention.

Claims (10)

1. determination methods of the coded format of a kind of JAVA files and byte stream, it is characterised in that including:
Read preceding four bytes of file or byte stream;
According to Unicode coding rules and preceding four bytes of the file or byte stream, the file or byte stream are judged Coded format.
2. determination methods of the coded format of JAVA files according to claim 1 and byte stream, it is characterised in that described According to Unicode coding rules and preceding four bytes of the file, the coded format of the file is judged, specifically include:
If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16;
If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode;
If the 1st byte is -17, and the 2nd byte is -69, and the 3rd byte is -65, then coded format is UTF- 8。
3. determination methods of the coded format of JAVA files according to claim 2 and byte stream, it is characterised in that if It is judged to coded format for Unicode, then the coded format for returning to file is UTF-16.
4. determination methods of the coded format of JAVA files according to claim 1 and byte stream, it is characterised in that described According to Unicode coding rules and preceding four bytes of the byte stream, the coded format of the byte stream is judged, specific bag Include:
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new Byte iHead>=0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<=0xfe, According to Unicode coding rules, this byte stream is GB2312;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new Byte iHead>=0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<=0x7e) Or (iTail>=0x80 and iTail<=0xfe), according to Unicode coding rules, this byte stream is GBK;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new Byte iHead>=0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<=0x7e) Or (iTail>=0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte stream is BIG5;
1st byte head and the 2nd byte tail2 and 0xff are taken and obtains iHead and iTail2 with operation, if first New byte iHead>=0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2<= 0x7e) or (iTail2>=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3<= 0x7e) or (iTail3>=0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<= 0x7e) or (iTail2>=0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte stream is UTF8.
5. determination methods of the coded format of JAVA files according to claim 1 and byte stream, it is characterised in that also wrap Include:
By the coded format for judging the coded format of byte stream to judge character string, specifically include:
The coded format for setting the former character string of unknown coded format is a certain coded format;
The former character string is converted into the byte stream of the coded format that coded format is the setting, is then by coded format The byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings, if two words Symbol string is identical, then the coded format of the former character string is the coded format of setting.
6. judgment means of the coded format of a kind of JAVA files and byte stream, it is characterised in that including:
Byte read module, preceding four bytes for reading file or byte stream;
Judge module, for preceding four bytes according to Unicode coding rules and the file or byte stream, judges described The coded format of file or byte stream.
7. judgment means of the coded format of JAVA files according to claim 6 and byte stream, it is characterised in that described Judge module judges the coded format of the file according to Unicode coding rules and preceding four bytes of the file, tool Body includes:
If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16;
If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode;
If the 1st byte is -17, and the 2nd byte is -69, and the 3rd byte is -65, then coded format is UTF- 8。
8. judgment means of the coded format of JAVA files according to claim 7 and byte stream, it is characterised in that if It is judged to coded format for Unicode, then the coded format for returning to file is UTF-16.
9. judgment means of the coded format of JAVA files according to claim 6 and byte stream, it is characterised in that described Judge module judges the coding lattice of the byte stream according to Unicode coding rules and preceding four bytes of the byte stream Formula, specifically includes:
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new Byte iHead>=0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<=0xfe, According to Unicode coding rules, this byte stream is GB2312;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new Byte iHead>=0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<=0x7e) Or (iTail>=0x80 and iTail<=0xfe), according to Unicode coding rules, this byte stream is GBK;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new Byte iHead>=0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<=0x7e) Or (iTail>=0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte stream is BIG5;
1st byte head and the 2nd byte tail2 and 0xff are taken and obtains iHead and iTail2 with operation, if first New byte iHead>=0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2<= 0x7e) or (iTail2>=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3<= 0x7e) or (iTail3>=0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<= 0x7e) or (iTail2>=0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte stream is UTF8.
10. judgment means of the coded format of JAVA files according to claim 6 and byte stream, it is characterised in that institute The coded format that judge module is additionally operable to by judging the coded format of byte stream to judge character string is stated, it is specifically included:
The coded format for setting the former character string of unknown coded format is a certain coded format;
The former character string is converted into the byte stream of the coded format that coded format is the setting, is then by coded format The byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings, if two words Symbol string is identical, then the coded format of the former character string is the coded format of setting.
CN201611041686.4A 2016-11-22 2016-11-22 The determination methods and device of the coded format of a kind of JAVA files and byte stream Pending CN106775909A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611041686.4A CN106775909A (en) 2016-11-22 2016-11-22 The determination methods and device of the coded format of a kind of JAVA files and byte stream

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611041686.4A CN106775909A (en) 2016-11-22 2016-11-22 The determination methods and device of the coded format of a kind of JAVA files and byte stream

Publications (1)

Publication Number Publication Date
CN106775909A true CN106775909A (en) 2017-05-31

Family

ID=58975167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611041686.4A Pending CN106775909A (en) 2016-11-22 2016-11-22 The determination methods and device of the coded format of a kind of JAVA files and byte stream

Country Status (1)

Country Link
CN (1) CN106775909A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943761A (en) * 2017-11-14 2018-04-20 北京思特奇信息技术股份有限公司 A kind of method of calibration and system of TXT document codings character set
CN109542507A (en) * 2018-10-26 2019-03-29 深圳点猫科技有限公司 A kind of GBK code processing method and electronic equipment based on educational system
CN110609684A (en) * 2019-09-18 2019-12-24 四川长虹电器股份有限公司 Method for converting video into character animation under Spring boot frame
CN111368508A (en) * 2020-03-03 2020-07-03 深信服科技股份有限公司 Data processing method, device, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567293A (en) * 2010-12-13 2012-07-11 汉王科技股份有限公司 Coded format detection method and coded format detection device for text files
CN104156373A (en) * 2013-05-15 2014-11-19 宏碁股份有限公司 Coding format detection method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567293A (en) * 2010-12-13 2012-07-11 汉王科技股份有限公司 Coded format detection method and coded format detection device for text files
CN104156373A (en) * 2013-05-15 2014-11-19 宏碁股份有限公司 Coding format detection method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CAESAR525: "判断是不是GB2312", 《HTTPS://BBS.CSDN.NET./TOPICS/19172569》 *
HOORAY520: "判断字符串编码", 《HTTPS://BLOG.CSDN.NET/HOORAY520/ARTICLE/DETAILS/83916560》 *
OICQXIESIDILIERIC: "Java正确判别出文件的字符集", 《HTTPS://BLOG.CSDN.NET/OICQXIESIDILIERIC/ARTICLE/DETAILS/8464630》 *
风叙: "对各字符集编码范围的总结", 《HTTPS://WWW.CNBLOGS.COM/JUNEAPPLE/ARTICLES/1768983.HTML》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943761A (en) * 2017-11-14 2018-04-20 北京思特奇信息技术股份有限公司 A kind of method of calibration and system of TXT document codings character set
CN109542507A (en) * 2018-10-26 2019-03-29 深圳点猫科技有限公司 A kind of GBK code processing method and electronic equipment based on educational system
CN110609684A (en) * 2019-09-18 2019-12-24 四川长虹电器股份有限公司 Method for converting video into character animation under Spring boot frame
CN110609684B (en) * 2019-09-18 2022-08-19 四川长虹电器股份有限公司 Method for converting video into character animation under Spring boot frame
CN111368508A (en) * 2020-03-03 2020-07-03 深信服科技股份有限公司 Data processing method, device, equipment and medium
CN111368508B (en) * 2020-03-03 2024-04-09 深信服科技股份有限公司 Data processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN106775909A (en) The determination methods and device of the coded format of a kind of JAVA files and byte stream
CN106293677B (en) A kind of code conversion method and device
CN105247472B (en) Processor, method, system and instruction for the variable-length codes point transcoding to Unicode character
CN111144128B (en) Semantic analysis method and device
CN109460219B (en) Method for quickly serializing interface control file
CN113778449B (en) Avionic interface data adaptation conversion system
CN104050077A (en) Fusible instructions and logic to provide or-test and and-test functionality using multiple test sources
CN107566090B (en) Fixed-length/variable-length text message processing method and device
CN109933602A (en) A kind of conversion method and device of natural language and structured query language
CN103827815A (en) Instruction and logic to provide vector loads and stores with strides and masking functionality
CN106940743A (en) A kind of ventilation shaft mechanical analyzing method and system
CN115455382A (en) Semantic comparison method and device for binary function codes
CN112364631A (en) Chinese grammar error detection method and system based on hierarchical multitask learning
CN104572102A (en) Method for solving Chinese messy codes in JAVA
CN112860584B (en) Workflow model-based testing method and device
CN114138243A (en) Function calling method, device, equipment and storage medium based on development platform
CN104021147B (en) A kind of code stream analyzing method and device
CN105653506B (en) It is a kind of based on character code conversion GPU in text-processing method and device
CN111124541B (en) Configuration file generation method, device, equipment and medium
JP2011090526A (en) Compression program, method, and device, and decompression program, method, and device
CN104462157A (en) Method and device for secondary structuralizing of text data
CN105793842B (en) Conversion method and device between serialized message
CN116644180A (en) Training method and training system for text matching model and text label determining method
CN109460236A (en) Program version building and inspection method and system
CN114297408A (en) Relation triple extraction method based on cascade binary labeling framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170531

RJ01 Rejection of invention patent application after publication