CN106775909A - The determination methods and device of the coded format of a kind of JAVA files and byte stream - Google Patents
The determination methods and device of the coded format of a kind of JAVA files and byte stream Download PDFInfo
- Publication number
- CN106775909A CN106775909A CN201611041686.4A CN201611041686A CN106775909A CN 106775909 A CN106775909 A CN 106775909A CN 201611041686 A CN201611041686 A CN 201611041686A CN 106775909 A CN106775909 A CN 106775909A
- Authority
- CN
- China
- Prior art keywords
- byte
- coded format
- itail
- ihead
- stream
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of JAVA files and the determination methods and device of the coded format of byte stream.Methods described includes:Read preceding four bytes of file or byte stream;According to Unicode coding rules and preceding four bytes of the file or byte stream, the coded format of the file or byte stream is judged.The determination methods and device of the coded format of the JAVA files and byte stream of the embodiment of the present invention, the coded format of file and byte stream is judged according to Unicode coding rules, has the advantages that workload is small, program is succinct, accuracy of judgement.
Description
Technical field
Concretely it is a kind of the present invention relates to data processing field, more particularly to a kind of determination methods of coded format
The determination methods and device of the coded format of JAVA files and byte stream.
Background technology
This part is it is intended that the embodiments of the present invention stated in claims provide background or context.Herein
Description in being included in this part because just do not recognize it is prior art.
Character string in internal memory is not limited solely to the character string being loaded directly into from class codes, more also
Character string is read from text, also has plenty of what is read by database, it is also possible to built from byte arrays
, but it be not Unicode codings that they is substantially all, reason is very simple, for storage optimization.
Therefore it is accomplished by processing various encoded questions, before treatment, it is necessary to the coding in clear and definite " source ", Ran Houyong
The coded system specified correctly is read in internal memory.
The method of the currently a popular coded format for judging file and byte stream is the volume for judging all of byte of file stream
The code scope and all of coding range of byte stream, has that workload is big, program is complicated, the defect of easy error.
The content of the invention
The embodiment of the present invention judges the coded format of file and byte stream using the coding rule of Unicode, to solve
Existing determination methods workload is big and problem of easily error.
In order to achieve the above object, the embodiment of the present invention provides a kind of judgement of the coded format of JAVA files and byte stream
Method, including:Read preceding four bytes of file or byte stream;According to Unicode coding rules and the file or byte
Preceding four bytes of stream, judge the coded format of the file or byte stream.
Further, in one embodiment, it is described according to Unicode coding rules and preceding four words of the file
Section, judges the coded format of the file, specifically includes:
If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16;
If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode;
If the 1st byte is -17, and the 2nd byte is -69, and the 3rd byte is -65, then coded format is
UTF-8。
Further, in one embodiment, if it is decided that for coded format is Unicode, then return to the coding lattice of file
Formula is UTF-16.
Further, in one embodiment, it is described according to Unicode coding rules and preceding four words of the byte stream
Section, judges the coded format of the byte stream, specifically includes:
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first
Individual new byte iHead>=0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<=
0xfe, according to Unicode coding rules, this byte stream is GB2312;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first
Individual new byte iHead>=0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<=
0x7e) or (iTail>=0x80 and iTail<=0xfe), according to Unicode coding rules, this byte stream is GBK;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first
Individual new byte iHead>=0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<=
0x7e) or (iTail>=0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte stream is BIG5;
1st byte head and the 2nd byte tail2 and 0xff are taken and obtains iHead and iTail2 with operation, if the
One new byte iHead>=0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2
<=0x7e) or (iTail2>=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3<
=0x7e) or (iTail3>=0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<=
0x7e) or (iTail2>=0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte stream is UTF8.
Further, in one embodiment, also include:
By the coded format for judging the coded format of byte stream to judge character string, specifically include:
The coded format for setting the former character string of unknown coded format is a certain coded format;
The former character string is converted into the byte stream of the coded format that coded format is the setting, lattice then will be encoded
Formula is that the byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings, if two
Individual character string is identical, then the coded format of the former character string is the coded format of setting.
In order to achieve the above object, the embodiment of the present invention also provides a kind of sentencing for the coded format of JAVA files and byte stream
Disconnected device, including:Byte read module, preceding four bytes for reading file or byte stream;Judge module, for basis
Preceding four bytes of Unicode coding rules and the file or byte stream, judge the coding lattice of the file or byte stream
Formula.
Further, in one embodiment, before the judge module is according to Unicode coding rules and the file
Four bytes, judge the coded format of the file, specifically include:
If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16;
If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode;
If the 1st byte is -17, and the 2nd byte is -69, and the 3rd byte is -65, then coded format is
UTF-8。
Further, in one embodiment, if it is decided that for coded format is Unicode, then return to the coding lattice of file
Formula is UTF-16.
Further, in one embodiment, the judge module is according to Unicode coding rules and the byte stream
Preceding four bytes, judge the coded format of the byte stream, specifically include:
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first
Individual new byte iHead>=0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<=
0xfe, according to Unicode coding rules, this byte stream is GB2312;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first
Individual new byte iHead>=0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<=
0x7e) or (iTail>=0x80 and iTail<=0xfe), according to Unicode coding rules, this byte stream is GBK;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first
Individual new byte iHead>=0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<=
0x7e) or (iTail>=0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte stream is BIG5;
1st byte head and the 2nd byte tail2 and 0xff are taken and obtains iHead and iTail2 with operation, if the
One new byte iHead>=0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2
<=0x7e) or (iTail2>=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3<
=0x7e) or (iTail3>=0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<=
0x7e) or (iTail2>=0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte stream is UTF8.
Further, in one embodiment, the judge module is additionally operable to sentence by judging the coded format of byte stream
The coded format of word break character string, it is specifically included:
The coded format for setting the former character string of unknown coded format is a certain coded format;
The former character string is converted into the byte stream of the coded format that coded format is the setting, lattice then will be encoded
Formula is that the byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings, if two
Individual character string is identical, then the coded format of the former character string is the coded format of setting.
The determination methods and device of the coded format of the JAVA files and byte stream of the embodiment of the present invention, according to Unicode
Coding rule judges the coded format of file and byte stream, has the advantages that workload is small, program is succinct, accuracy of judgement.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those skilled in the art, without having to pay creative labor, can be with root
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is the process chart of the judgment means of the coded format of the JAVA files and byte stream of the embodiment of the present invention;
Fig. 2 is the structural representation of the judgment means of the coded format of the JAVA files and byte stream of the embodiment of the present invention.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made
Embodiment, belongs to the scope of protection of the invention.
Art technology technical staff knows, embodiments of the present invention can be implemented as a kind of system, device, equipment,
Method or computer program product.Therefore, the disclosure can be implemented as following form, i.e.,:It is complete hardware, complete soft
Part (including firmware, resident software, microcode etc.), or the form that hardware and software is combined.
Below with reference to some representative embodiments of the invention, principle of the invention and spirit are explained in detail.
Principle of the invention is:The coded format of file and byte stream is judged according to Unicode coding rules, relatively
In the coding range and the determination methods of all of coding range of byte stream of the existing judgement all of byte of file stream, workload
It is small, accuracy of judgement.
Fig. 1 is the process chart of the determination methods of the coded format of the JAVA files and byte stream of the embodiment of the present invention.
As illustrated, including:Step S101, reads preceding four bytes of file or byte stream;Step S102, encodes according to Unicode
Preceding four bytes of the regular and file or byte stream, judge the coded format of the file or byte stream.
First, in the present invention, Unicode coding rules are as follows:
1)ANSI:The coding of file is exactly two bytes " D1CF ", and the GB2312 of this exactly " tight " is encoded, and this is also implied
GB2312 is stored using major part mode.
2)Unicode:Coding is four bytes " 4E of FF FE 25 ", wherein " FF FE " is shown to be microcephaly's mode store,
Real coding is 4E25.
3)Unicode big endian:Coding is four bytes " FE FF 4E 25 ", wherein " FE FF " is shown to be greatly
Head mode is stored.
4)UTF-8:Coding is six bytes " EF BB BF E4B8A5 ", and first three byte " EF BB BF " represents that this is
UTF-8 is encoded, and three " E4B8A5 " is exactly the specific coding of " tight " afterwards, and its storage order is consistent with coded sequence.
Embodiment one:
According to Unicode coding rules and preceding four bytes of the file, the coded format of the file is judged, have
Body step includes:
1) coded format of current operation system running environment is obtained:
String dc=Charset.defaultCharset () .name ();
2) inlet flow is converted into Unicode inlet flows:
UnicodeInputStream uin=new UnicodeInputStream (in, dc);
3) the one before byte of file stream is read:
Byte [] head=new byte [4];
in.read(head);
4) it is GBK to define coded format:
String code=" GBK ";
5) judged according to Unicode coding rules, if the 1st byte is -1, and the 2nd byte is -2, then encode
Form is UTF-16:
If (head [0]==-1&&head [1]==-2)
Code=" UTF-16 ";
6) judged according to Unicode coding rules, if the 1st byte is -2, and the 2nd byte is -1, then encode
Form is Unicode:
If (head [0]==-2&&head [1]==-1)
Code=" Unicode ";
7) judged according to Unicode coding rules, if the 1st byte is -17, and the 2nd byte is -69 and the
3 bytes are -65, then coded format is UTF-8:
If (head [0]==-17&&head [1]==-69&&head [2]==-65)
Code=" UTF-8 ";
In step 6) in, if coded format is Unicode, the coded format for returning to file is UTF-16;Therefore most
The coded format for returning to file afterwards is the one kind in tri- kinds of GBK/UTF-8/UTF-16.
In the embodiment, judge that the code realization of the coded format of the file is as follows:
Embodiment two:
It is described according to Unicode coding rules and preceding four bytes of the byte stream, judge the volume of the byte stream
Code form, its concrete methods of realizing is:
1) judge whether byte stream is GB2312 codings, the first two byte of byte [] byte stream is obtained, by the 1st byte
Head and the 2nd byte tail and 0xff takes and obtains iHead and iTail with operation, if first new byte iHead>=
0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<=0xfe, compiles according to Unicode
Code rule, this byte arrays are GB2312.
2) judge whether byte stream is GBK codings, the first two byte of byte [] byte stream is obtained, by the 1st byte
Head and the 2nd byte tail and 0xff takes and obtains iHead and iTail with operation, if first new byte iHead>=
0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<=0x7e) or (iTail>=
0x80 and iTail<=0xfe), according to Unicode coding rules, this byte arrays is GBK.
3) judge whether byte stream is BIG5 codings, the first two byte of byte [] byte stream is obtained, by the 1st byte
Head and the 2nd byte tail and 0xff takes and obtains iHead and iTail with operation, if first new byte iHead>=
0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<=0x7e) or (iTail>=
0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte arrays is BIG5.
4) judge whether byte stream is UTF-8 codings, preceding 4 bytes of byte [] byte stream are obtained, by the 1st byte
Head and the 2nd byte tail2 and 0xff takes and obtains iHead and iTail2 with operation, if first new byte iHead>=
0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2<=0x7e) or (iTail2>
=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3<=0x7e) or (iTail3>=
0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<=0x7e) or (iTail2>=
0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte arrays is UTF8.
In the embodiment, judge that the code realization of the coded format of the byte stream is as follows:
Embodiment three:
The determination methods of the coded format of JAVA files of the invention and byte stream also include the volume by judging byte stream
Code form judges the coded format of character string, specifically includes:Set unknown coded format former character string coded format as
A certain coded format;The former character string is converted into the byte stream of the coded format that coded format is the setting, then will
Coded format is that the byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings,
If two character strings are identical, the coded format of the former character string is the coded format of setting.
During specific implementation, the coding rule for judging character string is carried out on the coding rule for judge byte stream, function
It is as follows:
1) coded format for setting the former character string A of unknown coded format first is GB2312.
2) former character string A is converted into and assumes that coded format is byte [] byte stream of GB2312, then by coded format
For byte [] byte stream of GB2312 is converted to character string B, if original character string A with according to coded format for GB2312 word
Symbol string B compares, if two character strings are identical, the coded format of former character string A is GB2312.
And then the coded format of the former character string A that sets unknown coded format is ISO-8859-1 3).
4) former character string A is converted into and assumes that coded format is byte [] byte stream of ISO-8859-1, then will coding
Form is converted to character string B for byte [] byte stream of ISO-8859-1, if original character string A is with according to coded format
The character string B of ISO-8859-1 compares, if two character strings are identical, the coded format of former character string A is ISO-8859-1.
And then the coded format of the former character string A that sets unknown coded format is UTF-8 5).
6) former character string A is converted into and assumes that coded format is byte [] byte stream of UTF-8, be then by coded format
Byte [] byte stream of UTF-8 is converted to character string B, if original character string A with according to coded format for UTF-8 character string B
Compare, if two character strings are identical, the coded format of former character string A is UTF-8.
And then the coded format of the former character string A that sets unknown coded format is GBK 7).
8) former character string A is converted into and assumes that coded format is byte [] byte stream of GBK, be then by coded format
Byte [] byte stream of GBK is converted to character string B, if original character string A with according to coded format for GBK character string B ratios
Compared with if two character strings are identical, the coded format of former character string A is GBK.
And then the coded format of the former character string A that sets unknown coded format is BIG5 9).
10) former character string A is converted into and assumes that coded format is byte [] byte stream of BIG5, be then by coded format
Byte [] byte stream of BIG5 is converted to character string B, if original character string A with according to coded format for BIG5 character string B ratios
Compared with if two character strings are identical, the coded format of former character string A is BIG5.
Code referring to:
In the embodiment, judge that the code realization of the coded format of the character string is as follows:
It should be noted that although the operation of the inventive method is described with particular order in the accompanying drawings, this is not required that
Or imply that these must be performed according to the particular order operates, or the operation having to carry out shown in whole could realize the phase
The result of prestige.Additionally or alternatively, it is convenient to omit some steps, multiple steps are merged into a step to perform, and/or will
One step is decomposed into execution of multiple steps.
After the method for describing exemplary embodiment of the invention, next, with reference to Fig. 2 to exemplary reality of the invention
The judgment means for applying the coded format of the JAVA files and byte stream of mode are introduced.The implementation of the device may refer to above-mentioned
The implementation of method, repeats part and repeats no more.Term " module " used below and " unit ", can realize predetermined function
Software and/or hardware.Although the module described by following examples is preferably realized with software, hardware, or soft
The realization of the combination of part and hardware is also that may and be contemplated.
Fig. 2 is the structural representation of the judgment means of the coded format of the JAVA files and byte stream of the embodiment of the present invention.
As shown in Fig. 2 including:
Byte read module 101, preceding four bytes for reading file or byte stream;
Judge module 102, for preceding four bytes according to Unicode coding rules and the file or byte stream,
Judge the coded format of the file or byte stream.
In the present embodiment, the judge module 102 is according to Unicode coding rules and preceding four words of the file
Section, judges the coded format of the file, specifically includes:
If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16;
If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode;If it is determined that compiling
Code form is Unicode, then the coded format for returning to file is UTF-16;
If the 1st byte is -17, and the 2nd byte is -69, and the 3rd byte is -65, then coded format is
UTF-8。
In the present embodiment, first four according to Unicode coding rules and the byte stream of the judge module 102
Byte, judges the coded format of the byte stream, specifically includes:
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first
Individual new byte iHead>=0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<=
0xfe, according to Unicode coding rules, this byte stream is GB2312;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first
Individual new byte iHead>=0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<=
0x7e) or (iTail>=0x80 and iTail<=0xfe), according to Unicode coding rules, this byte stream is GBK;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first
Individual new byte iHead>=0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<=
0x7e) or (iTail>=0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte stream is BIG5;
1st byte head and the 2nd byte tail2 and 0xff are taken and obtains iHead and iTail2 with operation, if the
One new byte iHead>=0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2
<=0x7e) or (iTail2>=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3<
=0x7e) or (iTail3>=0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<=
0x7e) or (iTail2>=0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte stream is UTF8.
In the present embodiment, the judge module 102 is additionally operable to judge character by judging the coded format of byte stream
The coded format of string, it is specifically included:
The coded format for setting the former character string of unknown coded format is a certain coded format;The former character string is converted
It is the byte stream of coded format that coded format is the setting, then by the byte stream of coded format that coded format is setting
New character strings are converted to, the former character string is compared with new character strings, if two character strings are identical, the former character string
Coded format be setting coded format.
Although additionally, being referred to the judgment means of the coded format of JAVA files and byte stream in above-detailed
Some units, but this division is only not enforceable.In fact, according to the embodiment of the present invention, it is above-described
The feature and function of two or more units can embody in a unit.Equally, the spy of an above-described unit
Function of seeking peace can also be further divided into being embodied by multiple units.
The determination methods and device of the coded format of the JAVA files and byte stream of the embodiment of the present invention, according to Unicode
Coding rule judges the coded format of file and byte stream, has the advantages that workload is small, program is succinct, accuracy of judgement.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program
Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.And, the present invention can be used and wherein include the computer of computer usable program code at one or more
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) is produced
The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product
Figure and/or block diagram are described.It should be understood that every first-class during flow chart and/or block diagram can be realized by computer program instructions
The combination of flow and/or square frame in journey and/or square frame and flow chart and/or block diagram.These computer programs can be provided
The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that produced for reality by the instruction of computer or the computing device of other programmable data processing devices
The device of the function of being specified in present one flow of flow chart or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or other programmable data processing devices with spy
In determining the computer-readable memory that mode works so that instruction of the storage in the computer-readable memory is produced and include finger
Make the manufacture of device, the command device realize in one flow of flow chart or multiple one square frame of flow and/or block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented treatment, so as in computer or
The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in individual square frame or multiple square frames.
Apply specific embodiment in the present invention to be set forth principle of the invention and implementation method, above example
Explanation be only intended to help and understand the method for the present invention and its core concept;Simultaneously for those of ordinary skill in the art,
According to thought of the invention, will change in specific embodiments and applications, in sum, in this specification
Appearance should not be construed as limiting the invention.
Claims (10)
1. determination methods of the coded format of a kind of JAVA files and byte stream, it is characterised in that including:
Read preceding four bytes of file or byte stream;
According to Unicode coding rules and preceding four bytes of the file or byte stream, the file or byte stream are judged
Coded format.
2. determination methods of the coded format of JAVA files according to claim 1 and byte stream, it is characterised in that described
According to Unicode coding rules and preceding four bytes of the file, the coded format of the file is judged, specifically include:
If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16;
If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode;
If the 1st byte is -17, and the 2nd byte is -69, and the 3rd byte is -65, then coded format is UTF-
8。
3. determination methods of the coded format of JAVA files according to claim 2 and byte stream, it is characterised in that if
It is judged to coded format for Unicode, then the coded format for returning to file is UTF-16.
4. determination methods of the coded format of JAVA files according to claim 1 and byte stream, it is characterised in that described
According to Unicode coding rules and preceding four bytes of the byte stream, the coded format of the byte stream is judged, specific bag
Include:
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new
Byte iHead>=0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<=0xfe,
According to Unicode coding rules, this byte stream is GB2312;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new
Byte iHead>=0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<=0x7e)
Or (iTail>=0x80 and iTail<=0xfe), according to Unicode coding rules, this byte stream is GBK;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new
Byte iHead>=0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<=0x7e)
Or (iTail>=0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte stream is BIG5;
1st byte head and the 2nd byte tail2 and 0xff are taken and obtains iHead and iTail2 with operation, if first
New byte iHead>=0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2<=
0x7e) or (iTail2>=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3<=
0x7e) or (iTail3>=0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<=
0x7e) or (iTail2>=0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte stream is UTF8.
5. determination methods of the coded format of JAVA files according to claim 1 and byte stream, it is characterised in that also wrap
Include:
By the coded format for judging the coded format of byte stream to judge character string, specifically include:
The coded format for setting the former character string of unknown coded format is a certain coded format;
The former character string is converted into the byte stream of the coded format that coded format is the setting, is then by coded format
The byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings, if two words
Symbol string is identical, then the coded format of the former character string is the coded format of setting.
6. judgment means of the coded format of a kind of JAVA files and byte stream, it is characterised in that including:
Byte read module, preceding four bytes for reading file or byte stream;
Judge module, for preceding four bytes according to Unicode coding rules and the file or byte stream, judges described
The coded format of file or byte stream.
7. judgment means of the coded format of JAVA files according to claim 6 and byte stream, it is characterised in that described
Judge module judges the coded format of the file according to Unicode coding rules and preceding four bytes of the file, tool
Body includes:
If the 1st byte is -1, and the 2nd byte is -2, then coded format is UTF-16;
If the 1st byte is -2, and the 2nd byte is -1, then coded format is Unicode;
If the 1st byte is -17, and the 2nd byte is -69, and the 3rd byte is -65, then coded format is UTF-
8。
8. judgment means of the coded format of JAVA files according to claim 7 and byte stream, it is characterised in that if
It is judged to coded format for Unicode, then the coded format for returning to file is UTF-16.
9. judgment means of the coded format of JAVA files according to claim 6 and byte stream, it is characterised in that described
Judge module judges the coding lattice of the byte stream according to Unicode coding rules and preceding four bytes of the byte stream
Formula, specifically includes:
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new
Byte iHead>=0xa1 and iHead<=0xf7, while second new byte iTail>=0xa1 and iTail<=0xfe,
According to Unicode coding rules, this byte stream is GB2312;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new
Byte iHead>=0x81 and iHead<=0xf7, while second new byte (iTail>=0x40 and iTail<=0x7e)
Or (iTail>=0x80 and iTail<=0xfe), according to Unicode coding rules, this byte stream is GBK;
1st byte head and the 2nd byte tail and 0xff are taken and obtains iHead and iTail with operation, if first new
Byte iHead>=0xa1 and iHead<=0xf9, while second new byte (iTail>=0x40 and iTail<=0x7e)
Or (iTail>=0xa1 and iTail<=0xfe), according to Unicode coding rules, this byte stream is BIG5;
1st byte head and the 2nd byte tail2 and 0xff are taken and obtains iHead and iTail2 with operation, if first
New byte iHead>=0x81 and iHead<=0xf9, while second new byte (iTail2>=0x40 and iTail2<=
0x7e) or (iTail2>=0xa1 and iTail2<=0xfe), the 3rd new byte (iTail3>=0x40 and iTail3<=
0x7e) or (iTail3>=0xa1 and iTail3<=0xfe), the 4th new byte (iTail4>=0x40 and iTail4<=
0x7e) or (iTail2>=0xa1 and iTail4<=0xfe), according to Unicode coding rules, this byte stream is UTF8.
10. judgment means of the coded format of JAVA files according to claim 6 and byte stream, it is characterised in that institute
The coded format that judge module is additionally operable to by judging the coded format of byte stream to judge character string is stated, it is specifically included:
The coded format for setting the former character string of unknown coded format is a certain coded format;
The former character string is converted into the byte stream of the coded format that coded format is the setting, is then by coded format
The byte stream of the coded format of setting is converted to new character strings, and the former character string is compared with new character strings, if two words
Symbol string is identical, then the coded format of the former character string is the coded format of setting.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611041686.4A CN106775909A (en) | 2016-11-22 | 2016-11-22 | The determination methods and device of the coded format of a kind of JAVA files and byte stream |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611041686.4A CN106775909A (en) | 2016-11-22 | 2016-11-22 | The determination methods and device of the coded format of a kind of JAVA files and byte stream |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106775909A true CN106775909A (en) | 2017-05-31 |
Family
ID=58975167
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611041686.4A Pending CN106775909A (en) | 2016-11-22 | 2016-11-22 | The determination methods and device of the coded format of a kind of JAVA files and byte stream |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106775909A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107943761A (en) * | 2017-11-14 | 2018-04-20 | 北京思特奇信息技术股份有限公司 | A kind of method of calibration and system of TXT document codings character set |
CN109542507A (en) * | 2018-10-26 | 2019-03-29 | 深圳点猫科技有限公司 | A kind of GBK code processing method and electronic equipment based on educational system |
CN110609684A (en) * | 2019-09-18 | 2019-12-24 | 四川长虹电器股份有限公司 | Method for converting video into character animation under Spring boot frame |
CN111368508A (en) * | 2020-03-03 | 2020-07-03 | 深信服科技股份有限公司 | Data processing method, device, equipment and medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567293A (en) * | 2010-12-13 | 2012-07-11 | 汉王科技股份有限公司 | Coded format detection method and coded format detection device for text files |
CN104156373A (en) * | 2013-05-15 | 2014-11-19 | 宏碁股份有限公司 | Coding format detection method and device |
-
2016
- 2016-11-22 CN CN201611041686.4A patent/CN106775909A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567293A (en) * | 2010-12-13 | 2012-07-11 | 汉王科技股份有限公司 | Coded format detection method and coded format detection device for text files |
CN104156373A (en) * | 2013-05-15 | 2014-11-19 | 宏碁股份有限公司 | Coding format detection method and device |
Non-Patent Citations (4)
Title |
---|
CAESAR525: "判断是不是GB2312", 《HTTPS://BBS.CSDN.NET./TOPICS/19172569》 * |
HOORAY520: "判断字符串编码", 《HTTPS://BLOG.CSDN.NET/HOORAY520/ARTICLE/DETAILS/83916560》 * |
OICQXIESIDILIERIC: "Java正确判别出文件的字符集", 《HTTPS://BLOG.CSDN.NET/OICQXIESIDILIERIC/ARTICLE/DETAILS/8464630》 * |
风叙: "对各字符集编码范围的总结", 《HTTPS://WWW.CNBLOGS.COM/JUNEAPPLE/ARTICLES/1768983.HTML》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107943761A (en) * | 2017-11-14 | 2018-04-20 | 北京思特奇信息技术股份有限公司 | A kind of method of calibration and system of TXT document codings character set |
CN109542507A (en) * | 2018-10-26 | 2019-03-29 | 深圳点猫科技有限公司 | A kind of GBK code processing method and electronic equipment based on educational system |
CN110609684A (en) * | 2019-09-18 | 2019-12-24 | 四川长虹电器股份有限公司 | Method for converting video into character animation under Spring boot frame |
CN110609684B (en) * | 2019-09-18 | 2022-08-19 | 四川长虹电器股份有限公司 | Method for converting video into character animation under Spring boot frame |
CN111368508A (en) * | 2020-03-03 | 2020-07-03 | 深信服科技股份有限公司 | Data processing method, device, equipment and medium |
CN111368508B (en) * | 2020-03-03 | 2024-04-09 | 深信服科技股份有限公司 | Data processing method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106775909A (en) | The determination methods and device of the coded format of a kind of JAVA files and byte stream | |
CN106293677B (en) | A kind of code conversion method and device | |
CN105247472B (en) | Processor, method, system and instruction for the variable-length codes point transcoding to Unicode character | |
CN111144128B (en) | Semantic analysis method and device | |
CN109460219B (en) | Method for quickly serializing interface control file | |
CN113778449B (en) | Avionic interface data adaptation conversion system | |
CN104050077A (en) | Fusible instructions and logic to provide or-test and and-test functionality using multiple test sources | |
CN107566090B (en) | Fixed-length/variable-length text message processing method and device | |
CN109933602A (en) | A kind of conversion method and device of natural language and structured query language | |
CN103827815A (en) | Instruction and logic to provide vector loads and stores with strides and masking functionality | |
CN106940743A (en) | A kind of ventilation shaft mechanical analyzing method and system | |
CN115455382A (en) | Semantic comparison method and device for binary function codes | |
CN112364631A (en) | Chinese grammar error detection method and system based on hierarchical multitask learning | |
CN104572102A (en) | Method for solving Chinese messy codes in JAVA | |
CN112860584B (en) | Workflow model-based testing method and device | |
CN114138243A (en) | Function calling method, device, equipment and storage medium based on development platform | |
CN104021147B (en) | A kind of code stream analyzing method and device | |
CN105653506B (en) | It is a kind of based on character code conversion GPU in text-processing method and device | |
CN111124541B (en) | Configuration file generation method, device, equipment and medium | |
JP2011090526A (en) | Compression program, method, and device, and decompression program, method, and device | |
CN104462157A (en) | Method and device for secondary structuralizing of text data | |
CN105793842B (en) | Conversion method and device between serialized message | |
CN116644180A (en) | Training method and training system for text matching model and text label determining method | |
CN109460236A (en) | Program version building and inspection method and system | |
CN114297408A (en) | Relation triple extraction method based on cascade binary labeling framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170531 |
|
RJ01 | Rejection of invention patent application after publication |