WO2015043072A1 - 一种选择读取目标文档的编码格式的方法及其系统 - Google Patents
一种选择读取目标文档的编码格式的方法及其系统 Download PDFInfo
- Publication number
- WO2015043072A1 WO2015043072A1 PCT/CN2013/088745 CN2013088745W WO2015043072A1 WO 2015043072 A1 WO2015043072 A1 WO 2015043072A1 CN 2013088745 W CN2013088745 W CN 2013088745W WO 2015043072 A1 WO2015043072 A1 WO 2015043072A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- encoding format
- garbled
- target document
- characters
- read
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/146—Coding or compression of tree-structured data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/445—Program loading or initiating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/451—Execution arrangements for user interfaces
- G06F9/454—Multi-language systems; Localisation; Internationalisation
Definitions
- the present invention relates to a method and system for selecting an encoding format for reading a target document, and belongs to the technical field of electrical digital data processing.
- Encoding format refers to the digitization of words, numbers or other objects in a predetermined manner.
- the coding format is widely used in related fields such as electronic computers and televisions.
- the file encoding format also known as the character encoding format, is used to specify how characters are represented when processing text. When the Chinese file is read, the case where the file encoding format is not correctly matched may cause an exception or an incorrect result.
- Common Chinese character encoding formats include GB2312, BIG5, GBK, UTF-8, etc.
- the encoding format reading method of the text document disclosed in the prior art is to read the first few bytes of the text document, determine the values of the bytes, and thereby know the format of the encoding.
- the first few bytes of the text document do not retain the encoding format information of the text document. In this way, the encoding format of the text document cannot be obtained. If there is a mechanism to read the document by selecting the correct encoding format, it will greatly reduce the problems caused by the file encoding format and improve the development efficiency.
- the technical problem to be solved by the present invention is that the prior art only reads out the first few words of the target document. Section, determine the value of these bytes, so as to know the format of the encoding, but sometimes, the first few bytes of the target document does not retain the encoding format information of the document, and the encoding format of the document cannot be obtained.
- the present invention is achieved by the following technical solutions:
- a method for selecting an encoding format of a target document comprising: reading a reference document by using at least one reference encoding format, determining all or part of a garbled pattern obtained when the reference document is read by using the reference encoding format; Reading a target document in an encoding format; for each encoding format, comparing data generated when the encoding format reads the target document with a determined garbled pattern, determining a result generated when the target document is read by using the encoding format Garbled; statistically garbled when the target document is read using each encoding format, and compared, and then determines the encoding format of the target document.
- the reference encoding format belongs to a coding format set that includes all or part of an encoding format.
- the reference document is read by all of the reference encoding formats to determine all or part of the garbled pattern obtained when the reference document is read using the reference encoding format.
- the process of determining all or part code patterns obtained when the reference document is read by using the reference encoding format is as follows: for the garbled character string obtained when the reference document is read by using the reference encoding format, in the code string
- the non-effective judgment character is obtained by the judgment character; the statistically valid judges the number of occurrences of the garbled character in the character, and obtains the garbled pattern.
- the non-effective judgment character refers to an English alphabet, a number, and a white space character; and the valid judgment character refers to all characters except the non-effective judgment character.
- the statistics effectively determines the number of occurrences of garbled characters in the character, and when the garbled mode is obtained, the threshold of the preset number of times is set, and all the garbled characters whose occurrence times are greater than the threshold are saved in the garbled mode.
- the statistics effectively determine the number of occurrences of garbled characters in the character, and when the garbled mode is obtained, press The number of garbled characters appears in reverse order to arrange garbled characters; the garbled characters in the preceding part are obtained, and the garbled characters obtained are garbled mode.
- the garbled characters arranged in the first k% are obtained, and the garbled characters are saved in a garbled mode, where k is a positive number, 50 ⁇ k ⁇ 100.
- the part of the content is read each time the target document is read by using one encoding format, until a preset number of valid judgment characters is obtained; if all the contents of the document are read, the preset is not obtained.
- comparing the data generated when the encoding format is read by the encoding file with the determined garbled pattern, determining a garbled code generated when the target document is read by using the encoding format is: going through each encoding The data generated when the target document is read by the format is compared one by one with the garbled characters in the garbled mode one by one. If the garbled characters contain the data, it is determined that the data is garbled. Otherwise, the data is not considered to be a ⁇ code.
- counting the garbled characters generated when the target document is read by using each encoding format, and comparing, and then determining the encoding format of the target document is: counting using the encoding format to read the encoding
- the garbled ratio generated when the target document is generated, the encoding format with the lowest garbled ratio is selected as the encoding format for reading the target document; or the garbled ratio generated when the target document is read by using each encoding format is selected, and the garbled ratio is selected to be lower than
- the encoding format of the preset threshold is used as an encoding format for reading the target document.
- the garbled ratio is a ratio of the garbled characters to valid judgment characters.
- the encoding format belongs to the encoding format set each time the target document is read using one encoding format.
- a system for selecting an encoding format of a target document comprising: a garbled pattern generating module, configured to read a reference document by using at least one reference encoding format, and determine all or part obtained when the reference document is read by using the reference encoding format Garbled mode; a target document reading module for reading a target document each time using an encoding format; reading a garbled module for encoding each encoding format Formatting the data generated when the target document is read, comparing with the determined garbled pattern, determining garbled characters generated when the target document is read by using the encoding format; reading an encoding format selection module, for using each encoding format statistically The garbled characters generated when the target document is read are compared and then determined, and then the encoding format of the target document is read.
- the reference encoding format belongs to a coding format set that includes all or part of
- the reference document is read by all of the reference encoding formats to determine all or part of the garbled pattern obtained when the reference document is read using the reference encoding format.
- the process of determining all or part code patterns obtained when the reference document is read by using the reference encoding format is as follows: For the garbled character string obtained when the reference document is read by using the reference encoding format, deleting the I ⁇ L code The non-valid judgment character in the string is judged by the valid judgment character; the statistically valid judges the number of occurrences of the garbled character in the character, and obtains the garbled pattern.
- the non-effective judgment character refers to an English alphabet, a number, and a white space character; and the valid judgment character refers to all characters except the non-effective judgment character.
- the statistics effectively determines the number of occurrences of garbled characters in the character, and when the garbled mode is obtained, the threshold of the preset number of times is set, and all the garbled characters whose occurrence times are greater than the threshold are saved in the garbled mode.
- the statistics effectively determine the number of occurrences of garbled characters in the character, and when the garbled mode is obtained, the garbled characters are arranged in reverse order according to the number of occurrences of garbled characters; the garbled characters in the preceding part are obtained, and the garbled characters are garbled. .
- the garbled characters arranged in the first k% are obtained, and the garbled characters are saved in a garbled mode, where k is a positive number, 50 ⁇ k ⁇ 100.
- the part of the content is read each time the target document is read by using one encoding format, until a preset number of valid judgment characters is obtained; if all the contents of the document are read, the preset is not obtained. If the number of characters is valid, the number of valid judgment characters obtained will be taken as the standard. The number of valid characters is 50-1000.
- comparing the data generated when the encoding format is read by the encoding file with the determined garbled pattern, determining a garbled code generated when the target document is read by using the encoding format is: going through each encoding The data generated when the target document is read by the format is compared one by one with the garbled characters in the garbled mode one by one. If the garbled characters contain the data, it is determined that the data is garbled. Otherwise, the data is not considered to be a ⁇ code.
- counting the garbled characters generated when the target document is read by using each encoding format, and comparing, and then determining the encoding format of the target document is: counting using the encoding format to read the encoding
- the garbled ratio generated when the target document is generated, and the encoding format with the lowest garbled ratio is selected as the encoding format for reading the target document; or, the garbled ratio generated when the target document is read by using each encoding format is statistically selected, and the garbled ratio is low.
- the encoding format of the preset threshold is used as an encoding format for reading the target document.
- the garbled ratio is a ratio of the garbled characters to valid judgment characters.
- the encoding format belongs to the set of encoding formats each time the target document is read using an encoding format.
- the above technical solution has the following one or more advantages compared to the prior art:
- the method and system for selecting the encoding format of the target document first reading the reference document by referring to the encoding format to obtain the garbled pattern, and then, when reading the target document, Each encoding format, comparing data generated when the encoding format is read by the encoding format with a determined garbled pattern, determining garbled characters generated when the target document is read by using the encoding format; The garbled code generated when the target document is read, the garbled ratio is minimized or garbled.
- the first few bytes of the target document are read, and the values of the bytes are determined, so that the encoded code is known. Format, however, sometimes, the first few bytes of the target document do not retain the encoding format information of the document, and the encoding format of the document cannot be obtained.
- the garbled character having a higher number of occurrences of garbled characters is used as a garbled pattern according to a certain ratio, filtering some uncommon garble characters, and improving subsequent selection.
- the efficiency of reading the encoding format of the target document is used as a garbled pattern according to a certain ratio, filtering some uncommon garble characters, and improving subsequent selection.
- the method for selecting an encoding format for reading a target document according to the present disclosure is to select an encoding format for reading a target document according to a method with a minimum garbled ratio, which is cleverly conceived, simple in method, and easy to implement.
- the method for selecting an encoding format for reading a target document when the garbled ratio is less than a preset threshold, the encoding format is used as an encoding format for reading a document, thereby avoiding reading with all encoding formats. Longer processing time is required when selecting the target document, which further improves the efficiency of selecting the encoding of the target document.
- FIG. 1 is an embodiment of the method for selecting an encoding format for reading a target document according to the present invention.
- FIG. 2 is a block diagram of an embodiment of a system for selecting an encoding format for reading a target document according to the present invention.
- DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiment 1 This embodiment provides a method for selecting an encoding format of a target document, which is shown in FIG. 1 , and the steps include:
- the document in this embodiment is a text document.
- the reference encoding format belongs to a set of encoding formats that include all encoding formats.
- the process of determining all or part of the garbled pattern obtained when the reference document is read by using the reference encoding format is as follows: For the garbled string obtained when the reference document is read using the reference encoding format, deleting the illegible character string is invalid Judging the character, obtaining the judgment character; statistically determining the number of occurrences of the garbled character in the character, and obtaining the garbled pattern.
- the non-effective judgment characters refer to English letters, numbers and blank characters; the effective judgment characters refer to all characters except the valid judgment characters.
- the illegible character string obtained when the reference document is read by using the reference encoding format is deleted, and the non-effective judgment character in the garbled character string is deleted, so that the number of processed characters is less, and the processing is further improved.
- Speed JL also improves the accuracy of getting garbled.
- the statistics effectively determines the number of occurrences of garbled characters in the character, and when the garbled mode is obtained, the threshold of the preset number of times is garbled mode for all garbled characters whose occurrence times are greater than the threshold.
- a garbled character with a high number of garbled character occurrences is used as a garbled pattern according to a certain ratio, and some extraordinary garbled characters are filtered, which improves the efficiency of subsequent selection of the encoding format of the target document.
- the number of occurrences of garbled characters in the character is statistically determined.
- the garbled characters are arranged in reverse order according to the number of occurrences of garbled characters; the garbled characters in the preceding part are obtained, and the garbled characters are saved in garbled mode.
- the first 80% of garbled characters are obtained, and these garbled characters are garbled patterns that need to be acquired.
- the first k% garbled characters are taken and the garbled characters are saved in garbled mode.
- k is a positive number and k has a value range of 60 ⁇ k ⁇ 90.
- k can choose different values such as 60, 70, 75, 90, etc., and choose different values according to the user's needs.
- the encoding format belongs to the encoding format set, which means that it belongs to the previous one.
- the reference encoding format and the encoding format of the selected read target document belong to the same set.
- the garbled pattern has a high recognition rate for the generated garbled characters. If the number of valid judgment characters set in advance is not obtained after reading all the contents of the target document, the number of valid judgment characters actually obtained is taken as the standard. The number of valid characters is 50-1000.
- the number of valid judged characters is taken as 100.
- the number of valid characters can be determined by taking different values such as 70, 150, 200, 300, 500, 700, 1000, etc., and different values are selected according to the needs of the user.
- the method for selecting the encoding format of the target document in the embodiment, only part of the content of the target document is read when the target document is read, and the number of valid judgment characters set in advance is obtained, so that the selected valid judgment characters are not lost. Sexuality further increases the efficiency of selecting the encoding of the target document.
- each encoding format For each encoding format, compare the data generated when the encoding format reads the target document with the determined garble pattern, and determine garbled characters generated when the target document is read by using the encoding format.
- the specific process is: when the target document is read by each encoding format The data is compared with the garbled characters in the garbled mode one by one. If the garbled characters contain this data, it is determined that the data is garbled. Otherwise, the data is not considered to be a ⁇ code.
- the specific process is: counting the garbled ratio generated when the target document is read by using each encoding format, and selecting the encoding format with the lowest garbled ratio as the encoding format of the read target document.
- the garbled ratio is the proportion of the illegible characters in the valid judgment characters read.
- the method for selecting the encoding format of the target document in this embodiment is to select the encoding format of the target document according to the method with the smallest garbled ratio, and the concept is ingenious, the method is simple, and the implementation is easy.
- the garbled ratio generated when the target document is read by using each encoding format is counted, and the encoding format whose garbled ratio is lower than the preset threshold is selected as the encoding format of the read target document.
- the method for selecting the encoding format of the target document in this embodiment when the garbled ratio is less than the preset threshold, uses the encoding format as the encoding format of the read document, and avoids using all the encoding formats to obtain the target document. The selection requires a long processing time, which further improves the efficiency of encoding the target document.
- the method for selecting an encoding format for reading a target document and the system thereof in the embodiment first obtain a garbled pattern by reading a document by referring to an encoding format, and then reading the encoding format for each encoding format when reading the target document.
- the data generated when the target document is taken is compared with the determined garbled pattern, and the garbled code generated when the target document is read by using the encoding format is determined; and the garbled code generated when the target document is read by using each encoding format is counted, and the garbled ratio is minimized.
- the encoding format whose garbled ratio is lower than the preset threshold is determined as the encoding format of the read target document.
- Embodiment 2 First, collect 500 reference documents. Reading the reference document by at least one reference encoding format, Determine all or part of the garbled pattern obtained when reading the reference document using the reference encoding format. Take one of the above reference documents to read the document in UTF-8 encoding format, for example, the original text "Mid-Autumn Festival, our company sent me a fee of 1,500 yuan, everyone is very happy! ,,, J ⁇ The encoding format of the document is UTF-8, which is read using the GB2312 code, and the order is obtained.
- the target document is read by the encoding format GB2312, and 100 valid judgment characters are obtained.
- the encoding format UTF-8 and the encoding format GB2312 The data generated when the target format is read by the encoding format is compared with the garbled characters in the determined garbled pattern, and the garbled characters generated when the target document is read by the encoding format are determined.
- the UTF-8 is used to read the mesh. Acquiring the document data 100 according to the statistical model distortion determining the number of significant characters garbled characters is 86, the calculated ratio of 86% garbled characters.
- the data obtained by the target document is read by the encoding format GB2312, and the proportion of the garbled characters is calculated to be 0%.
- Embodiment 3 provides a method for selecting a Chinese text document reading and encoding format, which includes two stages, a first stage is a process of obtaining a garbled pattern by statistics, and a second stage is selecting a reading and encoding format of a target document. 1 ⁇ 2.
- the non-effective judgment characters refer to English letters (including upper and lower case), numbers, and white space characters (including spaces, tabs, line breaks, etc.); valid judgment characters refer to non-English letters (including upper and lower case), numbers, and Blank characters (including spaces, tabs, newlines, etc.).
- the process of garbled pattern is obtained through statistics.
- the encoding format of the file is expressed as / ( )
- the garbled encoding format is output as the reference encoding format
- the reference encoding format is expressed as C: , which means that the ⁇ code is generated when the file is read in the encoding format in use.
- the third step is to delete the non-valid judgment characters in the L code string.
- the fourth step is to scan the garbled characters. The number of occurrences of each garbled character in the string.
- the second to fourth steps are repeated for each reference document, and the number of occurrences of each garbled character is counted.
- the garbled characters are arranged in reverse order of the number of occurrences of garbled characters.
- the garbled characters of the first m% are obtained.
- the garbled characters are garbled patterns that need to be obtained.
- the value of m is greater than or equal to 50 and less than 100, and m is between 60 and 90.
- the process of selecting the target encoding format for the target document is selected.
- the target document here is also a Chinese text document. There are two implementations at this stage.
- the implementation manner is as follows:
- an encoding format c is obtained from the encoding format set C, and part of the content of the target document is read by using the encoding format c, and the first n valid judgment characters are obtained. If the number of valid judgment characters obtained is still less than n when the entire file is read, the number of characters actually obtained is taken as the standard. Assume that this stage finally obtains m valid judgment characters. The value of n is greater than or equal to 10, and the value range is [50, 1000].
- the second step the number m of garbled characters in the m valid decision characters is counted according to the garbled pattern, and the garbled ratio m, /m is calculated.
- One method of adding the encoding format c and the garbled ratio m, /m to the list statistical garbled characters is: m, zero, and each character of m characters is read in turn, if the character belongs to garbled mode, then m, Add one; when traversing m characters, the value in m, is the number of garbled characters.
- the third step for the other encoding formats in the encoding format set C, repeat the first step to the second step.
- the encoding format with the smallest selection ratio in the list L is returned as the encoding format of the read target document.
- the second stage can also be performed as follows:
- an encoding format c is obtained from the encoding format set C, and the encoding content c is used to read part of the target document.
- the second step the number m of garbled characters in the m valid decision characters is counted according to the garbled pattern, and the ratio m, /m of the garbled characters is calculated.
- One way to count garbled characters is: m, set zero, and read each of the m characters in turn. If the character belongs to garbled mode, m, add one; when traversing m characters, m, The value is the number of garbled characters.
- the third step if the garbled ratio m, /m is greater than or equal to the threshold, the first step and the second step are repeated. If the garbled ratio is less than the threshold, the encoding format c, is returned.
- the threshold value is greater than or equal to 1%, and the threshold value is between 5% and 50%, which is 15% in this embodiment.
- the garbled mode is obtained by statistics, and the read encoding format of the target document (which is a Chinese text document) is automatically selected according to the garbled mode.
- Embodiment 4 This embodiment provides a method for selecting an encoding format for reading a target document, including a first phase and a second phase, as follows: In the first phase, a garbled mode is obtained by statistics. The collection of encodings of interest is ⁇ UTF-8, GB2312 ⁇ . In the first step, 1000 Chinese text documents are collected as reference documents, which are used as Chinese training corpus. Among them, 500 files are UTF-8 code, and another 500 files are GB2312 code.
- the text document in this embodiment refers to a text document in a format with a suffix of .txt.
- the second step is to obtain a UTF-8 encoded file and read the document using GB2312 encoding to obtain a garbled string.
- the non-valid judgment character in the L code string is deleted.
- the fourth step is to count the number of occurrences of each garbled character in the garbled string.
- the fifth step for each reference document, repeat steps 2 through 4 to count the number of occurrences of each garbled character.
- the garbled characters are arranged in reverse order of the number of occurrences of garbled characters.
- the seventh step the first 80% of garbled characters are obtained.
- the garbled pattern contains common garbled features under the focused code set ⁇ UTF-8, GB2312 ⁇ .
- the garbled pattern obtained is [ ⁇ ].
- the first step is to get an encoding UTF-8 from the collection of encodings of interest ⁇ UTF-8, GB2312 ⁇ . Encodes the first 100 valid judgment characters of UTF-8 to read the target document.
- the second step according to the garbled mode, the number of garbled characters in the 100 valid decision characters is 86, and the proportion of the garbled characters is 86%. Add the entry in the list L (UTF-8, 86%).
- the second phase may also be implemented in the following manner, including: First, obtaining an encoding UTF-8 from the focused encoding set ⁇ UTF-8, GB2312 ⁇ , using the encoding UTF -8 Read the first 100 valid judgment characters of the target document.
- FIG. 2 is a structural diagram of an embodiment of a system for selecting an encoding format of a target document according to the present invention.
- the embodiment provides a system for selecting an encoding format of a target document, including: an L code mode generating module 21 And for reading the reference document by using at least one reference encoding format to determine all or part of the garbled mode obtained when the reference document is read by using the reference encoding format.
- the reference encoding format belongs to a set of encoding formats including all or part of the encoding format, and the reference encoding format is an encoding format that generates garbled characters when the reference format is read in the encoding format.
- the process of determining all or part of the garbled pattern obtained when reading the reference document by using the reference encoding format is as follows: For the garbled character string obtained when the reference document is read using the reference encoding format, the non-valid character in the I ⁇ L code string is deleted. , obtain a valid judgment character; statistically determine the number of occurrences of garbled characters in the character, and obtain a garbled pattern.
- the non-effective judgment characters refer to English letters, numbers and blank characters; the effective judgment characters refer to all characters except the valid judgment characters.
- the illegible character string obtained when the reference document is read by using the reference encoding format is deleted, and the non-effective judgment character in the garbled character string is deleted, so that the number of processed characters is less, and the processing is further improved.
- Speed JL also improves the accuracy of getting garbled.
- the statistics effectively determines the number of occurrences of garbled characters in the character, and when the garbled mode is obtained, the threshold of the preset number of times is garbled mode for all garbled characters whose occurrence times are greater than the threshold.
- a garbled character with a high number of garbled character occurrences is used as a garbled pattern according to a certain ratio, and some unusual garbled characters are filtered, which provides a good basis for the subsequent selection of the accuracy of the encoding format of the target document.
- the statistics effectively determines the number of occurrences of garbled characters in the character, and when the garbled mode is obtained, the garbled characters are arranged in reverse order according to the number of occurrences of garbled characters; the garbled characters in the preceding part are obtained, and the garbled characters are saved in the garbled mode.
- the garbled characters arranged in the first k% and save the garbled characters in the garbled mode, where k is a positive number, 50 ⁇ k ⁇ 100.
- the first 80% of garbled characters are obtained, and these garbled characters are garbled patterns that need to be acquired.
- the first k% of garbled characters are obtained and the garbled characters are saved in garbled mode.
- k is a positive number and k has a value range of 60 ⁇ k ⁇ 90. k can select different values such as 60, 70, 75, 90, etc., and select different values according to the user's needs.
- the target document reading module 22 is configured to read the target document each time using an encoding format.
- the module obtains a preset number of valid judgment characters, and the encoding format belongs to a coding format set to which the reference encoding format belongs. If the number of valid judgment characters set in advance is not obtained after reading all the contents of all the documents, the number of valid judgment characters actually obtained is taken as the standard.
- the number of valid characters is 50-1000.
- the number of valid judgment characters is 100.
- the effective number of characters can be determined as 70, 150, 200, Different values such as 300, 500, 700, and 1000, and different values are selected according to the needs of the user.
- the L code module 23 is configured to compare, for each encoding format, the data generated when the encoding format reads the target document with the determined garbled pattern, and determine garbled characters generated when the target document is read by using the encoding format.
- the specific process is: comparing the data generated when the target document is read by each encoding format with the garbled characters in the garbled mode one by one, and if the garbled characters include the data, determining that the data is garbled; otherwise, the data is not considered as ⁇ ⁇ Code.
- the non-effective judging character in the I ⁇ L code string is deleted, and the T-effect judging character is obtained; the statistically valid judges the number of occurrences of the garbled character in the character, Get garbled mode.
- the garbled ratio generated when the target document is read by using each encoding format is counted, and the encoding format with the lowest encoding ratio is used as the encoding format of the target document.
- the garbled ratio generated when the target document is read by using each encoding format is counted, and the encoding format whose garbled ratio is lower than the preset threshold is selected as the encoding format of the read target document.
- the garbled ratio is the proportion of garbled characters occupying the valid judgment characters read.
- the system for selecting the encoding format of the target document in the embodiment by using the method for selecting the encoding format of the target document of the present invention, effectively avoids the prior art only reading the first few bytes of the target document, and determining The value of these bytes, so that the format of the encoding is known, but sometimes, the first few bytes of the target document do not retain the encoding format information of the document, and the encoding format of the document cannot be obtained. It is apparent that the above-described embodiments are merely illustrative of the examples, and are not intended to limit the embodiments. Other variations or modifications of the various forms may be made in the above description for those skilled in the art. There is no need and no way to exhaust all of the implementations.
- embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the present invention may employ a computer-usable storage medium (including but not limited to disk storage, in one or more of the computer-usable program code embodied therein.
- the computational instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that instructions stored in the computer readable memory produce an article of manufacture including the instruction device.
- the instruction means implements the functions specified in one or more blocks of the flow or in a flow or block diagram of the flowchart.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Document Processing Apparatus (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/025,513 US10366143B2 (en) | 2013-09-29 | 2013-12-06 | Method and system for selecting encoding format for reading target document |
EP13894578.7A EP3051428B1 (en) | 2013-09-29 | 2013-12-06 | Method and system for selecting an encoding format for reading a target document |
JP2016517326A JP6280211B2 (ja) | 2013-09-29 | 2013-12-06 | ターゲット文書を読取るためのエンコーディングフォーマットを選択する方法及びシステム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310456276.6A CN104516862B (zh) | 2013-09-29 | 2013-09-29 | 一种选择读取目标文档的编码格式的方法及其系统 |
CN201310456276.6 | 2013-09-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015043072A1 true WO2015043072A1 (zh) | 2015-04-02 |
Family
ID=52741913
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2013/088745 WO2015043072A1 (zh) | 2013-09-29 | 2013-12-06 | 一种选择读取目标文档的编码格式的方法及其系统 |
Country Status (5)
Country | Link |
---|---|
US (1) | US10366143B2 (zh) |
EP (1) | EP3051428B1 (zh) |
JP (1) | JP6280211B2 (zh) |
CN (1) | CN104516862B (zh) |
WO (1) | WO2015043072A1 (zh) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105988977A (zh) * | 2015-02-16 | 2016-10-05 | 珠海金山办公软件有限公司 | 一种字符编码识别结果的显示方法和装置 |
CN105760364B (zh) * | 2016-02-22 | 2018-09-04 | 深圳市茁壮网络股份有限公司 | 一种字符集检测方法和装置 |
CN105847931B (zh) * | 2016-03-28 | 2019-08-27 | 深圳Tcl新技术有限公司 | 字幕显示方法及装置 |
CN106407438A (zh) * | 2016-09-28 | 2017-02-15 | 珠海迈越信息技术有限公司 | 一种数据处理方法及系统 |
CN108108267B (zh) * | 2016-11-25 | 2021-06-22 | 北京国双科技有限公司 | 数据的恢复方法和装置 |
CN108271041B (zh) * | 2016-12-30 | 2021-01-22 | 北京国双科技有限公司 | 乱码处理方法和装置 |
CN112580302B (zh) * | 2020-12-11 | 2023-07-14 | 海信视像科技股份有限公司 | 一种字幕校正方法及显示设备 |
CN114629707B (zh) * | 2022-03-16 | 2024-05-24 | 深信服科技股份有限公司 | 一种乱码检测方法、装置及电子设备和存储介质 |
CN114757145A (zh) * | 2022-03-21 | 2022-07-15 | 慧之安信息技术股份有限公司 | 一种判断消息字符集编码的方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101350858A (zh) * | 2008-09-10 | 2009-01-21 | 深圳华为通信技术有限公司 | 一种短信解码的方法和用户终端 |
CN101526963A (zh) * | 2009-04-17 | 2009-09-09 | 深圳华为通信技术有限公司 | 网页编码识别方法、装置和终端设备 |
CN102360392A (zh) * | 2011-10-24 | 2012-02-22 | 青岛海信移动通信技术股份有限公司 | 一种确定网页编码方式的方法及设备 |
CN102567293A (zh) * | 2010-12-13 | 2012-07-11 | 汉王科技股份有限公司 | 文本文件的编码格式探测方法和装置 |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3203544B2 (ja) * | 1996-01-31 | 2001-08-27 | 日本電信電話株式会社 | テキスト最尤復号方法及び最尤復号装置と、データ通信ネットワーク装置 |
US6049869A (en) * | 1997-10-03 | 2000-04-11 | Microsoft Corporation | Method and system for detecting and identifying a text or data encoding system |
JP2000148754A (ja) * | 1998-11-13 | 2000-05-30 | Omron Corp | マルチリンガル・システム,マルチリンガル処理方法およびマルチリンガル処理のプログラムを記憶した媒体 |
US7191114B1 (en) * | 1999-08-27 | 2007-03-13 | International Business Machines Corporation | System and method for evaluating character sets to determine a best match encoding a message |
CA2312540A1 (en) * | 2000-06-27 | 2001-12-27 | Neteka Inc. | Network address name resolution server |
US6701320B1 (en) * | 2002-04-24 | 2004-03-02 | Bmc Software, Inc. | System and method for determining a character encoding scheme |
US7148824B1 (en) * | 2005-08-05 | 2006-12-12 | Xerox Corporation | Automatic detection of character encoding format using statistical analysis of the text strings |
US7711673B1 (en) * | 2005-09-28 | 2010-05-04 | Trend Micro Incorporated | Automatic charset detection using SIM algorithm with charset grouping |
CN101034391A (zh) * | 2007-04-26 | 2007-09-12 | 北京立通无限科技有限公司 | 一种确定文本流字符集的方法及装置 |
CN101055593A (zh) * | 2007-06-15 | 2007-10-17 | 中国科学院软件研究所 | 藏文网页及其编码的识别方法 |
CN101110072A (zh) * | 2007-08-21 | 2008-01-23 | 无敌科技(西安)有限公司 | 一种自动辨识文字编码的装置及其方法 |
JP2010176237A (ja) * | 2009-01-28 | 2010-08-12 | Nec Corp | 文字コード自動判別システム、文字コード自動判別方法及び文字コード自動判別プログラム |
-
2013
- 2013-09-29 CN CN201310456276.6A patent/CN104516862B/zh not_active Expired - Fee Related
- 2013-12-06 US US15/025,513 patent/US10366143B2/en not_active Expired - Fee Related
- 2013-12-06 EP EP13894578.7A patent/EP3051428B1/en not_active Not-in-force
- 2013-12-06 WO PCT/CN2013/088745 patent/WO2015043072A1/zh active Application Filing
- 2013-12-06 JP JP2016517326A patent/JP6280211B2/ja not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101350858A (zh) * | 2008-09-10 | 2009-01-21 | 深圳华为通信技术有限公司 | 一种短信解码的方法和用户终端 |
CN101526963A (zh) * | 2009-04-17 | 2009-09-09 | 深圳华为通信技术有限公司 | 网页编码识别方法、装置和终端设备 |
CN102567293A (zh) * | 2010-12-13 | 2012-07-11 | 汉王科技股份有限公司 | 文本文件的编码格式探测方法和装置 |
CN102360392A (zh) * | 2011-10-24 | 2012-02-22 | 青岛海信移动通信技术股份有限公司 | 一种确定网页编码方式的方法及设备 |
Non-Patent Citations (1)
Title |
---|
See also references of EP3051428A4 * |
Also Published As
Publication number | Publication date |
---|---|
JP6280211B2 (ja) | 2018-02-14 |
CN104516862B (zh) | 2018-05-01 |
EP3051428A1 (en) | 2016-08-03 |
JP2016540269A (ja) | 2016-12-22 |
EP3051428B1 (en) | 2019-08-14 |
EP3051428A4 (en) | 2017-06-07 |
US20160239467A1 (en) | 2016-08-18 |
CN104516862A (zh) | 2015-04-15 |
US10366143B2 (en) | 2019-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2015043072A1 (zh) | 一种选择读取目标文档的编码格式的方法及其系统 | |
WO2016180268A1 (zh) | 一种文本聚合方法及装置 | |
CN109241274B (zh) | 文本聚类方法及装置 | |
CN105912514B (zh) | 基于指纹特征的文本复制检测系统及方法 | |
JP6562461B2 (ja) | 動的手書き検証、手書きに基づくユーザ認証、手書きデータ生成、及び手書きデータ保存 | |
WO2017084586A1 (zh) | 基于深度学习方法推断恶意代码规则的方法、系统及设备 | |
US9852122B2 (en) | Method of automated analysis of text documents | |
WO2017028789A1 (zh) | 网络攻击检测方法和设备 | |
CN110222790B (zh) | 用户身份识别方法、装置及服务器 | |
WO2022089227A1 (zh) | 地址参数处理方法及相关设备 | |
CN103455753B (zh) | 一种样本文件分析方法及装置 | |
WO2023039942A1 (zh) | 基于文本识别的要素信息提取方法、装置、设备及介质 | |
US11886583B2 (en) | Description-entropy-based intelligent detection method for big data mobile software similarity | |
US20130322759A1 (en) | Method and device for identifying font | |
CN104424435B (zh) | 一种获取病毒特征码的方法及装置 | |
Zhang et al. | Effective and Fast Near Duplicate Detection via Signature‐Based Compression Metrics | |
US11997116B2 (en) | Detection device and detection method for malicious HTTP request | |
CN111159996B (zh) | 基于文本指纹算法的短文本集合相似度比较方法及系统 | |
CN115618809A (zh) | 基于二元字符频次的字符分组方法及安全字库构建方法 | |
JP2015115652A (ja) | 情報処理装置、情報処理方法及びプログラム | |
KR101943065B1 (ko) | 전자문서 오류 검출 장치 및 방법 | |
CN107656909B (zh) | 一种基于文档混合特征的文档相似度判定方法和装置 | |
CN111104484A (zh) | 文本相似度检测方法、装置及电子设备 | |
CN115169291B (zh) | 文本转换方法、装置、终端设备和计算机可读存储介质 | |
CN115952411B (zh) | 一种前端界面的动态语言反混淆特征提取方法及提取系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13894578 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2016517326 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15025513 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
REEP | Request for entry into the european phase |
Ref document number: 2013894578 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2013894578 Country of ref document: EP |