CN113064862B

CN113064862B - File code identification method based on forward and reverse word stock and storage medium

Info

Publication number: CN113064862B
Application number: CN202110207815.7A
Authority: CN
Inventors: 刘德建; 陈丛亮; 郭玉湖
Original assignee: Fujian TQ Digital Co Ltd
Current assignee: Fujian TQ Digital Co Ltd
Priority date: 2019-04-19
Filing date: 2019-04-19
Publication date: 2022-06-07
Anticipated expiration: 2039-04-19
Also published as: CN110096481A; CN110096481B; CN113064862A; CN113064863B; CN113064863A

Abstract

The invention discloses a file code identification method and a storage medium based on a forward and reverse word stock, wherein the method comprises the following steps: collecting a sample file; respectively converting the file codes of the sample files into preset codes to generate forward word libraries corresponding to the codes; respectively decoding the sample file through other codes different from the file codes of the sample file to obtain a messy code file and recording the code conversion direction; generating a reverse word stock corresponding to the coding conversion direction of the messy code file according to the messy code file; acquiring a file to be identified; sequentially decoding the files to be identified through a code; acquiring words and single characters in the decoded file to be recognized, and respectively matching the words and the single characters in a corresponding forward word bank and a reverse word bank to obtain a forward matching number and a reverse matching number; and if the forward matching number is greater than the reverse matching number, taking a code as the file code of the file to be identified. The invention can correctly identify the file code.

Description

File code identification method based on forward and reverse word stock and storage medium

The present application is a divisional application based on an invention patent entitled "method for identifying a document code and computer-readable storage medium" having an application date of 2019, 04, 19 and an application number of 201910317628.7.

Technical Field

The present invention relates to the field of code identification technologies, and in particular, to a file code identification method and a computer-readable storage medium.

Background

At present, a plurality of coding modes exist, so that the coding mode of a text file needs to be known when the text file is opened, otherwise, the text file is decoded by an error coding mode, and messy codes appear.

In the prior art, the judgment of the file coding can only judge whether the file coding is UTF-8(8-bit Unicode Transformation Format, a variable length character coding aiming at Unicode, also called ten thousand national codes) according to the first 3 bytes of the file, while other file coding has no obvious characteristics to judge, and only users can select to check the coding of the file, and messy codes can appear if the coding selected by the users is incorrect.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a method for identifying a file code and a computer-readable storage medium are provided, which can correctly identify the file code and prevent the occurrence of a messy code.

In order to solve the technical problems, the invention adopts the technical scheme that: a file code identification method comprises the following steps:

collecting a sample file, wherein the sample file comprises non-messy code texts of various languages;

respectively converting the file codes of the sample files into codes in a preset code set, and generating forward word libraries corresponding to the codes according to the converted sample files;

respectively decoding the sample file through other codes different from the file codes in the code set to obtain a messy code file, and recording the code conversion direction of the messy code file, wherein the code conversion direction comprises file codes and decoding codes;

generating a reverse word stock corresponding to the coding conversion direction of the messy code file according to the messy code file;

acquiring a file to be identified;

sequentially decoding the file to be identified through one code in the code set;

acquiring words and single characters in a decoded file to be recognized, and respectively matching the words and the single characters in a forward word bank corresponding to one code and a reverse word bank corresponding to a first code conversion direction to obtain a forward matching number and a reverse matching number, wherein the decoded code in the first code conversion direction is the code;

and if the forward matching number is greater than the reverse matching number, taking the code as the file code of the file to be identified.

The invention also relates to a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps as described above.

The invention has the beneficial effects that: and analyzing and processing the acquired sample file to generate a forward word bank and a reverse word bank, and then obtaining the file code of the file to be recognized according to the matching result of the file to be recognized and the forward word bank and the reverse word bank. The invention can correctly code and identify the file with unknown coding mode, and effectively avoids the occurrence of messy codes.

Drawings

FIG. 1 is a flow chart of a method for identifying a document code according to the present invention;

FIG. 2 is a first flowchart of a method according to a first embodiment of the present invention;

fig. 3 is a flowchart of a method according to a first embodiment of the invention.

Detailed Description

In order to explain technical contents, objects and effects of the present invention in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

The most key concept of the invention is as follows: and analyzing and processing the collected sample file to generate a forward word stock and a reverse word stock, and then obtaining the file code of the file to be recognized according to the matching result of the file to be recognized and the forward word stock and the reverse word stock.

Referring to fig. 1, a method for identifying a file code includes:

acquiring a file to be identified;

From the above description, the beneficial effects of the present invention are: the file code can be correctly identified, and the occurrence of messy codes is prevented.

Further, after the acquiring the sample file, the method further comprises:

and replacing a first character in the sample file with a blank space, wherein the first character is a letter and a symbol represented by ASCII code.

Further, after the decoding the file to be identified by an encoding in the encoding set, the method further includes:

and eliminating the first character in the decoded file to be recognized.

As can be seen from the above description, since the ASCII code has the highest versatility and may affect the subsequent matching numbers, the recognition accuracy can be improved by eliminating letters and symbols represented by the ASCII code.

Further, the converting the file codes of the sample file into codes in a preset code set, and generating a forward word library corresponding to each code according to the converted sample file specifically includes:

converting the file code of the sample file into a code in a preset code set;

acquiring all the single characters in the converted sample file, and generating a forward character library corresponding to the code;

and acquiring all continuous and non-space characters in the converted sample file, and generating a forward lexicon corresponding to the code.

From the above description, the display format of each character or character combination under the code is stored in the forward word bank corresponding to the code.

Further, the generating of the reverse word stock corresponding to the code conversion direction of the garbled file is specifically as follows:

acquiring all single characters in a messy code file, and generating a reverse character library corresponding to the code conversion direction of the messy code file;

acquiring all continuous and non-space two characters in a messy code file, and generating a reverse word stock corresponding to the code conversion direction of the messy code file.

As can be seen from the above description, the display format of each character or character combination in the encoding and converting direction is stored in the reverse word bank corresponding to the encoding and converting direction.

Further, the forward word stock comprises a forward word stock and a forward word stock, and the reverse word stock comprises a reverse word stock and a reverse word stock;

the step from the step of sequentially decoding the file to be identified by one code in the code set to the step of taking the code as the file code of the file to be identified if the forward matching number is greater than the reverse matching number specifically comprises the following steps:

acquiring a code in the code set, and decoding the file to be identified through the code;

acquiring words in the decoded file to be recognized, wherein the words are two continuous non-blank characters;

matching the words with a forward word bank corresponding to the code to obtain a first forward matching number;

matching the words with each reverse lexicon corresponding to a first coding conversion direction respectively to obtain a first reverse matching number of each reverse lexicon, wherein the decoding codes in the first coding conversion direction are the codes;

adding the first reverse matching numbers of the reverse word banks to obtain a second reverse matching number;

if the first forward matching number is larger than the second reverse matching number, the code is used as a file code of the file to be identified;

if the first forward matching number is smaller than the second reverse matching number, acquiring a file code in a code conversion direction corresponding to a reverse lexicon with the maximum first reverse matching number, taking the file code as a code, and continuously executing the step of decoding the file to be identified through the code;

if the first forward matching number and the second reverse matching number are equal and are not zero, acquiring a next code in the code set, taking the next code as a code, and continuing to execute the step of decoding the file to be identified through the code;

if the first forward matching number and the second reverse matching number are both zero, acquiring a single character in the decoded file to be identified;

matching the single character with a forward character library corresponding to the code to obtain a second forward matching number;

respectively matching the single characters with reverse word banks corresponding to the first coding conversion direction to obtain a third reverse matching number of each reverse word bank;

adding the third reverse matching numbers of the reverse character libraries to obtain a fourth reverse matching number;

if the second forward matching number is larger than the fourth reverse matching number, the code is used as a file code of the file to be identified;

if the second forward matching number is smaller than the fourth reverse matching number, acquiring a file code in a code conversion direction corresponding to a reverse word stock with the largest third reverse matching number, taking the file code as a code, and continuously executing the step of decoding the file to be identified through the code;

and if the second forward matching number and the fourth reverse matching number are equal, acquiring a next code in the code set, taking the next code as a code, and continuously executing the step of decoding the file to be identified through the code.

According to the description, the correct file code can be reversely deduced more quickly by recording the code conversion direction, and the recognition efficiency is improved.

The invention also proposes a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps as described above.

Example one

Referring to fig. 2-3, a first embodiment of the present invention is: a method for recognizing the code of file includes collecting sample file to generate forward word library and reverse word library, and recognizing the source code of file according to said forward word library and reverse word library.

As shown in fig. 2, the first part includes the following steps:

s101: the method comprises the steps of collecting a preset number of sample files, wherein the sample files comprise non-messy code texts of various languages, such as articles of Chinese, Japanese and the like. Since the sample files are used for generating the forward word stock and the reverse word stock, the larger the number of the sample files is, the better the recognition effect is.

S102: and respectively converting the file codes of the sample file into codes in a preset code set, and generating a forward word stock corresponding to each code according to the converted sample file, wherein the forward word stock comprises a forward word stock and a forward word stock.

Specifically, an encoding set may be preset, where the encoding set includes common file codes such as UTF-8 encoding, GBK encoding, GB2312 encoding, and the like. Then copying the collected sample files into corresponding parts according to the number of the coding types in the coding set, so that each code in the coding set can correspond to one sample file, and then executing the following operations on each sample file:

converting the file code of the sample file into a code, namely converting the file code of the sample file from the original code into a code in a code set;

Preferably, before this step, the first character in the sample file is replaced by a blank space, and the first character is a letter and a symbol represented by ASCII code. For example, for a sample file encoded with a file encoding UTF-8, since UTF-8 uses the same encoding as ASCII, encoding characters in the encoding range of 00000000 to 01111111 can all be removed and replaced with spaces.

S103: and respectively decoding the sample file through other codes different from the file codes in the code set to obtain a messy code file, and recording the code conversion direction of the messy code file according to the file codes of the sample file and the codes for decoding the sample file.

The method comprises the steps that each sample file is decoded through other codes different from file codes of the sample file, the coded files are obtained due to the fact that the file codes are different from the decoded codes, each coded file corresponds to one coding conversion direction, the coding conversion direction comprises file coding parameters and decoding coding parameters, the values of the file coding parameters are file codes of the sample files, and the values of the decoding coding parameters are codes used for decoding the sample files.

For example, if a sample file is a GBK-encoded chinese file and is decoded with UTF-8 encoding, the resulting scrambled file will have the file encoded in the encoding transformation direction from GBK to UTF-8, and the encoding transformation direction can be expressed as "GBK to UTF-8".

In this step, if the coding type of the sample file covers the coding set, the recorded coding conversion direction may be a combination of any two codes in the coding set.

S104: and generating a reverse word stock corresponding to the coding conversion direction of the messy code file according to the messy code file. Specifically, for a messy code file, all single characters in the messy code file are obtained, and a reverse character library corresponding to the code conversion direction of the messy code file is generated; and acquiring all continuous and non-space two characters in the messy code file, and generating a reverse word stock corresponding to the code conversion direction of the messy code file. That is, each transcoding direction corresponds to an inverse word stock and an inverse word stock.

Further, if the characters beyond the character representation range exist in the messy code file, the characters are added into a reverse word stock corresponding to the code conversion direction of the messy code file. For example, the encoding range of GB2312 is hexadecimal A1 A1-FEFE, if a character is A1a0, the character corresponding to GB2312 cannot be queried, so A1a0 is directly recorded as a reverse font library.

The above steps complete the generation of the forward word stock and the reverse word stock, and then the identification of the file code is performed on the file to be identified, as shown in fig. 3, the method includes the following steps:

s201: and acquiring the file to be identified.

S202: a code is obtained from the set of codes.

S203: and decoding the file to be identified through the code. Further, each time the file to be recognized is decoded with a new code, the original file to be recognized is decoded, so that a plurality of files to be recognized may be copied in advance after step S201 before this step.

Preferably, after this step, the first character in the decoded file to be recognized is removed, that is, the letters and symbols represented by the ASCII code are removed.

S204: and acquiring words in the decoded file to be recognized, wherein the words are two continuous non-blank characters.

S205: matching the words with a forward word bank corresponding to the code to obtain a first forward matching number; the obtained words are searched and matched in the forward word bank corresponding to the code, and the number of the matched words is the first forward matching number.

S206: matching the words with each reverse lexicon corresponding to a first coding conversion direction respectively to obtain a first reverse matching number of each reverse lexicon, wherein the decoding codes in the first coding conversion direction are the codes; and then adding the first reverse matching numbers of the reverse word banks to obtain a second reverse matching number.

The second reverse matching number is the matching number of the word in all the reverse word banks corresponding to the first coding conversion direction. Specifically, the decoding codes are obtained as each coding conversion direction of the code, then reverse word banks corresponding to the coding conversion directions are obtained respectively, then words are searched and matched in the reverse word banks respectively, the number of words which can be matched in one reverse word bank is the first reverse matching number of the reverse word bank, and finally the first reverse matching numbers of the reverse word banks are added and summed to obtain the second reverse matching number.

S207: determining whether the first forward matching number is greater than the second reverse matching number, if so, performing step S215, and if not, performing step S208.

S208: and judging whether the first forward matching number is smaller than the second reverse matching number, if so, executing step S209, otherwise, indicating that the first forward matching number is equal to the second reverse matching number, and executing step S210.

S209: and acquiring a file code in the code conversion direction corresponding to the reverse lexicon with the maximum first reverse matching number, taking the file code as a code, and continuing to execute the step S203.

In step S206, the words are respectively matched in each reverse lexicon corresponding to the encoding conversion direction of the encoding into the one encoding, so as to obtain a first reverse matching number of each reverse lexicon. The step is to first obtain the reverse lexicon with the largest first reverse matching number, that is, the reverse lexicon with the largest word matching number, from the reverse lexicons, then obtain the file code in the code conversion direction corresponding to the reverse lexicon (the decoding code in the code conversion direction is the code), then use the file code as a new code for decoding the file to be recognized, and continue to execute step S203, that is, decode the file to be recognized with the file code in the next step.

S210: and judging whether the first forward matching number and the second reverse matching number are both zero, if so, indicating that the forward word stock and the reverse word stock are not matched, matching through the forward word stock and the reverse word stock, executing step S211, otherwise, indicating that the forward word stock and the reverse word stock are equal but not zero, acquiring a next code from the code set, decoding the file to be identified by using the next code, and continuing to execute step S202.

S211: and acquiring the single characters in the decoded file to be recognized.

S212: matching the single character with a forward character library corresponding to the code to obtain a second forward matching number; the acquired single characters are searched and matched in the forward character library corresponding to the code, and the number of the matched single characters is the second forward matching number.

S213: and respectively matching the single characters with the reverse character libraries corresponding to the first coding conversion direction to obtain third reverse matching numbers of the reverse character libraries, and adding the third reverse matching numbers of the reverse character libraries to obtain a fourth reverse matching number.

The fourth reverse matching number is the matching number of the single character in all the reverse character libraries corresponding to the first coding conversion direction. Specifically, the decoding codes are obtained as each code conversion direction of the code, then reverse word banks corresponding to the code conversion directions are obtained respectively, then single words are searched and matched in the reverse word banks respectively, the number of the single words which can be matched in one reverse word bank is the third reverse matching number of the reverse word bank, and finally the third reverse matching numbers of the reverse word banks are added and summed to obtain the fourth reverse matching number.

S214: determining whether the second forward matching number is greater than the fourth reverse matching number, if so, performing step S215, and if not, performing step S216.

S215: and taking the code as the file code of the file to be identified, namely judging that the file code of the file to be identified is the code.

S216: and judging whether the second forward matching number is smaller than the fourth reverse matching number, if so, executing step S217, otherwise, indicating that the second forward matching number is equal to the fourth reverse matching number, at this time, obtaining a next code from the code set, decoding the file to be identified by using the next code, and then, continuing to execute step S202.

S217: and acquiring a file code in the code conversion direction corresponding to the reverse word stock with the maximum third reverse matching number, taking the file code as a code, and continuing to execute the step S203.

In step S213, the single character is already matched in each reverse word stock corresponding to the encoding conversion direction of the decoding encoding into the one encoding, so as to obtain a third reverse matching number of each reverse word stock. The step is to first obtain the reverse word stock with the maximum third reverse matching number, that is, the reverse word stock with the maximum single-word matching number, from the reverse word stocks, then obtain the file code in the code conversion direction corresponding to the reverse word stock (the decoding code in the code conversion direction is the code), then use the file code as a new code for decoding the file to be recognized, and continue to execute step S203, that is, decode the file to be recognized by using the file code.

Further, since the UTF-8 encoding is more versatile, it is preferable that only the forward word library corresponding to the UTF-8 encoding is generated in step S102. In step S103, only the sample file encoded by UTF-8 may be decoded by other encoding, and the other encoded sample files may be decoded by UTF-8 encoding, that is, the encoding conversion direction necessarily includes the UTF-8 encoding. In step S202, UTF-8 encoding is preferentially acquired.

Two specific examples of the present embodiment are given below.

If the content of the file 1 to be identified (encoded by GB 2312) is: test file GB 2312. Firstly, decoding the UTF-8 code to obtain the following contents: [B2] [ E2] [ CA ] [ D4] [ CE ] [ C4] [ BC ] [ FE ] gb 2312; after eliminating letters and symbols represented by ASCII codes: [B2] [ E2] [ CA ] [ D4] [ CE ] [ C4] [ BC ] [ FE ]. And then matching the characters in a forward word bank corresponding to the UTF-8 code and a reverse word bank corresponding to the code conversion direction of decoding and coding the UTF-8 code respectively to obtain that the number of the characters matched in the forward word bank is 0, the number of the characters matched in the reverse word bank is 8, and considering that the file code of the file 1 to be identified is not the UTF-8 code because the forward matching number is less than the reverse matching number. Then, analysis shows that the coding conversion direction corresponding to the reverse word stock with the largest matching number is from "GB 2312 coding to UTF-8 coding", so that the document 1 to be recognized is decoded by the GB2312 coding again, and the obtained content is: test file GB 2312; after eliminating letters and symbols represented by ASCII codes: and (6) testing the file. And then, matching the characters in a forward word bank corresponding to a GB2312 code and a reverse word bank corresponding to a code conversion direction of decoding and coding the GB2312 code respectively to obtain the number of the characters matched in the forward word bank as 2 (test and file) and the number of the characters matched in the reverse word bank as 0, so that the file code of the file 1 to be recognized is judged as the GB2312 code.

Similarly, if the content of the file to be identified 2 (encoded by UTF-8) is: test file UTF 8. Decoding the code by GB2312 coding to obtain the following contents: master albizzia UTF 8. The matching number of the forward word bank corresponding to the GB2312 code is 0 through query, the matching number of the forward word bank corresponding to the GB2312 code is 2 ("master ", " happy"), and the forward word bank and the reverse word bank corresponding to the GB2312 code are both from the UTF-8 code to the GB2312 code, so that the file 2 to be identified is decoded through the UTF-8 code again, and the obtained content is as follows: test file UTF 8. And (3) inquiring to obtain that the matching number in the forward lexicon corresponding to the UTF-8 code is 2 (test and file), and the matching number in the reverse lexicon is 0, so that the file code of the file 2 to be identified is judged to be the UTF-8 code.

The embodiment can intelligently identify the correct code of the file without the character mark (such as the file 1 to be identified), and can reversely deduce the correct file code more quickly by recording the code conversion direction, thereby improving the identification efficiency. Meanwhile, letters and symbols expressed by ASCII codes are removed, so that the identification accuracy is improved.

Example two

This embodiment is a computer-readable storage medium corresponding to the above embodiment, on which a computer program is stored, the program, when executed by a processor, implementing the steps of:

acquiring a file to be identified;

Further, after the acquiring the sample file, the method further comprises:

and eliminating the first character in the decoded file to be recognized.

converting the file code of the sample file into a code in a preset code set;

and acquiring all the two continuous non-blank characters in the converted sample file, and generating a forward word bank corresponding to the code.

the step of sequentially decoding the file to be identified by one code in the code set to the step of taking the one code as the file code of the file to be identified if the forward matching number is greater than the reverse matching number specifically includes:

matching the single character with the forward character library corresponding to the code to obtain a second forward matching number;

In summary, the file code recognition method and the computer-readable storage medium provided by the present invention generate the forward word stock and the backward word stock by analyzing and processing the collected sample file, and then obtain the file code of the file to be recognized according to the matching result between the file to be recognized and the forward word stock and the backward word stock. The invention can correctly code and identify the file with unknown coding mode, thereby effectively avoiding the occurrence of messy codes; the recognition accuracy can be improved by removing letters and symbols expressed by ASCII codes; by recording the code conversion direction, the correct file code can be reversely deduced more quickly, and the recognition efficiency is improved.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims

1. A file coding identification method based on a forward and reverse word stock is characterized by comprising the following steps:

acquiring a file to be identified;

if the forward matching number is larger than the reverse matching number, the code is used as the file code of the file to be identified;

the step of converting the file codes of the sample file into codes in a preset code set respectively, and generating a forward word stock corresponding to each code according to the converted sample file specifically comprises the following steps:

converting the file code of the sample file into a code in a preset code set;

acquiring all continuous and non-space two characters in the converted sample file, and generating a forward lexicon corresponding to the code;

the step of generating a reverse word library corresponding to the coding conversion direction of the messy code file specifically comprises the following steps:

acquiring all continuous and non-space two characters in a messy code file, and generating a reverse word stock corresponding to the code conversion direction of the messy code file;

2. The method for identifying file codes based on the forward and reverse word library according to claim 1, wherein after the sample file is collected, the method further comprises:

3. The method for identifying files based on the forward/reverse word library according to claim 2, wherein after decoding the file to be identified by a code in the code set, the method further comprises:

and eliminating the first character in the decoded file to be recognized.

4. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1-3.