CN113064862B - File code identification method based on forward and reverse word stock and storage medium - Google Patents

File code identification method based on forward and reverse word stock and storage medium Download PDF

Info

Publication number
CN113064862B
CN113064862B CN202110207815.7A CN202110207815A CN113064862B CN 113064862 B CN113064862 B CN 113064862B CN 202110207815 A CN202110207815 A CN 202110207815A CN 113064862 B CN113064862 B CN 113064862B
Authority
CN
China
Prior art keywords
code
file
reverse
matching number
codes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110207815.7A
Other languages
Chinese (zh)
Other versions
CN113064862A (en
Inventor
刘德建
陈丛亮
郭玉湖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian TQ Digital Co Ltd
Original Assignee
Fujian TQ Digital Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian TQ Digital Co Ltd filed Critical Fujian TQ Digital Co Ltd
Priority to CN202110207815.7A priority Critical patent/CN113064862B/en
Publication of CN113064862A publication Critical patent/CN113064862A/en
Application granted granted Critical
Publication of CN113064862B publication Critical patent/CN113064862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound
    • H03M7/705Unicode

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a file code identification method and a storage medium based on a forward and reverse word stock, wherein the method comprises the following steps: collecting a sample file; respectively converting the file codes of the sample files into preset codes to generate forward word libraries corresponding to the codes; respectively decoding the sample file through other codes different from the file codes of the sample file to obtain a messy code file and recording the code conversion direction; generating a reverse word stock corresponding to the coding conversion direction of the messy code file according to the messy code file; acquiring a file to be identified; sequentially decoding the files to be identified through a code; acquiring words and single characters in the decoded file to be recognized, and respectively matching the words and the single characters in a corresponding forward word bank and a reverse word bank to obtain a forward matching number and a reverse matching number; and if the forward matching number is greater than the reverse matching number, taking a code as the file code of the file to be identified. The invention can correctly identify the file code.

Description

File code identification method based on forward and reverse word stock and storage medium
The present application is a divisional application based on an invention patent entitled "method for identifying a document code and computer-readable storage medium" having an application date of 2019, 04, 19 and an application number of 201910317628.7.
Technical Field
The present invention relates to the field of code identification technologies, and in particular, to a file code identification method and a computer-readable storage medium.
Background
At present, a plurality of coding modes exist, so that the coding mode of a text file needs to be known when the text file is opened, otherwise, the text file is decoded by an error coding mode, and messy codes appear.
In the prior art, the judgment of the file coding can only judge whether the file coding is UTF-8(8-bit Unicode Transformation Format, a variable length character coding aiming at Unicode, also called ten thousand national codes) according to the first 3 bytes of the file, while other file coding has no obvious characteristics to judge, and only users can select to check the coding of the file, and messy codes can appear if the coding selected by the users is incorrect.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: a method for identifying a file code and a computer-readable storage medium are provided, which can correctly identify the file code and prevent the occurrence of a messy code.
In order to solve the technical problems, the invention adopts the technical scheme that: a file code identification method comprises the following steps:
collecting a sample file, wherein the sample file comprises non-messy code texts of various languages;
respectively converting the file codes of the sample files into codes in a preset code set, and generating forward word libraries corresponding to the codes according to the converted sample files;
respectively decoding the sample file through other codes different from the file codes in the code set to obtain a messy code file, and recording the code conversion direction of the messy code file, wherein the code conversion direction comprises file codes and decoding codes;
generating a reverse word stock corresponding to the coding conversion direction of the messy code file according to the messy code file;
acquiring a file to be identified;
sequentially decoding the file to be identified through one code in the code set;
acquiring words and single characters in a decoded file to be recognized, and respectively matching the words and the single characters in a forward word bank corresponding to one code and a reverse word bank corresponding to a first code conversion direction to obtain a forward matching number and a reverse matching number, wherein the decoded code in the first code conversion direction is the code;
and if the forward matching number is greater than the reverse matching number, taking the code as the file code of the file to be identified.
The invention also relates to a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps as described above.
The invention has the beneficial effects that: and analyzing and processing the acquired sample file to generate a forward word bank and a reverse word bank, and then obtaining the file code of the file to be recognized according to the matching result of the file to be recognized and the forward word bank and the reverse word bank. The invention can correctly code and identify the file with unknown coding mode, and effectively avoids the occurrence of messy codes.
Drawings
FIG. 1 is a flow chart of a method for identifying a document code according to the present invention;
FIG. 2 is a first flowchart of a method according to a first embodiment of the present invention;
fig. 3 is a flowchart of a method according to a first embodiment of the invention.
Detailed Description
In order to explain technical contents, objects and effects of the present invention in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.
The most key concept of the invention is as follows: and analyzing and processing the collected sample file to generate a forward word stock and a reverse word stock, and then obtaining the file code of the file to be recognized according to the matching result of the file to be recognized and the forward word stock and the reverse word stock.
Referring to fig. 1, a method for identifying a file code includes:
collecting a sample file, wherein the sample file comprises non-messy code texts of various languages;
respectively converting the file codes of the sample files into codes in a preset code set, and generating forward word libraries corresponding to the codes according to the converted sample files;
respectively decoding the sample file through other codes different from the file codes in the code set to obtain a messy code file, and recording the code conversion direction of the messy code file, wherein the code conversion direction comprises file codes and decoding codes;
generating a reverse word stock corresponding to the coding conversion direction of the messy code file according to the messy code file;
acquiring a file to be identified;
sequentially decoding the file to be identified through one code in the code set;
acquiring words and single characters in a decoded file to be recognized, and respectively matching the words and the single characters in a forward word bank corresponding to one code and a reverse word bank corresponding to a first code conversion direction to obtain a forward matching number and a reverse matching number, wherein the decoded code in the first code conversion direction is the code;
and if the forward matching number is greater than the reverse matching number, taking the code as the file code of the file to be identified.
From the above description, the beneficial effects of the present invention are: the file code can be correctly identified, and the occurrence of messy codes is prevented.
Further, after the acquiring the sample file, the method further comprises:
and replacing a first character in the sample file with a blank space, wherein the first character is a letter and a symbol represented by ASCII code.
Further, after the decoding the file to be identified by an encoding in the encoding set, the method further includes:
and eliminating the first character in the decoded file to be recognized.
As can be seen from the above description, since the ASCII code has the highest versatility and may affect the subsequent matching numbers, the recognition accuracy can be improved by eliminating letters and symbols represented by the ASCII code.
Further, the converting the file codes of the sample file into codes in a preset code set, and generating a forward word library corresponding to each code according to the converted sample file specifically includes:
converting the file code of the sample file into a code in a preset code set;
acquiring all the single characters in the converted sample file, and generating a forward character library corresponding to the code;
and acquiring all continuous and non-space characters in the converted sample file, and generating a forward lexicon corresponding to the code.
From the above description, the display format of each character or character combination under the code is stored in the forward word bank corresponding to the code.
Further, the generating of the reverse word stock corresponding to the code conversion direction of the garbled file is specifically as follows:
acquiring all single characters in a messy code file, and generating a reverse character library corresponding to the code conversion direction of the messy code file;
acquiring all continuous and non-space two characters in a messy code file, and generating a reverse word stock corresponding to the code conversion direction of the messy code file.
As can be seen from the above description, the display format of each character or character combination in the encoding and converting direction is stored in the reverse word bank corresponding to the encoding and converting direction.
Further, the forward word stock comprises a forward word stock and a forward word stock, and the reverse word stock comprises a reverse word stock and a reverse word stock;
the step from the step of sequentially decoding the file to be identified by one code in the code set to the step of taking the code as the file code of the file to be identified if the forward matching number is greater than the reverse matching number specifically comprises the following steps:
acquiring a code in the code set, and decoding the file to be identified through the code;
acquiring words in the decoded file to be recognized, wherein the words are two continuous non-blank characters;
matching the words with a forward word bank corresponding to the code to obtain a first forward matching number;
matching the words with each reverse lexicon corresponding to a first coding conversion direction respectively to obtain a first reverse matching number of each reverse lexicon, wherein the decoding codes in the first coding conversion direction are the codes;
adding the first reverse matching numbers of the reverse word banks to obtain a second reverse matching number;
if the first forward matching number is larger than the second reverse matching number, the code is used as a file code of the file to be identified;
if the first forward matching number is smaller than the second reverse matching number, acquiring a file code in a code conversion direction corresponding to a reverse lexicon with the maximum first reverse matching number, taking the file code as a code, and continuously executing the step of decoding the file to be identified through the code;
if the first forward matching number and the second reverse matching number are equal and are not zero, acquiring a next code in the code set, taking the next code as a code, and continuing to execute the step of decoding the file to be identified through the code;
if the first forward matching number and the second reverse matching number are both zero, acquiring a single character in the decoded file to be identified;
matching the single character with a forward character library corresponding to the code to obtain a second forward matching number;
respectively matching the single characters with reverse word banks corresponding to the first coding conversion direction to obtain a third reverse matching number of each reverse word bank;
adding the third reverse matching numbers of the reverse character libraries to obtain a fourth reverse matching number;
if the second forward matching number is larger than the fourth reverse matching number, the code is used as a file code of the file to be identified;
if the second forward matching number is smaller than the fourth reverse matching number, acquiring a file code in a code conversion direction corresponding to a reverse word stock with the largest third reverse matching number, taking the file code as a code, and continuously executing the step of decoding the file to be identified through the code;
and if the second forward matching number and the fourth reverse matching number are equal, acquiring a next code in the code set, taking the next code as a code, and continuously executing the step of decoding the file to be identified through the code.
According to the description, the correct file code can be reversely deduced more quickly by recording the code conversion direction, and the recognition efficiency is improved.
The invention also proposes a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps as described above.
Example one
Referring to fig. 2-3, a first embodiment of the present invention is: a method for recognizing the code of file includes collecting sample file to generate forward word library and reverse word library, and recognizing the source code of file according to said forward word library and reverse word library.
As shown in fig. 2, the first part includes the following steps:
s101: the method comprises the steps of collecting a preset number of sample files, wherein the sample files comprise non-messy code texts of various languages, such as articles of Chinese, Japanese and the like. Since the sample files are used for generating the forward word stock and the reverse word stock, the larger the number of the sample files is, the better the recognition effect is.
S102: and respectively converting the file codes of the sample file into codes in a preset code set, and generating a forward word stock corresponding to each code according to the converted sample file, wherein the forward word stock comprises a forward word stock and a forward word stock.
Specifically, an encoding set may be preset, where the encoding set includes common file codes such as UTF-8 encoding, GBK encoding, GB2312 encoding, and the like. Then copying the collected sample files into corresponding parts according to the number of the coding types in the coding set, so that each code in the coding set can correspond to one sample file, and then executing the following operations on each sample file:
converting the file code of the sample file into a code, namely converting the file code of the sample file from the original code into a code in a code set;
acquiring all the single characters in the converted sample file, and generating a forward character library corresponding to the code;
and acquiring all continuous and non-space characters in the converted sample file, and generating a forward lexicon corresponding to the code.
Preferably, before this step, the first character in the sample file is replaced by a blank space, and the first character is a letter and a symbol represented by ASCII code. For example, for a sample file encoded with a file encoding UTF-8, since UTF-8 uses the same encoding as ASCII, encoding characters in the encoding range of 00000000 to 01111111 can all be removed and replaced with spaces.
S103: and respectively decoding the sample file through other codes different from the file codes in the code set to obtain a messy code file, and recording the code conversion direction of the messy code file according to the file codes of the sample file and the codes for decoding the sample file.
The method comprises the steps that each sample file is decoded through other codes different from file codes of the sample file, the coded files are obtained due to the fact that the file codes are different from the decoded codes, each coded file corresponds to one coding conversion direction, the coding conversion direction comprises file coding parameters and decoding coding parameters, the values of the file coding parameters are file codes of the sample files, and the values of the decoding coding parameters are codes used for decoding the sample files.
For example, if a sample file is a GBK-encoded chinese file and is decoded with UTF-8 encoding, the resulting scrambled file will have the file encoded in the encoding transformation direction from GBK to UTF-8, and the encoding transformation direction can be expressed as "GBK to UTF-8".
In this step, if the coding type of the sample file covers the coding set, the recorded coding conversion direction may be a combination of any two codes in the coding set.
S104: and generating a reverse word stock corresponding to the coding conversion direction of the messy code file according to the messy code file. Specifically, for a messy code file, all single characters in the messy code file are obtained, and a reverse character library corresponding to the code conversion direction of the messy code file is generated; and acquiring all continuous and non-space two characters in the messy code file, and generating a reverse word stock corresponding to the code conversion direction of the messy code file. That is, each transcoding direction corresponds to an inverse word stock and an inverse word stock.
Further, if the characters beyond the character representation range exist in the messy code file, the characters are added into a reverse word stock corresponding to the code conversion direction of the messy code file. For example, the encoding range of GB2312 is hexadecimal A1 A1-FEFE, if a character is A1a0, the character corresponding to GB2312 cannot be queried, so A1a0 is directly recorded as a reverse font library.
The above steps complete the generation of the forward word stock and the reverse word stock, and then the identification of the file code is performed on the file to be identified, as shown in fig. 3, the method includes the following steps:
s201: and acquiring the file to be identified.
S202: a code is obtained from the set of codes.
S203: and decoding the file to be identified through the code. Further, each time the file to be recognized is decoded with a new code, the original file to be recognized is decoded, so that a plurality of files to be recognized may be copied in advance after step S201 before this step.
Preferably, after this step, the first character in the decoded file to be recognized is removed, that is, the letters and symbols represented by the ASCII code are removed.
S204: and acquiring words in the decoded file to be recognized, wherein the words are two continuous non-blank characters.
S205: matching the words with a forward word bank corresponding to the code to obtain a first forward matching number; the obtained words are searched and matched in the forward word bank corresponding to the code, and the number of the matched words is the first forward matching number.
S206: matching the words with each reverse lexicon corresponding to a first coding conversion direction respectively to obtain a first reverse matching number of each reverse lexicon, wherein the decoding codes in the first coding conversion direction are the codes; and then adding the first reverse matching numbers of the reverse word banks to obtain a second reverse matching number.
The second reverse matching number is the matching number of the word in all the reverse word banks corresponding to the first coding conversion direction. Specifically, the decoding codes are obtained as each coding conversion direction of the code, then reverse word banks corresponding to the coding conversion directions are obtained respectively, then words are searched and matched in the reverse word banks respectively, the number of words which can be matched in one reverse word bank is the first reverse matching number of the reverse word bank, and finally the first reverse matching numbers of the reverse word banks are added and summed to obtain the second reverse matching number.
S207: determining whether the first forward matching number is greater than the second reverse matching number, if so, performing step S215, and if not, performing step S208.
S208: and judging whether the first forward matching number is smaller than the second reverse matching number, if so, executing step S209, otherwise, indicating that the first forward matching number is equal to the second reverse matching number, and executing step S210.
S209: and acquiring a file code in the code conversion direction corresponding to the reverse lexicon with the maximum first reverse matching number, taking the file code as a code, and continuing to execute the step S203.
In step S206, the words are respectively matched in each reverse lexicon corresponding to the encoding conversion direction of the encoding into the one encoding, so as to obtain a first reverse matching number of each reverse lexicon. The step is to first obtain the reverse lexicon with the largest first reverse matching number, that is, the reverse lexicon with the largest word matching number, from the reverse lexicons, then obtain the file code in the code conversion direction corresponding to the reverse lexicon (the decoding code in the code conversion direction is the code), then use the file code as a new code for decoding the file to be recognized, and continue to execute step S203, that is, decode the file to be recognized with the file code in the next step.
S210: and judging whether the first forward matching number and the second reverse matching number are both zero, if so, indicating that the forward word stock and the reverse word stock are not matched, matching through the forward word stock and the reverse word stock, executing step S211, otherwise, indicating that the forward word stock and the reverse word stock are equal but not zero, acquiring a next code from the code set, decoding the file to be identified by using the next code, and continuing to execute step S202.
S211: and acquiring the single characters in the decoded file to be recognized.
S212: matching the single character with a forward character library corresponding to the code to obtain a second forward matching number; the acquired single characters are searched and matched in the forward character library corresponding to the code, and the number of the matched single characters is the second forward matching number.
S213: and respectively matching the single characters with the reverse character libraries corresponding to the first coding conversion direction to obtain third reverse matching numbers of the reverse character libraries, and adding the third reverse matching numbers of the reverse character libraries to obtain a fourth reverse matching number.
The fourth reverse matching number is the matching number of the single character in all the reverse character libraries corresponding to the first coding conversion direction. Specifically, the decoding codes are obtained as each code conversion direction of the code, then reverse word banks corresponding to the code conversion directions are obtained respectively, then single words are searched and matched in the reverse word banks respectively, the number of the single words which can be matched in one reverse word bank is the third reverse matching number of the reverse word bank, and finally the third reverse matching numbers of the reverse word banks are added and summed to obtain the fourth reverse matching number.
S214: determining whether the second forward matching number is greater than the fourth reverse matching number, if so, performing step S215, and if not, performing step S216.
S215: and taking the code as the file code of the file to be identified, namely judging that the file code of the file to be identified is the code.
S216: and judging whether the second forward matching number is smaller than the fourth reverse matching number, if so, executing step S217, otherwise, indicating that the second forward matching number is equal to the fourth reverse matching number, at this time, obtaining a next code from the code set, decoding the file to be identified by using the next code, and then, continuing to execute step S202.
S217: and acquiring a file code in the code conversion direction corresponding to the reverse word stock with the maximum third reverse matching number, taking the file code as a code, and continuing to execute the step S203.
In step S213, the single character is already matched in each reverse word stock corresponding to the encoding conversion direction of the decoding encoding into the one encoding, so as to obtain a third reverse matching number of each reverse word stock. The step is to first obtain the reverse word stock with the maximum third reverse matching number, that is, the reverse word stock with the maximum single-word matching number, from the reverse word stocks, then obtain the file code in the code conversion direction corresponding to the reverse word stock (the decoding code in the code conversion direction is the code), then use the file code as a new code for decoding the file to be recognized, and continue to execute step S203, that is, decode the file to be recognized by using the file code.
Further, since the UTF-8 encoding is more versatile, it is preferable that only the forward word library corresponding to the UTF-8 encoding is generated in step S102. In step S103, only the sample file encoded by UTF-8 may be decoded by other encoding, and the other encoded sample files may be decoded by UTF-8 encoding, that is, the encoding conversion direction necessarily includes the UTF-8 encoding. In step S202, UTF-8 encoding is preferentially acquired.
Two specific examples of the present embodiment are given below.
If the content of the file 1 to be identified (encoded by GB 2312) is: test file GB 2312. Firstly, decoding the UTF-8 code to obtain the following contents: [B2] [ E2] [ CA ] [ D4] [ CE ] [ C4] [ BC ] [ FE ] gb 2312; after eliminating letters and symbols represented by ASCII codes: [B2] [ E2] [ CA ] [ D4] [ CE ] [ C4] [ BC ] [ FE ]. And then matching the characters in a forward word bank corresponding to the UTF-8 code and a reverse word bank corresponding to the code conversion direction of decoding and coding the UTF-8 code respectively to obtain that the number of the characters matched in the forward word bank is 0, the number of the characters matched in the reverse word bank is 8, and considering that the file code of the file 1 to be identified is not the UTF-8 code because the forward matching number is less than the reverse matching number. Then, analysis shows that the coding conversion direction corresponding to the reverse word stock with the largest matching number is from "GB 2312 coding to UTF-8 coding", so that the document 1 to be recognized is decoded by the GB2312 coding again, and the obtained content is: test file GB 2312; after eliminating letters and symbols represented by ASCII codes: and (6) testing the file. And then, matching the characters in a forward word bank corresponding to a GB2312 code and a reverse word bank corresponding to a code conversion direction of decoding and coding the GB2312 code respectively to obtain the number of the characters matched in the forward word bank as 2 (test and file) and the number of the characters matched in the reverse word bank as 0, so that the file code of the file 1 to be recognized is judged as the GB2312 code.
Similarly, if the content of the file to be identified 2 (encoded by UTF-8) is: test file UTF 8. Decoding the code by GB2312 coding to obtain the following contents: master albizzia UTF 8. The matching number of the forward word bank corresponding to the GB2312 code is 0 through query, the matching number of the forward word bank corresponding to the GB2312 code is 2 ("master ", " happy"), and the forward word bank and the reverse word bank corresponding to the GB2312 code are both from the UTF-8 code to the GB2312 code, so that the file 2 to be identified is decoded through the UTF-8 code again, and the obtained content is as follows: test file UTF 8. And (3) inquiring to obtain that the matching number in the forward lexicon corresponding to the UTF-8 code is 2 (test and file), and the matching number in the reverse lexicon is 0, so that the file code of the file 2 to be identified is judged to be the UTF-8 code.
The embodiment can intelligently identify the correct code of the file without the character mark (such as the file 1 to be identified), and can reversely deduce the correct file code more quickly by recording the code conversion direction, thereby improving the identification efficiency. Meanwhile, letters and symbols expressed by ASCII codes are removed, so that the identification accuracy is improved.
Example two
This embodiment is a computer-readable storage medium corresponding to the above embodiment, on which a computer program is stored, the program, when executed by a processor, implementing the steps of:
collecting a sample file, wherein the sample file comprises non-messy code texts of various languages;
respectively converting the file codes of the sample files into codes in a preset code set, and generating forward word libraries corresponding to the codes according to the converted sample files;
respectively decoding the sample file through other codes different from the file codes in the code set to obtain a messy code file, and recording the code conversion direction of the messy code file, wherein the code conversion direction comprises file codes and decoding codes;
generating a reverse word stock corresponding to the coding conversion direction of the messy code file according to the messy code file;
acquiring a file to be identified;
sequentially decoding the file to be identified through one code in the code set;
acquiring words and single characters in a decoded file to be recognized, and respectively matching the words and the single characters in a forward word bank corresponding to one code and a reverse word bank corresponding to a first code conversion direction to obtain a forward matching number and a reverse matching number, wherein the decoded code in the first code conversion direction is the code;
and if the forward matching number is greater than the reverse matching number, taking the code as the file code of the file to be identified.
Further, after the acquiring the sample file, the method further comprises:
and replacing a first character in the sample file with a blank space, wherein the first character is a letter and a symbol represented by ASCII code.
Further, after the decoding the file to be identified by an encoding in the encoding set, the method further includes:
and eliminating the first character in the decoded file to be recognized.
Further, the converting the file codes of the sample file into codes in a preset code set, and generating a forward word library corresponding to each code according to the converted sample file specifically includes:
converting the file code of the sample file into a code in a preset code set;
acquiring all the single characters in the converted sample file, and generating a forward character library corresponding to the code;
and acquiring all the two continuous non-blank characters in the converted sample file, and generating a forward word bank corresponding to the code.
Further, the generating of the reverse word stock corresponding to the code conversion direction of the garbled file is specifically as follows:
acquiring all single characters in a messy code file, and generating a reverse character library corresponding to the code conversion direction of the messy code file;
acquiring all continuous and non-space two characters in a messy code file, and generating a reverse word stock corresponding to the code conversion direction of the messy code file.
Further, the forward word stock comprises a forward word stock and a forward word stock, and the reverse word stock comprises a reverse word stock and a reverse word stock;
the step of sequentially decoding the file to be identified by one code in the code set to the step of taking the one code as the file code of the file to be identified if the forward matching number is greater than the reverse matching number specifically includes:
acquiring a code in the code set, and decoding the file to be identified through the code;
acquiring words in the decoded file to be recognized, wherein the words are two continuous non-blank characters;
matching the words with a forward word bank corresponding to the code to obtain a first forward matching number;
matching the words with each reverse lexicon corresponding to a first coding conversion direction respectively to obtain a first reverse matching number of each reverse lexicon, wherein the decoding codes in the first coding conversion direction are the codes;
adding the first reverse matching numbers of the reverse word banks to obtain a second reverse matching number;
if the first forward matching number is larger than the second reverse matching number, the code is used as a file code of the file to be identified;
if the first forward matching number is smaller than the second reverse matching number, acquiring a file code in a code conversion direction corresponding to a reverse lexicon with the maximum first reverse matching number, taking the file code as a code, and continuously executing the step of decoding the file to be identified through the code;
if the first forward matching number and the second reverse matching number are equal and are not zero, acquiring a next code in the code set, taking the next code as a code, and continuing to execute the step of decoding the file to be identified through the code;
if the first forward matching number and the second reverse matching number are both zero, acquiring a single character in the decoded file to be identified;
matching the single character with the forward character library corresponding to the code to obtain a second forward matching number;
respectively matching the single characters with reverse word banks corresponding to the first coding conversion direction to obtain a third reverse matching number of each reverse word bank;
adding the third reverse matching numbers of the reverse character libraries to obtain a fourth reverse matching number;
if the second forward matching number is larger than the fourth reverse matching number, the code is used as a file code of the file to be identified;
if the second forward matching number is smaller than the fourth reverse matching number, acquiring a file code in a code conversion direction corresponding to a reverse word stock with the largest third reverse matching number, taking the file code as a code, and continuously executing the step of decoding the file to be identified through the code;
and if the second forward matching number and the fourth reverse matching number are equal, acquiring a next code in the code set, taking the next code as a code, and continuously executing the step of decoding the file to be identified through the code.
In summary, the file code recognition method and the computer-readable storage medium provided by the present invention generate the forward word stock and the backward word stock by analyzing and processing the collected sample file, and then obtain the file code of the file to be recognized according to the matching result between the file to be recognized and the forward word stock and the backward word stock. The invention can correctly code and identify the file with unknown coding mode, thereby effectively avoiding the occurrence of messy codes; the recognition accuracy can be improved by removing letters and symbols expressed by ASCII codes; by recording the code conversion direction, the correct file code can be reversely deduced more quickly, and the recognition efficiency is improved.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims (4)

1. A file coding identification method based on a forward and reverse word stock is characterized by comprising the following steps:
collecting a sample file, wherein the sample file comprises non-messy code texts of various languages;
respectively converting the file codes of the sample files into codes in a preset code set, and generating forward word libraries corresponding to the codes according to the converted sample files;
respectively decoding the sample file through other codes different from the file codes in the code set to obtain a messy code file, and recording the code conversion direction of the messy code file, wherein the code conversion direction comprises file codes and decoding codes;
generating a reverse word stock corresponding to the coding conversion direction of the messy code file according to the messy code file;
acquiring a file to be identified;
sequentially decoding the file to be identified through one code in the code set;
acquiring words and single characters in a decoded file to be recognized, and respectively matching the words and the single characters in a forward word bank corresponding to one code and a reverse word bank corresponding to a first code conversion direction to obtain a forward matching number and a reverse matching number, wherein the decoded code in the first code conversion direction is the code;
if the forward matching number is larger than the reverse matching number, the code is used as the file code of the file to be identified;
the step of converting the file codes of the sample file into codes in a preset code set respectively, and generating a forward word stock corresponding to each code according to the converted sample file specifically comprises the following steps:
converting the file code of the sample file into a code in a preset code set;
acquiring all the single characters in the converted sample file, and generating a forward character library corresponding to the code;
acquiring all continuous and non-space two characters in the converted sample file, and generating a forward lexicon corresponding to the code;
the step of generating a reverse word library corresponding to the coding conversion direction of the messy code file specifically comprises the following steps:
acquiring all single characters in a messy code file, and generating a reverse character library corresponding to the code conversion direction of the messy code file;
acquiring all continuous and non-space two characters in a messy code file, and generating a reverse word stock corresponding to the code conversion direction of the messy code file;
the step of sequentially decoding the file to be identified by one code in the code set to the step of taking the one code as the file code of the file to be identified if the forward matching number is greater than the reverse matching number specifically includes:
acquiring a code in the code set, and decoding the file to be identified through the code;
acquiring words in the decoded file to be recognized, wherein the words are two continuous non-blank characters;
matching the words with a forward word bank corresponding to the code to obtain a first forward matching number;
matching the words with each reverse lexicon corresponding to a first coding conversion direction respectively to obtain a first reverse matching number of each reverse lexicon, wherein the decoding codes in the first coding conversion direction are the codes;
adding the first reverse matching numbers of the reverse word banks to obtain a second reverse matching number;
if the first forward matching number is larger than the second reverse matching number, the code is used as a file code of the file to be identified;
if the first forward matching number is smaller than the second reverse matching number, acquiring a file code in a code conversion direction corresponding to a reverse lexicon with the maximum first reverse matching number, taking the file code as a code, and continuously executing the step of decoding the file to be identified through the code;
if the first forward matching number and the second reverse matching number are equal and are not zero, acquiring a next code in the code set, taking the next code as a code, and continuing to execute the step of decoding the file to be identified through the code;
if the first forward matching number and the second reverse matching number are both zero, acquiring a single character in the decoded file to be identified;
matching the single character with a forward character library corresponding to the code to obtain a second forward matching number;
respectively matching the single characters with reverse word banks corresponding to the first coding conversion direction to obtain a third reverse matching number of each reverse word bank;
adding the third reverse matching numbers of the reverse character libraries to obtain a fourth reverse matching number;
if the second forward matching number is larger than the fourth reverse matching number, the code is used as a file code of the file to be identified;
if the second forward matching number is smaller than the fourth reverse matching number, acquiring a file code in a code conversion direction corresponding to a reverse word stock with the largest third reverse matching number, taking the file code as a code, and continuously executing the step of decoding the file to be identified through the code;
and if the second forward matching number and the fourth reverse matching number are equal, acquiring a next code in the code set, taking the next code as a code, and continuously executing the step of decoding the file to be identified through the code.
2. The method for identifying file codes based on the forward and reverse word library according to claim 1, wherein after the sample file is collected, the method further comprises:
and replacing a first character in the sample file with a blank space, wherein the first character is a letter and a symbol represented by ASCII code.
3. The method for identifying files based on the forward/reverse word library according to claim 2, wherein after decoding the file to be identified by a code in the code set, the method further comprises:
and eliminating the first character in the decoded file to be recognized.
4. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1-3.
CN202110207815.7A 2019-04-19 2019-04-19 File code identification method based on forward and reverse word stock and storage medium Active CN113064862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110207815.7A CN113064862B (en) 2019-04-19 2019-04-19 File code identification method based on forward and reverse word stock and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110207815.7A CN113064862B (en) 2019-04-19 2019-04-19 File code identification method based on forward and reverse word stock and storage medium
CN201910317628.7A CN110096481B (en) 2019-04-19 2019-04-19 Method for identifying file code and computer readable storage medium

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201910317628.7A Division CN110096481B (en) 2019-04-19 2019-04-19 Method for identifying file code and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN113064862A CN113064862A (en) 2021-07-02
CN113064862B true CN113064862B (en) 2022-06-07

Family

ID=67445271

Family Applications (3)

Application Number Title Priority Date Filing Date
CN201910317628.7A Active CN110096481B (en) 2019-04-19 2019-04-19 Method for identifying file code and computer readable storage medium
CN202110207832.0A Active CN113064863B (en) 2019-04-19 2019-04-19 Method for automatically recognizing file code and computer readable storage medium
CN202110207815.7A Active CN113064862B (en) 2019-04-19 2019-04-19 File code identification method based on forward and reverse word stock and storage medium

Family Applications Before (2)

Application Number Title Priority Date Filing Date
CN201910317628.7A Active CN110096481B (en) 2019-04-19 2019-04-19 Method for identifying file code and computer readable storage medium
CN202110207832.0A Active CN113064863B (en) 2019-04-19 2019-04-19 Method for automatically recognizing file code and computer readable storage medium

Country Status (1)

Country Link
CN (3) CN110096481B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807807A (en) * 2021-08-16 2021-12-17 深圳市云采网络科技有限公司 Component parameter identification method and device, electronic equipment and readable medium
CN114492311B (en) * 2021-12-21 2024-08-06 成都鲁易科技有限公司 Method and device for recognizing messy code data, storage medium and computer equipment
CN114139498B (en) * 2022-01-26 2022-05-03 统信软件技术有限公司 Text decoding method and device, text reader and computing equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1575467A (en) * 2001-10-22 2005-02-02 数码世界语有限公司 Computerized coder-decoder without being restricted by language and method
US7610192B1 (en) * 2006-03-22 2009-10-27 Patrick William Jamieson Process and system for high precision coding of free text documents against a standard lexicon
CN103970913A (en) * 2014-05-28 2014-08-06 广州视源电子科技股份有限公司 UTF-8 and ANSI code identification method and device
CN107122342A (en) * 2017-04-21 2017-09-01 东莞中国科学院云计算产业技术创新与育成中心 Text code recognition methods and device
CN108197087A (en) * 2018-01-18 2018-06-22 北京奇安信科技有限公司 Character code recognition methods and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8670976B2 (en) * 2011-03-31 2014-03-11 King Abdulaziz City for Science & Technology System and methods for encoding and decoding multi-lingual text in a matrix code symbol
US8874430B2 (en) * 2011-03-31 2014-10-28 King Abdulaziz City For Science And Technology Applications for encoding and decoding multi-lingual text in a matrix code symbol
CN104360988B (en) * 2014-10-17 2017-10-20 北京锐安科技有限公司 The recognition methods of the coded system of Chinese character and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1575467A (en) * 2001-10-22 2005-02-02 数码世界语有限公司 Computerized coder-decoder without being restricted by language and method
US7610192B1 (en) * 2006-03-22 2009-10-27 Patrick William Jamieson Process and system for high precision coding of free text documents against a standard lexicon
CN103970913A (en) * 2014-05-28 2014-08-06 广州视源电子科技股份有限公司 UTF-8 and ANSI code identification method and device
CN107122342A (en) * 2017-04-21 2017-09-01 东莞中国科学院云计算产业技术创新与育成中心 Text code recognition methods and device
CN108197087A (en) * 2018-01-18 2018-06-22 北京奇安信科技有限公司 Character code recognition methods and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多语种eml文件编码及语种识别算法研究;张健,任炜,蒋欣,陈辰,赖跃群,袁保社;《新疆大学学报(自然科学版)》;20101130;第27卷(第4期);第482-485页 *

Also Published As

Publication number Publication date
CN110096481A (en) 2019-08-06
CN113064863A (en) 2021-07-02
CN113064863B (en) 2022-06-07
CN110096481B (en) 2021-03-23
CN113064862A (en) 2021-07-02

Similar Documents

Publication Publication Date Title
CN113064862B (en) File code identification method based on forward and reverse word stock and storage medium
KR101083540B1 (en) System and method for transforming vernacular pronunciation with respect to hanja using statistical method
KR20110038474A (en) Apparatus and method for detecting sentence boundaries
CN111460793A (en) Error correction method, device, equipment and storage medium
JP6447161B2 (en) Semantic structure search program, semantic structure search apparatus, and semantic structure search method
CN101814065A (en) Syntactic analysis device and syntactic analysis method
KR101143650B1 (en) An apparatus for preparing a display document for analysis
CN111159394A (en) Text abstract generation method and device
AU2020386055A1 (en) Computerized data compression and analysis using potentially non-adjacent pairs
US20230186020A1 (en) Systems and methods for language identification in binary file formats
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN110941704B (en) Text content similarity analysis method
CN109002423A (en) text search method and device
Alkhazi et al. Tag based models for Arabic text compression
CN113609874A (en) Text translation method and device, electronic equipment and storage medium
CN113239245A (en) Method and device for information query, electronic equipment and readable storage medium
WO1996011442A1 (en) Character information processing method and apparatus for the same
CN111651164A (en) Code identifier normalization method and device
JP2003331214A (en) Character recognition error correction method, device and program
Manne et al. A Feature Terms based Method for Improving Text Summarization with Supervised POS Tagging
CN112527309B (en) AS400 operation sentence conversion optimization method, device, equipment and storage medium
CN110175268B (en) Longest matching resource mapping method
Hanif et al. Unicode aided language identification across multiple scripts and heterogeneous data
CN105468724A (en) Data stream encoding prediction method and device
CN116778917A (en) Multilingual voice recognition method based on BBPE modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant