CN114580353A

CN114580353A - ZIP file decoding identification method, encoding correction method, computer-readable storage medium and system

Info

Publication number: CN114580353A
Application number: CN202210239459.1A
Authority: CN
Inventors: 盛钺清; 殷正航
Original assignee: Shanghai I Cloud Network Technology Co ltd
Current assignee: Shanghai I Cloud Network Technology Co ltd
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2022-06-03

Abstract

The invention relates to the technical field of document processing, in particular to a ZIP file decoding and identifying method, a code correcting method, a ZIP file decoding and identifying system and a computer readable storage medium. The encoding correction method of the ZIP file comprises the steps of firstly reading an extra head field of the ZIP file, identifying whether the read field has a UTF-8 zone bit, directly decoding if the field exists, executing a decoding identification method of the ZIP file if the field does not exist, extracting a plurality of file names from directory source data, splicing the file names to be used as a group of identification input to identify an encoding format, and obtaining the encoding format by using a spliced long identification text as the identification input, so that the identification accuracy of decoding identification is improved.

Description

ZIP file decoding identification method, encoding correction method, computer-readable storage medium and system

Technical Field

The invention relates to the technical field of document processing, in particular to a ZIP file decoding and identifying method, a ZIP file encoding and identifying method, a ZIP file decoding and identifying system and a computer readable storage medium.

Background

ZIP (compressed file format) is an ancient specification, which was first presented in IBM's DOS system and belongs to one of several mainstream compression formats currently. DOS of the year cannot support Unicode and UTF-8 encoding as today, and computers in different countries need to install different Code pages (Code pages) and can only be compatible with local (country/region) characters. In this case, although the simplified chinese GBK code, the traditional chinese Big5 code, and the japanese Shift-JIS code all contain a large number of completely identical chinese characters, the identical chinese characters use completely different byte encoding methods because of different encoding specifications. ZIP is the same as DOS, and the problem of Unicode unified coding is not considered in the initial design stage, so that files can be stored according to default codes of various operating systems during compression.

Nowadays, with the prosperity of new Unicode and UTF-8 encoding, more and more systems are beginning to support UTF-8 specification (which is an encoding mode capable of supporting all characters around the world). A new flag bit is added to the ZIP to indicate whether the compression encoding of the ZIP file is UTF-8. However, the mainstream operating system aims at the condition that the compressed function code of the ZIP is worn out for a long time, many functions do not conform to the latest ZIP standard, and the file systems of different operating systems do not support the encoding format uniformly. Gbk encoding is not supported by default under linux; the Chinese default Code of the Windows operating system is GBK, and until now, Windows 10 still adopts a mode of compatible Code Page (Code Page) to judge the system language, so that the ZIP compression of Windows can use local Code compression (default Code is GBK Code) without opening UTF-8 flag bit, but a special function of ZIP expansion file name field is used, and the file name of the UTF-8 Code is used in the expansion field; although the macOS operating system adopts the default chinese encoding UTF8, since the Code Page (Code Page) of Mac is UTF-8, the compression is performed according to UTF-8, and the UTF-8 flag is not turned on. Moreover, different operating systems do not identify the big/small case file names in a consistent manner, for example, case and case are distinguished under linux, and case are not distinguished under mac and Windows by default, that is: a.txt, a.txt can be regarded as the same file in Mac and Windows; and under linux it is considered to be a different file. Since the file recognition system cannot know which system the ZIP file to be decoded is encoded by, a decoding method matching with the ZIP file cannot be provided. Inconsistent encoding schemes in different operating systems present difficulties for the server side of the file identification system to decode the ZIP and correctly identify the file name.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a decoding recognition and encoding correction method for a ZIP file, which can recognize and decode ZIP files created under different operating systems, a computer-readable storage medium storing a computer program that implements the method when executed, and a system including the storage medium.

A decoding and identifying method of a ZIP file is provided, which comprises the following identifying steps: the method comprises the steps of obtaining directory source data of a ZIP file, extracting file name data from the obtained directory source data, identifying the coding format of the extracted file name data, and decoding the ZIP file by taking the identified coding format as the source coding format of the ZIP file.

Preferably, the splicing the plurality of file names includes: and splicing all the file names in the obtained directory source data according to the hierarchical sequence of the file names in the directory source data.

Preferably, the identifying the encoding format of the extracted filename data comprises an encoding detection step of: and matching and detecting the extracted file name data in a preset coding format library, and if a coding format with the matching degree of the file name data reaching a preset degree is detected in the coding format library, taking the detected coding format as the source coding format of the ZIP file.

Preferably, the code detection step uses code recognition tools for matching detection, and the code recognition tools include one or more of univerrucalchardet, iconv character code conversion, icu character string code detection and enca code conversion.

Preferably, in the identifying step, the obtaining the directory source data of the ZIP file includes a mapping conversion step: establishing a mapping table of limiting characters for a plurality of file name data in the directory source data, wherein the extracting of the file name data is realized by referring to the mapping table through links, and the limiting characters comprise one or more of letters, numbers, underlines and hyphens.

Preferably, in the mapping step, establishing a restricted character mapping table for a plurality of filename data in the directory source data means: the method comprises the steps of obtaining a hierarchical structure of directory source data, configuring limiting characters for a plurality of files of each hierarchical layer of the hierarchical structure, and constructing a mapping table of file names of the files pointed to by the configured limiting characters.

Preferably, the configuring the restriction character includes randomly generating a character string based on the restriction character for each of the plurality of files of each hierarchy using the nanoid.

The ZIP file coding correction method comprises the flag bit confirmation steps: reading an extra header field of the ZIP file, identifying whether the read field has a UTF-8 zone bit, and if the identification result is that the field does not have the UTF-8 zone bit, executing the ZIP file decoding identification method according to any one of claims 1-7 to obtain a ZIP file coding format, so as to decode the ZIP file in the obtained coding format.

There is also provided a computer-readable storage medium storing a computer program which, when executed by a processor, is capable of implementing the above-described decoding recognition method of the ZIP file and/or the above-described encoding modification method of the ZIP file.

The ZIP file decoding and identifying system comprises a ZIP file extracting module and a coding format identifying module, wherein the ZIP file extracting module is used for acquiring directory source data of the ZIP file and extracting file name data from the acquired directory source data, the coding format identifying module is used for identifying the coding format of the extracted file name data, the ZIP file decoding and identifying system also comprises a processor, the processor is internally pre-stored with the computer readable storage medium, and a computer program on the computer readable storage medium can be executed by the processor.

Has the beneficial effects that: the encoding and correcting method of the ZIP file comprises the steps of firstly reading an extra head field of the ZIP file through a decoding and identifying system of the ZIP file, identifying whether the read field has a UTF-8 zone bit, directly decoding according to the UTF-8 zone bit if the identification result exists, executing a decoding and identifying method of the ZIP file if the identification result does not exist, obtaining directory source data of the ZIP file, splicing a plurality of file names from the obtained directory source data, and identifying an encoding format by taking the spliced file names as a group of identification input.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below.

FIG. 1 is a flowchart of an encoding recognition and modification process of a ZIP file decoding recognition system according to the present invention.

FIG. 2 is a diagram of the original hierarchy of a ZIP file.

Fig. 3 is a schematic diagram of a hierarchical structure of a scrambled file obtained by directly recognizing the ZIP file of fig. 2.

FIG. 4 is a schematic view of a hierarchical structure of a ZIP file obtained by the ZIP file decoding recognition system of the present invention performing a mapping conversion step on the ZIP file of FIG. 2.

Fig. 5 is an expanded view of file name data of each hierarchy of the ZIP file of fig. 2.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The decoding and identifying system for the ZIP file is used for a decoding server side of a file server and comprises a ZIP file extracting module, a coding format identifying module and a decoding module. After receiving the ZIP file at the decoding server, the CP437 parsing and encoding is performed on the ZIP file, and then the decoding module of the decoding and identifying system of the ZIP file executes the encoding and modifying method for the ZIP file shown in fig. 1: reading an extra header field (i.e., a ZIP extra-header) of the ZIP file, identifying whether the read field has a UTF-8 flag bit, and if the identification result is that the field has the UTF-8 flag bit, directly decoding the ZIP file by a decoding module; if the identification result is that the ZIP file does not exist, the decoding service end of the file server obtains information that the UTF-8 flag bit of the ZIP file is missing, the decoding service end restores the file which is analyzed and coded by the CP437, and then informs a decoding module of a decoding and identifying system of the ZIP file to execute a decoding and identifying method of the ZIP file to identify the source coding format of the ZIP file, so that the coding correction of the ZIP file is realized, and the ZIP file is decoded in the identified source coding format of the ZIP file.

The following details the operation of the ZIP file decoding and identifying method of the present embodiment.

A ZIP file extraction module of the ZIP file decoding and identifying system acquires directory source data of ZIP files (see an original hierarchical structure of ZIP file courseware in figure 2), extracts a plurality of file name data in the acquired data and splices the acquired file names, namely, all the file names in the acquired directory source data are spliced from an upper layer to a lower layer in sequence according to the hierarchical sequence of the file names in the directory source data, then, when the encoding format of the extracted file name data is identified, the spliced file names are taken as a group of identification input to identify the encoding format, and finally, the identified encoding format is taken as the source encoding format of ZIP file decoding to be decoded. In the actual decoding test, compared with the common method that each file name is used as the identification input independently, the identification accuracy of decoding identification is improved by using the long identification text obtained by splicing a plurality of file names as the identification input, so that the decoding accuracy is improved.

Preferably, all file names in the obtained directory source data are sequentially spliced from upper layer to lower layer according to the hierarchical order of the file names in the directory source data, see the file name data expansion diagram of each hierarchy of the ZIP file "courseware" in FIG. 5, the ZIP file "courseware" is arranged at the top, then the file names "closure & decorator" and "gitwood" of the same hierarchy are sequentially spliced, then the file names "decorator use", "decorator", "index", "images", "fonts", "plugs", "app", "and" style "of the next hierarchy are spliced, then the file names" apple-touch-composition-152 "," favicon "," Fowensome "," Fowesome "," GIwensome "," gihook-plug-in-position "and" giuin-wood-composition "of the next hierarchy are spliced, and the file names" b-webbed "of the last hierarchy are spliced, and the file names" b-courseware "of the last hierarchy are arranged at the top, "fontawesome-webfont. eot", "plugin", "buttons", and "website". Alternatively, instead, the file names of all levels in fig. 2 are sequentially spliced after the file names of all levels are expanded, that is, the file name "closure & decorator" of the folder F11 and the file names "decorator use", "decorator" and "index" of all files of the next level are spliced after the ZIP file name "courseware", and then the file name "gitwood" of the folder F12 of the same level as the folder F11 and the file names of all folders and files thereof of the next level are spliced, for example, the file name "images" of the folder F21 is followed by the file names "ap-touch-icon-compressed-152" and "favicon" of the files D31 and 32 of the next level, and so on. The expanded sequence of the file names of all the hierarchies is sequentially spliced and applied to a decoding server side of a file server with the function of expanding all the file names by one key, so that the extraction speed of the directory source data of the file names can be increased.

Preferably, identifying the encoding format of the extracted filename data comprises an encoding probing step of: and detecting the type of the coding format, and if detecting that the coding format with the matching degree reaching the preset degree exists, decoding by taking the coding format as a source coding format. The code detection step adopts a code recognition tool to detect the code format type, the code recognition tool comprises universcalchardet, iconv character code conversion, icu character string code detection and enca code conversion, and the embodiment adopts a conventional universcalchardet code recognition tool.

Wherein, the matching degree reaching the preset degree is as follows: if the matching degree of the current recognition input and the simplified Chinese GBK coding format reaches 85 percent, the recognition input is considered as the simplified Chinese GBK coding format; or, the matching degree of the recognition input and the coding format exceeds the matching degree of other coding formats by a preset value (for example, 20%), if the matching degrees of the recognition input and the simplified Chinese GBK coding format, the Japanese Shift-JIS coding format and the traditional Chinese Big5 coding format are respectively 40%, 50% and 75%, and if the matching degree of the recognition input and the traditional Chinese Big5 coding format exceeds the matching degree of other two coding formats by more than 20%, the current recognition input is regarded as the traditional Chinese Big5 coding format.

Example two

In the process of performing the decoding test of the first embodiment, the inventor finds that the situation that the ZIP file cannot be identified or deleted exists, and the inventor analyzes that the task of directly encoding and identifying the file as the system layer is caused, the file with the wrong encoding and identifying code may become a file name messy code file after being decoded, for example, the ZIP file name "courseware" in fig. 2 is identified as the messy code "in fig. 3 to pollute the file system, so that the ZIP file cannot be identified or deleted. Therefore, the inventor adds a mapping conversion step in the step of acquiring the directory source data of the ZIP file in the identification step of the first embodiment: establishing a mapping table for mapping to a restriction character for a plurality of file names in directory source data (see fig. 4 for a hierarchical structure of a mapping file obtained by performing a mapping conversion step on a ZIP file), specifically, obtaining the hierarchical structure of the directory source data (see fig. 2), and randomly generating a character string for each hierarchy by using a unique character string generator nanoid so as to configure the restriction character for each hierarchy of the hierarchical structure. After the mapping conversion step is executed, the file name data in the acquired data is extracted in the step of acquiring directory source data of the ZIP file and is realized by a link reference mapping table, namely the file name read when the file is coded and identified in a system layer is a limited character (see FIG. 4, ZIP file name 'e _4kqzsbd 4') obtained after the mapping conversion step is processed, then a real file name (see FIG. 2, ZIP file name 'courseware') corresponding to the limited character is obtained in the application layer by the link reference mapping table and is coded and identified in the application layer, if the real file name is successfully identified, the source coding format of decoding the ZIP file is obtained, and if the real file name is unsuccessfully identified, a deletable code file is left in the application layer without affecting the system layer. In the mapping conversion step, the file name is read through the mapping table as the operation of the application layer, so that even if the subsequent coding identification generates a messy code by mistake, the messy code file can be directly deleted at the application layer, and the file system is prevented from being polluted by the file name messy code file at the system layer.

EXAMPLE III

On the basis of the first embodiment or the second embodiment, in the process of extracting a plurality of file name data, the ZIP file extraction module executes a file name screening step: if the hierarchical structure of the directory source data is three or more layers, a plurality of file names at the lowest layer are selected and reserved. Referring to the file name data expansion diagram of each hierarchy of the ZIP file, a folder 31 "FontAwesome" is arranged at the third hierarchy of the ZIP file name "courseware", and two files are arranged at the lower layer of the folder 31, namely the fourth hierarchy: file D41 "fontawesome-webfont" and file D42 "fontawesome-webfont. eot", the ZIP file extraction module performs a file name screening step to retain the file name "fontawesome-webfont" of file D41 therein and to discard the file name "fontawesome-webfont. eot" of file D42, thereby reducing the overall length of the spliced plurality of file names. The overlong identification input may cause the coding format identification process to be blocked or accidentally wrong, so the identification input length is limited through the file name screening step, the accidental errors of ZIP file decoding identification can be reduced, and the ZIP file decoding identification speed is improved.

The above-described embodiments of the ZIP file decoding and identifying system are merely illustrative, wherein the modules described as separate components may or may not be physically separate, and the components displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Finally, it should be noted that: the decoding and identifying method and the encoding and modifying method for the ZIP file disclosed in the embodiments of the present invention are only preferred embodiments of the present invention, and are only used for illustrating the technical solutions of the present invention, not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

The ZIP file decoding and identifying method is characterized by comprising the following identifying steps: acquiring directory source data of the ZIP file, extracting file name data from the acquired directory source data, identifying the encoding format of the extracted file name data, and decoding the ZIP file by taking the identified encoding format as the source encoding format of the ZIP file; the extracted file name data includes a plurality of spliced file names, and when the encoding format of the extracted file name data is identified, the plurality of spliced file names are used as a group of identification inputs to identify the encoding format.
2. The ZIP file decoding recognition method of claim 1, wherein the concatenating the plurality of filenames comprises: and sequentially splicing all the file names in the acquired directory source data from the upper layer to the lower layer according to the hierarchical sequence of the file names in the directory source data.
3. The ZIP file decoding recognition method as claimed in claim 1, wherein the recognizing the encoding format of the extracted filename data includes an encoding detection step of: and matching and detecting the extracted file name data in a preset coding format library, and if a coding format with the matching degree of the file name data reaching a preset degree is detected in the coding format library, taking the detected coding format as the source coding format of the ZIP file.
4. A method of decoding and identifying ZIP files as claimed in claim 3, wherein the code detection step uses code identification tools for match detection, the code identification tools including one or more of universcalchardet, iconv character transcoding, icu string code detection, enca transcoding.
5. The ZIP file decoding recognition method as claimed in claim 1 or 2, wherein said obtaining directory source data of the ZIP file in said recognition step includes a mapping conversion step of: establishing a restriction character mapping table for a plurality of file name data in the directory source data, wherein the extracting of the file name data is realized by referring to the mapping table through links, and the restriction characters comprise one or more of letters, numbers, underlines and hyphens.
6. The ZIP file decoding recognition method of claim 5, wherein in the mapping step, establishing a restricted character mapping table for a plurality of filename data in the directory source data is: the method comprises the steps of obtaining a hierarchical structure of directory source data, configuring limiting characters for files of each hierarchy of the hierarchical structure, and constructing a mapping table of file names of the files pointed to by the configured limiting characters.
7. The ZIP file decoding recognition method of claim 6, wherein the configuring of the limit character includes randomly generating a limit character-based character string for each of the plurality of files of each hierarchy using a unique character string generator nano id.
The code correction method of the ZIP file is characterized by comprising the following steps of: reading an extra header field of the ZIP file, identifying whether the read field has a UTF-8 zone bit, and if the identification result is that the field does not have the UTF-8 zone bit, executing the decoding identification method of the ZIP file according to any one of claims 1-7 to identify the encoding format of the ZIP file, so as to decode the ZIP file in the identified encoding format.
9. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a controller, is capable of implementing the decoding recognition method for the ZIP file according to any one of claims 1 to 7 and/or the encoding modification method for the ZIP file according to claim 8.
A decoding recognition system for a ZIP file, comprising a ZIP file extraction module for acquiring directory source data of a ZIP file and extracting filename data from the acquired directory source data, and a code format recognition module for recognizing a code format of the filename data extracted by the ZIP file extraction module, and further comprising a processor in which a computer-readable storage medium as claimed in claim 9 is prestored, wherein a computer program on the computer-readable storage medium is executable by the processor.