CN115712599A - Method and device for detecting file codes, storage medium and electronic equipment - Google Patents

Method and device for detecting file codes, storage medium and electronic equipment Download PDF

Info

Publication number
CN115712599A
CN115712599A CN202211521149.5A CN202211521149A CN115712599A CN 115712599 A CN115712599 A CN 115712599A CN 202211521149 A CN202211521149 A CN 202211521149A CN 115712599 A CN115712599 A CN 115712599A
Authority
CN
China
Prior art keywords
file
detected
byte stream
encoding
format
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211521149.5A
Other languages
Chinese (zh)
Inventor
朱宏波
徐东明
马单
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202211521149.5A priority Critical patent/CN115712599A/en
Publication of CN115712599A publication Critical patent/CN115712599A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for detecting file codes, a storage medium and electronic equipment. Wherein, the method comprises the following steps: receiving a file to be detected; acquiring target bytes of byte stream data corresponding to the file to be detected, and judging whether the file to be detected is of a standard encoding format type or not according to the target bytes; under the condition that the file to be detected is in a non-standard coding format type, uniformly dividing byte stream data to which the file to be detected belongs to obtain a byte stream segment set, wherein the non-standard coding format type comprises a plurality of specified coding formats; and determining the encoding format of the file to be detected from the encoding format set corresponding to the non-standard encoding format type according to the encoding format of each byte stream segment. The method and the device solve the technical problem of low detection efficiency caused by a single coding format type detection method in the prior art.

Description

Method and device for detecting file codes, storage medium and electronic equipment
Technical Field
The present application relates to the field of encoding, and in particular, to a method, an apparatus, a storage medium, and an electronic device for detecting a file encoding.
Background
As the range of computer program applications expands, more and more types of data need to be processed by the program. Some of these data may be stored in a database (e.g., oracle and mySql); and also can be stored in the caching middleware (such as redis and rabbitmq); and also directly stored on the server in the form of a text file. The files are generated from different platforms and in different ways, so that different files have multiple encoding formats, such as UTF-8, GBK, ISO-8859-1, and the like. When a text file stored on a server is read or sent, messy codes of the file are easy to appear due to improper operation, and when the messy code file is read again, the risk of file information loss exists.
At present, methods for detecting file coding formats include a method for detecting a standard coding file and a method for detecting a non-standard coding file format, but the detection efficiency is low for the detection of a single coding format.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the application provides a method, a device, a storage medium and electronic equipment for detecting file codes, so as to at least solve the technical problem of low detection efficiency caused by a single coding format type detection method in the prior art.
According to an aspect of the embodiments of the present application, there is provided a method for detecting file encoding, including: receiving a file to be detected; acquiring target bytes of byte stream data corresponding to a file to be detected, and judging whether the file to be detected is of a standard coding format type or not according to the target bytes; under the condition that the file to be detected is in a non-standard coding format type, uniformly dividing byte stream data to which the file to be detected belongs to obtain a byte stream segment set, wherein the non-standard coding format type comprises a plurality of specified coding formats; and determining the encoding format of the file to be detected from the encoding format set corresponding to the non-standard encoding format type according to the encoding format of each byte stream segment.
Optionally, judging whether the file to be detected is of a canonical encoding format type according to the target byte includes: judging whether the target byte has a byte order mark BOM or not; determining the encoding format type of the file to be detected as a standard encoding format type under the condition that the target byte has the BOM; and under the condition that the target byte does not have the BOM, determining that the encoding format type of the file to be detected is a non-standard encoding format type.
Optionally, the uniformly dividing the byte stream data to which the file to be detected belongs includes: determining a total length of byte stream data; determining a unit length from the total length and a predetermined number, wherein the sum of the unit lengths of the predetermined number is equal to the total length; the byte stream data is uniformly divided by unit length.
Optionally, determining the encoding format of the file to be detected from the encoding format set corresponding to the non-canonical encoding format type according to the encoding format of each byte stream segment includes: detecting the coding format of each byte stream segment; determining the number of byte stream segments in a byte stream segment set, wherein the byte stream segments belong to a messy code format; comparing the number of byte stream segments with a preset threshold; and judging whether the file to be detected is in a messy code format or not according to the comparison result.
Optionally, judging whether the file to be detected is in a messy code format according to the comparison result, including: determining that the file to be detected is in a messy code format under the condition that the number of the byte stream segments is greater than a preset threshold value; under the condition that the number of the byte stream segments is smaller than a preset threshold value, determining the number of the byte stream segments of other coding formats in the byte stream segment set; and determining the encoding format of the file to be detected according to the number of the byte stream segments in other encoding formats, wherein the other encoding formats are encoding formats except the messy code format.
Optionally, determining the encoding format of the file to be detected according to the number of byte stream segments in other encoding formats includes: and determining the most numerous byte stream segments of the encoding format in the byte stream segment set, and taking the most numerous encoding format as the encoding format of the file to be detected.
Optionally, determining the byte stream segment of the largest number of encoding formats in the byte stream segment set includes: determining the number of the byte stream segments in other coding formats in the byte stream segment set under the condition that the number of the byte stream segments in the messy code format in the byte stream segment set is the largest and the number of the byte stream segments in the messy code format does not exceed a preset threshold; and determining the byte stream segments with the largest number of encoding formats except the garbled format in the byte stream segment set.
According to another aspect of the embodiments of the present application, there is also provided a method for detecting a file encoding, including: target equipment receives a file to be detected; determining the encoding type of the file to be detected, wherein the type comprises the following steps: a canonical encoding format type and a non-canonical encoding format type; determining a target detection method corresponding to the coding type from detection methods prestored by target equipment, wherein the detection method is used for detecting the coding format of a file to be detected, and the detection method prestored by the target equipment comprises the following steps: a first detection method corresponding to a canonical encoding format type and a second detection method corresponding to a non-canonical encoding format type; and determining the encoding format of the file to be detected by adopting a target detection method.
According to another aspect of the embodiments of the present application, there is also provided an apparatus for detecting a file encoding, including: the receiving module is used for receiving the file to be detected; the judging module is used for acquiring target bytes of byte stream data corresponding to the file to be detected and judging whether the file to be detected is of a standard encoding format type or not according to the target bytes; the device comprises a dividing module, a storage module and a processing module, wherein the dividing module is used for uniformly dividing the byte stream data of a file to be detected under the condition that the file to be detected is in a non-standard coding format type to obtain a byte stream segment set, and the non-standard coding format type comprises a plurality of specified coding formats; and the determining module is used for determining the encoding format of the file to be detected from the encoding format set corresponding to the non-standard encoding format type according to the encoding format of each byte stream segment.
According to another aspect of embodiments of the present application, there is also provided a nonvolatile storage medium including: the storage medium comprises a stored program, wherein when the program runs, the device on which the storage medium is positioned is controlled to execute any method for detecting the file coding.
According to another aspect of the embodiments of the present application, there is also provided an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement any method for detecting the encoding of the file.
In the embodiment of the application, the file to be detected is received in a mode of determining a corresponding detection method according to the coding type of the file to be detected; acquiring target bytes of byte stream data corresponding to the file to be detected, and judging whether the file to be detected is of a standard encoding format type or not according to the target bytes; under the condition that the file to be detected is in a non-standard coding format type, uniformly dividing the byte stream data to which the file to be detected belongs to obtain a byte stream segment set; the encoding format of the file to be detected is determined from the encoding format set corresponding to the non-standard encoding format type according to the encoding format of each byte stream segment, so that the purpose of improving the detection efficiency is achieved, the technical effect of reducing the risk of file loss caused by messy codes when the file is read or sent is achieved, and the technical problem of low detection efficiency caused by a single encoding format type detection method in the prior art is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic flow chart diagram illustrating a method for detecting document encoding according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an alternative method of detecting document encoding according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of an apparatus for detecting document encoding according to an embodiment of the present application;
fig. 4 is a schematic block diagram of an electronic device 400 according to an embodiment of the application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For the convenience of better understanding of the embodiments related to the present application, technical terms or partial terms that may be referred to in the present application are now explained:
BOM (byte-order mark), which is a special mark inserted into the beginning of data encoded in UTF-8, UTF16, or UTF-32, identifies the encoding type of the data. A character U + FEFF is defined in the coding table that does not have an actual corresponding symbol. The character U + FEFF, if present at the beginning of the byte stream, is used to identify the endian of the byte stream, whether the high order bits precede the low order bits.
In accordance with an embodiment of the present application, there is provided an embodiment of a method for detecting file encoding, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that presented herein.
Fig. 1 is a method for detecting a document encoding according to an embodiment of the present application, as shown in fig. 1, the method includes the following steps:
step S102, receiving a file to be detected;
step S104, obtaining a target byte of byte stream data corresponding to the file to be detected, and judging whether the file to be detected is of a standard encoding format type according to the target byte;
it should be noted that the standardized encoding format type means that the file to be detected only contains one encoding format.
Step S106, under the condition that the file to be detected is in the non-standard coding format type, uniformly dividing the byte stream data to which the file to be detected belongs to obtain a byte stream segment set, wherein the non-standard coding format type comprises a plurality of specified coding formats;
it should be noted that the non-canonical encoding format type indicates that the file to be detected includes multiple encoding formats, where the encoding formats include, but are not limited to, a Unicode encoding format, a Unicode big endian encoding format, and a UTF-8 encoding format.
And S108, determining the encoding format of the file to be detected from the encoding format set corresponding to the non-standard encoding format type according to the encoding format of each byte stream segment.
It is understood that there are many encoding formats for the non-canonical encoding format type, i.e., the set of encoding formats corresponding to the non-canonical encoding format type includes, but is not limited to, unicode encoding format, unicode big endian encoding format, and UTF-8 encoding format.
In the embodiment of the application, the corresponding detection method is determined according to the coding type of the file to be detected, and the file to be detected is received; acquiring target bytes of byte stream data corresponding to the file to be detected, and judging whether the file to be detected is of a standard encoding format type or not according to the target bytes; under the condition that the file to be detected is in a non-standard coding format type, uniformly dividing byte stream data to which the file to be detected belongs to obtain a byte stream segment set; the encoding format of the file to be detected is determined from the encoding format set corresponding to the non-standard encoding format type according to the encoding format of each byte stream segment, so that the purpose of improving the detection efficiency is achieved, the technical effect of reducing the risk of file loss caused by messy codes when the file is read or sent is achieved, and the technical problem of low detection efficiency caused by a single encoding format type detection method in the prior art is solved.
In an exemplary embodiment of the present application, determining whether the file to be detected is of a canonical encoding format type according to the target byte includes: judging whether the target byte has a byte order mark BOM or not; determining the encoding format type of the file to be detected as a standard encoding format type under the condition that the target byte has the BOM; and under the condition that the target byte does not have the BOM, determining that the encoding format type of the file to be detected is a non-standard encoding format type.
For example, when the byte sequence is marked as FF FE, the encoding format of the file to be detected is determined to be Unicode; under the condition that the byte sequence mark is FE FF, determining that the encoding format of the file to be detected is Unicode big endian; and under the condition that the byte sequence mark is EF BB BF, determining that the encoding format of the file to be detected is UTF-8.
As an optional implementation manner, the uniformly dividing the byte stream data to which the file to be detected belongs includes: determining a total length of byte stream data; determining a unit length from the total length and a predetermined number, wherein the sum of the unit lengths of the predetermined number is equal to the total length; the byte stream data is divided evenly according to unit length.
For example, assuming that the length of the byte stream data is 10, the byte stream data is uniformly divided into 10 segments, wherein the larger the division number is, the more accurate the finally determined encoding format is.
In some optional embodiments of the present application, determining, according to the encoding format of each byte stream segment, the encoding format of the file to be detected from the encoding format set corresponding to the non-canonical encoding format type includes: detecting the coding format of each byte stream segment; determining the number of the byte stream segments belonging to a messy code format in the byte stream segment set; comparing the number of byte stream segments with a preset threshold; and judging whether the file to be detected is in a messy code format or not according to the comparison result.
In an exemplary embodiment of the application, determining whether the file to be detected is in a random code format according to the comparison result includes: determining that the file to be detected is in a messy code format under the condition that the number of the byte stream segments is greater than a preset threshold value; under the condition that the number of the byte stream segments is smaller than a preset threshold value, determining the number of the byte stream segments of other coding formats in the byte stream segment set; and determining the encoding format of the file to be detected according to the number of the byte stream segments in other encoding formats, wherein the other encoding formats are encoding formats except for the messy code format.
For example, if the preset threshold is 50 and the number of byte stream segments in the garbled format is 60, and the number of byte stream segments in the garbled format is greater than the preset threshold, the file to be detected is in the garbled format; if the preset threshold value is 50 and the number of the byte stream segments in the messy code format is 40, and the number of the byte stream segments in the messy code format is smaller than the preset threshold value, the file to be detected is not in the messy code format.
Optionally, determining the encoding format of the file to be detected according to the number of byte stream segments in other encoding formats includes: and determining the most numerous byte stream segments of the encoding format in the byte stream segment set, and taking the most numerous encoding format as the encoding format of the file to be detected.
In some optional embodiments of the present application, determining a byte stream segment of a largest number of encoding formats in the set of byte stream segments includes: determining the number of the byte stream segments in other coding formats in the byte stream segment set under the condition that the number of the byte stream segments in the messy code format in the byte stream segment set is the maximum and the number of the byte stream segments in the messy code format does not exceed a preset threshold; and determining the byte stream segments with the largest number of encoding formats except the garbled format in the byte stream segment set.
For example, if the preset threshold is 50, the number of byte stream segments in the scrambling code format is 40, the number of byte stream segments in the Unicode coding format is 30, the number of byte stream segments in the Unicode big endian coding format is 20, and the number of byte stream segments in the utf-8 coding format is 10, the number of byte stream segments in the scrambling code format is the largest but smaller than the preset threshold, the coding format of the file to be detected is the coding format with the largest number of byte stream segments except for the scrambling code format, and it can be understood that the coding format of the file to be detected is the Unicode coding format.
Fig. 2 is a schematic diagram of another alternative method for detecting file encoding according to an embodiment of the present application, as shown in fig. 2, the method includes the following steps:
step S202, target equipment receives a file to be detected;
step S204, determining the encoding type of the file to be detected, wherein the type comprises the following steps: a canonical coding format type and a non-canonical coding format type;
step S206, determining a target detection method corresponding to the encoding type from the detection methods pre-stored by the target device, where the detection method is used to detect the encoding format of the file to be detected, and the detection method pre-stored by the target device includes: a first detection method corresponding to a canonical coding format type and a second detection method corresponding to a non-canonical coding format type;
and S208, determining the encoding format of the file to be detected by adopting a target detection method.
It is easy to note that this detection method can detect files of different coding types, instead of detecting only for a single coding type.
To facilitate better understanding of the technical solutions of the present application by those skilled in the art, the description is now made with reference to a specific embodiment, and the method includes the following steps:
(1) Packaging go codes for detection;
(2) Introducing the package in a scene needing to be used;
(3) Placing the target file under a certain path;
(4) Detecting the file to be detected by using the detection program;
(5) If the target file is in the standard coding mode, directly judging the first three bytes of the target file to output the file format, wherein when the file to be detected is in the standard coding mode:
the data in the file can be stored in the file according to a given character set, the coding information is stored in the first three bytes of the file, the values of the first three bytes are judged, and the coding format is determined;
if the first two bytes are FF FE, the encoding format of the file to be detected is Unicode;
if the first two bytes are FE FF, the encoding format of the file to be detected is Unicode big endian;
if the first three bytes are EF BB BF, the encoding format of the file to be detected is UTF-8;
it should be noted that if the first three bytes are-17, -69, -65, the encoding format of the file to be detected is also UTF-8.
(6) If the target file is an irregular code, identifying and detecting the code format of the file to be detected by using an improved detector function, wherein when the file to be detected is the irregular code or a messy code:
the data in the file may not be stored in the file according to a given character set, so that no encoding information is stored in the first byte of the file, and data in multiple encoding formats may exist in the file;
dividing byte stream data corresponding to a file to be detected into n equal parts, wherein n is less than or equal to the length of the file;
detecting each byte stream segment in the divided byte stream set;
outputting the coding format with the most occurrence times as the coding format of the file to be detected;
it should be noted that, when n is larger, the accuracy of the finally obtained coding format is higher.
And setting a file messy code threshold, and when the number of byte stream segments corresponding to the messy code format is greater than the threshold, setting the file format to be detected as the messy code format. And if the number of the byte stream segments corresponding to the messy code format is maximum and does not exceed the threshold value, defining the encoding format of the file to be detected as the encoding format with the maximum number of the byte stream segments except the messy code format.
It is easy to notice that the detection method adopted by the application can quickly and accurately identify the coding format of the file to be detected, and is suitable for standard coding, non-standard coding and messy code files. Meanwhile, the method is easy to deploy and has good implementability.
Fig. 3 is a device for detecting document encoding according to an embodiment of the present application, as shown in fig. 3, the device includes:
a receiving module 30, configured to receive a file to be detected;
the judging module 32 is configured to obtain a target byte of byte stream data corresponding to the file to be detected, and judge whether the file to be detected is of a standard encoding format type according to the target byte;
the dividing module 34 is configured to, when the file to be detected is of a non-canonical encoding format type, uniformly divide the byte stream data to which the file to be detected belongs to obtain a byte stream segment set, where the non-canonical encoding format type includes multiple specified encoding formats;
and the determining module 36 is configured to determine, according to the encoding format of each byte stream segment, the encoding format of the file to be detected from the encoding format set corresponding to the non-canonical encoding format type.
In the device, a receiving module 30 is used for receiving a file to be detected; the judging module 32 is configured to obtain a target byte of byte stream data corresponding to the file to be detected, and judge whether the file to be detected is of a standard encoding format type according to the target byte; the dividing module 34 is configured to, when the file to be detected is of a non-canonical encoding format type, uniformly divide the byte stream data to which the file to be detected belongs to obtain a byte stream segment set, where the non-canonical encoding format type includes multiple specified encoding formats; the determining module 36 is configured to determine the encoding format of the file to be detected from the encoding format set corresponding to the non-canonical encoding format type according to the encoding format of each byte stream segment, so as to achieve the purpose of improving the detection efficiency, thereby achieving the technical effect of reducing the risk of file loss caused by code confusion when the file is read or sent, and further solving the technical problem of low detection efficiency in the prior art due to a single encoding format type detection method.
In some embodiments of the present application, the determining module 32 is further configured to determine whether the target byte has a byte order marker BOM; determining the encoding format type of the file to be detected as a standard encoding format type under the condition that the target byte has the BOM; and under the condition that the target byte does not have the BOM, determining that the encoding format type of the file to be detected is a non-standard encoding format type.
The dividing module 34 is further configured to determine a total length of the byte stream data; determining a unit length from the total length and a predetermined number, wherein the sum of the unit lengths of the predetermined number is equal to the total length; the byte stream data is divided evenly according to unit length.
The determining module 36 is further configured to determine the byte stream segment with the largest number of encoding formats in the byte stream segment set, and use the encoding format with the largest number as the encoding format of the file to be detected.
The embodiment of the application also provides a nonvolatile storage medium, which comprises a stored program, wherein when the program runs, a device where the nonvolatile storage medium is located is controlled to execute any method for detecting the file codes.
Specifically, the storage medium is used for storing program instructions of the following functions, and the following functions are realized:
receiving a file to be detected; acquiring target bytes of byte stream data corresponding to the file to be detected, and judging whether the file to be detected is of a standard encoding format type or not according to the target bytes; under the condition that the file to be detected is in a non-standard coding format type, uniformly dividing byte stream data to which the file to be detected belongs to obtain a byte stream segment set, wherein the non-standard coding format type comprises a plurality of specified coding formats; and determining the encoding format of the file to be detected from the encoding format set corresponding to the non-standard encoding format type according to the encoding format of each byte stream segment.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the aforementioned storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the aforementioned.
In an exemplary embodiment of the present application, there is also provided a computer program product, comprising a computer program, which when executed by a processor, implements any of the above methods for detecting file encoding.
Optionally, the computer program may, when executed by a processor, implement the steps of:
receiving a file to be detected; acquiring target bytes of byte stream data corresponding to the file to be detected, and judging whether the file to be detected is of a standard encoding format type or not according to the target bytes; under the condition that the file to be detected is in a non-standard coding format type, uniformly dividing byte stream data to which the file to be detected belongs to obtain a byte stream segment set, wherein the non-standard coding format type comprises a plurality of specified coding formats; and determining the encoding format of the file to be detected from the encoding format set corresponding to the non-standard encoding format type according to the encoding format of each byte stream segment.
An embodiment according to the present application provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform any one of the above methods for detecting file encoding.
Optionally, the electronic device may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.
Fig. 4 is a schematic block diagram of an electronic device 400 according to an embodiment of the application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.
As shown in fig. 4, the apparatus 400 includes a computing unit 401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data required for the operation of the device 400 can also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
A number of components in device 400 are connected to I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, or the like; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408 such as a magnetic disk, optical disk, or the like; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 401 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 401 executes the respective methods and processes described above, such as a method of detecting file encoding. For example, in some embodiments, the method of detecting file encoding may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM 402 and/or the communication unit 409. When the computer program is loaded into RAM 403 and executed by computing unit 401, one or more steps of the method of detecting file encoding described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the method of detecting file encoding by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
The above-mentioned serial numbers of the embodiments of the present application are merely for description, and do not represent the advantages and disadvantages of the embodiments.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (11)

1. A method for detecting file encoding, comprising:
receiving a file to be detected;
acquiring target bytes of byte stream data corresponding to the file to be detected, and judging whether the file to be detected is in a standard coding format type or not according to the target bytes;
under the condition that the file to be detected is in a non-standard coding format type, uniformly dividing the byte stream data to which the file to be detected belongs to obtain a byte stream segment set, wherein the non-standard coding format type comprises a plurality of specified coding formats;
and determining the encoding format of the file to be detected from an encoding format set corresponding to a non-standard encoding format type according to the encoding format of each byte stream segment.
2. The method according to claim 1, wherein determining whether the file to be detected is of a canonical encoding format type according to the target bytes comprises:
judging whether the target byte has a byte order mark BOM or not;
determining the encoding format type of the file to be detected as a standard encoding format type under the condition that the target byte has the BOM;
and under the condition that the target byte does not have the BOM, determining that the coding format type of the file to be detected is a non-standard coding format type.
3. The method according to claim 1, wherein the uniformly dividing the byte stream data to which the file to be detected belongs comprises:
determining a total length of the byte stream data;
determining a unit length from the total length and a predetermined number, wherein the sum of the predetermined number of unit lengths equals the total length;
and uniformly dividing the byte stream data according to the unit length.
4. The method according to claim 1, wherein determining the encoding format of the file to be detected from the encoding format set corresponding to the non-canonical encoding format type according to the encoding format of each byte stream segment includes:
detecting the encoding format of each of the byte stream segments;
determining the number of the byte stream segments belonging to a messy code format in the byte stream segment set;
comparing the number of byte stream segments with a preset threshold;
and judging whether the file to be detected is in a messy code format or not according to the comparison result.
5. The method of claim 4, wherein determining whether the file to be detected is in a scrambled format according to the comparison result comprises:
determining the file to be detected to be in a messy code format under the condition that the number of the byte stream segments is larger than the preset threshold value;
under the condition that the number of the byte stream segments is smaller than the preset threshold value, determining the number of the byte stream segments of other coding formats in the byte stream segment set;
and determining the encoding format of the file to be detected according to the number of the byte stream segments in the other encoding formats, wherein the other encoding formats are encoding formats except for the messy code format.
6. The method according to claim 5, wherein determining the encoding format of the file to be detected according to the number of byte streams of the other encoding formats comprises:
and determining the byte stream segment with the largest number of coding formats in the byte stream segment set, and taking the coding format with the largest number as the coding format of the file to be detected.
7. The method of claim 6, wherein determining the most numerous byte stream segments of the set of byte stream segments for the encoding format comprises:
determining the number of the byte stream segments in other coding formats in the byte stream segment set under the condition that the number of the byte stream segments in the messy code format in the byte stream segment set is the maximum and the number of the byte stream segments in the messy code format does not exceed the preset threshold;
and determining the byte stream segments with the largest number of coding formats except the messy code format in the byte stream segment set.
8. A method for detecting file encoding, comprising:
target equipment receives a file to be detected;
determining the coding type of a file to be detected, wherein the type comprises the following steps: a canonical encoding format type and a non-canonical encoding format type;
determining a target detection method corresponding to the coding type from the detection methods prestored by the target device, wherein the detection method is used for detecting the coding format of the file to be detected, and the detection method prestored by the target device comprises the following steps: a first detection method corresponding to a canonical encoding format type and a second detection method corresponding to a non-canonical encoding format type;
and determining the encoding format of the file to be detected by adopting the target detection method.
9. An apparatus for detecting file encoding, comprising:
the receiving module is used for receiving the file to be detected;
the judging module is used for acquiring a target byte of byte stream data corresponding to the file to be detected and judging whether the file to be detected is of a standard encoding format type or not according to the target byte;
the dividing module is used for uniformly dividing the byte stream data of the file to be detected under the condition that the file to be detected is of a non-standard coding format type to obtain a byte stream segment set, wherein the non-standard coding format type comprises a plurality of specified coding formats;
and the determining module is used for determining the coding format of the file to be detected from a coding format set corresponding to a non-standard coding format type according to the coding format of each byte stream segment.
10. A non-volatile storage medium, characterized in that the storage medium comprises a stored program, wherein when the program runs, the device where the storage medium is located is controlled to execute the method for detecting the file coding according to any one of claims 1 to 8.
11. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of detecting file encoding according to any one of claims 1 to 8.
CN202211521149.5A 2022-11-30 2022-11-30 Method and device for detecting file codes, storage medium and electronic equipment Pending CN115712599A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211521149.5A CN115712599A (en) 2022-11-30 2022-11-30 Method and device for detecting file codes, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211521149.5A CN115712599A (en) 2022-11-30 2022-11-30 Method and device for detecting file codes, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN115712599A true CN115712599A (en) 2023-02-24

Family

ID=85235525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211521149.5A Pending CN115712599A (en) 2022-11-30 2022-11-30 Method and device for detecting file codes, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN115712599A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117391070A (en) * 2023-12-08 2024-01-12 和元达信息科技有限公司 Method and system for adjusting random character

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117391070A (en) * 2023-12-08 2024-01-12 和元达信息科技有限公司 Method and system for adjusting random character
CN117391070B (en) * 2023-12-08 2024-03-22 和元达信息科技有限公司 Method and system for adjusting random character

Similar Documents

Publication Publication Date Title
CN115712599A (en) Method and device for detecting file codes, storage medium and electronic equipment
CN112765324B (en) Concept drift detection method and device
CN112818387A (en) Method, apparatus, storage medium, and program product for model parameter adjustment
CN112822265A (en) Data encoding method, device, equipment end and storage medium
CN111898340A (en) File processing method and device and readable storage medium
WO2016127858A1 (en) Method and device for identifying webpage intrusion script features
CN113111200B (en) Method, device, electronic equipment and storage medium for auditing picture files
CN109214846B (en) Information storage method and device
CN113112472B (en) Image processing method and device
CN113365140B (en) MP4 online playing method, device, equipment, storage medium and program product
WO2022088381A1 (en) Safety monitoring method and apparatus for cast iron production, and server
CN114629707A (en) Method and device for detecting messy codes, electronic equipment and storage medium
CN116405210B (en) Network message label confusion method and device and electronic equipment
CN112487765A (en) Method and device for generating notification text
US9722631B2 (en) Method and apparatus for calculating estimated data compression ratio
CN113674246B (en) Method, device, electronic equipment and storage medium for auditing picture files
CN113283215B (en) Data confusion method and device based on UTF-32 coding
CN113591440B (en) Text processing method and device and electronic equipment
CN115328497A (en) File merging method and device, electronic equipment and readable storage medium
CN114489764A (en) Data processing method and device, electronic equipment and storage medium
CN114692592A (en) Word information processing method and device
CN115529346A (en) Service changing method, device, equipment and storage medium
CN113641885A (en) Document detection method, device, equipment and storage medium
CN113656836A (en) Document processing method, device, equipment, storage medium and computer program product
CN117318728A (en) Information compression and information decompression methods, devices, equipment, media and products

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination