CN111459703B

CN111459703B - Coding detection method and system

Info

Publication number: CN111459703B
Application number: CN201910005269.1A
Authority: CN
Inventors: 徐佳宏; 朱吕亮; 梁达源
Original assignee: Shenzhen Ipanel TV Inc
Current assignee: Shenzhen Ipanel TV Inc
Priority date: 2019-01-03
Filing date: 2019-01-03
Publication date: 2024-03-19
Anticipated expiration: 2039-01-03
Also published as: CN111459703A

Abstract

The invention discloses a coding detection method and a coding detection system, wherein in the process of downloading each segmented file of a page file to be downloaded, non-text format filtering and ASCII code data filtering are sequentially carried out on the downloaded segmented file to obtain a second segmented file which only contains non-ASCII code data and is in a text format, all the second segmented files are connected end to end according to the sequence to obtain a combined file, and when the total byte number of the non-ASCII code data contained in the combined file is not less than a preset threshold value, coding detection is carried out on the combined file. According to the invention, the second segmented file which only contains non-ASCII code data and is in a text format and is obtained through filtering is combined, so that Chinese and English and Chinese characters which are segmented into different segmented files are combined into the same file before coding detection is carried out, and the accuracy of a coding detection result is greatly improved.

Description

Coding detection method and system

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and a system for detecting codes.

Background

The multi-thread downloading means that a plurality of threads are concurrently executed and downloaded from software or hardware, so that the purpose of quickly downloading files in a segmented mode can be achieved. Because the multithreading downloading needs to divide a page file into a plurality of segmented files, when the prior art carries out coding detection on the page file, firstly, each segmented file received is respectively subjected to coding detection; and then counting the coding detection results of all the segmented files of the page file, and selecting the coding detection result with the largest number of times of occurrence of the coding detection result from all the coding detection results as the coding detection result of the page file.

However, when the downloaded page file is a mixture of Chinese and English, after the page file is divided into a plurality of segments, the Chinese and English are likely to be separated, even one Chinese is divided into two segment files, so that the detection result of each segment file is different, and a large error exists in the finally determined coding detection result.

Disclosure of Invention

In view of this, the present invention discloses a method and a system for detecting codes, which solve the problem in the prior art that, because Chinese and English are separated, even one Chinese character is divided into two segmented files, the detection result of each segmented file is different, and thus, the finally determined detection result of the codes has larger error.

A code detection method, comprising:

receiving file downloading requests, creating a multithreaded downloading class instance with the same number as the file downloading requests according to the number of the file downloading requests, and downloading the segmented files of the page files to be downloaded;

in the process of downloading each segmented file of the page file to be downloaded, filtering the segmented files with non-text formats in the downloaded segmented files to obtain segmented files with text formats, and marking the segmented files as first segmented files;

filtering ASCII data in the data of each first segmented file to obtain a first segmented file only containing non-ASCII data, and recording the first segmented file as a second segmented file;

connecting the second segmented files end to end according to the sequence to obtain a combined file;

counting the total byte number of all non-ASCII code data in the combined file;

judging whether the total byte number is not smaller than a preset threshold value;

and if so, carrying out coding detection on the combined file.

Optionally, in the process of downloading each segment file of the page file to be downloaded, filtering the segment file in a non-text format in the downloaded segment file to obtain a segment file in a text format, and recording the segment file as a first segment file, which specifically includes:

judging whether the downloaded current segmented file contains header information or not and the header information indicates that the current segmented file belongs to data in a text format in the process of downloading each segmented file of the page file to be downloaded;

if yes, creating an instance of a code detection class to mark the current segmented file and recording the current segmented file as the first segmented file;

if not, carrying out data analysis on the content of the current segmented file according to the coding specification, and continuously judging whether the current segmented file belongs to the text format data;

if yes, marking the current segmented file as the first segmented file;

and if not, filtering the current segmented file.

Optionally, the counting the total number of bytes of all non-ASCII code data in the merged file specifically includes:

and adding the byte numbers of the non-ASCII data in each second segmented file contained in the combined file to obtain the total byte number of all the non-ASCII data in the combined file.

Optionally, the preset threshold is 256 bytes.

A code detection system, comprising:

the receiving unit is used for receiving file downloading requests, creating a multithreaded downloading class instance with the same number as the file downloading requests according to the number of the file downloading requests, and downloading the segmented files of the page files to be downloaded;

the first filtering unit is used for filtering the non-text format segmented files in the downloaded segmented files in the process of downloading the segmented files of the page file to be downloaded to obtain text format segmented files, and recording the text format segmented files as first segmented files;

the second filtering unit is used for filtering ASCII (American standard code for information) code data in the data of each first segmented file to obtain a first segmented file only containing non-ASCII code data and recording the first segmented file as a second segmented file;

the merging unit is used for connecting the second segmented files end to end according to the sequence to obtain merged files;

the statistics unit is used for counting the total byte number of all non-ASCII code data in the combined file;

the judging unit is used for judging whether the total byte quantity is not smaller than a preset threshold value;

and the coding detection unit is used for carrying out coding detection on the combined file when the judgment unit judges that the combined file is judged to be positive.

Optionally, the first filtering unit specifically includes:

a first judging subunit, configured to judge, in a process of downloading each segment file of the page file to be downloaded, whether a current segment file after the downloading is completed contains header information, where the header information indicates that the current segment file belongs to data in a text format;

a creating subunit, configured to create an instance of a code detection class to mark the current segment file and record the current segment file as the first segment file if the first judging subunit judges yes;

the second judging subunit is used for carrying out data analysis on the content of the current segmented file according to the coding specification under the condition that the first judging subunit judges no, and continuously judging whether the current segmented file belongs to the data in the text format;

a marking subunit operable to, in a case where the second judging subunit judges yes, marking the current segmented file as the first segmented file;

and the filtering subunit is used for filtering the current segmented file under the condition that the second judging subunit judges no.

Optionally, the statistics unit is specifically configured to:

Optionally, the preset threshold is 256 bytes.

As can be seen from the above technical solution, the present invention discloses a method and a system for detecting encoding, in the process of downloading each segment file of a page file to be downloaded, non-text format filtering and ASCII code data filtering are sequentially performed on the downloaded segment file to obtain second segment files only containing non-ASCII code data and having a text format, the second segment files are connected end to end in sequence to obtain a combined file, and when the total number of bytes of the non-ASCII code data contained in the combined file is not less than a preset threshold value, encoding detection is performed on the combined file. Compared with the prior art, the method and the device not only sequentially perform non-text format filtering and ASCII code data filtering on the downloaded segmented files, greatly reduce data processing amount, but also combine the second segmented files which only contain non-ASCII code data and are in text format, so that Chinese and English and Chinese characters segmented to different segmented files are combined into the same file before coding detection, and the accuracy of coding detection results is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the disclosed drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a code detection method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for filtering a non-text format segmented file from a downloaded segmented file to obtain a text format segmented file according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a coding detection system according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a first filtering unit according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention discloses a coding detection method and a coding detection system, wherein in the process of downloading each segmented file of a page file to be downloaded, non-text format filtering and ASCII code data filtering are sequentially carried out on the downloaded segmented file to obtain a second segmented file which only contains non-ASCII code data and is in a text format, all the second segmented files are connected end to end according to a sequence to obtain a combined file, and when the total byte number of the non-ASCII code data contained in the combined file is not less than a preset threshold value, coding detection is carried out on the combined file. Compared with the prior art, the method and the device not only sequentially perform non-text format filtering and ASCII code data filtering on the downloaded segmented files, greatly reduce data processing amount, but also combine the second segmented files which only contain non-ASCII code data and are in text format, so that Chinese and English and Chinese characters segmented to different segmented files are combined into the same file before coding detection, and the accuracy of coding detection results is greatly improved.

Referring to fig. 1, a flowchart of a coding detection method according to an embodiment of the present invention is disclosed, and the method includes the steps of:

step S101, receiving file downloading requests, and creating multi-thread downloading class examples with the same number as the file downloading requests according to the number of the file downloading requests, so as to download the segmented files of the page files to be downloaded;

specifically, when the page file to be downloaded needs to be downloaded in segments, the user needs to send a file downloading request for each segment file.

Step S102, filtering the non-text format segmented files in the downloaded segmented files in the process of downloading the segmented files of the page file to be downloaded to obtain text format segmented files, and recording the text format segmented files as first segmented files;

HTTP (HyperText Mark-up Language), a HyperText markup Language or HyperText link markup Language, is currently the most widely used Language on networks, and is also the main Language constituting web documents. HTML text is descriptive text composed of HTML commands that can specify words, graphics, animations, sounds, tables, links, etc. The structure of HTML includes two major parts, a header (Head) that describes information required for a browser and a Body (Body) that contains the specific content to be described.

In this embodiment, the page file to be downloaded is an HTTP file, and in practical application, the format of the downloaded HTTP file is various, including: the invention filters the non-text format segmented file in the downloaded segmented file to obtain the text format segmented file, and marks the text format segmented file as the first segmented file for the convenience of subsequent description.

Step S103, filtering ASCII code data in the data of each first segmented file to obtain a first segmented file only containing non-ASCII code data, and recording the first segmented file as a second segmented file;

in practical application, the ASCII code data does not act on the result of text encoding detection, so the encoding program does not count the ASCII code data, but counts only the non-ASCII code data, and marks the non-ASCII code data with a flag.

In particular, non-ASCII code data may be marked with an identifier (flag_non_ascii), which is also defined in the code detection class (TextDecoderClass).

Step S104, connecting the second segmented files end to end according to the sequence to obtain a combined file;

it should be noted that, in practical application, the first byte and the last byte in each segment file may determine the adjacent segment files of each segment file according to the first byte and the last byte of each segment file.

According to the embodiment, the second segmented files are connected end to end in sequence, so that the characters of the combined file obtained after connection cannot be increased or reduced.

Step 105, counting the total byte number of all non-ASCII code data in the combined file;

specifically, the byte number of the non-ASCII data in each second segment file included in the combined file is added to obtain the total byte number of the non-ASCII data included in the combined file.

Step S106, judging whether the total byte number is not less than a preset threshold value, if so, executing step S107;

the value of the preset threshold depends on the credibility of the encoding program, for example, the preset threshold k=256 bytes corresponds to the size of about 100 chinese characters.

And step S107, performing coding detection on the combined file.

Specifically, when the total number of bytes of the non-ASCII code data included in the combined file is not less than a preset threshold, the encoding program will have enough reliability (confidence) to encode and detect the combined file, and at this time, the encoding program will send the combined file to the encoding detection module, and the encoding detection module will enter the corresponding channels (GBK, UTF-8, BIG5, UNICODE, etc.) according to the common character set specification, and perform classified encoding detection on the combined file, so as to screen out the effective encoding of the combined file.

When the total number of bytes of the non-ASCII code data included in the combined file is less than a preset threshold, the encoding program will not have enough confidence (confidence) to encode and detect the combined file, and at this time, the encoding and detecting will not be performed on the combined file.

In summary, in the method for detecting the encoding disclosed by the invention, in the process of downloading each segmented file of the page file to be downloaded, non-text format filtering and ASCII code data filtering are sequentially carried out on the downloaded segmented file to obtain a second segmented file which only contains non-ASCII code data and is in a text format, all the second segmented files are connected end to end according to the sequence to obtain a combined file, and when the total byte number of the non-ASCII code data contained in the combined file is not less than a preset threshold value, the combined file is subjected to encoding detection. Compared with the prior art, the method and the device not only sequentially perform non-text format filtering and ASCII code data filtering on the downloaded segmented files, greatly reduce data processing amount, but also combine the second segmented files which only contain non-ASCII code data and are in text format, so that Chinese and English and Chinese characters segmented to different segmented files are combined into the same file before coding detection, and the accuracy of coding detection results is greatly improved.

In addition, the invention can also detect the downloaded segmented file while downloading the page file, thereby providing convenience for the subsequent operation of the user and improving the user experience.

It should be noted that if the file to be downloaded is smaller, when the end mark is received after the first download, it indicates that the file to be downloaded is transmitted completely, and at this time, the downloaded file is not required to be connected.

In order to further optimize the above embodiment, referring to fig. 2, a method flowchart for filtering a non-text format segmented file in a downloaded segmented file to obtain a text format segmented file according to an embodiment of the present invention is disclosed, where the embodiment is a specific implementation process of step S102 in the above embodiment, and includes the following steps:

step S201, in the process of downloading each segmented file of the page file to be downloaded, judging whether the downloaded current segmented file contains header information or not, wherein the header information indicates that the current segmented file belongs to text format data, if so, executing step S202, and if not, executing step S203;

where data in text format, such as html, txt, js, css, xml, etc.

Step S202, creating an instance of a code detection class to mark the current segmented file and recording the current segmented file as the first segmented file;

it should be noted that, creating a code detection class may mark the storage address, size, and first and last bytes of data of the current segmented file.

Step S203, data analysis is carried out on the content of the current segmented file according to the coding specification, whether the current segmented file belongs to text format data is continuously judged, if yes, step S204 is executed, and otherwise, step S205 is executed;

the process of analyzing the content of the current segment file according to the coding specification to determine whether the current segment file belongs to the text format data can refer to the existing mature scheme, and will not be repeated here.

Specifically, if the downloaded current segmented file does not include header information or the format of the downloaded current segmented file is not explicitly contained in the header information, data analysis is performed on the content of the downloaded current segmented file according to the coding specification, whether the content of the downloaded current segmented file belongs to the text format data is judged, if so, the downloaded current segmented file is recorded as the first segmented file, otherwise, the downloaded current segmented file is filtered.

Step S204, marking the current segmented file as the first segmented file;

step S205, filtering the current segmented file.

To facilitate understanding of the present embodiment, the following will be exemplified:

for example, a UTF-8 character set encoded file, comprising: with and without BOM header information, for which the header starts 3 bytes: EF BB BF.

UTF-16 (Big Endian) character set encoded file, its header beginning bytes are: FE FF.

The file has header information, which belongs to common codes, from which the file format can be determined. For files without header information, the first 64 bytes of the file data may be extracted and analyzed for whether the values corresponding to these bytes fall within the interval of the common character set. For example, the UTF-8 character set is defined by the RFC3629 standard, where the translation table specifies the numerical range of the UTF-8 code. Comparing the extracted 64 bytes with the code table of the common character sets, the data can be preliminarily judged whether belongs to a certain character set or not, and whether the data is in a text (the text is encoded by the character set) format or not is further judged.

In summary, in the process of downloading each segmented file of the page file to be downloaded, when non-text format filtering is sequentially performed on the downloaded segmented files, whether the segmented files belong to text format data is determined based on header information of each segmented file and contents contained in the header information, the segmented files belonging to the text format are recorded as first segmented files, and therefore second segmented files which only contain non-ASCII data and are in the text format are obtained by performing ASCII data filtering on the first segmented files, the second segmented files are connected end to end according to the sequence, a combined file is obtained, and when the total number of bytes of the non-ASCII data contained in the combined file is not less than a preset threshold value, encoding detection is performed on the combined file. Compared with the prior art, the method and the device not only sequentially perform non-text format filtering and ASCII code data filtering on the downloaded segmented files, greatly reduce data processing amount, but also combine the second segmented files which only contain non-ASCII code data and are in text format, so that Chinese and English and Chinese characters segmented to different segmented files are combined into the same file before coding detection, and the accuracy of coding detection results is greatly improved.

Corresponding to the embodiment of the method, the invention also discloses a code detection system.

Referring to fig. 3, a schematic structural diagram of a coding detection system according to an embodiment of the present invention is disclosed, the system includes:

a receiving unit 301, configured to receive a file downloading request, and create, according to the number of file downloading requests, an instance of a multithreaded downloading class having the same number as the number of file downloading requests, and download a segmented file of a page file to be downloaded;

A first filtering unit 302, configured to filter, in the process of downloading each segment file of the page file to be downloaded, a segment file in a non-text format in the downloaded segment file, so as to obtain a segment file in a text format, and record the segment file as a first segment file;

A second filtering unit 303, configured to filter the ASCII code data in the data of each first segment file, obtain a first segment file that only includes non-ASCII code data, and record the first segment file as a second segment file;

A merging unit 304, configured to connect each of the second segment files end to end according to a sequence, so as to obtain a merged file;

A statistics unit 305, configured to count the total number of bytes of all non-ASCII code data in the merged file;

the statistics unit 305 is specifically configured to add the number of bytes of the non-ASCII data in each second segment file included in the combined file, so as to obtain the total number of bytes of the non-ASCII data included in the combined file.

A judging unit 306, configured to judge whether the total number of bytes is not less than a preset threshold;

And a code detection unit 307 for performing code detection on the combined file when the determination unit 306 determines that it is.

In summary, in the encoding detection system disclosed by the invention, in the process of downloading each segmented file of the page file to be downloaded, non-text format filtering and ASCII code data filtering are sequentially performed on the downloaded segmented file to obtain a second segmented file which only contains non-ASCII code data and is in a text format, all the second segmented files are connected end to end according to the sequence to obtain a combined file, and when the total byte number of the non-ASCII code data contained in the combined file is not less than a preset threshold value, encoding detection is performed on the combined file. Compared with the prior art, the method and the device not only sequentially perform non-text format filtering and ASCII code data filtering on the downloaded segmented files, greatly reduce data processing amount, but also combine the second segmented files which only contain non-ASCII code data and are in text format, so that Chinese and English and Chinese characters segmented to different segmented files are combined into the same file before coding detection, and the accuracy of coding detection results is greatly improved.

In order to further optimize the foregoing embodiments, referring to fig. 4, a schematic structural diagram of a first filtering unit according to an embodiment of the present invention is disclosed, including:

a first judging subunit 401, configured to judge, in a process of downloading each segment file of the page file to be downloaded, whether a current segment file after the downloading is completed contains header information, where the header information indicates that the current segment file belongs to data in a text format;

a creating subunit 402, configured to create an instance of a code detection class to mark the current segment file and record the current segment file as the first segment file, if the first judging subunit 401 judges that the current segment file is a segment file;

a second judging subunit 403, configured to, if the first judging subunit 401 judges no, perform data analysis on the content of the current segmented file according to the coding specification, and continuously judge whether the current segmented file belongs to data in a text format;

A marking subunit 404, configured to mark the current segment file as the first segment file if the second judging subunit 403 judges yes;

and a filtering subunit 405, configured to filter the current segment file if the second judging subunit 403 judges no.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A code detection method, comprising:

counting the total byte number of all non-ASCII code data in the combined file;

if yes, carrying out coding detection on the combined file;

in the process of downloading each segmented file of the page file to be downloaded, filtering the segmented files in a non-text format in the downloaded segmented files to obtain segmented files in a text format, and recording the segmented files as first segmented files, wherein the method specifically comprises the following steps of:

if yes, marking the current segmented file as the first segmented file;

and if not, filtering the current segmented file.

2. The method for detecting codes according to claim 1, wherein said counting the total number of bytes of all non-ASCII code data in said merged file comprises:

3. The code detection method of claim 1, wherein the predetermined threshold is 256 bytes.

4. A code detection system, comprising:

the coding detection unit is used for carrying out coding detection on the combined file under the condition that the judgment unit judges that the combined file is yes;

wherein, the first filtering unit specifically includes:

5. The code detection system of claim 4, wherein the statistics unit is specifically configured to:

6. The code detection system of claim 4, wherein the predetermined threshold is 256 bytes.