CN111459703B - Coding detection method and system - Google Patents

Coding detection method and system Download PDF

Info

Publication number
CN111459703B
CN111459703B CN201910005269.1A CN201910005269A CN111459703B CN 111459703 B CN111459703 B CN 111459703B CN 201910005269 A CN201910005269 A CN 201910005269A CN 111459703 B CN111459703 B CN 111459703B
Authority
CN
China
Prior art keywords
file
segmented
files
downloaded
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910005269.1A
Other languages
Chinese (zh)
Other versions
CN111459703A (en
Inventor
徐佳宏
朱吕亮
梁达源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ipanel TV Inc
Original Assignee
Shenzhen Ipanel TV Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Ipanel TV Inc filed Critical Shenzhen Ipanel TV Inc
Priority to CN201910005269.1A priority Critical patent/CN111459703B/en
Publication of CN111459703A publication Critical patent/CN111459703A/en
Application granted granted Critical
Publication of CN111459703B publication Critical patent/CN111459703B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes

Abstract

The invention discloses a coding detection method and a coding detection system, wherein in the process of downloading each segmented file of a page file to be downloaded, non-text format filtering and ASCII code data filtering are sequentially carried out on the downloaded segmented file to obtain a second segmented file which only contains non-ASCII code data and is in a text format, all the second segmented files are connected end to end according to the sequence to obtain a combined file, and when the total byte number of the non-ASCII code data contained in the combined file is not less than a preset threshold value, coding detection is carried out on the combined file. According to the invention, the second segmented file which only contains non-ASCII code data and is in a text format and is obtained through filtering is combined, so that Chinese and English and Chinese characters which are segmented into different segmented files are combined into the same file before coding detection is carried out, and the accuracy of a coding detection result is greatly improved.

Description

Coding detection method and system
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and a system for detecting codes.
Background
The multi-thread downloading means that a plurality of threads are concurrently executed and downloaded from software or hardware, so that the purpose of quickly downloading files in a segmented mode can be achieved. Because the multithreading downloading needs to divide a page file into a plurality of segmented files, when the prior art carries out coding detection on the page file, firstly, each segmented file received is respectively subjected to coding detection; and then counting the coding detection results of all the segmented files of the page file, and selecting the coding detection result with the largest number of times of occurrence of the coding detection result from all the coding detection results as the coding detection result of the page file.
However, when the downloaded page file is a mixture of Chinese and English, after the page file is divided into a plurality of segments, the Chinese and English are likely to be separated, even one Chinese is divided into two segment files, so that the detection result of each segment file is different, and a large error exists in the finally determined coding detection result.
Disclosure of Invention
In view of this, the present invention discloses a method and a system for detecting codes, which solve the problem in the prior art that, because Chinese and English are separated, even one Chinese character is divided into two segmented files, the detection result of each segmented file is different, and thus, the finally determined detection result of the codes has larger error.
A code detection method, comprising:
receiving file downloading requests, creating a multithreaded downloading class instance with the same number as the file downloading requests according to the number of the file downloading requests, and downloading the segmented files of the page files to be downloaded;
in the process of downloading each segmented file of the page file to be downloaded, filtering the segmented files with non-text formats in the downloaded segmented files to obtain segmented files with text formats, and marking the segmented files as first segmented files;
filtering ASCII data in the data of each first segmented file to obtain a first segmented file only containing non-ASCII data, and recording the first segmented file as a second segmented file;
connecting the second segmented files end to end according to the sequence to obtain a combined file;
counting the total byte number of all non-ASCII code data in the combined file;
judging whether the total byte number is not smaller than a preset threshold value;
and if so, carrying out coding detection on the combined file.
Optionally, in the process of downloading each segment file of the page file to be downloaded, filtering the segment file in a non-text format in the downloaded segment file to obtain a segment file in a text format, and recording the segment file as a first segment file, which specifically includes:
judging whether the downloaded current segmented file contains header information or not and the header information indicates that the current segmented file belongs to data in a text format in the process of downloading each segmented file of the page file to be downloaded;
if yes, creating an instance of a code detection class to mark the current segmented file and recording the current segmented file as the first segmented file;
if not, carrying out data analysis on the content of the current segmented file according to the coding specification, and continuously judging whether the current segmented file belongs to the text format data;
if yes, marking the current segmented file as the first segmented file;
and if not, filtering the current segmented file.
Optionally, the counting the total number of bytes of all non-ASCII code data in the merged file specifically includes:
and adding the byte numbers of the non-ASCII data in each second segmented file contained in the combined file to obtain the total byte number of all the non-ASCII data in the combined file.
Optionally, the preset threshold is 256 bytes.
A code detection system, comprising:
the receiving unit is used for receiving file downloading requests, creating a multithreaded downloading class instance with the same number as the file downloading requests according to the number of the file downloading requests, and downloading the segmented files of the page files to be downloaded;
the first filtering unit is used for filtering the non-text format segmented files in the downloaded segmented files in the process of downloading the segmented files of the page file to be downloaded to obtain text format segmented files, and recording the text format segmented files as first segmented files;
the second filtering unit is used for filtering ASCII (American standard code for information) code data in the data of each first segmented file to obtain a first segmented file only containing non-ASCII code data and recording the first segmented file as a second segmented file;
the merging unit is used for connecting the second segmented files end to end according to the sequence to obtain merged files;
the statistics unit is used for counting the total byte number of all non-ASCII code data in the combined file;
the judging unit is used for judging whether the total byte quantity is not smaller than a preset threshold value;
and the coding detection unit is used for carrying out coding detection on the combined file when the judgment unit judges that the combined file is judged to be positive.
Optionally, the first filtering unit specifically includes:
a first judging subunit, configured to judge, in a process of downloading each segment file of the page file to be downloaded, whether a current segment file after the downloading is completed contains header information, where the header information indicates that the current segment file belongs to data in a text format;
a creating subunit, configured to create an instance of a code detection class to mark the current segment file and record the current segment file as the first segment file if the first judging subunit judges yes;
the second judging subunit is used for carrying out data analysis on the content of the current segmented file according to the coding specification under the condition that the first judging subunit judges no, and continuously judging whether the current segmented file belongs to the data in the text format;
a marking subunit operable to, in a case where the second judging subunit judges yes, marking the current segmented file as the first segmented file;
and the filtering subunit is used for filtering the current segmented file under the condition that the second judging subunit judges no.
Optionally, the statistics unit is specifically configured to:
and adding the byte numbers of the non-ASCII data in each second segmented file contained in the combined file to obtain the total byte number of all the non-ASCII data in the combined file.
Optionally, the preset threshold is 256 bytes.
As can be seen from the above technical solution, the present invention discloses a method and a system for detecting encoding, in the process of downloading each segment file of a page file to be downloaded, non-text format filtering and ASCII code data filtering are sequentially performed on the downloaded segment file to obtain second segment files only containing non-ASCII code data and having a text format, the second segment files are connected end to end in sequence to obtain a combined file, and when the total number of bytes of the non-ASCII code data contained in the combined file is not less than a preset threshold value, encoding detection is performed on the combined file. Compared with the prior art, the method and the device not only sequentially perform non-text format filtering and ASCII code data filtering on the downloaded segmented files, greatly reduce data processing amount, but also combine the second segmented files which only contain non-ASCII code data and are in text format, so that Chinese and English and Chinese characters segmented to different segmented files are combined into the same file before coding detection, and the accuracy of coding detection results is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the disclosed drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a code detection method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for filtering a non-text format segmented file from a downloaded segmented file to obtain a text format segmented file according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a coding detection system according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a first filtering unit according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention discloses a coding detection method and a coding detection system, wherein in the process of downloading each segmented file of a page file to be downloaded, non-text format filtering and ASCII code data filtering are sequentially carried out on the downloaded segmented file to obtain a second segmented file which only contains non-ASCII code data and is in a text format, all the second segmented files are connected end to end according to a sequence to obtain a combined file, and when the total byte number of the non-ASCII code data contained in the combined file is not less than a preset threshold value, coding detection is carried out on the combined file. Compared with the prior art, the method and the device not only sequentially perform non-text format filtering and ASCII code data filtering on the downloaded segmented files, greatly reduce data processing amount, but also combine the second segmented files which only contain non-ASCII code data and are in text format, so that Chinese and English and Chinese characters segmented to different segmented files are combined into the same file before coding detection, and the accuracy of coding detection results is greatly improved.
Referring to fig. 1, a flowchart of a coding detection method according to an embodiment of the present invention is disclosed, and the method includes the steps of:
step S101, receiving file downloading requests, and creating multi-thread downloading class examples with the same number as the file downloading requests according to the number of the file downloading requests, so as to download the segmented files of the page files to be downloaded;
specifically, when the page file to be downloaded needs to be downloaded in segments, the user needs to send a file downloading request for each segment file.
Step S102, filtering the non-text format segmented files in the downloaded segmented files in the process of downloading the segmented files of the page file to be downloaded to obtain text format segmented files, and recording the text format segmented files as first segmented files;
HTTP (HyperText Mark-up Language), a HyperText markup Language or HyperText link markup Language, is currently the most widely used Language on networks, and is also the main Language constituting web documents. HTML text is descriptive text composed of HTML commands that can specify words, graphics, animations, sounds, tables, links, etc. The structure of HTML includes two major parts, a header (Head) that describes information required for a browser and a Body (Body) that contains the specific content to be described.
In this embodiment, the page file to be downloaded is an HTTP file, and in practical application, the format of the downloaded HTTP file is various, including: the invention filters the non-text format segmented file in the downloaded segmented file to obtain the text format segmented file, and marks the text format segmented file as the first segmented file for the convenience of subsequent description.
Step S103, filtering ASCII code data in the data of each first segmented file to obtain a first segmented file only containing non-ASCII code data, and recording the first segmented file as a second segmented file;
in practical application, the ASCII code data does not act on the result of text encoding detection, so the encoding program does not count the ASCII code data, but counts only the non-ASCII code data, and marks the non-ASCII code data with a flag.
In particular, non-ASCII code data may be marked with an identifier (flag_non_ascii), which is also defined in the code detection class (TextDecoderClass).
Step S104, connecting the second segmented files end to end according to the sequence to obtain a combined file;
it should be noted that, in practical application, the first byte and the last byte in each segment file may determine the adjacent segment files of each segment file according to the first byte and the last byte of each segment file.
According to the embodiment, the second segmented files are connected end to end in sequence, so that the characters of the combined file obtained after connection cannot be increased or reduced.
Step 105, counting the total byte number of all non-ASCII code data in the combined file;
specifically, the byte number of the non-ASCII data in each second segment file included in the combined file is added to obtain the total byte number of the non-ASCII data included in the combined file.
Step S106, judging whether the total byte number is not less than a preset threshold value, if so, executing step S107;
the value of the preset threshold depends on the credibility of the encoding program, for example, the preset threshold k=256 bytes corresponds to the size of about 100 chinese characters.
And step S107, performing coding detection on the combined file.
Specifically, when the total number of bytes of the non-ASCII code data included in the combined file is not less than a preset threshold, the encoding program will have enough reliability (confidence) to encode and detect the combined file, and at this time, the encoding program will send the combined file to the encoding detection module, and the encoding detection module will enter the corresponding channels (GBK, UTF-8, BIG5, UNICODE, etc.) according to the common character set specification, and perform classified encoding detection on the combined file, so as to screen out the effective encoding of the combined file.
When the total number of bytes of the non-ASCII code data included in the combined file is less than a preset threshold, the encoding program will not have enough confidence (confidence) to encode and detect the combined file, and at this time, the encoding and detecting will not be performed on the combined file.
In summary, in the method for detecting the encoding disclosed by the invention, in the process of downloading each segmented file of the page file to be downloaded, non-text format filtering and ASCII code data filtering are sequentially carried out on the downloaded segmented file to obtain a second segmented file which only contains non-ASCII code data and is in a text format, all the second segmented files are connected end to end according to the sequence to obtain a combined file, and when the total byte number of the non-ASCII code data contained in the combined file is not less than a preset threshold value, the combined file is subjected to encoding detection. Compared with the prior art, the method and the device not only sequentially perform non-text format filtering and ASCII code data filtering on the downloaded segmented files, greatly reduce data processing amount, but also combine the second segmented files which only contain non-ASCII code data and are in text format, so that Chinese and English and Chinese characters segmented to different segmented files are combined into the same file before coding detection, and the accuracy of coding detection results is greatly improved.
In addition, the invention can also detect the downloaded segmented file while downloading the page file, thereby providing convenience for the subsequent operation of the user and improving the user experience.
It should be noted that if the file to be downloaded is smaller, when the end mark is received after the first download, it indicates that the file to be downloaded is transmitted completely, and at this time, the downloaded file is not required to be connected.
In order to further optimize the above embodiment, referring to fig. 2, a method flowchart for filtering a non-text format segmented file in a downloaded segmented file to obtain a text format segmented file according to an embodiment of the present invention is disclosed, where the embodiment is a specific implementation process of step S102 in the above embodiment, and includes the following steps:
step S201, in the process of downloading each segmented file of the page file to be downloaded, judging whether the downloaded current segmented file contains header information or not, wherein the header information indicates that the current segmented file belongs to text format data, if so, executing step S202, and if not, executing step S203;
where data in text format, such as html, txt, js, css, xml, etc.
Step S202, creating an instance of a code detection class to mark the current segmented file and recording the current segmented file as the first segmented file;
it should be noted that, creating a code detection class may mark the storage address, size, and first and last bytes of data of the current segmented file.
Step S203, data analysis is carried out on the content of the current segmented file according to the coding specification, whether the current segmented file belongs to text format data is continuously judged, if yes, step S204 is executed, and otherwise, step S205 is executed;
the process of analyzing the content of the current segment file according to the coding specification to determine whether the current segment file belongs to the text format data can refer to the existing mature scheme, and will not be repeated here.
Specifically, if the downloaded current segmented file does not include header information or the format of the downloaded current segmented file is not explicitly contained in the header information, data analysis is performed on the content of the downloaded current segmented file according to the coding specification, whether the content of the downloaded current segmented file belongs to the text format data is judged, if so, the downloaded current segmented file is recorded as the first segmented file, otherwise, the downloaded current segmented file is filtered.
Step S204, marking the current segmented file as the first segmented file;
step S205, filtering the current segmented file.
To facilitate understanding of the present embodiment, the following will be exemplified:
for example, a UTF-8 character set encoded file, comprising: with and without BOM header information, for which the header starts 3 bytes: EF BB BF.
UTF-16 (Big Endian) character set encoded file, its header beginning bytes are: FE FF.
The file has header information, which belongs to common codes, from which the file format can be determined. For files without header information, the first 64 bytes of the file data may be extracted and analyzed for whether the values corresponding to these bytes fall within the interval of the common character set. For example, the UTF-8 character set is defined by the RFC3629 standard, where the translation table specifies the numerical range of the UTF-8 code. Comparing the extracted 64 bytes with the code table of the common character sets, the data can be preliminarily judged whether belongs to a certain character set or not, and whether the data is in a text (the text is encoded by the character set) format or not is further judged.
In summary, in the process of downloading each segmented file of the page file to be downloaded, when non-text format filtering is sequentially performed on the downloaded segmented files, whether the segmented files belong to text format data is determined based on header information of each segmented file and contents contained in the header information, the segmented files belonging to the text format are recorded as first segmented files, and therefore second segmented files which only contain non-ASCII data and are in the text format are obtained by performing ASCII data filtering on the first segmented files, the second segmented files are connected end to end according to the sequence, a combined file is obtained, and when the total number of bytes of the non-ASCII data contained in the combined file is not less than a preset threshold value, encoding detection is performed on the combined file. Compared with the prior art, the method and the device not only sequentially perform non-text format filtering and ASCII code data filtering on the downloaded segmented files, greatly reduce data processing amount, but also combine the second segmented files which only contain non-ASCII code data and are in text format, so that Chinese and English and Chinese characters segmented to different segmented files are combined into the same file before coding detection, and the accuracy of coding detection results is greatly improved.
Corresponding to the embodiment of the method, the invention also discloses a code detection system.
Referring to fig. 3, a schematic structural diagram of a coding detection system according to an embodiment of the present invention is disclosed, the system includes:
a receiving unit 301, configured to receive a file downloading request, and create, according to the number of file downloading requests, an instance of a multithreaded downloading class having the same number as the number of file downloading requests, and download a segmented file of a page file to be downloaded;
specifically, when the page file to be downloaded needs to be downloaded in segments, the user needs to send a file downloading request for each segment file.
A first filtering unit 302, configured to filter, in the process of downloading each segment file of the page file to be downloaded, a segment file in a non-text format in the downloaded segment file, so as to obtain a segment file in a text format, and record the segment file as a first segment file;
in this embodiment, the page file to be downloaded is an HTTP file, and in practical application, the format of the downloaded HTTP file is various, including: the invention filters the non-text format segmented file in the downloaded segmented file to obtain the text format segmented file, and marks the text format segmented file as the first segmented file for the convenience of subsequent description.
A second filtering unit 303, configured to filter the ASCII code data in the data of each first segment file, obtain a first segment file that only includes non-ASCII code data, and record the first segment file as a second segment file;
in particular, non-ASCII code data may be marked with an identifier (flag_non_ascii), which is also defined in the code detection class (TextDecoderClass).
A merging unit 304, configured to connect each of the second segment files end to end according to a sequence, so as to obtain a merged file;
it should be noted that, in practical application, the first byte and the last byte in each segment file may determine the adjacent segment files of each segment file according to the first byte and the last byte of each segment file.
According to the embodiment, the second segmented files are connected end to end in sequence, so that the characters of the combined file obtained after connection cannot be increased or reduced.
A statistics unit 305, configured to count the total number of bytes of all non-ASCII code data in the merged file;
the statistics unit 305 is specifically configured to add the number of bytes of the non-ASCII data in each second segment file included in the combined file, so as to obtain the total number of bytes of the non-ASCII data included in the combined file.
A judging unit 306, configured to judge whether the total number of bytes is not less than a preset threshold;
the value of the preset threshold depends on the credibility of the encoding program, for example, the preset threshold k=256 bytes corresponds to the size of about 100 chinese characters.
And a code detection unit 307 for performing code detection on the combined file when the determination unit 306 determines that it is.
Specifically, when the total number of bytes of the non-ASCII code data included in the combined file is not less than a preset threshold, the encoding program will have enough reliability (confidence) to encode and detect the combined file, and at this time, the encoding program will send the combined file to the encoding detection module, and the encoding detection module will enter the corresponding channels (GBK, UTF-8, BIG5, UNICODE, etc.) according to the common character set specification, and perform classified encoding detection on the combined file, so as to screen out the effective encoding of the combined file.
When the total number of bytes of the non-ASCII code data included in the combined file is less than a preset threshold, the encoding program will not have enough confidence (confidence) to encode and detect the combined file, and at this time, the encoding and detecting will not be performed on the combined file.
In summary, in the encoding detection system disclosed by the invention, in the process of downloading each segmented file of the page file to be downloaded, non-text format filtering and ASCII code data filtering are sequentially performed on the downloaded segmented file to obtain a second segmented file which only contains non-ASCII code data and is in a text format, all the second segmented files are connected end to end according to the sequence to obtain a combined file, and when the total byte number of the non-ASCII code data contained in the combined file is not less than a preset threshold value, encoding detection is performed on the combined file. Compared with the prior art, the method and the device not only sequentially perform non-text format filtering and ASCII code data filtering on the downloaded segmented files, greatly reduce data processing amount, but also combine the second segmented files which only contain non-ASCII code data and are in text format, so that Chinese and English and Chinese characters segmented to different segmented files are combined into the same file before coding detection, and the accuracy of coding detection results is greatly improved.
In addition, the invention can also detect the downloaded segmented file while downloading the page file, thereby providing convenience for the subsequent operation of the user and improving the user experience.
In order to further optimize the foregoing embodiments, referring to fig. 4, a schematic structural diagram of a first filtering unit according to an embodiment of the present invention is disclosed, including:
a first judging subunit 401, configured to judge, in a process of downloading each segment file of the page file to be downloaded, whether a current segment file after the downloading is completed contains header information, where the header information indicates that the current segment file belongs to data in a text format;
a creating subunit 402, configured to create an instance of a code detection class to mark the current segment file and record the current segment file as the first segment file, if the first judging subunit 401 judges that the current segment file is a segment file;
a second judging subunit 403, configured to, if the first judging subunit 401 judges no, perform data analysis on the content of the current segmented file according to the coding specification, and continuously judge whether the current segmented file belongs to data in a text format;
the process of analyzing the content of the current segment file according to the coding specification to determine whether the current segment file belongs to the text format data can refer to the existing mature scheme, and will not be repeated here.
Specifically, if the downloaded current segmented file does not include header information or the format of the downloaded current segmented file is not explicitly contained in the header information, data analysis is performed on the content of the downloaded current segmented file according to the coding specification, whether the content of the downloaded current segmented file belongs to the text format data is judged, if so, the downloaded current segmented file is recorded as the first segmented file, otherwise, the downloaded current segmented file is filtered.
A marking subunit 404, configured to mark the current segment file as the first segment file if the second judging subunit 403 judges yes;
and a filtering subunit 405, configured to filter the current segment file if the second judging subunit 403 judges no.
In summary, in the process of downloading each segmented file of the page file to be downloaded, when non-text format filtering is sequentially performed on the downloaded segmented files, whether the segmented files belong to text format data is determined based on header information of each segmented file and contents contained in the header information, the segmented files belonging to the text format are recorded as first segmented files, and therefore second segmented files which only contain non-ASCII data and are in the text format are obtained by performing ASCII data filtering on the first segmented files, the second segmented files are connected end to end according to the sequence, a combined file is obtained, and when the total number of bytes of the non-ASCII data contained in the combined file is not less than a preset threshold value, encoding detection is performed on the combined file. Compared with the prior art, the method and the device not only sequentially perform non-text format filtering and ASCII code data filtering on the downloaded segmented files, greatly reduce data processing amount, but also combine the second segmented files which only contain non-ASCII code data and are in text format, so that Chinese and English and Chinese characters segmented to different segmented files are combined into the same file before coding detection, and the accuracy of coding detection results is greatly improved.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. A code detection method, comprising:
receiving file downloading requests, creating a multithreaded downloading class instance with the same number as the file downloading requests according to the number of the file downloading requests, and downloading the segmented files of the page files to be downloaded;
in the process of downloading each segmented file of the page file to be downloaded, filtering the segmented files with non-text formats in the downloaded segmented files to obtain segmented files with text formats, and marking the segmented files as first segmented files;
filtering ASCII data in the data of each first segmented file to obtain a first segmented file only containing non-ASCII data, and recording the first segmented file as a second segmented file;
connecting the second segmented files end to end according to the sequence to obtain a combined file;
counting the total byte number of all non-ASCII code data in the combined file;
judging whether the total byte number is not smaller than a preset threshold value;
if yes, carrying out coding detection on the combined file;
in the process of downloading each segmented file of the page file to be downloaded, filtering the segmented files in a non-text format in the downloaded segmented files to obtain segmented files in a text format, and recording the segmented files as first segmented files, wherein the method specifically comprises the following steps of:
judging whether the downloaded current segmented file contains header information or not and the header information indicates that the current segmented file belongs to data in a text format in the process of downloading each segmented file of the page file to be downloaded;
if yes, creating an instance of a code detection class to mark the current segmented file and recording the current segmented file as the first segmented file;
if not, carrying out data analysis on the content of the current segmented file according to the coding specification, and continuously judging whether the current segmented file belongs to the text format data;
if yes, marking the current segmented file as the first segmented file;
and if not, filtering the current segmented file.
2. The method for detecting codes according to claim 1, wherein said counting the total number of bytes of all non-ASCII code data in said merged file comprises:
and adding the byte numbers of the non-ASCII data in each second segmented file contained in the combined file to obtain the total byte number of all the non-ASCII data in the combined file.
3. The code detection method of claim 1, wherein the predetermined threshold is 256 bytes.
4. A code detection system, comprising:
the receiving unit is used for receiving file downloading requests, creating a multithreaded downloading class instance with the same number as the file downloading requests according to the number of the file downloading requests, and downloading the segmented files of the page files to be downloaded;
the first filtering unit is used for filtering the non-text format segmented files in the downloaded segmented files in the process of downloading the segmented files of the page file to be downloaded to obtain text format segmented files, and recording the text format segmented files as first segmented files;
the second filtering unit is used for filtering ASCII (American standard code for information) code data in the data of each first segmented file to obtain a first segmented file only containing non-ASCII code data and recording the first segmented file as a second segmented file;
the merging unit is used for connecting the second segmented files end to end according to the sequence to obtain merged files;
the statistics unit is used for counting the total byte number of all non-ASCII code data in the combined file;
the judging unit is used for judging whether the total byte quantity is not smaller than a preset threshold value;
the coding detection unit is used for carrying out coding detection on the combined file under the condition that the judgment unit judges that the combined file is yes;
wherein, the first filtering unit specifically includes:
a first judging subunit, configured to judge, in a process of downloading each segment file of the page file to be downloaded, whether a current segment file after the downloading is completed contains header information, where the header information indicates that the current segment file belongs to data in a text format;
a creating subunit, configured to create an instance of a code detection class to mark the current segment file and record the current segment file as the first segment file if the first judging subunit judges yes;
the second judging subunit is used for carrying out data analysis on the content of the current segmented file according to the coding specification under the condition that the first judging subunit judges no, and continuously judging whether the current segmented file belongs to the data in the text format;
a marking subunit operable to, in a case where the second judging subunit judges yes, marking the current segmented file as the first segmented file;
and the filtering subunit is used for filtering the current segmented file under the condition that the second judging subunit judges no.
5. The code detection system of claim 4, wherein the statistics unit is specifically configured to:
and adding the byte numbers of the non-ASCII data in each second segmented file contained in the combined file to obtain the total byte number of all the non-ASCII data in the combined file.
6. The code detection system of claim 4, wherein the predetermined threshold is 256 bytes.
CN201910005269.1A 2019-01-03 2019-01-03 Coding detection method and system Active CN111459703B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910005269.1A CN111459703B (en) 2019-01-03 2019-01-03 Coding detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910005269.1A CN111459703B (en) 2019-01-03 2019-01-03 Coding detection method and system

Publications (2)

Publication Number Publication Date
CN111459703A CN111459703A (en) 2020-07-28
CN111459703B true CN111459703B (en) 2024-03-19

Family

ID=71679735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910005269.1A Active CN111459703B (en) 2019-01-03 2019-01-03 Coding detection method and system

Country Status (1)

Country Link
CN (1) CN111459703B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011069455A1 (en) * 2009-12-09 2011-06-16 成都市华为赛门铁克科技有限公司 Method, apparatus and cache system for providing file downloading service
CN102289511A (en) * 2011-08-31 2011-12-21 深圳市茁壮网络股份有限公司 Word stock file downloading method, user terminal and server
JP2012065316A (en) * 2010-09-17 2012-03-29 Ntt Docomo Inc Data transmitting method and apparatus based on network encoding
CN103124275A (en) * 2011-11-18 2013-05-29 腾讯科技(深圳)有限公司 Method and device of obtaining files
WO2014032559A1 (en) * 2012-08-28 2014-03-06 腾讯科技(深圳)有限公司 Method and device for downloading file
CN104283955A (en) * 2014-09-30 2015-01-14 百度在线网络技术(北京)有限公司 Browser, server, downloading system and downloading method
CN105763886A (en) * 2016-03-01 2016-07-13 深圳市茁壮网络股份有限公司 Distributed transcoding method and apparatus
CN107943761A (en) * 2017-11-14 2018-04-20 北京思特奇信息技术股份有限公司 A kind of method of calibration and system of TXT document codings character set

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011069455A1 (en) * 2009-12-09 2011-06-16 成都市华为赛门铁克科技有限公司 Method, apparatus and cache system for providing file downloading service
JP2012065316A (en) * 2010-09-17 2012-03-29 Ntt Docomo Inc Data transmitting method and apparatus based on network encoding
CN102289511A (en) * 2011-08-31 2011-12-21 深圳市茁壮网络股份有限公司 Word stock file downloading method, user terminal and server
CN103124275A (en) * 2011-11-18 2013-05-29 腾讯科技(深圳)有限公司 Method and device of obtaining files
WO2014032559A1 (en) * 2012-08-28 2014-03-06 腾讯科技(深圳)有限公司 Method and device for downloading file
CN104283955A (en) * 2014-09-30 2015-01-14 百度在线网络技术(北京)有限公司 Browser, server, downloading system and downloading method
CN105763886A (en) * 2016-03-01 2016-07-13 深圳市茁壮网络股份有限公司 Distributed transcoding method and apparatus
CN107943761A (en) * 2017-11-14 2018-04-20 北京思特奇信息技术股份有限公司 A kind of method of calibration and system of TXT document codings character set

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于文件校核比对的智能变电站配置文件管控系统研究与实现;陆承宇;王松;高秀荣;丁希亮;;电工技术(第07期);64-66 *

Also Published As

Publication number Publication date
CN111459703A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN109857992B (en) Medical data structured analysis method and device, readable medium and electronic equipment
US8533172B2 (en) Method and device for coding and decoding information
US9158742B2 (en) Automatically detecting layout of bidirectional (BIDI) text
US20190196811A1 (en) Api specification generation
CN103389972B (en) A kind of method and device that text is obtained based on Simple Syndication
US8234288B2 (en) Method and device for generating reference patterns from a document written in markup language and associated coding and decoding methods and devices
CN104994128A (en) Data coding type identifying and transcoding method and device
EP3051428B1 (en) Method and system for selecting an encoding format for reading a target document
CN102662969A (en) Internet information object positioning method based on webpage structure semantic meaning
Li et al. A composite approach to language/encoding detection
CN106547895B (en) Webpage information extraction method and device
CN111881094B (en) Method, device, terminal and storage medium for extracting key information in log
CN104750663B (en) The recognition methods of text messy code and device in the page
CN106534267A (en) File uploading and resolving method and device
CN110381363A (en) Video encoding/decoding method, device, server and storage medium
US8805860B2 (en) Processing encoded data elements using an index stored in a file
KR101143650B1 (en) An apparatus for preparing a display document for analysis
US8271263B2 (en) Multi-language text fragment transcoding and featurization
CN104978325B (en) A kind of web page processing method, device and user terminal
CN111459703B (en) Coding detection method and system
CN110019012B (en) Data preprocessing method, data preprocessing device and computer-readable storage medium
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction
CN104216868B (en) A kind of adaptation method and device of document display format
CN104079450A (en) Method and device for generating characteristic pattern set
CN107943761A (en) A kind of method of calibration and system of TXT document codings character set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant