WO2016061930A1 - 网页编码识别方法及装置 - Google Patents
网页编码识别方法及装置 Download PDFInfo
- Publication number
- WO2016061930A1 WO2016061930A1 PCT/CN2015/071308 CN2015071308W WO2016061930A1 WO 2016061930 A1 WO2016061930 A1 WO 2016061930A1 CN 2015071308 W CN2015071308 W CN 2015071308W WO 2016061930 A1 WO2016061930 A1 WO 2016061930A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- resource
- encoding
- webpage
- mode
- html
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
Definitions
- the present disclosure relates to the field of computer networks, and in particular, to a webpage code recognition method and apparatus.
- the browser Since the webpage data may be encoded by different encoding methods, the browser first needs to identify the encoding mode of the webpage data according to the "charset" field in the webpage data, and then decode the webpage data by using the decoding method corresponding to the encoding mode, and then Display web page data.
- the “charset” field is missing or miswritten in the web page data developed by many technicians.
- the browser uses the default decoding method to decode, which may display garbled characters. .
- the browser may display garbled characters.
- the embodiment of the present disclosure provides a webpage encoding and identification method and apparatus. The technical solution is as follows:
- a webpage encoding and recognizing method comprising:
- webpage data including at least one webpage resource
- the web resource is an HTML resource but does not declare an encoding method, the encoding method of the HTML resource is identified;
- the HTML resource is decoded by a decoding method corresponding to the identified encoding mode.
- the method further includes:
- the webpage resource is an HTML resource but the encoding mode is declared, detecting whether the declared encoding mode is one of the preset encoding modes;
- the encoding mode of the HTML resource is recognized; or, the encoded encoding mode is automatically corrected, and the encoding method after the automatic error correction is obtained.
- the encoding of the HTML resource is identified, including:
- the predetermined character encoding recognition algorithm is called to identify the encoding method of the HTML resource.
- the error correction is performed on the declared coding mode to obtain an automatic error correction coding method, including:
- the preset encoding mode corresponding to the highest spelling similarity is determined as the encoding mode after automatic error correction.
- the method further includes:
- the encoding method adopted by the HTML resource in the webpage data is identified as the encoding mode of the CSS resource, and the CSS resource is decoded by using the decoding method corresponding to the encoding mode.
- a webpage code recognition apparatus comprising:
- a data loading module configured to load webpage data, the webpage data including at least one webpage resource
- the mode detection module is configured to detect whether the webpage resource is an HTML resource and declare an encoding mode
- the mode identification module is configured to identify an encoding mode of the HTML resource when the webpage resource is an HTML resource but does not declare an encoding mode
- the resource decoding module is configured to decode the HTML resource by using a decoding manner corresponding to the identified encoding mode.
- the apparatus further includes:
- the code detection module is configured to detect whether the declared coding mode is one of a preset coding mode when the webpage resource is an HTML resource but the coding mode is declared;
- the mode identification module is configured to identify an encoding mode of the HTML resource when the declared encoding mode is not one of the preset encoding modes; or the automatic error correcting module is configured to: when the declared encoding mode is not preset In one of the encoding modes, the encoded mode is automatically corrected, and the encoding mode after automatic error correction is obtained.
- the mode identification module is configured to invoke a predetermined character encoding recognition algorithm to identify the encoding of the HTML resource.
- the automatic error correction module includes:
- a spelling calculation sub-module configured to calculate spelling similarity respectively for each of the declared encoding mode and the preset encoding mode
- the automatic error correction sub-module is configured to determine the preset coding mode corresponding to the highest spell similarity as the automatic error correction coding mode when the highest spell similarity is greater than the preset threshold.
- the apparatus further comprises:
- the encoding negative module is configured to identify the encoding mode adopted by the HTML resource in the webpage data as the encoding mode of the CSS resource when the webpage resource is a CSS resource, and decode the CSS resource by using a decoding manner corresponding to the encoding mode.
- a webpage code recognition apparatus comprising:
- a memory for storing executable instructions of the processor
- processor is configured to:
- webpage data including at least one webpage resource
- the web resource is an HTML resource but does not declare an encoding method, the encoding method of the HTML resource is identified;
- the HTML resource is decoded by a decoding method corresponding to the identified encoding mode.
- the coding mode of the webpage resource is identified, and the webpage resource is decoded by using the decoding method corresponding to the coding mode; when the "charset" field in the webpage coding in the related art is missed, The browser may display garbled characters; it can achieve the effect of decoding and displaying the web resources normally even if the encoding method is not declared in the web resource.
- FIG. 1 is a flowchart of a webpage encoding and recognizing method according to an exemplary embodiment
- FIG. 2 is a flowchart of a webpage encoding and recognizing method according to another exemplary embodiment
- FIG. 3 is a block diagram of a webpage encoding apparatus according to an exemplary embodiment
- FIG. 4 is a block diagram of a webpage encoding and recognizing apparatus according to another exemplary embodiment
- FIG. 5 is a block diagram of a webpage encoding and recognizing apparatus according to an exemplary embodiment.
- the terminals involved in the embodiments of the present disclosure may be a mobile phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III), and a MP4 (Moving Picture Experts Group Audio).
- Layer IV motion imaging experts compress standard audio layers 4) players, laptops and desktop computers, and more.
- FIG. 1 is a flowchart of a method for recognizing a webpage code according to an exemplary embodiment. This embodiment is illustrated by using the webpage encoding and recognizing method in a terminal.
- the webpage encoding and identifying method may include the following steps:
- step 101 webpage data is loaded, and the webpage data includes at least one webpage resource.
- Web resources are usually divided into two types: HTML (HyperText Mark-up Language) resources and CSS (Cascading Style Sheets) resources.
- HTML HyperText Mark-up Language
- CSS CSS
- step 102 it is detected whether the webpage resource is an HTML resource and the encoding mode is declared.
- step 103 if the webpage resource is an HTML resource but the encoding mode is not declared, the encoding mode of the HTML resource is identified.
- step 104 the HTML resource is decoded using a decoding method corresponding to the identified encoding mode.
- the webpage encoding and recognizing method can identify the encoding mode of the webpage resource when the webpage resource does not declare the encoding mode, and decode the webpage resource by using the decoding method corresponding to the encoding mode;
- the browser may display the garbled problem; even if the encoding method is not declared in the webpage resource, the webpage resource can be normally decoded and displayed.
- FIG. 2 is a flowchart of a webpage encoding and recognizing method according to another exemplary embodiment. This embodiment is exemplified by applying the webpage code recognition method to the terminal.
- the webpage encoding and identifying method may include the following steps:
- step 201 webpage data is loaded, and the webpage data includes at least one webpage resource.
- the terminal When the terminal needs to display a webpage, the terminal first loads the webpage data of the webpage. At least one webpage resource is included in the webpage data of each webpage.
- Web resources can be divided into two types: HTML resources and CSS resources.
- step 202 it is detected whether the webpage resource is an HTML resource.
- the terminal Before decoding each web resource, the terminal first detects whether the web resource is an HTML resource.
- the web resource is an HTML resource, proceed to step 203;
- step 210 If the web resource is a CSS resource, proceed to step 210.
- step 203 it is detected whether the HTML resource declares the encoding mode.
- Common coding methods include: UTF-8 (8-bit Unicode Transformation Format), Big5 (big five code), GB2312 (Chinese character coded character set for information exchange), GBK (Chinese character coded characters for information exchange) Set), ISO-8859-1 (International Organization for Standardization, International Standards Organization), ISO-8859-2, etc.
- HTML resources typically use the "charset” field to declare the encoding they use. However, due to the different levels of web developers, the "charset” field in HTML resources may be missed or miswritten.
- step 206 If the HTML resource declares the encoding mode, the process proceeds to step 206.
- step 204 if the HTML resource does not declare the encoding mode, the encoding mode of the HTML resource is identified.
- the terminal can call a predetermined character encoding recognition algorithm to identify the encoding mode of the HTML resource.
- the predetermined character recognition algorithm may be a chardet character encoding recognition algorithm.
- the terminal calls the chardet character encoding recognition algorithm to recognize that the encoding method used by the HTML resource is GB2312 encoding.
- the Chardet character encoding recognition algorithm is an algorithm for identifying the encoding format of a character string. Often used to identify the encoding format of text characters.
- the terminal may extract a character string of a predetermined length in the HTML resource, and identify a coding manner of the character string of the predetermined length by a predetermined character code recognition algorithm. There is no need to identify all the strings in the entire HTML resource.
- step 205 the HTML resource is decoded using a decoding method corresponding to the identified encoding mode.
- the terminal After identifying the coding mode used by the HTML resource, the terminal decodes the HTML resource by using a decoding method corresponding to the identified coding mode.
- step 206 if the HTML resource has declared the encoding mode, it is detected whether the declared encoding mode is one of the preset encoding modes.
- the terminal When the encoding mode has been declared in the HTML resource, the terminal needs to detect whether the declared encoding mode is one of the preset encoding modes because the encoding method of the encoding may be misspelled.
- the preset encoding methods include but are not limited to: UTF-8 (8-bit Unicode Transformation Format), Big5 (big five code), GB2312 (Chinese character coded character set for information exchange), GBK (for information exchange) Chinese character coded character set), ISO-8859-1 (International Organization for Standardization, ISO-8859-2, etc.).
- step 207 If the declared encoding mode is one of the preset encoding modes, proceed to step 207;
- step 208 is entered.
- step 207 if the declared coding mode is one of the preset coding modes, the HTML resource is decoded using the decoding mode corresponding to the declared coding mode.
- the terminal decodes the HTML resource by using the decoding method corresponding to the declared encoding mode.
- step 208 if the declared encoding mode is not one of the preset encoding modes, the encoding mode of the HTML resource is recognized; or, the encoded encoding mode is automatically corrected, and the encoding method after the automatic error correction is obtained. .
- this embodiment provides two different processing methods:
- the first processing method the terminal identifies the encoding mode of the HTML resource
- the identification mode is the same as step 204, and the terminal can call a predetermined character encoding recognition algorithm to identify the encoding mode of the HTML resource.
- the predetermined character recognition algorithm may be a chardet character encoding recognition algorithm.
- the second processing method the terminal automatically corrects the coded mode that has been declared, and obtains an automatic error correction coding mode.
- the process of automatic error correction is as follows: the terminal calculates the spelling similarity separately for each of the declared encoding mode and the preset encoding mode, and if there are six preset encoding modes, six spelling similarities can be calculated. When the highest spelling similarity When the threshold is greater than the preset threshold, the terminal determines the preset coding mode corresponding to the highest spell similarity as the coding mode after automatic error correction.
- the declared encoding method is "GB2812"
- there are 6 preset encoding methods and there are 6 calculated spelling similarities.
- the highest spelling similarity with the preset encoding method "GB2312” is 83%, which is greater than the preset threshold of 60%. Therefore, the terminal determines the preset encoding mode "GB2312" as the encoding mode after automatic error correction.
- the first processing method and the second processing method may be used alternatively or in combination.
- the second processing method is used first, but if the highest spelling similarity is less than the preset threshold, or two or more preset encoding methods have the highest spelling similarity.
- the terminal can re-recognize the encoding method of the HTML resource by using the first processing method.
- step 209 the HTML resource is decoded using a decoding method corresponding to the re-identification or automatic error correction coding mode.
- step 210 if the webpage resource is a CSS resource, the encoding method adopted by the HTML resource in the webpage data is identified as the encoding mode of the CSS resource, and the CSS resource is decoded by using the decoding method corresponding to the encoding mode.
- the terminal recognizes the encoding mode adopted by the HTML resource in the webpage data as The encoding method of the CSS resource, wherein the process of identifying the encoding mode of the HTML resource can be referred to the foregoing steps 202 to 207.
- the terminal decodes the CSS resource by using a decoding method corresponding to the coding mode of the CSS resource.
- the terminal may display the webpage according to the decoded webpage resource.
- the webpage encoding and recognizing method can identify the encoding mode of the webpage resource when the webpage resource does not declare the encoding mode, and decode the webpage resource by using the decoding method corresponding to the encoding mode;
- the browser may display the garbled problem; even if the encoding method is not declared in the webpage resource, the webpage resource can be normally decoded and displayed.
- the webpage encoding and recognizing method provided by the embodiment further decodes the webpage resource by decoding the decoding method corresponding to the re-identification or automatic error correction when the encoding method of the webpage resource is declared but the declared encoding method has a spelling error.
- the problem that the browser displays garbled characters when the "charset" field in the webpage coding is miswritten in the related art is solved; even if the encoding method declared in the webpage resource is miswritten, the webpage resource can be normally decoded and displayed. Effect.
- FIG. 3 is a block diagram of a webpage encoding and recognizing apparatus, which may be implemented as part or all of a terminal by software, hardware, or a combination of both, according to an exemplary embodiment.
- the webpage code recognition device may include:
- the data loading module 320 is configured to load webpage data, and the webpage data includes at least one webpage resource.
- the mode detection module 340 is configured to detect whether the webpage resource is an HTML resource and declare an encoding mode.
- the mode identification module 360 is configured to identify an encoding mode of the HTML resource when the webpage resource is an HTML resource but the encoding mode is not declared.
- the resource decoding module 380 is configured to decode the HTML resource by using a decoding method corresponding to the identified encoding mode.
- the webpage encoding and recognizing device provided by the embodiment can identify the encoding mode of the webpage resource when the webpage resource does not declare the encoding mode, and decode the webpage resource by using the decoding method corresponding to the encoding mode;
- the browser may display the garbled problem; even if the encoding method is not declared in the webpage resource, the webpage resource can be normally decoded and displayed.
- FIG. 4 is a block diagram of a web page encoding and recognizing apparatus according to another exemplary embodiment.
- the webpage code recognition means may be implemented as part or all of the terminal by software, hardware or a combination of both.
- the webpage code recognition device may include:
- the data loading module 320 is configured to load webpage data, and the webpage data includes at least one webpage resource.
- the mode detection module 340 is configured to detect whether the webpage resource is an HTML resource and declare an encoding mode.
- the mode identification module 360 is configured to identify an encoding mode of the HTML resource when the webpage resource is an HTML resource but the encoding mode is not declared.
- the resource decoding module 380 is configured to decode the HTML resource by using a decoding method corresponding to the identified encoding mode.
- the device further includes:
- the encoding detection module 352 is configured to detect whether the declared encoding mode is one of the preset encoding modes when the webpage resource is an HTML resource but the encoding mode is declared.
- the mode identification module 360 is configured to identify an encoding mode of the HTML resource when the declared encoding mode is not one of the preset encoding modes. or,
- the automatic error correction module 370 is configured to perform automatic error correction on the declared coding mode when the declared coding mode is not one of the preset coding modes, and obtain an automatic error correction coding mode.
- the mode identification module 360 is configured to invoke a predetermined character encoding recognition algorithm to identify an encoding mode of the HTML resource.
- the automatic error correction module 370 includes:
- the spelling calculation sub-module 372 is configured to calculate spelling similarity respectively for each of the declared encoding mode and the preset encoding mode;
- the automatic error correction sub-module 374 is configured to determine the preset coding mode corresponding to the highest spell similarity as the automatic error correction coding mode when the highest spell similarity is greater than the preset threshold.
- the device further includes:
- the code multiplexing module 354 is configured to: when the webpage resource is a CSS resource, the HTML resource in the webpage data
- the coding mode adopted by the source is identified as the coding mode of the CSS resource, and the CSS resource is decoded by using the decoding mode corresponding to the coding mode.
- the webpage encoding and recognizing device provided by the embodiment can identify the encoding mode of the webpage resource when the webpage resource does not declare the encoding mode, and decode the webpage resource by using the decoding method corresponding to the encoding mode;
- the browser may display the garbled problem; even if the encoding method is not declared in the webpage resource, the webpage resource can be normally decoded and displayed.
- the webpage code recognition apparatus further decodes the webpage resource by decoding means corresponding to the re-identification or automatic error correction coding mode when the coding method of the webpage resource is declared but the declared coding method has a spelling error.
- the problem that the browser displays garbled characters when the "charset" field in the webpage coding is miswritten in the related art is solved; even if the encoding method declared in the webpage resource is miswritten, the webpage resource can be normally decoded and displayed. Effect.
- FIG. 5 is a block diagram of a web page encoding recognition apparatus 500, according to an exemplary embodiment.
- device 500 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
- apparatus 500 can include one or more of the following components: processing component 502, memory 504, power component 506, multimedia component 508, audio component 510, input/output (I/O) interface 512, sensor component 514, And a communication component 516.
- Processing component 502 typically controls the overall operation of device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
- Processing component 502 can include one or more processors 520 to execute instructions to perform all or part of the steps of the above described methods.
- processing component 502 can include one or more modules to facilitate interaction between component 502 and other components.
- processing component 502 can include a multimedia module to facilitate interaction between multimedia component 508 and processing component 502.
- Memory 504 is configured to store various types of data to support operation at device 500. Examples of such data include instructions for any application or method operating on device 500, contact data, phone book data, messages, pictures, videos, and the like.
- the memory 504 can be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable.
- SRAM static random access memory
- EEPROM electrically erasable programmable read only memory
- EPROM Electrically erasable programmable read only memory
- PROM Programmable Read Only Memory
- ROM Read Only Memory
- Magnetic Memory Flash Memory
- Disk Disk or Optical Disk.
- Power component 506 provides power to various components of device 500.
- Power component 506 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for device 500.
- the multimedia component 508 includes a screen between the device 500 and the user that provides an output interface.
- the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user.
- the touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor may sense not only the boundary of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
- the multimedia component 508 includes a front camera and/or a rear camera. When the device 500 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
- the audio component 510 is configured to output and/or input an audio signal.
- audio component 510 includes a microphone (MIC) that is configured to receive an external audio signal when device 500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode.
- the received audio signal may be further stored in memory 504 or transmitted via communication component 516.
- audio component 510 also includes a speaker for outputting an audio signal.
- the I/O interface 512 provides an interface between the processing component 502 and the peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to, a home button, a volume button, a start button, and a lock button.
- Sensor assembly 514 includes one or more sensors for providing device 500 with various aspects of status assessment.
- sensor assembly 514 can detect an open/closed state of device 500, a relative positioning of components, such as the display and keypad of device 500, and sensor component 514 can also detect a change in position of one component of device 500 or device 500. The presence or absence of user contact with device 500, device 500 orientation or acceleration/deceleration, and temperature variation of device 500.
- Sensor assembly 514 can include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
- Sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
- the sensor component 514 can also include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
- Communication component 516 is configured to facilitate wired or wireless communication between device 500 and other devices.
- the device 500 can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof.
- communication component 516 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel.
- the communication component 516 also includes a near field communication (NFC) module to facilitate short range communication.
- NFC near field communication
- the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
- RFID radio frequency identification
- IrDA infrared data association
- UWB ultra-wideband
- Bluetooth Bluetooth
- apparatus 500 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor, or other electronic component implementation for performing the above methods.
- ASICs application specific integrated circuits
- DSPs digital signal processors
- DSPDs digital signal processing devices
- PLDs programmable logic devices
- FPGA field programmable A gate array
- controller microcontroller, microprocessor, or other electronic component implementation for performing the above methods.
- non-transitory computer readable storage medium comprising instructions, such as a memory 504 comprising instructions executable by processor 520 of apparatus 500 to perform the above method.
- the non-transitory computer readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.
- a non-transitory computer readable storage medium that, when executed by a processor of apparatus 500, enables apparatus 500 to perform the web page encoding identification method illustrated in FIG. 1 or FIG.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Document Processing Apparatus (AREA)
- Digital Computer Display Output (AREA)
Abstract
Description
Claims (11)
- 一种网页编码识别方法,其特征在于,所述方法包括:加载网页数据,所述网页数据包括至少一个网页资源;检测所述网页资源是否为超文本标记语言HTML资源且声明了编码方式;若所述网页资源是HTML资源但未声明编码方式,则识别所述HTML资源的编码方式;采用与识别到的所述编码方式所对应的解码方式解码所述HTML资源。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:若所述网页资源是HTML资源但已声明编码方式,则检测已声明的所述编码方式是否为预设编码方式中的一种;若已声明的所述编码方式不是所述预设编码方式中的一种,则识别所述HTML资源的编码方式;或,对已声明的所述编码方式进行自动纠错,得到自动纠错后的所述编码方式。
- 根据权利要求1或2所述的方法,其特征在于,所述识别所述HTML资源的编码方式,包括:调用预定的字符编码识别算法识别所述HTML资源的编码方式。
- 根据权利要求2所述的方法,其特征在于,所述对已声明的所述编码方式进行自动纠错,得到自动纠错后的所述编码方式,包括:将已声明的所述编码方式与所述预设编码方式中的每一种分别计算拼写相似度;当最高的所述拼写相似度大于预设阈值时,将最高的所述拼写相似度所对应的预设编码方式确定为自动纠错后的所述编码方式。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:若所述网页资源是层叠样式表CSS资源,则将所述网页数据中的所述HTML资源采用的编码方式识别为所述CSS资源的编码方式,采用与所述编码方式所对应的解码方式解码所述CSS资源。
- 一种网页编码识别装置,其特征在于,所述装置包括:数据加载模块,被配置为加载网页数据,所述网页数据包括至少一个网页资源;方式检测模块,被配置为检测所述网页资源是否为超文本标记语言HTML资源且声明了编码方式;方式识别模块,被配置为当所述网页资源是HTML资源但未声明编码方式时,识别 所述HTML资源的编码方式;资源解码模块,被配置为采用与识别到的所述编码方式所对应的解码方式解码所述HTML资源。
- 根据权利要求6所述的装置,其特征在于,所述装置还包括:编码检测模块,被配置为当所述网页资源是HTML资源但已声明编码方式,则检测已声明的所述编码方式是否为预设编码方式中的一种;所述方式识别模块,被配置为当已声明的所述编码方式不是所述预设编码方式中的一种时,识别所述HTML资源的编码方式;或,自动纠错模块,被配置为当已声明的所述编码方式不是所述预设编码方式中的一种时,对已声明的所述编码方式进行自动纠错,得到自动纠错后的所述编码方式。
- 根据权利要求6或7所述的装置,其特征在于,所述方式识别模块,被配置为调用预定的字符编码识别算法识别所述HTML资源的编码方式。
- 根据权利要求7所述的装置,其特征在于,所述自动纠错模块,包括:拼写计算子模块,被配置为将已声明的所述编码方式与所述预设编码方式中的每一种分别计算拼写相似度;自动纠错子模块,被配置为当最高的所述拼写相似度大于预设阈值时,将最高的所述拼写相似度所对应的预设编码方式确定为自动纠错后的所述编码方式。
- 根据权利要求6所述的装置,其特征在于,所述装置还包括:编码负用模块,被配置为当所述网页资源是层叠样式表CSS资源,则将所述网页数据中的所述HTML资源采用的编码方式识别为所述CSS资源的编码方式,采用与所述编码方式所对应的解码方式解码所述CSS资源。
- 一种网页编码识别装置,其特征在于,所述装置包括:处理器;用于存储所述处理器的可执行指令的存储器;其中,所述处理器被配置为:加载网页数据,所述网页数据包括至少一个网页资源;检测所述网页资源是否为超文本标记语言HTML资源且声明了编码方式;若所述网页资源是HTML资源但未声明编码方式,则识别所述HTML资源的编码方式;采用与识别到的所述编码方式所对应的解码方式解码所述HTML资源。
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
RU2015110973A RU2610245C2 (ru) | 2014-10-21 | 2015-01-22 | Способ и устройство для идентификации кодирования веб-страницы |
BR112015006725A BR112015006725A2 (pt) | 2014-10-21 | 2015-01-22 | método e dispositivo para identificar codificação da página de internet |
MX2015003807A MX361564B (es) | 2014-10-21 | 2015-01-22 | Método y dispositivo para identificar la codificación de página web. |
JP2016554794A JP6130976B2 (ja) | 2014-10-21 | 2015-01-22 | ウェブページエンコード識別方法、ウェブページエンコード識別装置、プログラム及び記録媒体 |
KR1020157007129A KR20160059455A (ko) | 2014-10-21 | 2015-01-22 | 웹 페이지 인코딩 인식 방법, 인식 장치, 프로그램 및 저장매체 |
US14/684,855 US20160112491A1 (en) | 2014-10-21 | 2015-04-13 | Method and device for identifying encoding of web page |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410562477.9A CN104361021B (zh) | 2014-10-21 | 2014-10-21 | 网页编码识别方法及装置 |
CN201410562477.9 | 2014-10-21 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/684,855 Continuation US20160112491A1 (en) | 2014-10-21 | 2015-04-13 | Method and device for identifying encoding of web page |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016061930A1 true WO2016061930A1 (zh) | 2016-04-28 |
Family
ID=52528283
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2015/071308 WO2016061930A1 (zh) | 2014-10-21 | 2015-01-22 | 网页编码识别方法及装置 |
Country Status (8)
Country | Link |
---|---|
EP (1) | EP3012750A1 (zh) |
JP (1) | JP6130976B2 (zh) |
KR (1) | KR20160059455A (zh) |
CN (1) | CN104361021B (zh) |
BR (1) | BR112015006725A2 (zh) |
MX (1) | MX361564B (zh) |
RU (1) | RU2610245C2 (zh) |
WO (1) | WO2016061930A1 (zh) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104994128B (zh) * | 2015-05-15 | 2019-04-26 | 北京网康科技有限公司 | 一种数据编码类型识别及转码方法和装置 |
CN105468753A (zh) * | 2015-11-27 | 2016-04-06 | 北京金和网络股份有限公司 | 多编码格式数据显示系统及方法 |
CN106407438A (zh) * | 2016-09-28 | 2017-02-15 | 珠海迈越信息技术有限公司 | 一种数据处理方法及系统 |
CN110020343B (zh) * | 2017-09-01 | 2021-03-30 | 北京国双科技有限公司 | 网页编码格式的确定方法和装置 |
CN110674377A (zh) * | 2019-09-24 | 2020-01-10 | 四川长虹电器股份有限公司 | 基于爬虫的新闻热点词获取方法 |
CN114024651A (zh) * | 2020-07-16 | 2022-02-08 | 深信服科技股份有限公司 | 一种编码类型识别方法、装置、设备及可读存储介质 |
CN114415817B (zh) * | 2020-10-28 | 2024-05-07 | 北京小米移动软件有限公司 | 显示控制方法、电子设备及存储介质 |
CN113595683A (zh) * | 2021-07-07 | 2021-11-02 | 西安震有信通科技有限公司 | 基于各类编码文件的转换处理方法、装置、终端及介质 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101526963A (zh) * | 2009-04-17 | 2009-09-09 | 深圳华为通信技术有限公司 | 网页编码识别方法、装置和终端设备 |
CN103207877A (zh) * | 2012-01-17 | 2013-07-17 | 阿里巴巴集团控股有限公司 | 解码方法及装置 |
US20140075344A1 (en) * | 2012-09-11 | 2014-03-13 | Ebay Inc. | Visual state comparator |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3203544B2 (ja) * | 1996-01-31 | 2001-08-27 | 日本電信電話株式会社 | テキスト最尤復号方法及び最尤復号装置と、データ通信ネットワーク装置 |
JP2000132449A (ja) * | 1998-10-27 | 2000-05-12 | Nippon Telegr & Teleph Corp <Ntt> | 代理アクセス方法、装置、および代理アクセスプログラムを記録した記録媒体 |
US6701320B1 (en) * | 2002-04-24 | 2004-03-02 | Bmc Software, Inc. | System and method for determining a character encoding scheme |
US7148824B1 (en) * | 2005-08-05 | 2006-12-12 | Xerox Corporation | Automatic detection of character encoding format using statistical analysis of the text strings |
US7711673B1 (en) * | 2005-09-28 | 2010-05-04 | Trend Micro Incorporated | Automatic charset detection using SIM algorithm with charset grouping |
US8271263B2 (en) * | 2007-03-30 | 2012-09-18 | Symantec Corporation | Multi-language text fragment transcoding and featurization |
JP5565197B2 (ja) * | 2010-08-18 | 2014-08-06 | 富士通株式会社 | Webアプリケーションの連携方法、連携装置、および連携プログラム |
RU2500024C2 (ru) * | 2011-12-27 | 2013-11-27 | Общество С Ограниченной Ответственностью "Центр Инноваций Натальи Касперской" | Способ автоматизированного определения языка и (или) кодировки текстового документа |
TWI493365B (zh) * | 2013-08-16 | 2015-07-21 | Arphic Technology Co Ltd | 多字集字碼輸入與即時顯示方法、系統與裝置 |
-
2014
- 2014-10-21 CN CN201410562477.9A patent/CN104361021B/zh active Active
-
2015
- 2015-01-22 KR KR1020157007129A patent/KR20160059455A/ko not_active Application Discontinuation
- 2015-01-22 RU RU2015110973A patent/RU2610245C2/ru active
- 2015-01-22 WO PCT/CN2015/071308 patent/WO2016061930A1/zh active Application Filing
- 2015-01-22 MX MX2015003807A patent/MX361564B/es active IP Right Grant
- 2015-01-22 BR BR112015006725A patent/BR112015006725A2/pt not_active IP Right Cessation
- 2015-01-22 JP JP2016554794A patent/JP6130976B2/ja active Active
- 2015-07-27 EP EP15178533.4A patent/EP3012750A1/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101526963A (zh) * | 2009-04-17 | 2009-09-09 | 深圳华为通信技术有限公司 | 网页编码识别方法、装置和终端设备 |
CN103207877A (zh) * | 2012-01-17 | 2013-07-17 | 阿里巴巴集团控股有限公司 | 解码方法及装置 |
US20140075344A1 (en) * | 2012-09-11 | 2014-03-13 | Ebay Inc. | Visual state comparator |
Also Published As
Publication number | Publication date |
---|---|
EP3012750A1 (en) | 2016-04-27 |
KR20160059455A (ko) | 2016-05-26 |
MX2015003807A (es) | 2016-08-02 |
JP2016539450A (ja) | 2016-12-15 |
JP6130976B2 (ja) | 2017-05-17 |
RU2015110973A (ru) | 2016-10-20 |
CN104361021A (zh) | 2015-02-18 |
MX361564B (es) | 2018-12-11 |
BR112015006725A2 (pt) | 2017-07-04 |
CN104361021B (zh) | 2018-07-24 |
RU2610245C2 (ru) | 2017-02-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016061930A1 (zh) | 网页编码识别方法及装置 | |
EP3300407B1 (en) | Method and device for processing verification code | |
US10949490B2 (en) | Method and apparatus for displaying webpage content | |
EP2924591A1 (en) | Method and device for controlling page rollback | |
JP6918181B2 (ja) | 機械翻訳モデルのトレーニング方法、装置およびシステム | |
WO2016188060A1 (zh) | 应用程序安装包的处理方法及装置 | |
WO2016011717A1 (zh) | 用于轻应用的消息推送方法、装置、终端及服务器 | |
WO2019165832A1 (zh) | 文字信息处理方法、装置及终端 | |
US20190012153A1 (en) | Method and device for supporting multi-framework syntax | |
WO2016119497A1 (zh) | 固件压缩方法、固件解压方法和装置 | |
KR101944416B1 (ko) | 영상 통화 분석 서비스를 제공하기 위한 방법 및 그 전자 장치 | |
US20180365200A1 (en) | Method, device, electric device and computer-readable storage medium for updating page | |
EP2921969A1 (en) | Method and apparatus for centering and zooming webpage and electronic device | |
WO2015176483A1 (zh) | 标签创建方法、装置及终端 | |
WO2017092121A1 (zh) | 信息处理的方法及装置 | |
US20160350584A1 (en) | Method and apparatus for providing contact card | |
US20230252778A1 (en) | Formula recognition method and apparatus | |
CN104951445B (zh) | 一种网页处理方法及装置 | |
CN109977424B (zh) | 一种机器翻译模型的训练方法及装置 | |
EP2963561A1 (en) | Method and device for updating user data | |
CN109992754B (zh) | 文档处理方法及装置 | |
US20210157981A1 (en) | Method and terminal for performing word segmentation on text information, and storage medium | |
WO2023092975A1 (zh) | 图像处理方法及装置、电子设备、存储介质及计算机程序产品 | |
US20160112491A1 (en) | Method and device for identifying encoding of web page | |
US9679076B2 (en) | Method and device for controlling page rollback |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2016554794 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20157007129 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: MX/A/2015/003807 Country of ref document: MX |
|
ENP | Entry into the national phase |
Ref document number: 2015110973 Country of ref document: RU Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15852729 Country of ref document: EP Kind code of ref document: A1 |
|
REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112015006725 Country of ref document: BR |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 112015006725 Country of ref document: BR Kind code of ref document: A2 Effective date: 20150325 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15852729 Country of ref document: EP Kind code of ref document: A1 |