WO2016061930A1 - 网页编码识别方法及装置 - Google Patents

网页编码识别方法及装置 Download PDF

Info

Publication number
WO2016061930A1
WO2016061930A1 PCT/CN2015/071308 CN2015071308W WO2016061930A1 WO 2016061930 A1 WO2016061930 A1 WO 2016061930A1 CN 2015071308 W CN2015071308 W CN 2015071308W WO 2016061930 A1 WO2016061930 A1 WO 2016061930A1
Authority
WO
WIPO (PCT)
Prior art keywords
resource
encoding
webpage
mode
html
Prior art date
Application number
PCT/CN2015/071308
Other languages
English (en)
French (fr)
Inventor
左景龙
范金松
田凡
Original Assignee
小米科技有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 小米科技有限责任公司 filed Critical 小米科技有限责任公司
Priority to RU2015110973A priority Critical patent/RU2610245C2/ru
Priority to BR112015006725A priority patent/BR112015006725A2/pt
Priority to MX2015003807A priority patent/MX361564B/es
Priority to JP2016554794A priority patent/JP6130976B2/ja
Priority to KR1020157007129A priority patent/KR20160059455A/ko
Priority to US14/684,855 priority patent/US20160112491A1/en
Publication of WO2016061930A1 publication Critical patent/WO2016061930A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Definitions

  • the present disclosure relates to the field of computer networks, and in particular, to a webpage code recognition method and apparatus.
  • the browser Since the webpage data may be encoded by different encoding methods, the browser first needs to identify the encoding mode of the webpage data according to the "charset" field in the webpage data, and then decode the webpage data by using the decoding method corresponding to the encoding mode, and then Display web page data.
  • the “charset” field is missing or miswritten in the web page data developed by many technicians.
  • the browser uses the default decoding method to decode, which may display garbled characters. .
  • the browser may display garbled characters.
  • the embodiment of the present disclosure provides a webpage encoding and identification method and apparatus. The technical solution is as follows:
  • a webpage encoding and recognizing method comprising:
  • webpage data including at least one webpage resource
  • the web resource is an HTML resource but does not declare an encoding method, the encoding method of the HTML resource is identified;
  • the HTML resource is decoded by a decoding method corresponding to the identified encoding mode.
  • the method further includes:
  • the webpage resource is an HTML resource but the encoding mode is declared, detecting whether the declared encoding mode is one of the preset encoding modes;
  • the encoding mode of the HTML resource is recognized; or, the encoded encoding mode is automatically corrected, and the encoding method after the automatic error correction is obtained.
  • the encoding of the HTML resource is identified, including:
  • the predetermined character encoding recognition algorithm is called to identify the encoding method of the HTML resource.
  • the error correction is performed on the declared coding mode to obtain an automatic error correction coding method, including:
  • the preset encoding mode corresponding to the highest spelling similarity is determined as the encoding mode after automatic error correction.
  • the method further includes:
  • the encoding method adopted by the HTML resource in the webpage data is identified as the encoding mode of the CSS resource, and the CSS resource is decoded by using the decoding method corresponding to the encoding mode.
  • a webpage code recognition apparatus comprising:
  • a data loading module configured to load webpage data, the webpage data including at least one webpage resource
  • the mode detection module is configured to detect whether the webpage resource is an HTML resource and declare an encoding mode
  • the mode identification module is configured to identify an encoding mode of the HTML resource when the webpage resource is an HTML resource but does not declare an encoding mode
  • the resource decoding module is configured to decode the HTML resource by using a decoding manner corresponding to the identified encoding mode.
  • the apparatus further includes:
  • the code detection module is configured to detect whether the declared coding mode is one of a preset coding mode when the webpage resource is an HTML resource but the coding mode is declared;
  • the mode identification module is configured to identify an encoding mode of the HTML resource when the declared encoding mode is not one of the preset encoding modes; or the automatic error correcting module is configured to: when the declared encoding mode is not preset In one of the encoding modes, the encoded mode is automatically corrected, and the encoding mode after automatic error correction is obtained.
  • the mode identification module is configured to invoke a predetermined character encoding recognition algorithm to identify the encoding of the HTML resource.
  • the automatic error correction module includes:
  • a spelling calculation sub-module configured to calculate spelling similarity respectively for each of the declared encoding mode and the preset encoding mode
  • the automatic error correction sub-module is configured to determine the preset coding mode corresponding to the highest spell similarity as the automatic error correction coding mode when the highest spell similarity is greater than the preset threshold.
  • the apparatus further comprises:
  • the encoding negative module is configured to identify the encoding mode adopted by the HTML resource in the webpage data as the encoding mode of the CSS resource when the webpage resource is a CSS resource, and decode the CSS resource by using a decoding manner corresponding to the encoding mode.
  • a webpage code recognition apparatus comprising:
  • a memory for storing executable instructions of the processor
  • processor is configured to:
  • webpage data including at least one webpage resource
  • the web resource is an HTML resource but does not declare an encoding method, the encoding method of the HTML resource is identified;
  • the HTML resource is decoded by a decoding method corresponding to the identified encoding mode.
  • the coding mode of the webpage resource is identified, and the webpage resource is decoded by using the decoding method corresponding to the coding mode; when the "charset" field in the webpage coding in the related art is missed, The browser may display garbled characters; it can achieve the effect of decoding and displaying the web resources normally even if the encoding method is not declared in the web resource.
  • FIG. 1 is a flowchart of a webpage encoding and recognizing method according to an exemplary embodiment
  • FIG. 2 is a flowchart of a webpage encoding and recognizing method according to another exemplary embodiment
  • FIG. 3 is a block diagram of a webpage encoding apparatus according to an exemplary embodiment
  • FIG. 4 is a block diagram of a webpage encoding and recognizing apparatus according to another exemplary embodiment
  • FIG. 5 is a block diagram of a webpage encoding and recognizing apparatus according to an exemplary embodiment.
  • the terminals involved in the embodiments of the present disclosure may be a mobile phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III), and a MP4 (Moving Picture Experts Group Audio).
  • Layer IV motion imaging experts compress standard audio layers 4) players, laptops and desktop computers, and more.
  • FIG. 1 is a flowchart of a method for recognizing a webpage code according to an exemplary embodiment. This embodiment is illustrated by using the webpage encoding and recognizing method in a terminal.
  • the webpage encoding and identifying method may include the following steps:
  • step 101 webpage data is loaded, and the webpage data includes at least one webpage resource.
  • Web resources are usually divided into two types: HTML (HyperText Mark-up Language) resources and CSS (Cascading Style Sheets) resources.
  • HTML HyperText Mark-up Language
  • CSS CSS
  • step 102 it is detected whether the webpage resource is an HTML resource and the encoding mode is declared.
  • step 103 if the webpage resource is an HTML resource but the encoding mode is not declared, the encoding mode of the HTML resource is identified.
  • step 104 the HTML resource is decoded using a decoding method corresponding to the identified encoding mode.
  • the webpage encoding and recognizing method can identify the encoding mode of the webpage resource when the webpage resource does not declare the encoding mode, and decode the webpage resource by using the decoding method corresponding to the encoding mode;
  • the browser may display the garbled problem; even if the encoding method is not declared in the webpage resource, the webpage resource can be normally decoded and displayed.
  • FIG. 2 is a flowchart of a webpage encoding and recognizing method according to another exemplary embodiment. This embodiment is exemplified by applying the webpage code recognition method to the terminal.
  • the webpage encoding and identifying method may include the following steps:
  • step 201 webpage data is loaded, and the webpage data includes at least one webpage resource.
  • the terminal When the terminal needs to display a webpage, the terminal first loads the webpage data of the webpage. At least one webpage resource is included in the webpage data of each webpage.
  • Web resources can be divided into two types: HTML resources and CSS resources.
  • step 202 it is detected whether the webpage resource is an HTML resource.
  • the terminal Before decoding each web resource, the terminal first detects whether the web resource is an HTML resource.
  • the web resource is an HTML resource, proceed to step 203;
  • step 210 If the web resource is a CSS resource, proceed to step 210.
  • step 203 it is detected whether the HTML resource declares the encoding mode.
  • Common coding methods include: UTF-8 (8-bit Unicode Transformation Format), Big5 (big five code), GB2312 (Chinese character coded character set for information exchange), GBK (Chinese character coded characters for information exchange) Set), ISO-8859-1 (International Organization for Standardization, International Standards Organization), ISO-8859-2, etc.
  • HTML resources typically use the "charset” field to declare the encoding they use. However, due to the different levels of web developers, the "charset” field in HTML resources may be missed or miswritten.
  • step 206 If the HTML resource declares the encoding mode, the process proceeds to step 206.
  • step 204 if the HTML resource does not declare the encoding mode, the encoding mode of the HTML resource is identified.
  • the terminal can call a predetermined character encoding recognition algorithm to identify the encoding mode of the HTML resource.
  • the predetermined character recognition algorithm may be a chardet character encoding recognition algorithm.
  • the terminal calls the chardet character encoding recognition algorithm to recognize that the encoding method used by the HTML resource is GB2312 encoding.
  • the Chardet character encoding recognition algorithm is an algorithm for identifying the encoding format of a character string. Often used to identify the encoding format of text characters.
  • the terminal may extract a character string of a predetermined length in the HTML resource, and identify a coding manner of the character string of the predetermined length by a predetermined character code recognition algorithm. There is no need to identify all the strings in the entire HTML resource.
  • step 205 the HTML resource is decoded using a decoding method corresponding to the identified encoding mode.
  • the terminal After identifying the coding mode used by the HTML resource, the terminal decodes the HTML resource by using a decoding method corresponding to the identified coding mode.
  • step 206 if the HTML resource has declared the encoding mode, it is detected whether the declared encoding mode is one of the preset encoding modes.
  • the terminal When the encoding mode has been declared in the HTML resource, the terminal needs to detect whether the declared encoding mode is one of the preset encoding modes because the encoding method of the encoding may be misspelled.
  • the preset encoding methods include but are not limited to: UTF-8 (8-bit Unicode Transformation Format), Big5 (big five code), GB2312 (Chinese character coded character set for information exchange), GBK (for information exchange) Chinese character coded character set), ISO-8859-1 (International Organization for Standardization, ISO-8859-2, etc.).
  • step 207 If the declared encoding mode is one of the preset encoding modes, proceed to step 207;
  • step 208 is entered.
  • step 207 if the declared coding mode is one of the preset coding modes, the HTML resource is decoded using the decoding mode corresponding to the declared coding mode.
  • the terminal decodes the HTML resource by using the decoding method corresponding to the declared encoding mode.
  • step 208 if the declared encoding mode is not one of the preset encoding modes, the encoding mode of the HTML resource is recognized; or, the encoded encoding mode is automatically corrected, and the encoding method after the automatic error correction is obtained. .
  • this embodiment provides two different processing methods:
  • the first processing method the terminal identifies the encoding mode of the HTML resource
  • the identification mode is the same as step 204, and the terminal can call a predetermined character encoding recognition algorithm to identify the encoding mode of the HTML resource.
  • the predetermined character recognition algorithm may be a chardet character encoding recognition algorithm.
  • the second processing method the terminal automatically corrects the coded mode that has been declared, and obtains an automatic error correction coding mode.
  • the process of automatic error correction is as follows: the terminal calculates the spelling similarity separately for each of the declared encoding mode and the preset encoding mode, and if there are six preset encoding modes, six spelling similarities can be calculated. When the highest spelling similarity When the threshold is greater than the preset threshold, the terminal determines the preset coding mode corresponding to the highest spell similarity as the coding mode after automatic error correction.
  • the declared encoding method is "GB2812"
  • there are 6 preset encoding methods and there are 6 calculated spelling similarities.
  • the highest spelling similarity with the preset encoding method "GB2312” is 83%, which is greater than the preset threshold of 60%. Therefore, the terminal determines the preset encoding mode "GB2312" as the encoding mode after automatic error correction.
  • the first processing method and the second processing method may be used alternatively or in combination.
  • the second processing method is used first, but if the highest spelling similarity is less than the preset threshold, or two or more preset encoding methods have the highest spelling similarity.
  • the terminal can re-recognize the encoding method of the HTML resource by using the first processing method.
  • step 209 the HTML resource is decoded using a decoding method corresponding to the re-identification or automatic error correction coding mode.
  • step 210 if the webpage resource is a CSS resource, the encoding method adopted by the HTML resource in the webpage data is identified as the encoding mode of the CSS resource, and the CSS resource is decoded by using the decoding method corresponding to the encoding mode.
  • the terminal recognizes the encoding mode adopted by the HTML resource in the webpage data as The encoding method of the CSS resource, wherein the process of identifying the encoding mode of the HTML resource can be referred to the foregoing steps 202 to 207.
  • the terminal decodes the CSS resource by using a decoding method corresponding to the coding mode of the CSS resource.
  • the terminal may display the webpage according to the decoded webpage resource.
  • the webpage encoding and recognizing method can identify the encoding mode of the webpage resource when the webpage resource does not declare the encoding mode, and decode the webpage resource by using the decoding method corresponding to the encoding mode;
  • the browser may display the garbled problem; even if the encoding method is not declared in the webpage resource, the webpage resource can be normally decoded and displayed.
  • the webpage encoding and recognizing method provided by the embodiment further decodes the webpage resource by decoding the decoding method corresponding to the re-identification or automatic error correction when the encoding method of the webpage resource is declared but the declared encoding method has a spelling error.
  • the problem that the browser displays garbled characters when the "charset" field in the webpage coding is miswritten in the related art is solved; even if the encoding method declared in the webpage resource is miswritten, the webpage resource can be normally decoded and displayed. Effect.
  • FIG. 3 is a block diagram of a webpage encoding and recognizing apparatus, which may be implemented as part or all of a terminal by software, hardware, or a combination of both, according to an exemplary embodiment.
  • the webpage code recognition device may include:
  • the data loading module 320 is configured to load webpage data, and the webpage data includes at least one webpage resource.
  • the mode detection module 340 is configured to detect whether the webpage resource is an HTML resource and declare an encoding mode.
  • the mode identification module 360 is configured to identify an encoding mode of the HTML resource when the webpage resource is an HTML resource but the encoding mode is not declared.
  • the resource decoding module 380 is configured to decode the HTML resource by using a decoding method corresponding to the identified encoding mode.
  • the webpage encoding and recognizing device provided by the embodiment can identify the encoding mode of the webpage resource when the webpage resource does not declare the encoding mode, and decode the webpage resource by using the decoding method corresponding to the encoding mode;
  • the browser may display the garbled problem; even if the encoding method is not declared in the webpage resource, the webpage resource can be normally decoded and displayed.
  • FIG. 4 is a block diagram of a web page encoding and recognizing apparatus according to another exemplary embodiment.
  • the webpage code recognition means may be implemented as part or all of the terminal by software, hardware or a combination of both.
  • the webpage code recognition device may include:
  • the data loading module 320 is configured to load webpage data, and the webpage data includes at least one webpage resource.
  • the mode detection module 340 is configured to detect whether the webpage resource is an HTML resource and declare an encoding mode.
  • the mode identification module 360 is configured to identify an encoding mode of the HTML resource when the webpage resource is an HTML resource but the encoding mode is not declared.
  • the resource decoding module 380 is configured to decode the HTML resource by using a decoding method corresponding to the identified encoding mode.
  • the device further includes:
  • the encoding detection module 352 is configured to detect whether the declared encoding mode is one of the preset encoding modes when the webpage resource is an HTML resource but the encoding mode is declared.
  • the mode identification module 360 is configured to identify an encoding mode of the HTML resource when the declared encoding mode is not one of the preset encoding modes. or,
  • the automatic error correction module 370 is configured to perform automatic error correction on the declared coding mode when the declared coding mode is not one of the preset coding modes, and obtain an automatic error correction coding mode.
  • the mode identification module 360 is configured to invoke a predetermined character encoding recognition algorithm to identify an encoding mode of the HTML resource.
  • the automatic error correction module 370 includes:
  • the spelling calculation sub-module 372 is configured to calculate spelling similarity respectively for each of the declared encoding mode and the preset encoding mode;
  • the automatic error correction sub-module 374 is configured to determine the preset coding mode corresponding to the highest spell similarity as the automatic error correction coding mode when the highest spell similarity is greater than the preset threshold.
  • the device further includes:
  • the code multiplexing module 354 is configured to: when the webpage resource is a CSS resource, the HTML resource in the webpage data
  • the coding mode adopted by the source is identified as the coding mode of the CSS resource, and the CSS resource is decoded by using the decoding mode corresponding to the coding mode.
  • the webpage encoding and recognizing device provided by the embodiment can identify the encoding mode of the webpage resource when the webpage resource does not declare the encoding mode, and decode the webpage resource by using the decoding method corresponding to the encoding mode;
  • the browser may display the garbled problem; even if the encoding method is not declared in the webpage resource, the webpage resource can be normally decoded and displayed.
  • the webpage code recognition apparatus further decodes the webpage resource by decoding means corresponding to the re-identification or automatic error correction coding mode when the coding method of the webpage resource is declared but the declared coding method has a spelling error.
  • the problem that the browser displays garbled characters when the "charset" field in the webpage coding is miswritten in the related art is solved; even if the encoding method declared in the webpage resource is miswritten, the webpage resource can be normally decoded and displayed. Effect.
  • FIG. 5 is a block diagram of a web page encoding recognition apparatus 500, according to an exemplary embodiment.
  • device 500 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
  • apparatus 500 can include one or more of the following components: processing component 502, memory 504, power component 506, multimedia component 508, audio component 510, input/output (I/O) interface 512, sensor component 514, And a communication component 516.
  • Processing component 502 typically controls the overall operation of device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • Processing component 502 can include one or more processors 520 to execute instructions to perform all or part of the steps of the above described methods.
  • processing component 502 can include one or more modules to facilitate interaction between component 502 and other components.
  • processing component 502 can include a multimedia module to facilitate interaction between multimedia component 508 and processing component 502.
  • Memory 504 is configured to store various types of data to support operation at device 500. Examples of such data include instructions for any application or method operating on device 500, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 504 can be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM Electrically erasable programmable read only memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • Power component 506 provides power to various components of device 500.
  • Power component 506 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for device 500.
  • the multimedia component 508 includes a screen between the device 500 and the user that provides an output interface.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor may sense not only the boundary of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
  • the multimedia component 508 includes a front camera and/or a rear camera. When the device 500 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 510 is configured to output and/or input an audio signal.
  • audio component 510 includes a microphone (MIC) that is configured to receive an external audio signal when device 500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode.
  • the received audio signal may be further stored in memory 504 or transmitted via communication component 516.
  • audio component 510 also includes a speaker for outputting an audio signal.
  • the I/O interface 512 provides an interface between the processing component 502 and the peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to, a home button, a volume button, a start button, and a lock button.
  • Sensor assembly 514 includes one or more sensors for providing device 500 with various aspects of status assessment.
  • sensor assembly 514 can detect an open/closed state of device 500, a relative positioning of components, such as the display and keypad of device 500, and sensor component 514 can also detect a change in position of one component of device 500 or device 500. The presence or absence of user contact with device 500, device 500 orientation or acceleration/deceleration, and temperature variation of device 500.
  • Sensor assembly 514 can include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • Sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 514 can also include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 516 is configured to facilitate wired or wireless communication between device 500 and other devices.
  • the device 500 can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof.
  • communication component 516 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel.
  • the communication component 516 also includes a near field communication (NFC) module to facilitate short range communication.
  • NFC near field communication
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • apparatus 500 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor, or other electronic component implementation for performing the above methods.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A gate array
  • controller microcontroller, microprocessor, or other electronic component implementation for performing the above methods.
  • non-transitory computer readable storage medium comprising instructions, such as a memory 504 comprising instructions executable by processor 520 of apparatus 500 to perform the above method.
  • the non-transitory computer readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.
  • a non-transitory computer readable storage medium that, when executed by a processor of apparatus 500, enables apparatus 500 to perform the web page encoding identification method illustrated in FIG. 1 or FIG.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Document Processing Apparatus (AREA)
  • Digital Computer Display Output (AREA)

Abstract

本公开是关于一种网页编码识别方法及装置,属于计算机网络领域。所述方法包括:加载网页数据,所述网页数据包括至少一个网页资源;检测所述网页资源是否为超文本标记语言HTML资源且声明了编码方式;若所述网页资源是HTML资源但未声明编码方式,则识别所述HTML资源的编码方式;采用与识别到的所述编码方式所对应的解码方式解码所述HTML资源。本公开解决了相关技术中网页编码中的"charset"字段漏写时,浏览器可能会显示乱码的问题;达到了即便网页资源中未声明编码方式,也能够正常解码网页资源并进行显示的效果。

Description

网页编码识别方法及装置
本申请基于申请号为201410562477.9、申请日为2014年10月21日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本公开涉及计算机网络领域,特别涉及一种网页编码识别方法及装置。
背景技术
随着网络技术的发展,用户使用终端中的浏览器来浏览网页是最常使用的一种功能。
由于网页数据可能采用不同的编码方式进行编码,浏览器首先需要根据网页数据中的“charset”字段来识别网页数据的编码方式,进而采用与该编码方式对应的解码方式对网页数据进行解码,然后对网页数据进行显示。但是由于网站搭建和网页编辑的技术越来越普及,很多技术人员开发的网页数据中会漏写或错写“charset”字段,此时,浏览器采用默认的解码方式解码,有可能会显示乱码。
发明内容
为了解决相关技术中网页编码中的“charset”字段漏写或者错写时,浏览器会显示乱码的问题,本公开实施例提供了一种网页编码识别方法及装置。所述技术方案如下:
根据本公开实施例提供的一种网页编码识别方法,该方法包括:
加载网页数据,网页数据包括至少一个网页资源;
检测网页资源是否为HTML资源且声明了编码方式;
若网页资源是HTML资源但未声明编码方式,则识别HTML资源的编码方式;
采用与识别到的编码方式所对应的解码方式解码HTML资源。
在一个实施例中,该方法还包括:
若网页资源是HTML资源但已声明编码方式,则检测已声明的编码方式是否为预设编码方式中的一种;
若已声明的编码方式不是预设编码方式中的一种,则识别HTML资源的编码方式;或,对已声明的编码方式进行自动纠错,得到自动纠错后的编码方式。
在一个实施例中,识别HTML资源的编码方式,包括:
调用预定的字符编码识别算法识别HTML资源的编码方式。
在一个实施例中,对已声明的编码方式进行自动纠错,得到自动纠错后的编码方式,包括:
将已声明的编码方式与预设编码方式中的每一种分别计算拼写相似度;
当最高的拼写相似度大于预设阈值时,将最高的拼写相似度所对应的预设编码方式确定为自动纠错后的编码方式。
在一个实施例中,该方法还包括:
若网页资源是CSS资源,则将网页数据中的HTML资源采用的编码方式识别为CSS资源的编码方式,采用与编码方式所对应的解码方式解码CSS资源。
根据本公开实施例的第二方面,提供了一种网页编码识别装置,该装置包括:
数据加载模块,被配置为加载网页数据,网页数据包括至少一个网页资源;
方式检测模块,被配置为检测网页资源是否为HTML资源且声明了编码方式;
方式识别模块,被配置为当网页资源是HTML资源但未声明编码方式时,识别HTML资源的编码方式;
资源解码模块,被配置为采用与识别到的编码方式所对应的解码方式解码HTML资源。
在一个实施例中,装置还包括:
编码检测模块,被配置为当网页资源是HTML资源但已声明编码方式,则检测已声明的编码方式是否为预设编码方式中的一种;
方式识别模块,被配置为当已声明的编码方式不是预设编码方式中的一种时,识别HTML资源的编码方式;或,自动纠错模块,被配置为当已声明的编码方式不是预设编码方式中的一种时,对已声明的编码方式进行自动纠错,得到自动纠错后的编码方式。
在一个实施例中,方式识别模块,被配置为调用预定的字符编码识别算法识别HTML资源的编码方式。
在一个实施例中,自动纠错模块,包括:
拼写计算子模块,被配置为将已声明的编码方式与预设编码方式中的每一种分别计算拼写相似度;
自动纠错子模块,被配置为当最高的拼写相似度大于预设阈值时,将最高的拼写相似度所对应的预设编码方式确定为自动纠错后的编码方式。
在一个实施例中,该装置还包括:
编码负用模块,被配置为当网页资源是CSS资源,则将网页数据中的HTML资源采用的编码方式识别为CSS资源的编码方式,采用与编码方式所对应的解码方式解码CSS资源。
根据本公开的第三方面,提供了一种网页编码识别装置,该装置包括:
处理器;
用于存储处理器的可执行指令的存储器;
其中,处理器被配置为:
加载网页数据,网页数据包括至少一个网页资源;
检测网页资源是否为超文本标记语言HTML资源且声明了编码方式;
若网页资源是HTML资源但未声明编码方式,则识别HTML资源的编码方式;
采用与识别到的编码方式所对应的解码方式解码HTML资源。
本公开实施例提供的技术方案可以包括以下有益效果:
通过在网页资源未声明编码方式时,识别网页资源的编码方式,并采用与该编码方式对应的解码方式对网页资源进行解码;解决了相关技术中网页编码中的“charset”字段漏写时,浏览器可能会显示乱码的问题;达到了即便网页资源中未声明编码方式,也能够正常解码网页资源并进行显示的效果。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
图1是根据一示例性实施例示出的一种网页编码识别方法的流程图;
图2是根据另一示例性实施例示出的一种网页编码识别方法的流程图;
图3是根据一示例性实施例示出的一种网页编码装置的框图;
图4是根据另一示例性实施例示出的一种网页编码识别装置的框图;
图5是根据一示例性实施例示出的一种网页编码识别装置的框图。
通过上述附图,已示出本公开明确的实施例,后文中将有更详细的描述。这些附图和文字描述并不是为了通过任何方式限制本公开构思的范围,而是通过参考特定实施例为本领域技术人员说明本公开的概念。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
本公开实施例中所涉及的终端可以是手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。
图1是根据一示例性实施例示出的一种网页编码识别方法的流程图,本实施例以该网页编码识别方法应用在终端中来举例说明。该网页编码识别方法可以包括如下几个步骤:
在步骤101中,加载网页数据,网页数据包括至少一个网页资源。
网页资源通常分为2种类型:HTML(HyperText Mark-up Language,超文本标记语言)资源和CSS(Cascading Style Sheets,层叠样式表)资源。
在步骤102中,检测网页资源是否为HTML资源且声明了编码方式。
在步骤103中,若网页资源是HTML资源但未声明编码方式,则识别HTML资源的编码方式。
在步骤104中,采用与识别到的编码方式所对应的解码方式解码HTML资源。
综上所述,本实施例提供的网页编码识别方法,通过在网页资源未声明编码方式时,识别网页资源的编码方式,并采用与该编码方式对应的解码方式对网页资源进行解码;解决了相关技术中网页编码中的“charset”字段漏写时,浏览器会显示乱码的问题;达到了即便网页资源中未声明编码方式,也能够正常解码网页资源并进行显示的效果。
图2是根据另一示例性实施例示出的一种网页编码识别方法的流程图。本实施例以该网页编码识别方法应用在终端中来举例说明。该网页编码识别方法可以包括如下几个步骤:
在步骤201中,加载网页数据,网页数据包括至少一个网页资源。
终端在需要显示一个网页时,首先加载该网页的网页数据。每个网页的网页数据中包括至少一个网页资源。
网页资源可以分为两种:HTML资源和CSS资源。
在步骤202中,检测网页资源是否为HTML资源。
在解码每个网页资源之前,终端首先检测网页资源是否为HTML资源。
若网页资源是HTML资源,则进入步骤203;
若网页资源是CSS资源,则进入步骤210。
在步骤203中,检测HTML资源是否声明了编码方式。
常见的编码方式包括:UTF-8(8-bit Unicode Transformation Format,8比特Unicode转换格式)、Big5(大五码)、GB2312(信息交换用汉字编码字符集)、GBK(信息交换用汉字编码字符集)、ISO-8859-1(International Organization for Standardization,国际标准化组织)、ISO-8859-2等。
HTML资源通常采用“charset”字段来声明自身所使用的编码方式。但是由于网页开发人员的水平不同,HTML资源中的“charset”字段可能会被漏写或者错写。
若HTML资源未声明编码方式,则进入步骤204;
若HTML资源声明了编码方式,则进入步骤206。
在步骤204中,若HTML资源未声明编码方式,则识别HTML资源的编码方式。
终端可以调用预定的字符编码识别算法识别HTML资源的编码方式。预定的字符识别算法可以是chardet字符编码识别算法。
比如,当HTML资源未声明编码方式,则终端调用chardet字符编码识别算法识别出该HTML资源所采用的编码方式是GB2312编码。
Chardet字符编码识别算法是一种用于识别字符串的编码格式的算法。常用于对文本字符的编码格式的识别。
为了加快识别速度,终端可以提取HTML资源中预定长度的字符串,通过预定的字符编码识别算法识别该预定长度的字符串的编码方式。而无需对整个HTML资源中的所有字符串都进行识别。
在步骤205中,采用与识别到的编码方式所对应的解码方式解码HTML资源。
在识别出HTML资源所采用的编码方式后,终端采用与识别到的编码方式所对应的解码方式解码HTML资源。
在步骤206中,若HTML资源已声明编码方式,则检测已声明的编码方式是否为预设编码方式中的一种。
当HTML资源中已经声明了编码方式时,由于声明的编码方式可能发生拼写错误,终端需要检测已声明的编码方式是否为预设编码方式中的一种。
预设编码方式包括但不限于:UTF-8(8-bit Unicode Transformation Format,8比特Unicode转换格式)、Big5(大五码)、GB2312(信息交换用汉字编码字符集)、GBK(信息交换用汉字编码字符集)、ISO-8859-1(International Organization for Standardization,国际标准化组织)、ISO-8859-2等。
若已声明的编码方式是预设编码方式中的一种,则进入步骤207;
若已声明的编码方式不是预设编码方式中的一种,则进入步骤208。
在步骤207中,若已声明的编码方式是预设编码方式中的一种,则使用已声明的编码方式所对应的解码方式解码HTML资源。
在已声明的编码方式是预设编码方式中的一种时,表明已声明的编码方式没有拼写错误,终端采用与已声明的编码方式所对应的解码方式解码HTML资源。
在步骤208中,若已声明的编码方式不是预设编码方式中的一种,则识别HTML资源的编码方式;或,对已声明的编码方式进行自动纠错,得到自动纠错后的编码方式。
在已声明的编码方式是预设编码方式中的一种时,表明已声明的编码方式存在拼写错误。此时,本实施例提供两种不同的处理方式:
第一种处理方式:终端识别HTML资源的编码方式;
识别方式与步骤204相同,终端可以调用预定的字符编码识别算法识别HTML资源的编码方式。预定的字符识别算法可以是chardet字符编码识别算法。
第二种处理方式:终端对已声明的编码方式进行自动纠错,得到自动纠错后的编码方式。
自动纠错的过程为:终端将已声明的编码方式与预设编码方式中的每一种分别计算拼写相似度,若预设编码方式有6种,则可以计算到6个拼写相似度。当最高的拼写相似度 大于预设阈值时,终端将最高的拼写相似度所对应的预设编码方式确定为自动纠错后的编码方式。
比如,已声明的编码方式为“GB2812”,预设编码方式有6种,计算出的拼写相似度也有6种。其中,与预设编码方式“GB2312”存在最高的拼写相似度为83%,大于预设阈值60%。所以终端将预设编码方式“GB2312”确定为自动纠错后的编码方式。
需要说明的一点是,第一种处理方式和第二处理方式可以择一使用或者结合使用。作为可能的一种结合使用方式:先采用第二处理方式进行处理,但是若最高的拼写相似度小于预设阈值,或者,存在两个或者两个以上的预设编码方式都具有最高的拼写相似度时,终端可以再采用第一种处理方式重新识别HTML资源的编码方式。
在步骤209中,使用重新识别或自动纠错后的编码方式所对应的解码方式解码HTML资源。
在步骤210中,若网页资源是CSS资源,则将网页数据中的HTML资源采用的编码方式识别为CSS资源的编码方式,并采用与编码方式所对应的解码方式解码CSS资源。
也即,如果当前网页资源不是HTML资源而是CSS资源,由于同一网页数据中的HTML资源和CSS资源通常采用相同的编码方式,则终端将该网页数据中的HTML资源所采用的编码方式识别为CSS资源的编码方式,其中,HTML资源的编码方式的识别过程可以参考上述步骤202至207所述。
然后,终端采用与CSS资源的编码方式所对应的解码方式解码CSS资源。
最后,在解码得到各个网页资源后,终端可以根据解码得到的网页资源显示网页。
综上所述,本实施例提供的网页编码识别方法,通过在网页资源未声明编码方式时,识别网页资源的编码方式,并采用与该编码方式对应的解码方式对网页资源进行解码;解决了相关技术中网页编码中的“charset”字段漏写时,浏览器会显示乱码的问题;达到了即便网页资源中未声明编码方式,也能够正常解码网页资源并进行显示的效果。
本实施例提供的网页编码识别方法,还通过在网页资源声明了编码方式但是声明的编码方式存在拼写错误时,通过重新识别或者自动纠错出的编码方式所对应的解码方式对网页资源进行解码,解决了相关技术中网页编码中的“charset”字段错写时,浏览器会显示乱码的问题;达到了即便网页资源中声明的编码方式发生了错写,也能够正常解码网页资源并进行显示的效果。
下述为本公开装置实施例,可以用于执行本公开方法实施例。对于本公开装置实施例中未披露的细节,请参照本公开方法实施例。
图3是根据一示例性实施例示出的一种网页编码识别装置的框图,该网页编码识别装置可以通过软件、硬件或者两者的结合实现成为终端的部分或者全部。该网页编码识别装置可以包括:
数据加载模块320,被配置为加载网页数据,网页数据包括至少一个网页资源。
方式检测模块340,被配置为检测网页资源是否为HTML资源且声明了编码方式。
方式识别模块360,被配置为当网页资源是HTML资源但未声明编码方式时,识别HTML资源的编码方式。
资源解码模块380,被配置为采用与识别到的编码方式所对应的解码方式解码HTML资源。
综上所述,本实施例提供的网页编码识别装置,通过在网页资源未声明编码方式时,识别网页资源的编码方式,并采用与该编码方式对应的解码方式对网页资源进行解码;解决了相关技术中网页编码中的“charset”字段漏写时,浏览器会显示乱码的问题;达到了即便网页资源中未声明编码方式,也能够正常解码网页资源并进行显示的效果。
图4是根据另一示例性实施例示出的一种网页编码识别装置的框图。该网页编码识别装置可以通过软件、硬件或者两者的结合实现成为终端的部分或者全部。该网页编码识别装置可以包括:
数据加载模块320,被配置为加载网页数据,网页数据包括至少一个网页资源。
方式检测模块340,被配置为检测网页资源是否为HTML资源且声明了编码方式。
方式识别模块360,被配置为当网页资源是HTML资源但未声明编码方式时,识别HTML资源的编码方式。
资源解码模块380,被配置为采用与识别到的编码方式所对应的解码方式解码HTML资源。
可选地,该装置还包括:
编码检测模块352,被配置为当网页资源是HTML资源但已声明编码方式,则检测已声明的编码方式是否为预设编码方式中的一种。
方式识别模块360,被配置为当已声明的编码方式不是预设编码方式中的一种时,识别HTML资源的编码方式。或,
自动纠错模块370,被配置为当已声明的编码方式不是预设编码方式中的一种时,对已声明的编码方式进行自动纠错,得到自动纠错后的编码方式。
可选地,方式识别模块360,被配置为调用预定的字符编码识别算法识别HTML资源的编码方式。
可选地,自动纠错模块370,包括:
拼写计算子模块372,被配置为将已声明的编码方式与预设编码方式中的每一种分别计算拼写相似度;
自动纠错子模块374,被配置为当最高的拼写相似度大于预设阈值时,将最高的拼写相似度所对应的预设编码方式确定为自动纠错后的编码方式。
可选地,该装置还包括:
编码复用模块354,被配置为当网页资源是CSS资源,则将网页数据中的HTML资 源采用的编码方式识别为CSS资源的编码方式,采用与编码方式所对应的解码方式解码CSS资源。
综上所述,本实施例提供的网页编码识别装置,通过在网页资源未声明编码方式时,识别网页资源的编码方式,并采用与该编码方式对应的解码方式对网页资源进行解码;解决了相关技术中网页编码中的“charset”字段漏写时,浏览器会显示乱码的问题;达到了即便网页资源中未声明编码方式,也能够正常解码网页资源并进行显示的效果。
本实施例提供的网页编码识别装置,还通过在网页资源声明了编码方式但是声明的编码方式存在拼写错误时,通过重新识别或者自动纠错出的编码方式所对应的解码方式对网页资源进行解码,解决了相关技术中网页编码中的“charset”字段错写时,浏览器会显示乱码的问题;达到了即便网页资源中声明的编码方式发生了错写,也能够正常解码网页资源并进行显示的效果。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
图5是根据一示例性实施例示出的一种用于网页编码识别装置500的框图。例如,装置500可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图5,装置500可以包括以下一个或多个组件:处理组件502,存储器504,电源组件506,多媒体组件508,音频组件510,输入/输出(I/O)的接口512,传感器组件514,以及通信组件516。
处理组件502通常控制装置500的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件502可以包括一个或多个处理器520来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件502可以包括一个或多个模块,便于处理组件502和其他组件之间的交互。例如,处理组件502可以包括多媒体模块,以方便多媒体组件508和处理组件502之间的交互。
存储器504被配置为存储各种类型的数据以支持在装置500的操作。这些数据的示例包括用于在装置500上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器504可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件506为装置500的各种组件提供电力。电源组件506可以包括电源管理系统,一个或多个电源,及其他与为装置500生成、管理和分配电力相关联的组件。
多媒体组件508包括在所述装置500和用户之间的提供一个输出接口的屏幕。在一些 实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件508包括一个前置摄像头和/或后置摄像头。当装置500处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件510被配置为输出和/或输入音频信号。例如,音频组件510包括一个麦克风(MIC),当装置500处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器504或经由通信组件516发送。在一些实施例中,音频组件510还包括一个扬声器,用于输出音频信号。
I/O接口512为处理组件502和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件514包括一个或多个传感器,用于为装置500提供各个方面的状态评估。例如,传感器组件514可以检测到装置500的打开/关闭状态,组件的相对定位,例如所述组件为装置500的显示器和小键盘,传感器组件514还可以检测装置500或装置500一个组件的位置改变,用户与装置500接触的存在或不存在,装置500方位或加速/减速和装置500的温度变化。传感器组件514可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件514还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件514还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件516被配置为便于装置500和其他设备之间有线或无线方式的通信。装置500可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件516经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件516还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,装置500可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器504,上述指令可由装置500的处理器520执行以完成上述方法。例如, 所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
一种非临时性计算机可读存储介质,当所述存储介质中的指令由装置500的处理器执行时,使得装置500能够执行图1或者图2所示出的网页编码识别方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (11)

  1. 一种网页编码识别方法,其特征在于,所述方法包括:
    加载网页数据,所述网页数据包括至少一个网页资源;
    检测所述网页资源是否为超文本标记语言HTML资源且声明了编码方式;
    若所述网页资源是HTML资源但未声明编码方式,则识别所述HTML资源的编码方式;
    采用与识别到的所述编码方式所对应的解码方式解码所述HTML资源。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    若所述网页资源是HTML资源但已声明编码方式,则检测已声明的所述编码方式是否为预设编码方式中的一种;
    若已声明的所述编码方式不是所述预设编码方式中的一种,则识别所述HTML资源的编码方式;或,对已声明的所述编码方式进行自动纠错,得到自动纠错后的所述编码方式。
  3. 根据权利要求1或2所述的方法,其特征在于,所述识别所述HTML资源的编码方式,包括:
    调用预定的字符编码识别算法识别所述HTML资源的编码方式。
  4. 根据权利要求2所述的方法,其特征在于,所述对已声明的所述编码方式进行自动纠错,得到自动纠错后的所述编码方式,包括:
    将已声明的所述编码方式与所述预设编码方式中的每一种分别计算拼写相似度;
    当最高的所述拼写相似度大于预设阈值时,将最高的所述拼写相似度所对应的预设编码方式确定为自动纠错后的所述编码方式。
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    若所述网页资源是层叠样式表CSS资源,则将所述网页数据中的所述HTML资源采用的编码方式识别为所述CSS资源的编码方式,采用与所述编码方式所对应的解码方式解码所述CSS资源。
  6. 一种网页编码识别装置,其特征在于,所述装置包括:
    数据加载模块,被配置为加载网页数据,所述网页数据包括至少一个网页资源;
    方式检测模块,被配置为检测所述网页资源是否为超文本标记语言HTML资源且声明了编码方式;
    方式识别模块,被配置为当所述网页资源是HTML资源但未声明编码方式时,识别 所述HTML资源的编码方式;
    资源解码模块,被配置为采用与识别到的所述编码方式所对应的解码方式解码所述HTML资源。
  7. 根据权利要求6所述的装置,其特征在于,所述装置还包括:
    编码检测模块,被配置为当所述网页资源是HTML资源但已声明编码方式,则检测已声明的所述编码方式是否为预设编码方式中的一种;
    所述方式识别模块,被配置为当已声明的所述编码方式不是所述预设编码方式中的一种时,识别所述HTML资源的编码方式;或,自动纠错模块,被配置为当已声明的所述编码方式不是所述预设编码方式中的一种时,对已声明的所述编码方式进行自动纠错,得到自动纠错后的所述编码方式。
  8. 根据权利要求6或7所述的装置,其特征在于,
    所述方式识别模块,被配置为调用预定的字符编码识别算法识别所述HTML资源的编码方式。
  9. 根据权利要求7所述的装置,其特征在于,所述自动纠错模块,包括:
    拼写计算子模块,被配置为将已声明的所述编码方式与所述预设编码方式中的每一种分别计算拼写相似度;
    自动纠错子模块,被配置为当最高的所述拼写相似度大于预设阈值时,将最高的所述拼写相似度所对应的预设编码方式确定为自动纠错后的所述编码方式。
  10. 根据权利要求6所述的装置,其特征在于,所述装置还包括:
    编码负用模块,被配置为当所述网页资源是层叠样式表CSS资源,则将所述网页数据中的所述HTML资源采用的编码方式识别为所述CSS资源的编码方式,采用与所述编码方式所对应的解码方式解码所述CSS资源。
  11. 一种网页编码识别装置,其特征在于,所述装置包括:
    处理器;
    用于存储所述处理器的可执行指令的存储器;
    其中,所述处理器被配置为:
    加载网页数据,所述网页数据包括至少一个网页资源;
    检测所述网页资源是否为超文本标记语言HTML资源且声明了编码方式;
    若所述网页资源是HTML资源但未声明编码方式,则识别所述HTML资源的编码方式;
    采用与识别到的所述编码方式所对应的解码方式解码所述HTML资源。
PCT/CN2015/071308 2014-10-21 2015-01-22 网页编码识别方法及装置 WO2016061930A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
RU2015110973A RU2610245C2 (ru) 2014-10-21 2015-01-22 Способ и устройство для идентификации кодирования веб-страницы
BR112015006725A BR112015006725A2 (pt) 2014-10-21 2015-01-22 método e dispositivo para identificar codificação da página de internet
MX2015003807A MX361564B (es) 2014-10-21 2015-01-22 Método y dispositivo para identificar la codificación de página web.
JP2016554794A JP6130976B2 (ja) 2014-10-21 2015-01-22 ウェブページエンコード識別方法、ウェブページエンコード識別装置、プログラム及び記録媒体
KR1020157007129A KR20160059455A (ko) 2014-10-21 2015-01-22 웹 페이지 인코딩 인식 방법, 인식 장치, 프로그램 및 저장매체
US14/684,855 US20160112491A1 (en) 2014-10-21 2015-04-13 Method and device for identifying encoding of web page

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410562477.9A CN104361021B (zh) 2014-10-21 2014-10-21 网页编码识别方法及装置
CN201410562477.9 2014-10-21

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/684,855 Continuation US20160112491A1 (en) 2014-10-21 2015-04-13 Method and device for identifying encoding of web page

Publications (1)

Publication Number Publication Date
WO2016061930A1 true WO2016061930A1 (zh) 2016-04-28

Family

ID=52528283

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/071308 WO2016061930A1 (zh) 2014-10-21 2015-01-22 网页编码识别方法及装置

Country Status (8)

Country Link
EP (1) EP3012750A1 (zh)
JP (1) JP6130976B2 (zh)
KR (1) KR20160059455A (zh)
CN (1) CN104361021B (zh)
BR (1) BR112015006725A2 (zh)
MX (1) MX361564B (zh)
RU (1) RU2610245C2 (zh)
WO (1) WO2016061930A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104994128B (zh) * 2015-05-15 2019-04-26 北京网康科技有限公司 一种数据编码类型识别及转码方法和装置
CN105468753A (zh) * 2015-11-27 2016-04-06 北京金和网络股份有限公司 多编码格式数据显示系统及方法
CN106407438A (zh) * 2016-09-28 2017-02-15 珠海迈越信息技术有限公司 一种数据处理方法及系统
CN110020343B (zh) * 2017-09-01 2021-03-30 北京国双科技有限公司 网页编码格式的确定方法和装置
CN110674377A (zh) * 2019-09-24 2020-01-10 四川长虹电器股份有限公司 基于爬虫的新闻热点词获取方法
CN114024651A (zh) * 2020-07-16 2022-02-08 深信服科技股份有限公司 一种编码类型识别方法、装置、设备及可读存储介质
CN114415817B (zh) * 2020-10-28 2024-05-07 北京小米移动软件有限公司 显示控制方法、电子设备及存储介质
CN113595683A (zh) * 2021-07-07 2021-11-02 西安震有信通科技有限公司 基于各类编码文件的转换处理方法、装置、终端及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101526963A (zh) * 2009-04-17 2009-09-09 深圳华为通信技术有限公司 网页编码识别方法、装置和终端设备
CN103207877A (zh) * 2012-01-17 2013-07-17 阿里巴巴集团控股有限公司 解码方法及装置
US20140075344A1 (en) * 2012-09-11 2014-03-13 Ebay Inc. Visual state comparator

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3203544B2 (ja) * 1996-01-31 2001-08-27 日本電信電話株式会社 テキスト最尤復号方法及び最尤復号装置と、データ通信ネットワーク装置
JP2000132449A (ja) * 1998-10-27 2000-05-12 Nippon Telegr & Teleph Corp <Ntt> 代理アクセス方法、装置、および代理アクセスプログラムを記録した記録媒体
US6701320B1 (en) * 2002-04-24 2004-03-02 Bmc Software, Inc. System and method for determining a character encoding scheme
US7148824B1 (en) * 2005-08-05 2006-12-12 Xerox Corporation Automatic detection of character encoding format using statistical analysis of the text strings
US7711673B1 (en) * 2005-09-28 2010-05-04 Trend Micro Incorporated Automatic charset detection using SIM algorithm with charset grouping
US8271263B2 (en) * 2007-03-30 2012-09-18 Symantec Corporation Multi-language text fragment transcoding and featurization
JP5565197B2 (ja) * 2010-08-18 2014-08-06 富士通株式会社 Webアプリケーションの連携方法、連携装置、および連携プログラム
RU2500024C2 (ru) * 2011-12-27 2013-11-27 Общество С Ограниченной Ответственностью "Центр Инноваций Натальи Касперской" Способ автоматизированного определения языка и (или) кодировки текстового документа
TWI493365B (zh) * 2013-08-16 2015-07-21 Arphic Technology Co Ltd 多字集字碼輸入與即時顯示方法、系統與裝置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101526963A (zh) * 2009-04-17 2009-09-09 深圳华为通信技术有限公司 网页编码识别方法、装置和终端设备
CN103207877A (zh) * 2012-01-17 2013-07-17 阿里巴巴集团控股有限公司 解码方法及装置
US20140075344A1 (en) * 2012-09-11 2014-03-13 Ebay Inc. Visual state comparator

Also Published As

Publication number Publication date
EP3012750A1 (en) 2016-04-27
KR20160059455A (ko) 2016-05-26
MX2015003807A (es) 2016-08-02
JP2016539450A (ja) 2016-12-15
JP6130976B2 (ja) 2017-05-17
RU2015110973A (ru) 2016-10-20
CN104361021A (zh) 2015-02-18
MX361564B (es) 2018-12-11
BR112015006725A2 (pt) 2017-07-04
CN104361021B (zh) 2018-07-24
RU2610245C2 (ru) 2017-02-08

Similar Documents

Publication Publication Date Title
WO2016061930A1 (zh) 网页编码识别方法及装置
EP3300407B1 (en) Method and device for processing verification code
US10949490B2 (en) Method and apparatus for displaying webpage content
EP2924591A1 (en) Method and device for controlling page rollback
JP6918181B2 (ja) 機械翻訳モデルのトレーニング方法、装置およびシステム
WO2016188060A1 (zh) 应用程序安装包的处理方法及装置
WO2016011717A1 (zh) 用于轻应用的消息推送方法、装置、终端及服务器
WO2019165832A1 (zh) 文字信息处理方法、装置及终端
US20190012153A1 (en) Method and device for supporting multi-framework syntax
WO2016119497A1 (zh) 固件压缩方法、固件解压方法和装置
KR101944416B1 (ko) 영상 통화 분석 서비스를 제공하기 위한 방법 및 그 전자 장치
US20180365200A1 (en) Method, device, electric device and computer-readable storage medium for updating page
EP2921969A1 (en) Method and apparatus for centering and zooming webpage and electronic device
WO2015176483A1 (zh) 标签创建方法、装置及终端
WO2017092121A1 (zh) 信息处理的方法及装置
US20160350584A1 (en) Method and apparatus for providing contact card
US20230252778A1 (en) Formula recognition method and apparatus
CN104951445B (zh) 一种网页处理方法及装置
CN109977424B (zh) 一种机器翻译模型的训练方法及装置
EP2963561A1 (en) Method and device for updating user data
CN109992754B (zh) 文档处理方法及装置
US20210157981A1 (en) Method and terminal for performing word segmentation on text information, and storage medium
WO2023092975A1 (zh) 图像处理方法及装置、电子设备、存储介质及计算机程序产品
US20160112491A1 (en) Method and device for identifying encoding of web page
US9679076B2 (en) Method and device for controlling page rollback

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2016554794

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20157007129

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: MX/A/2015/003807

Country of ref document: MX

ENP Entry into the national phase

Ref document number: 2015110973

Country of ref document: RU

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15852729

Country of ref document: EP

Kind code of ref document: A1

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112015006725

Country of ref document: BR

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 112015006725

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20150325

122 Ep: pct application non-entry in european phase

Ref document number: 15852729

Country of ref document: EP

Kind code of ref document: A1