CN105468753A - Multi-coding-format data display system and method - Google Patents

Multi-coding-format data display system and method Download PDF

Info

Publication number
CN105468753A
CN105468753A CN201510848005.4A CN201510848005A CN105468753A CN 105468753 A CN105468753 A CN 105468753A CN 201510848005 A CN201510848005 A CN 201510848005A CN 105468753 A CN105468753 A CN 105468753A
Authority
CN
China
Prior art keywords
data
unit
coded format
code form
display
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510848005.4A
Other languages
Chinese (zh)
Inventor
张天祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gold And Network Ltd Co
Original Assignee
Beijing Gold And Network Ltd Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gold And Network Ltd Co filed Critical Beijing Gold And Network Ltd Co
Priority to CN201510848005.4A priority Critical patent/CN105468753A/en
Publication of CN105468753A publication Critical patent/CN105468753A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention aims to provide a display system and method which aims at a data source captured by a webcrawler from a network. The system comprises an acquisition unit, an analytic unit, a first comparison unit, a first conversion unit, a storage unit and a display unit, wherein the acquisition unit acquires data from a data source; the analytic unit determines a coding format type of the data; the first comparison unit judges whether the coding format of the data is consistent with a preset coding format of the storage unit or not; the first conversion unit converts the coding format of the data into the preset coding format of the storage unit; the storage unit stores the data of the preset coding format; and the display unit displays the data acquired from the storage unit. The system and the method can effectively avoid the problem of unreadable codes during data storage and display through processing including coding conversion and the like even if the coding formats of the data source captured from the network are diverse.

Description

Odd encoder formatted data display system and method
Technical field
The present invention relates to a kind of display system and method for odd encoder formatted data, be specifically related to a kind of display system and method for the data source captured from network for web crawlers.
Background technology
Along with developing rapidly of network, WWW becomes the carrier of bulk information, how effectively to extract and to utilize these information to become a huge challenge.With Google etc., the instrument as auxiliary people's retrieving information becomes entrance and the guide that user accesses WWW.Web crawlers is a program automatically extracting webpage, it is that search engine is from downloading web pages WWW, web crawlers is the program for finding, browsing and download resource available on website in a network, to form corpus, namely can one group of resource using by other programs, be the important composition of search engine.They are also referred to as ant, robot, Web Spider ...Next, they are called " web crawlers ", or more referred to as " reptile ".
The coded format of the data source captured from network due to web crawlers is varied, common are GB2312, UTF-8, iso8859-1, and the coding such as jp system coding, West Europe, Russian of Japanese is different, some reptiles carry out simple code identification to webpage to carry out Unified coding again, have plenty of the judgement not doing source web page directly to unify to process by utf-8, cause the mess code situation of front end display.Therefore, need storage and the display packing of finding a kind of odd encoder formatted data, solve the problems referred to above.
Summary of the invention
The object of the present invention is to provide a kind of display system and method for the data source captured from network for web crawlers, solve the problem occurring mess code in storage in Multi-encoding formatted data source and display.
According to an aspect of the present invention, a kind of display system of odd encoder formatted data is provided, comprises: acquiring unit, obtain data from data source, resolution unit, determine the coded format type of described data, the first comparing unit, judge that whether the coded format of described data is consistent with the pre-arranged code form of storage unit, first converting unit, the coded format of described data is converted to the pre-arranged code form of storage unit, storage unit, stores the data of described pre-arranged code form; And display unit, shows the data obtained from described storage unit.
Preferably, also comprise: the second comparing unit, the coded format of the data stored in the pre-arranged code form of described display unit and described storage unit is compared; And described data are converted to the pre-arranged code form of display unit by the second converting unit.
Preferably, acquiring unit is reptile engine.
Preferably, the coded format of data is GBK or UTF-8.
Preferably, pre-arranged code form is GB2312 or UTF-8.
According to a further aspect in the invention, a kind of display packing of odd encoder formatted data is provided, comprise: obtaining step, data are obtained from data source by acquiring unit, coded format analyzing step, resolved by the coded format of resolution unit to described data, determine coded format type, storing step, by the first comparing unit, the coded format of described data and pre-arranged code form are compared, when coded format is pre-arranged code form, then described data are directly stored to storage unit, when the coded format of described data is not pre-arranged code form, then the coded format of described data is converted to the pre-arranged code form of storage unit by described first converting unit, store, and, step display, from storage unit, obtain data by display unit and show.
Preferably, in step display, the coded format of the described data obtained from storage unit and the pre-arranged code form of display unit compare by the second comparing unit, when described data encoding format is the pre-arranged code form of display unit, then described data are directly shown in display unit, when the coded format of described data is not the pre-arranged code form of display unit, then the coded format of described data is converted to the pre-arranged code form of display unit by the second converting unit, shows in display unit.
Preferably, in obtaining step, web crawlers engine crawls webpage from network.
Preferably, pre-arranged code form is GB2312 or UTF-8.
Wherein, coded format analyzing step be to webpage in the content of HTML (Hypertext Markup Language) header, the metacharacter collection of webpage, any one definition in these three positions of webpage head file judge, determine web page coding type.
According to display system and the method for odd encoder formatted data provided by the present invention, even if the coded format of the data source captured from network is varied, through process such as code conversion, can effectively avoid data to store, display time there is the problem of mess code.
Accompanying drawing explanation
Fig. 1 is the functional block diagram of odd encoder formatted data display system embodiment one.
Fig. 2 is the functional block diagram of odd encoder formatted data display system embodiment two.
Fig. 3 is the general flow chart of odd encoder formatted data display packing.
Fig. 4 is the process flow diagram of storing step in odd encoder formatted data display packing.
Fig. 5 is the process flow diagram of the step display in the embodiment two of odd encoder formatted data display packing.
Embodiment
Clearly understand to make object of the present invention, technical scheme and advantage, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, be to be understood that, specific embodiment described herein only in order to explain the present invention, is not intended to limit the present invention.Described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making other embodiments all obtained under creative work prerequisite, belong to the scope of protection of the invention.
Fig. 1 is the functional block diagram of odd encoder formatted data display system embodiment one.Odd encoder formatted data display system as shown in Figure 1, comprise: acquiring unit 11, data are obtained from data source, resolution unit 12, determine the coded format type of data, comparing unit 13, judge that whether the coded format of data is consistent with the pre-arranged code form of storage unit 15, converting unit 14, the coded format of data is converted to the pre-arranged code form of storage unit 15, storage unit 15, stores the data of described pre-arranged code form, display unit 16, shows the data obtained from described storage unit 15.Wherein, comparing unit 13 is directly connected with storage unit 15, and meanwhile, comparing unit is connected with storage unit 15 by conversion 14.
Acquiring unit 11 is preferably web crawlers engine.The coded format of data is the one in GBK, UTF-8, or other common coded formats etc.The pre-arranged code form of storage unit 15 is preferably the one in GB2312 or UTF-8.The coded format that display unit 16 is supported is preferably the one in GB2312 or UTF-8.In this case, need the coded format that display unit 16 is supported to be set to consistent with the pre-arranged code form of storage unit 15.
According to odd encoder formatted data display system embodiment two of the present invention, as shown in Figure 2, except above-mentioned each functional unit, the display system of odd encoder formatted data also comprises comparing unit 13 ˊ, the coded format of the data that the coded format support display unit 16 and storage unit 15 store compares, and converting unit 14 ˊ, is converted to the coded format that display unit 16 is supported by data encoding format.Wherein, comparing unit 13 ˊ is directly connected with display unit 16, and meanwhile, comparing unit 13 ˊ is also connected with storage unit 15 by conversion 14 ˊ.
Acquiring unit 11 is preferably web crawlers engine.Data encoding format is the one in GBK, UTF-8, or other common coded formats etc.The pre-arranged code form of storage unit 15 is preferably the one in GB2312 or UTF-8.The coded format that display unit 16 is supported is preferably the one in GB2312 or UTF-8.In this case, the coded format that display unit 16 is supported can be inconsistent with the pre-arranged code form of storage unit 15.
According to a further aspect in the invention, a kind of display packing of odd encoder formatted data is provided.Be specifically described below in conjunction with accompanying drawing.Fig. 3 is the general flow chart of odd encoder formatted data display packing.As shown in Figure 3, specifically comprise the steps:
First, in obtaining step S1, acquiring unit 11 obtains webpage from website.Wherein, acquiring unit 11 is preferably web crawlers.
Next, in coded format analyzing step S2, the coded format of resolution unit 12 pairs of data is resolved, and determines coded format type.Such as the content of HTML (Hypertext Markup Language) header in webpage, the metacharacter collection of webpage, any one definition in these three positions of webpage head file are judged, determine web page coding type.The coded format of webpage is such as the one in UTF-8 or GBK, or other common coded formats etc.In addition, resolution unit 12 can be resolved web page contents, such as, comprise title, content, article issuing time etc.
Next, 4 couples of storing step S3 are described by reference to the accompanying drawings.In step S31, the pre-arranged code form of data encoding format and storage unit 15 compares by comparing unit 13, when data encoding format is pre-arranged code form, is then judged as "Yes", directly enter storing step S33, data are stored to storage unit 15.When the pre-arranged code form of data encoding format and storage unit 15 is inconsistent, then be judged as "No", enter coded format switch process S32, converting unit 14 is utilized data encoding format to be converted to the pre-arranged code form of storage unit 15, afterwards, carry out storing step S33, data are stored to storage unit 15.Wherein, the pre-arranged code form of storage unit 15 is preferably the one in GB2312 or UTF-8.Constantly be cycled to repeat above-mentioned obtaining step S1, coded format analyzing step S2 and storing step S3, in storage unit 15, store mass data.
Next, in step display S4, display unit 16 obtains desired data and shows from storage unit 15.Wherein, the data encoding format that display unit 16 is supported is set to consistent with the pre-arranged code form of storage unit 15.
According to a kind of embodiment two of display packing of odd encoder formatted data, only step display S4 ' is distinct, and other steps are all same as the previously described embodiments.The process flow diagram of step display S4 ' as shown in Figure 5.Desired data (S4 ' 1) is obtained from storage unit 15, afterwards, in coded format comparison step S4 ' 2, the coded format that data encoding format and display unit 16 are supported compares by comparing unit 13 ˊ, when data encoding format is the coded format supported, then be judged as "Yes", directly enter step display S4 ' 4, data are shown in display unit 16.When the pre-arranged code form of data encoding format and storage unit 16 is inconsistent, then be judged as "No", enter coded format switch process S4 ' 3, converting unit 14 ˊ is utilized data encoding format to be converted to the pre-arranged code form of display unit 16, afterwards, carry out step display S4 ' 4, data are shown in display unit 16.Wherein, the pre-arranged code form of display unit 16 is preferably the one in GB2312 or UTF-8.In this case, the data encoding format that display unit 16 is supported can be inconsistent with the pre-arranged code form of storage unit 15.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.

Claims (10)

1. an odd encoder formatted data display system, is characterized in that,
Comprise:
Acquiring unit, obtains data from data source;
Resolution unit, determines the coded format type of described data;
First comparing unit, judges that whether the coded format of described data is consistent with the pre-arranged code form of storage unit;
First converting unit, is converted to the pre-arranged code form of storage unit by the coded format of described data;
Storage unit, stores the data of described pre-arranged code form; And,
Display unit, shows the data obtained from described storage unit.
2. odd encoder formatted data display system according to claim 1, is characterized in that,
Also comprise:
Second comparing unit, compares the coded format of the data stored in the pre-arranged code form of described display unit and described storage unit; And,
Described data are converted to the pre-arranged code form of display unit by the second converting unit.
3. odd encoder formatted data display system according to claim 1 and 2, is characterized in that,
Described acquiring unit is reptile engine.
4. odd encoder formatted data display system according to claim 3, is characterized in that,
Described coded format is GBK or UTF-8.
5. odd encoder formatted data display system according to claim 3, is characterized in that,
Described pre-arranged code form is GB2312 or UTF-8.
6. an odd encoder formatted data display packing, is characterized in that,
Comprise:
Obtaining step, obtains data by acquiring unit from data source;
Coded format analyzing step, is resolved by the coded format of resolution unit to described data, determines coded format type;
Storing step, by the first comparing unit, the coded format of described data and pre-arranged code form are compared, when coded format is pre-arranged code form, then described data are directly stored to storage unit, when the coded format of described data is not pre-arranged code form, then the coded format of described data is converted to the pre-arranged code form of storage unit by described first converting unit, stores;
And,
Step display, obtains data by display unit and shows from storage unit.
7. odd encoder formatted data display packing according to claim 6, is characterized in that,
In described step display, the coded format of the described data obtained from storage unit and the pre-arranged code form of display unit compare by the second comparing unit, when described data encoding format is the pre-arranged code form of display unit, then described data are directly shown in display unit, when the coded format of described data is not the pre-arranged code form of display unit, then the coded format of described data is converted to the pre-arranged code form of display unit by the second converting unit, shows in display unit.
8. the odd encoder formatted data display packing according to claim 6 or 7, is characterized in that,
In described obtaining step, web crawlers engine crawls webpage from network.
9. odd encoder formatted data display packing according to claim 8, is characterized in that,
Described pre-arranged code form is GB2312 or UTF-8.
10. odd encoder formatted data display packing according to claim 8, is characterized in that,
Described coded format analyzing step is content to HTML (Hypertext Markup Language) header in webpage, the metacharacter collection of webpage, any one definition in these three positions of webpage head file judge, determines web page coding type.
CN201510848005.4A 2015-11-27 2015-11-27 Multi-coding-format data display system and method Pending CN105468753A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510848005.4A CN105468753A (en) 2015-11-27 2015-11-27 Multi-coding-format data display system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510848005.4A CN105468753A (en) 2015-11-27 2015-11-27 Multi-coding-format data display system and method

Publications (1)

Publication Number Publication Date
CN105468753A true CN105468753A (en) 2016-04-06

Family

ID=55606454

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510848005.4A Pending CN105468753A (en) 2015-11-27 2015-11-27 Multi-coding-format data display system and method

Country Status (1)

Country Link
CN (1) CN105468753A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107086942A (en) * 2017-04-25 2017-08-22 北京锐安科技有限公司 A kind of Web content service provider ICP reported datas inspection method and device
CN108256110A (en) * 2018-02-08 2018-07-06 平安科技(深圳)有限公司 Gathering method, device, computer equipment and the storage medium of information
CN109063091A (en) * 2018-07-26 2018-12-21 成都大学 Data migration method, data migration device and the storage medium of hybrid coding
CN111368508A (en) * 2020-03-03 2020-07-03 深信服科技股份有限公司 Data processing method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7405677B2 (en) * 2006-08-08 2008-07-29 International Business Machines Corporation Apparatus, system, and method for incremental encoding conversion of XML data using Java
CN101551792A (en) * 2008-04-03 2009-10-07 鸿富锦精密工业(深圳)有限公司 Messy code recovery system and method
CN102262520A (en) * 2010-05-31 2011-11-30 北京创艺和弦科贸有限公司 Test display method based on built-in platform mobile phone and applied device thereof
CN104361021A (en) * 2014-10-21 2015-02-18 小米科技有限责任公司 Webpage encoding identifying method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7405677B2 (en) * 2006-08-08 2008-07-29 International Business Machines Corporation Apparatus, system, and method for incremental encoding conversion of XML data using Java
CN101551792A (en) * 2008-04-03 2009-10-07 鸿富锦精密工业(深圳)有限公司 Messy code recovery system and method
CN102262520A (en) * 2010-05-31 2011-11-30 北京创艺和弦科贸有限公司 Test display method based on built-in platform mobile phone and applied device thereof
CN104361021A (en) * 2014-10-21 2015-02-18 小米科技有限责任公司 Webpage encoding identifying method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107086942A (en) * 2017-04-25 2017-08-22 北京锐安科技有限公司 A kind of Web content service provider ICP reported datas inspection method and device
CN107086942B (en) * 2017-04-25 2019-12-03 北京锐安科技有限公司 A kind of Web content service provider ICP reported data inspection method and device
CN108256110A (en) * 2018-02-08 2018-07-06 平安科技(深圳)有限公司 Gathering method, device, computer equipment and the storage medium of information
WO2019153588A1 (en) * 2018-02-08 2019-08-15 平安科技(深圳)有限公司 Intelligence information collection method and apparatus, computer device and storage medium
CN109063091A (en) * 2018-07-26 2018-12-21 成都大学 Data migration method, data migration device and the storage medium of hybrid coding
CN111368508A (en) * 2020-03-03 2020-07-03 深信服科技股份有限公司 Data processing method, device, equipment and medium
CN111368508B (en) * 2020-03-03 2024-04-09 深信服科技股份有限公司 Data processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
US9639631B2 (en) Converting XML to JSON with configurable output
CN103577466B (en) Method and device for displaying webpage content in browser
CN103166981B (en) A kind of radio web page code-transferring method and device
CN108717437B (en) Search result display method and device and storage medium
CN106547749B (en) Webpage data acquisition method and device
CN105468753A (en) Multi-coding-format data display system and method
CN102306201B (en) Method and system for analyzing webpage title
CN101526953A (en) WWW transformation technology
CN102812456A (en) Method For Content Folding
CN108334508B (en) Webpage information extraction method and device
CN104063401A (en) Webpage style address merging method and device
CN103365877B (en) Method and server to establishing catalogue after webpage progress transcoding
US11403078B2 (en) Interface layout interference detection
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN105447198A (en) Convenient page script importing method and device
WO2013148351A1 (en) System and method for analyzing an electronic documents
CN103838862A (en) Video searching method, device and terminal
CN104331438A (en) Method and device for selectively extracting content of novel webpage
CN106294885A (en) A kind of data collection towards isomery webpage and mask method
CN103488560A (en) Test object processing method and test object processing device for webpage test
CN101539933B (en) Intelligent content direct technology
CN111381809B (en) Method and device for searching focus page
US10095791B2 (en) Information search method and apparatus
CN103853777A (en) Method and device for accessing websites through keywords
KR101175168B1 (en) Apparatus and method for searching a plurality of web-sites through a web-site in the terminal device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160406