CN105468753A - Multi-coding-format data display system and method - Google Patents
Multi-coding-format data display system and method Download PDFInfo
- Publication number
- CN105468753A CN105468753A CN201510848005.4A CN201510848005A CN105468753A CN 105468753 A CN105468753 A CN 105468753A CN 201510848005 A CN201510848005 A CN 201510848005A CN 105468753 A CN105468753 A CN 105468753A
- Authority
- CN
- China
- Prior art keywords
- data
- unit
- coded format
- code form
- display
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention aims to provide a display system and method which aims at a data source captured by a webcrawler from a network. The system comprises an acquisition unit, an analytic unit, a first comparison unit, a first conversion unit, a storage unit and a display unit, wherein the acquisition unit acquires data from a data source; the analytic unit determines a coding format type of the data; the first comparison unit judges whether the coding format of the data is consistent with a preset coding format of the storage unit or not; the first conversion unit converts the coding format of the data into the preset coding format of the storage unit; the storage unit stores the data of the preset coding format; and the display unit displays the data acquired from the storage unit. The system and the method can effectively avoid the problem of unreadable codes during data storage and display through processing including coding conversion and the like even if the coding formats of the data source captured from the network are diverse.
Description
Technical field
The present invention relates to a kind of display system and method for odd encoder formatted data, be specifically related to a kind of display system and method for the data source captured from network for web crawlers.
Background technology
Along with developing rapidly of network, WWW becomes the carrier of bulk information, how effectively to extract and to utilize these information to become a huge challenge.With Google etc., the instrument as auxiliary people's retrieving information becomes entrance and the guide that user accesses WWW.Web crawlers is a program automatically extracting webpage, it is that search engine is from downloading web pages WWW, web crawlers is the program for finding, browsing and download resource available on website in a network, to form corpus, namely can one group of resource using by other programs, be the important composition of search engine.They are also referred to as ant, robot, Web Spider ...Next, they are called " web crawlers ", or more referred to as " reptile ".
The coded format of the data source captured from network due to web crawlers is varied, common are GB2312, UTF-8, iso8859-1, and the coding such as jp system coding, West Europe, Russian of Japanese is different, some reptiles carry out simple code identification to webpage to carry out Unified coding again, have plenty of the judgement not doing source web page directly to unify to process by utf-8, cause the mess code situation of front end display.Therefore, need storage and the display packing of finding a kind of odd encoder formatted data, solve the problems referred to above.
Summary of the invention
The object of the present invention is to provide a kind of display system and method for the data source captured from network for web crawlers, solve the problem occurring mess code in storage in Multi-encoding formatted data source and display.
According to an aspect of the present invention, a kind of display system of odd encoder formatted data is provided, comprises: acquiring unit, obtain data from data source, resolution unit, determine the coded format type of described data, the first comparing unit, judge that whether the coded format of described data is consistent with the pre-arranged code form of storage unit, first converting unit, the coded format of described data is converted to the pre-arranged code form of storage unit, storage unit, stores the data of described pre-arranged code form; And display unit, shows the data obtained from described storage unit.
Preferably, also comprise: the second comparing unit, the coded format of the data stored in the pre-arranged code form of described display unit and described storage unit is compared; And described data are converted to the pre-arranged code form of display unit by the second converting unit.
Preferably, acquiring unit is reptile engine.
Preferably, the coded format of data is GBK or UTF-8.
Preferably, pre-arranged code form is GB2312 or UTF-8.
According to a further aspect in the invention, a kind of display packing of odd encoder formatted data is provided, comprise: obtaining step, data are obtained from data source by acquiring unit, coded format analyzing step, resolved by the coded format of resolution unit to described data, determine coded format type, storing step, by the first comparing unit, the coded format of described data and pre-arranged code form are compared, when coded format is pre-arranged code form, then described data are directly stored to storage unit, when the coded format of described data is not pre-arranged code form, then the coded format of described data is converted to the pre-arranged code form of storage unit by described first converting unit, store, and, step display, from storage unit, obtain data by display unit and show.
Preferably, in step display, the coded format of the described data obtained from storage unit and the pre-arranged code form of display unit compare by the second comparing unit, when described data encoding format is the pre-arranged code form of display unit, then described data are directly shown in display unit, when the coded format of described data is not the pre-arranged code form of display unit, then the coded format of described data is converted to the pre-arranged code form of display unit by the second converting unit, shows in display unit.
Preferably, in obtaining step, web crawlers engine crawls webpage from network.
Preferably, pre-arranged code form is GB2312 or UTF-8.
Wherein, coded format analyzing step be to webpage in the content of HTML (Hypertext Markup Language) header, the metacharacter collection of webpage, any one definition in these three positions of webpage head file judge, determine web page coding type.
According to display system and the method for odd encoder formatted data provided by the present invention, even if the coded format of the data source captured from network is varied, through process such as code conversion, can effectively avoid data to store, display time there is the problem of mess code.
Accompanying drawing explanation
Fig. 1 is the functional block diagram of odd encoder formatted data display system embodiment one.
Fig. 2 is the functional block diagram of odd encoder formatted data display system embodiment two.
Fig. 3 is the general flow chart of odd encoder formatted data display packing.
Fig. 4 is the process flow diagram of storing step in odd encoder formatted data display packing.
Fig. 5 is the process flow diagram of the step display in the embodiment two of odd encoder formatted data display packing.
Embodiment
Clearly understand to make object of the present invention, technical scheme and advantage, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, be to be understood that, specific embodiment described herein only in order to explain the present invention, is not intended to limit the present invention.Described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making other embodiments all obtained under creative work prerequisite, belong to the scope of protection of the invention.
Fig. 1 is the functional block diagram of odd encoder formatted data display system embodiment one.Odd encoder formatted data display system as shown in Figure 1, comprise: acquiring unit 11, data are obtained from data source, resolution unit 12, determine the coded format type of data, comparing unit 13, judge that whether the coded format of data is consistent with the pre-arranged code form of storage unit 15, converting unit 14, the coded format of data is converted to the pre-arranged code form of storage unit 15, storage unit 15, stores the data of described pre-arranged code form, display unit 16, shows the data obtained from described storage unit 15.Wherein, comparing unit 13 is directly connected with storage unit 15, and meanwhile, comparing unit is connected with storage unit 15 by conversion 14.
Acquiring unit 11 is preferably web crawlers engine.The coded format of data is the one in GBK, UTF-8, or other common coded formats etc.The pre-arranged code form of storage unit 15 is preferably the one in GB2312 or UTF-8.The coded format that display unit 16 is supported is preferably the one in GB2312 or UTF-8.In this case, need the coded format that display unit 16 is supported to be set to consistent with the pre-arranged code form of storage unit 15.
According to odd encoder formatted data display system embodiment two of the present invention, as shown in Figure 2, except above-mentioned each functional unit, the display system of odd encoder formatted data also comprises comparing unit 13 ˊ, the coded format of the data that the coded format support display unit 16 and storage unit 15 store compares, and converting unit 14 ˊ, is converted to the coded format that display unit 16 is supported by data encoding format.Wherein, comparing unit 13 ˊ is directly connected with display unit 16, and meanwhile, comparing unit 13 ˊ is also connected with storage unit 15 by conversion 14 ˊ.
Acquiring unit 11 is preferably web crawlers engine.Data encoding format is the one in GBK, UTF-8, or other common coded formats etc.The pre-arranged code form of storage unit 15 is preferably the one in GB2312 or UTF-8.The coded format that display unit 16 is supported is preferably the one in GB2312 or UTF-8.In this case, the coded format that display unit 16 is supported can be inconsistent with the pre-arranged code form of storage unit 15.
According to a further aspect in the invention, a kind of display packing of odd encoder formatted data is provided.Be specifically described below in conjunction with accompanying drawing.Fig. 3 is the general flow chart of odd encoder formatted data display packing.As shown in Figure 3, specifically comprise the steps:
First, in obtaining step S1, acquiring unit 11 obtains webpage from website.Wherein, acquiring unit 11 is preferably web crawlers.
Next, in coded format analyzing step S2, the coded format of resolution unit 12 pairs of data is resolved, and determines coded format type.Such as the content of HTML (Hypertext Markup Language) header in webpage, the metacharacter collection of webpage, any one definition in these three positions of webpage head file are judged, determine web page coding type.The coded format of webpage is such as the one in UTF-8 or GBK, or other common coded formats etc.In addition, resolution unit 12 can be resolved web page contents, such as, comprise title, content, article issuing time etc.
Next, 4 couples of storing step S3 are described by reference to the accompanying drawings.In step S31, the pre-arranged code form of data encoding format and storage unit 15 compares by comparing unit 13, when data encoding format is pre-arranged code form, is then judged as "Yes", directly enter storing step S33, data are stored to storage unit 15.When the pre-arranged code form of data encoding format and storage unit 15 is inconsistent, then be judged as "No", enter coded format switch process S32, converting unit 14 is utilized data encoding format to be converted to the pre-arranged code form of storage unit 15, afterwards, carry out storing step S33, data are stored to storage unit 15.Wherein, the pre-arranged code form of storage unit 15 is preferably the one in GB2312 or UTF-8.Constantly be cycled to repeat above-mentioned obtaining step S1, coded format analyzing step S2 and storing step S3, in storage unit 15, store mass data.
Next, in step display S4, display unit 16 obtains desired data and shows from storage unit 15.Wherein, the data encoding format that display unit 16 is supported is set to consistent with the pre-arranged code form of storage unit 15.
According to a kind of embodiment two of display packing of odd encoder formatted data, only step display S4 ' is distinct, and other steps are all same as the previously described embodiments.The process flow diagram of step display S4 ' as shown in Figure 5.Desired data (S4 ' 1) is obtained from storage unit 15, afterwards, in coded format comparison step S4 ' 2, the coded format that data encoding format and display unit 16 are supported compares by comparing unit 13 ˊ, when data encoding format is the coded format supported, then be judged as "Yes", directly enter step display S4 ' 4, data are shown in display unit 16.When the pre-arranged code form of data encoding format and storage unit 16 is inconsistent, then be judged as "No", enter coded format switch process S4 ' 3, converting unit 14 ˊ is utilized data encoding format to be converted to the pre-arranged code form of display unit 16, afterwards, carry out step display S4 ' 4, data are shown in display unit 16.Wherein, the pre-arranged code form of display unit 16 is preferably the one in GB2312 or UTF-8.In this case, the data encoding format that display unit 16 is supported can be inconsistent with the pre-arranged code form of storage unit 15.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.
Claims (10)
1. an odd encoder formatted data display system, is characterized in that,
Comprise:
Acquiring unit, obtains data from data source;
Resolution unit, determines the coded format type of described data;
First comparing unit, judges that whether the coded format of described data is consistent with the pre-arranged code form of storage unit;
First converting unit, is converted to the pre-arranged code form of storage unit by the coded format of described data;
Storage unit, stores the data of described pre-arranged code form; And,
Display unit, shows the data obtained from described storage unit.
2. odd encoder formatted data display system according to claim 1, is characterized in that,
Also comprise:
Second comparing unit, compares the coded format of the data stored in the pre-arranged code form of described display unit and described storage unit; And,
Described data are converted to the pre-arranged code form of display unit by the second converting unit.
3. odd encoder formatted data display system according to claim 1 and 2, is characterized in that,
Described acquiring unit is reptile engine.
4. odd encoder formatted data display system according to claim 3, is characterized in that,
Described coded format is GBK or UTF-8.
5. odd encoder formatted data display system according to claim 3, is characterized in that,
Described pre-arranged code form is GB2312 or UTF-8.
6. an odd encoder formatted data display packing, is characterized in that,
Comprise:
Obtaining step, obtains data by acquiring unit from data source;
Coded format analyzing step, is resolved by the coded format of resolution unit to described data, determines coded format type;
Storing step, by the first comparing unit, the coded format of described data and pre-arranged code form are compared, when coded format is pre-arranged code form, then described data are directly stored to storage unit, when the coded format of described data is not pre-arranged code form, then the coded format of described data is converted to the pre-arranged code form of storage unit by described first converting unit, stores;
And,
Step display, obtains data by display unit and shows from storage unit.
7. odd encoder formatted data display packing according to claim 6, is characterized in that,
In described step display, the coded format of the described data obtained from storage unit and the pre-arranged code form of display unit compare by the second comparing unit, when described data encoding format is the pre-arranged code form of display unit, then described data are directly shown in display unit, when the coded format of described data is not the pre-arranged code form of display unit, then the coded format of described data is converted to the pre-arranged code form of display unit by the second converting unit, shows in display unit.
8. the odd encoder formatted data display packing according to claim 6 or 7, is characterized in that,
In described obtaining step, web crawlers engine crawls webpage from network.
9. odd encoder formatted data display packing according to claim 8, is characterized in that,
Described pre-arranged code form is GB2312 or UTF-8.
10. odd encoder formatted data display packing according to claim 8, is characterized in that,
Described coded format analyzing step is content to HTML (Hypertext Markup Language) header in webpage, the metacharacter collection of webpage, any one definition in these three positions of webpage head file judge, determines web page coding type.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510848005.4A CN105468753A (en) | 2015-11-27 | 2015-11-27 | Multi-coding-format data display system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510848005.4A CN105468753A (en) | 2015-11-27 | 2015-11-27 | Multi-coding-format data display system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105468753A true CN105468753A (en) | 2016-04-06 |
Family
ID=55606454
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510848005.4A Pending CN105468753A (en) | 2015-11-27 | 2015-11-27 | Multi-coding-format data display system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105468753A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107086942A (en) * | 2017-04-25 | 2017-08-22 | 北京锐安科技有限公司 | A kind of Web content service provider ICP reported datas inspection method and device |
CN108256110A (en) * | 2018-02-08 | 2018-07-06 | 平安科技(深圳)有限公司 | Gathering method, device, computer equipment and the storage medium of information |
CN109063091A (en) * | 2018-07-26 | 2018-12-21 | 成都大学 | Data migration method, data migration device and the storage medium of hybrid coding |
CN111368508A (en) * | 2020-03-03 | 2020-07-03 | 深信服科技股份有限公司 | Data processing method, device, equipment and medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7405677B2 (en) * | 2006-08-08 | 2008-07-29 | International Business Machines Corporation | Apparatus, system, and method for incremental encoding conversion of XML data using Java |
CN101551792A (en) * | 2008-04-03 | 2009-10-07 | 鸿富锦精密工业(深圳)有限公司 | Messy code recovery system and method |
CN102262520A (en) * | 2010-05-31 | 2011-11-30 | 北京创艺和弦科贸有限公司 | Test display method based on built-in platform mobile phone and applied device thereof |
CN104361021A (en) * | 2014-10-21 | 2015-02-18 | 小米科技有限责任公司 | Webpage encoding identifying method and device |
-
2015
- 2015-11-27 CN CN201510848005.4A patent/CN105468753A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7405677B2 (en) * | 2006-08-08 | 2008-07-29 | International Business Machines Corporation | Apparatus, system, and method for incremental encoding conversion of XML data using Java |
CN101551792A (en) * | 2008-04-03 | 2009-10-07 | 鸿富锦精密工业(深圳)有限公司 | Messy code recovery system and method |
CN102262520A (en) * | 2010-05-31 | 2011-11-30 | 北京创艺和弦科贸有限公司 | Test display method based on built-in platform mobile phone and applied device thereof |
CN104361021A (en) * | 2014-10-21 | 2015-02-18 | 小米科技有限责任公司 | Webpage encoding identifying method and device |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107086942A (en) * | 2017-04-25 | 2017-08-22 | 北京锐安科技有限公司 | A kind of Web content service provider ICP reported datas inspection method and device |
CN107086942B (en) * | 2017-04-25 | 2019-12-03 | 北京锐安科技有限公司 | A kind of Web content service provider ICP reported data inspection method and device |
CN108256110A (en) * | 2018-02-08 | 2018-07-06 | 平安科技(深圳)有限公司 | Gathering method, device, computer equipment and the storage medium of information |
WO2019153588A1 (en) * | 2018-02-08 | 2019-08-15 | 平安科技(深圳)有限公司 | Intelligence information collection method and apparatus, computer device and storage medium |
CN109063091A (en) * | 2018-07-26 | 2018-12-21 | 成都大学 | Data migration method, data migration device and the storage medium of hybrid coding |
CN111368508A (en) * | 2020-03-03 | 2020-07-03 | 深信服科技股份有限公司 | Data processing method, device, equipment and medium |
CN111368508B (en) * | 2020-03-03 | 2024-04-09 | 深信服科技股份有限公司 | Data processing method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9639631B2 (en) | Converting XML to JSON with configurable output | |
CN103577466B (en) | Method and device for displaying webpage content in browser | |
CN103166981B (en) | A kind of radio web page code-transferring method and device | |
CN108717437B (en) | Search result display method and device and storage medium | |
CN106547749B (en) | Webpage data acquisition method and device | |
CN105468753A (en) | Multi-coding-format data display system and method | |
CN102306201B (en) | Method and system for analyzing webpage title | |
CN101526953A (en) | WWW transformation technology | |
CN102812456A (en) | Method For Content Folding | |
CN108334508B (en) | Webpage information extraction method and device | |
CN104063401A (en) | Webpage style address merging method and device | |
CN103365877B (en) | Method and server to establishing catalogue after webpage progress transcoding | |
US11403078B2 (en) | Interface layout interference detection | |
CN101441629A (en) | Automatic acquiring method of non-structured web page information | |
CN105447198A (en) | Convenient page script importing method and device | |
WO2013148351A1 (en) | System and method for analyzing an electronic documents | |
CN103838862A (en) | Video searching method, device and terminal | |
CN104331438A (en) | Method and device for selectively extracting content of novel webpage | |
CN106294885A (en) | A kind of data collection towards isomery webpage and mask method | |
CN103488560A (en) | Test object processing method and test object processing device for webpage test | |
CN101539933B (en) | Intelligent content direct technology | |
CN111381809B (en) | Method and device for searching focus page | |
US10095791B2 (en) | Information search method and apparatus | |
CN103853777A (en) | Method and device for accessing websites through keywords | |
KR101175168B1 (en) | Apparatus and method for searching a plurality of web-sites through a web-site in the terminal device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160406 |