CN101055593A

CN101055593A - Tibetan web page and its code identification method

Info

Publication number: CN101055593A
Application number: CN 200710111099
Authority: CN
Inventors: 吴健; 芮建武; 刘汇丹
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2007-06-15
Filing date: 2007-06-15
Publication date: 2007-10-17

Abstract

The invention relates to a method for identifying Tibetan language webpage and its coded, including the steps of: giving a code of characteristic string, which is syllable node and/or selected high frequency syllable, in Tibetan language codefirstly; webpage character flow, the code of characteristic string as keyword, scanned and searched; calculating the frequency that accords with characteristic string coded character to appear by the couter; determining whether the webpage is Tibetan language webpage and the Tibetan language code is used according to the result of counter. The invention makes the best of the syllable structural feature of Tibetan language language and the statistics characteristic of Tibetan language word, and respectively applys the identification criteria for different code, accordingly Tibetan language webpage and non Tibetan language webpageu can efficiently be distinguished correctly, and Tibetan language coding used by the webpage is also able to be identified.

Description

The recognition methods of Tibetan web page and coding thereof

Described technical field

The invention belongs to literal code recognition technology field, relate in particular to the recognition methods of a kind of Tibetan web page and coding thereof.

Background technology

Along with Internet development, online information is more and more, and this brings great convenience for people's life.And the data that find us to need in the network data of magnanimity are very real problems, and the appearance of search engine has solved this problem.Nearest 2 years, the development of search engine was like a raging fire, had emerged in large numbers many Chinese search engines that has their own characteristics each, and waited for as Baidu, search dog, cruel news.By contrast, as a minority language, the searching products relevant with Tibetan language also do not occur.

The functional module of search engine generally can be divided into foreground and backstage.The foreground provides the interface mutual with the final user.The backstage will be ceaselessly from the network extracting information, and, data are deposited in the database through a series of processing, use during in order to search.In the process that back-end data is handled, just comprise the normalized of web page coding, the webpage of various different codings is converted to a kind of coding deposits exactly.Do code conversion, at first will carry out code identification.

Compare with Chinese, the development of Tibetan language information processing relatively lags behind, and the webpage of Tibetan language is fewer on the network, and nowadays Tibetan code still is the situation of " surging ahead ", and total amount and few Tibetan web page have but comprised tens of kinds of different Tibetan codes.In the background process process of Tibetan language search engine, identify the webpage of Tibetan language the webpage of various language such as will be from the internet a large amount of English, Chinese, identify its employed Tibetan code, carry out code conversion then.

In the past few decades, numerous computing machines and Tibetan language and literature worker have done a large amount of work, have successfully developed some Tibetan language Words, and these Tibetan language softwares all are to adopt custom coding, have formed the situation of Tibetan language " ten thousand yards Pentium ".According to the difference of coding structure, we are divided three classes these codings: based on the Tibetan code of ASCII, based on the Tibetan code of GB2312 with based on the Tibetan code of Unicode.

Tibetan code based on ASCII adopts single byte that the Tibetan language character is encoded, and the available code space is 0x00-0xFF, removes the code-point (control character etc.) of special implication, and actual available code-point has 222; Some codings are only encoded with the following code-point of 0x7F, and actual so available code-point has only 94.Because available code-point is less, generally use a plurality of character libraries to realize, represent a plurality of Tibetan language characters with a code-point.

This class coding is as shown in table 1:

Table 1 is based on the Tibetan code of ASCII

The coding title	The code-point scope	Syllable point coding
The coding title	The code-point scope	Syllable point coding	?LTibetan ?TCRC ?Old?Sambhota ?New?Sambhota ?TM ?TMW ?Tibword ?TibKey ?tsamkey ?SUZTIB ?UCHAN	?0x21-0xFE ?0x21-0xFE ?0x21-0xFE ?0x21-0x7E ?0x21-0xFE ?0x21-0x7E ?0x21-0xFE ?0x21-0xFE ?0x21-0x7E ?0x21-0xFE ?0x21-0x7E	?0x2D ?0x2D ?0x2D ?0x2D ?0xCD ?0x2D ?0x2D ?0x2D ?0x2E ?0x2D ?0x2D

Tibetan code based on GB2312 adopts double byte that Tibetan language word fourth is encoded, and the most significant digit of first byte is that the most significant digit of 1 or two byte all is 1, thereby can and deposit with English.Domestic software adopts this class coding more, 10 1 15 districts that take GB2312 that have or the vacant code-point in 88-94 district, certain segment encode point in 15 districts to 81 district that simply takies Chinese character GB2312 that has, that have even taken the code-point of GBK Chinese character expansion area, this class coding adopts double byte to encode, space encoder is bigger, generally can realize with a character library, and is as shown in table 2:

Table 2 is based on the Tibetan code of GB2312

The coding title	The first byte scope	The trail byte scope	Syllable point coding
The coding title	The first byte scope	The trail byte scope	Syllable point coding	The upright Windows of upright DOS China light DOS China light Windows encodes with primitive encoding Tibet University	?0xC0-0xEE ?0xAA-0xAC， ?0xB0-0xDE ?0xB0-0xFB ?0xB0-0xFB ?0x81-0xEE， ?0xF5 ?0xAA-0xAF， ?0xF8-0xFB	?0x21-0x7E ?0xA0-0xFE ? ?0x21-0x7E ?0xA1-0xFE ?0x21-0x3D， ?0x40-0xFE ?0xA1-0xFE	?0xC032 ?0xAAAC ? ?0xE162 ?0xE1E2 ?0xA6E6 ? ?0xFABB

Based on the Tibetan code of Unicode, except the Unicode baseset of international standard, some Tibetan language software adopts the mode to Tibetan language word fourth direct coding, uses the Unicode private area to encode.Listed each character set has the difference of coded systems such as UTF-16LE, UTF-16BE, UTF-8 again in the table 3 when concrete expression, so have 9 kinds of codings.

Table 3 is based on the Tibetan language character set of Unicode standard

Character set	The code-point scope	Syllable point code-point
Character set	The code-point scope	Syllable point code-point	Unicode standard expansion sets A satin drill Tibetan language	U+0F00-U+0FCF U+F300-U+F8FF U+E000-U+E3A6	?U+0F0B ?U+0F0B ?U+E0DF

The Tibetan language information handling system, every whether Tibetan language of webpage of at first will discerning secondly if it is any Tibetan code that Tibetan language also will be discerned it employed, just can be done subsequent treatment then.

Do not see at present the report that Tibetan web page code identification related work is arranged, and the code identification of Chinese is generally discerned according to " encoding " and " charset " keyword of html file, the employed character set of " charset " expression webpage, this is to be used for the keyword of the employed character set of presentation web page at first, progress along with technology, the ability to express of charset can not satisfy the demands, represented the coding that webpage adopts so occurred " encoding " keyword afterwards, the value of these two keywords has corresponding international organization in unified management.For example the head of the webpage of a Chinese (＜head〉and＜/head between part) in comprise the html code probably:

This is that what " gb2312 " here represented this webpage employing is the character set of State Standard of the People's Republic of China GB2312-80 regulation because Chinese web page all adopts national standard codes greatly.Because the coding of Tibetan language is self-defining coding mostly, the keyword of charset and encoding does not have the value corresponding to Tibetan language, corresponding aforementioned various codings, it also all is the value of having used other literal, for example in the webpage of Tibetan language, have similar information such as " charset=gb2312 ", " charset=ascii ", at this time can't judge whether Tibetan code of webpage according to these information.

Summary of the invention

The object of the present invention is to provide a kind of method, can correctly distinguish Tibetan web page and non-Tibetan web page, and to discern Tibetan web page employed be any Tibetan code.

The recognition methods of Tibetan web page of the present invention and coding thereof, its step comprises:

1. the characteristic character string encoding in the given Tibetan code, described feature string are syllable point and/or selected high frequency syllable;

2. as keyword the webpage character stream is carried out scanning search with this characteristic character string encoding;

3. calculate the number of times of the character appearance that meets the characteristic character string encoding by counter;

4. according to the counter result, judge whether this webpage is Tibetan web page, and the Tibetan code that adopts.

Described feature string is encoded to a syllable point coding or the selected high frequency syllable coding in the Tibetan code, or the trail of syllable point coding and selected high frequency syllable coding, counter calculates the number of times that occurs the characteristic character string encoding in the webpage character stream in the scanning search process, when this number of times reaches preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.

Described feature string is encoded to a syllable point coding or the selected high frequency syllable coding in the Tibetan code, counter calculates syllable point coding or the high frequency syllable coding number of times that occurs in the webpage character stream in the scanning search process, obtain feature string shared ratio in this webpage character stream according to Counter Value, when this ratio reaches preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.

Described feature string is encoded to the syllable point coding in the Tibetan code, and in the time of between the number of characters between the character of the webpage of adjacent 2 syllable points coding correspondence in the scanning search process is 1 to 7 the time, counter adds one; When Counter Value reaches when falling the preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.

Described feature string is encoded to the syllable point coding in the Tibetan code, the character that the webpage of the syllable point coding correspondence that order is adjacent more than 3 in scanning, occurs, number of characters between the character of the webpage of every adjacent 2 syllable points coding correspondence is 1 between 7 the time, and counter adds one; When Counter Value reaches preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.

The syllable of Tibetan language (also claim to hide word, character syllabication in the Tibetan language, syllable constitutes speech) can form by one or more base characters, seven (referring to accompanying drawings) can be arranged at most.A basic word and a vowel sign are arranged in these seven base characters, other character be added in respectively basic word upper and lower, forward and backward, again after.The simplest Tibetan language syllable only comprises a basic word, and does not comprise other ingredients.Tibetan language is write across the page from left to right, and with a point separately, this point is exactly the syllable point between each syllable.The syllable point is similar to the space in the English, and its existence makes the Tibetan web page character stream present clear regularity: a syllable point just occurs every several characters.

According to related data, there are three experts once the philology feature of Tibetan language to be carried out statistical research (seeing Table 4) up to now.(it is the result that unit adds up that so-called high frequency syllable is based on the syllable for Tibetan language high frequency syllable, each syllable is by frequency of occurrences descending sort, the syllable that is illustrated in the appearance of Tibetan language high frequency that the position is forward, be defined as the high frequency syllable) in the cumulative frequencies that occur of preceding ten syllables, three experts' statistics is respectively: 31.83%, 22.99% and 18.97%, and the frequency that this explanation Tibetan language high frequency syllable occurs is still quite high.

The present invention can select a specific syllable from the high frequency syllable that these high-frequencies occur, the webpage character stream is searched for as searching key word with this characteristic character.Which selects at the syllable that the Tibetan language high frequency occurs as for concrete, can set according to actual needs.

Table 4 some high frequency syllables for occurring in the statistics for the present invention, just can be selected a syllable enumerating in the following table as the high frequency syllable.

Table 4 Tibetan language high frequency syllable statistics table

In addition, for the sentence structure of Tibetan language, each Tibetan language sentence word of statistical result showed on average contains 7 syllables.Therefore, the present invention also is used for the syllable number between the 2 adjacent syllable points identification of Tibetan web page as the condition of search.

According to the characteristics of Tibetan language, be characteristic character (string) with the syllable point and the high frequency syllable of Tibetan language, can take following several concrete criterion to judge whether webpage is Tibetan web page.

Criterion 1: if occurred above-mentioned characteristic character string encoding in the web page contents, just can assert that this webpage is a Tibetan web page, what assert simultaneously that the Tibetan language of this webpage adopts is the coding identical with feature string.

Criterion 2: calculate the ratio of feature string in the whole web page contents, just can assert that this webpage is a Tibetan web page if reach threshold value, what assert simultaneously that the Tibetan language of this webpage adopts is the coding identical with feature string.

Criterion 3: the number of times that the calculated characteristics character string occurs, if reach threshold value, just can assert that this webpage is a Tibetan web page, assert simultaneously this webpage the Tibetan language employing be the coding identical with feature string.

Criterion 4: with the spacing of adjacent tone node 1 to 7 character being arranged is feature, if the number of times that this feature occurs reaches threshold value, just can assert that this webpage is a Tibetan web page, assert simultaneously this webpage the Tibetan language employing be the coding identical with feature string.

Criterion 5: the continuous appearance (1 to 7 character is arranged between adjacent two) with a plurality of (more than 3) syllable point is a feature, if the number of times that this feature occurs reaches threshold value, just can assert that this webpage is a Tibetan web page, what assert simultaneously that the Tibetan language of this webpage adopts is the coding identical with feature string.

This method makes full use of the statistics characteristics of special characteristics of Tibetan language spoken and written languages syllable structure and Tibetan language usefulness word, in conjunction with using above-mentioned criterion of identification respectively at different codings, can correctly distinguish Tibetan web page and non-Tibetan web page effectively, and identification Tibetan code that webpage uses.

Description of drawings

Tibetan language syllable structure synoptic diagram

The syllable of Tibetan language can be made up of one or more base characters, can have seven at most.A basic word and a vowel sign are arranged in these seven base characters, other character be added in respectively basic word upper and lower, forward and backward, again after.A syllable the inside, except basic word, other parts all may not occur.The simplest Tibetan language syllable only comprises a basic word, and does not comprise other ingredients.

Embodiment

Embodiment 1 adopts whether criterion 4 identification webpage Tibetan codes are upright DOS coding.

For upright DOS coding, the coding of syllable point is that " C0 32 " (are that 16 systems are represented here, down together), the webpage character stream is scanned, if the number of characters of finding to comprise between adjacent two " C0 32 " between 1 to 7, counter adds one, if to before the current web page end of scan, counter has reached predetermined threshold value (for example 10), just thinks that current web page is a Tibetan web page, and it has adopted upright DOS coding.

Embodiment 2 adopts criterion 4 identifications, and whether upright Windows encodes.

Process is with example 1, just at this time the coding of syllable point is changed to " AAAC " by " C0 32 ".

Embodiment 3 adopts criterion 3 to discern whether TCRC encodes.

Encode for TCRC, syllable point coding be " 2D " (16 systems, down together), the TCRC coded sequence of a high frequency syllable is " 7A F4 68 ", be feature string with " 2D 7A F4 68 2D " so, calculate the number of times that it occurs in current web page, if number of times is greater than threshold value (for example 10), just think Tibetan web page, it uses the TCRC coding.

Embodiment 4 adopts criterion 3 to discern whether Tibetan Machine encodes.

Encode for Tibetan Machine, the coding of syllable point is " CD ", with same in the example 3 a high frequency syllable, its Tibetan Machine coded sequence is " FD 37 DC ", be feature string with " CD FD 37 DC CD " so, calculate the number of times that it occurs in current web page, if number of times is greater than threshold value (for example 10), just think Tibetan web page, it uses Tibetan Machine coding.

Above-mentioned criterion also can be united use, can discern Tibetan web page and coding thereof more exactly:

Embodiment 4 is suitable for multiple criterion identification Tibetan web page and coding thereof

1. whether the first step: adopting criterion 4 to detect successively is one of following coding, if changeed for the 4th step, otherwise carry out next step: upright DOS, upright Windows, magnificent light DOS, magnificent light Windows, same primitive encoding, three kinds of codings of expansion sets A, Tibet University's coding, three kinds of codings of satin drill Tibetan code;

2. whether second go on foot: adopting criterion 3 to detect successively is one of following coding, if changeed for the 4th step, otherwise carry out next step: LTibetan, TCRC, Old Sambhota, New Sambhota, Tibetan Machine (TM), Tibetan MachineWeb (TMW), TibKey, TibWord, tsamkey, SUZTIB, UCHAN;

3. the 3rd go on foot: think it is non-Tibetan web page;

4. the 4th go on foot: think Tibetan web page, the output encoder Scenario Name.

Claims

1. the recognition methods of Tibetan web page and coding thereof, its step comprises:

1) the characteristic character string encoding in the given Tibetan code, described feature string are syllable point and/or selected high frequency syllable;

2) as keyword the webpage character stream is carried out scanning search with this characteristic character string encoding;

3) calculate the number of times that the character meet the characteristic character string encoding occurs by counter;

4) according to the counter result, judge whether this webpage is Tibetan web page, and the Tibetan code that adopts.

2. the recognition methods of Tibetan web page as claimed in claim 1 and coding thereof is characterized in that described high frequency audio is selected to be selected from

Or

Or

3. the recognition methods of Tibetan web page as claimed in claim 1 or 2 and coding thereof, it is characterized in that described feature string is encoded to a syllable point coding or the selected high frequency syllable coding in the Tibetan code, or the trail of syllable point coding and selected high frequency syllable coding, counter calculates the number of times that occurs the characteristic character string encoding in the webpage character stream in the scanning search process, when this number of times reaches preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.

4. the recognition methods of Tibetan web page as claimed in claim 1 or 2 and coding thereof, it is characterized in that described feature string is encoded to a syllable point coding or the selected high frequency syllable coding in the Tibetan code, counter calculates syllable point coding or the high frequency syllable coding number of times that occurs in the webpage character stream in the scanning search process, obtain feature string shared ratio in this webpage character stream according to Counter Value, when this ratio reaches preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.

5. the recognition methods of Tibetan web page as claimed in claim 1 or 2 and coding thereof, it is characterized in that described feature string is encoded to the syllable point coding in the Tibetan code, in the time of between the number of characters between the character of the webpage that adjacent 2 syllable points coding is corresponding in the scanning search process is 1 to 7 the time, counter adds one; When Counter Value reaches when falling the preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.

6. the recognition methods of Tibetan web page as claimed in claim 1 or 2 and coding thereof, it is characterized in that described feature string is encoded to the syllable point coding in the Tibetan code, the character that the webpage of the syllable point coding correspondence that order is adjacent more than 3 in scanning, occurs, number of characters between the character of the webpage of every adjacent 2 syllable points coding correspondence is 1 between 7 the time, and counter adds one; When Counter Value reaches preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.