CN101055593A - Tibetan web page and its code identification method - Google Patents

Tibetan web page and its code identification method Download PDF

Info

Publication number
CN101055593A
CN101055593A CN 200710111099 CN200710111099A CN101055593A CN 101055593 A CN101055593 A CN 101055593A CN 200710111099 CN200710111099 CN 200710111099 CN 200710111099 A CN200710111099 A CN 200710111099A CN 101055593 A CN101055593 A CN 101055593A
Authority
CN
China
Prior art keywords
tibetan
coding
webpage
syllable
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200710111099
Other languages
Chinese (zh)
Inventor
吴健
芮建武
刘汇丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN 200710111099 priority Critical patent/CN101055593A/en
Publication of CN101055593A publication Critical patent/CN101055593A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a method for identifying Tibetan language webpage and its coded, including the steps of: giving a code of characteristic string, which is syllable node and/or selected high frequency syllable, in Tibetan language codefirstly; webpage character flow, the code of characteristic string as keyword, scanned and searched; calculating the frequency that accords with characteristic string coded character to appear by the couter; determining whether the webpage is Tibetan language webpage and the Tibetan language code is used according to the result of counter. The invention makes the best of the syllable structural feature of Tibetan language language and the statistics characteristic of Tibetan language word, and respectively applys the identification criteria for different code, accordingly Tibetan language webpage and non Tibetan language webpageu can efficiently be distinguished correctly, and Tibetan language coding used by the webpage is also able to be identified.

Description

The recognition methods of Tibetan web page and coding thereof
Described technical field
The invention belongs to literal code recognition technology field, relate in particular to the recognition methods of a kind of Tibetan web page and coding thereof.
Background technology
Along with Internet development, online information is more and more, and this brings great convenience for people's life.And the data that find us to need in the network data of magnanimity are very real problems, and the appearance of search engine has solved this problem.Nearest 2 years, the development of search engine was like a raging fire, had emerged in large numbers many Chinese search engines that has their own characteristics each, and waited for as Baidu, search dog, cruel news.By contrast, as a minority language, the searching products relevant with Tibetan language also do not occur.
The functional module of search engine generally can be divided into foreground and backstage.The foreground provides the interface mutual with the final user.The backstage will be ceaselessly from the network extracting information, and, data are deposited in the database through a series of processing, use during in order to search.In the process that back-end data is handled, just comprise the normalized of web page coding, the webpage of various different codings is converted to a kind of coding deposits exactly.Do code conversion, at first will carry out code identification.
Compare with Chinese, the development of Tibetan language information processing relatively lags behind, and the webpage of Tibetan language is fewer on the network, and nowadays Tibetan code still is the situation of " surging ahead ", and total amount and few Tibetan web page have but comprised tens of kinds of different Tibetan codes.In the background process process of Tibetan language search engine, identify the webpage of Tibetan language the webpage of various language such as will be from the internet a large amount of English, Chinese, identify its employed Tibetan code, carry out code conversion then.
In the past few decades, numerous computing machines and Tibetan language and literature worker have done a large amount of work, have successfully developed some Tibetan language Words, and these Tibetan language softwares all are to adopt custom coding, have formed the situation of Tibetan language " ten thousand yards Pentium ".According to the difference of coding structure, we are divided three classes these codings: based on the Tibetan code of ASCII, based on the Tibetan code of GB2312 with based on the Tibetan code of Unicode.
Tibetan code based on ASCII adopts single byte that the Tibetan language character is encoded, and the available code space is 0x00-0xFF, removes the code-point (control character etc.) of special implication, and actual available code-point has 222; Some codings are only encoded with the following code-point of 0x7F, and actual so available code-point has only 94.Because available code-point is less, generally use a plurality of character libraries to realize, represent a plurality of Tibetan language characters with a code-point.
This class coding is as shown in table 1:
Table 1 is based on the Tibetan code of ASCII
The coding title The code-point scope Syllable point coding
?LTibetan ?TCRC ?Old?Sambhota ?New?Sambhota ?TM ?TMW ?Tibword ?TibKey ?tsamkey ?SUZTIB ?UCHAN ?0x21-0xFE ?0x21-0xFE ?0x21-0xFE ?0x21-0x7E ?0x21-0xFE ?0x21-0x7E ?0x21-0xFE ?0x21-0xFE ?0x21-0x7E ?0x21-0xFE ?0x21-0x7E ?0x2D ?0x2D ?0x2D ?0x2D ?0xCD ?0x2D ?0x2D ?0x2D ?0x2E ?0x2D ?0x2D
Tibetan code based on GB2312 adopts double byte that Tibetan language word fourth is encoded, and the most significant digit of first byte is that the most significant digit of 1 or two byte all is 1, thereby can and deposit with English.Domestic software adopts this class coding more, 10 1 15 districts that take GB2312 that have or the vacant code-point in 88-94 district, certain segment encode point in 15 districts to 81 district that simply takies Chinese character GB2312 that has, that have even taken the code-point of GBK Chinese character expansion area, this class coding adopts double byte to encode, space encoder is bigger, generally can realize with a character library, and is as shown in table 2:
Table 2 is based on the Tibetan code of GB2312
The coding title The first byte scope The trail byte scope Syllable point coding
The upright Windows of upright DOS China light DOS China light Windows encodes with primitive encoding Tibet University ?0xC0-0xEE ?0xAA-0xAC, ?0xB0-0xDE ?0xB0-0xFB ?0xB0-0xFB ?0x81-0xEE, ?0xF5 ?0xAA-0xAF, ?0xF8-0xFB ?0x21-0x7E ?0xA0-0xFE ? ?0x21-0x7E ?0xA1-0xFE ?0x21-0x3D, ?0x40-0xFE ?0xA1-0xFE ?0xC032 ?0xAAAC ? ?0xE162 ?0xE1E2 ?0xA6E6 ? ?0xFABB
Based on the Tibetan code of Unicode, except the Unicode baseset of international standard, some Tibetan language software adopts the mode to Tibetan language word fourth direct coding, uses the Unicode private area to encode.Listed each character set has the difference of coded systems such as UTF-16LE, UTF-16BE, UTF-8 again in the table 3 when concrete expression, so have 9 kinds of codings.
Table 3 is based on the Tibetan language character set of Unicode standard
Character set The code-point scope Syllable point code-point
Unicode standard expansion sets A satin drill Tibetan language U+0F00-U+0FCF U+F300-U+F8FF U+E000-U+E3A6 ?U+0F0B ?U+0F0B ?U+E0DF
The Tibetan language information handling system, every whether Tibetan language of webpage of at first will discerning secondly if it is any Tibetan code that Tibetan language also will be discerned it employed, just can be done subsequent treatment then.
Do not see at present the report that Tibetan web page code identification related work is arranged, and the code identification of Chinese is generally discerned according to " encoding " and " charset " keyword of html file, the employed character set of " charset " expression webpage, this is to be used for the keyword of the employed character set of presentation web page at first, progress along with technology, the ability to express of charset can not satisfy the demands, represented the coding that webpage adopts so occurred " encoding " keyword afterwards, the value of these two keywords has corresponding international organization in unified management.For example the head of the webpage of a Chinese (<head〉and</head between part) in comprise the html code probably:
<meta?http-equiv=″content-type″content=″text/html;charset=gb2312″>
This is that what " gb2312 " here represented this webpage employing is the character set of State Standard of the People's Republic of China GB2312-80 regulation because Chinese web page all adopts national standard codes greatly.Because the coding of Tibetan language is self-defining coding mostly, the keyword of charset and encoding does not have the value corresponding to Tibetan language, corresponding aforementioned various codings, it also all is the value of having used other literal, for example in the webpage of Tibetan language, have similar information such as " charset=gb2312 ", " charset=ascii ", at this time can't judge whether Tibetan code of webpage according to these information.
Summary of the invention
The object of the present invention is to provide a kind of method, can correctly distinguish Tibetan web page and non-Tibetan web page, and to discern Tibetan web page employed be any Tibetan code.
The recognition methods of Tibetan web page of the present invention and coding thereof, its step comprises:
1. the characteristic character string encoding in the given Tibetan code, described feature string are syllable point and/or selected high frequency syllable;
2. as keyword the webpage character stream is carried out scanning search with this characteristic character string encoding;
3. calculate the number of times of the character appearance that meets the characteristic character string encoding by counter;
4. according to the counter result, judge whether this webpage is Tibetan web page, and the Tibetan code that adopts.
Described feature string is encoded to a syllable point coding or the selected high frequency syllable coding in the Tibetan code, or the trail of syllable point coding and selected high frequency syllable coding, counter calculates the number of times that occurs the characteristic character string encoding in the webpage character stream in the scanning search process, when this number of times reaches preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.
Described feature string is encoded to a syllable point coding or the selected high frequency syllable coding in the Tibetan code, counter calculates syllable point coding or the high frequency syllable coding number of times that occurs in the webpage character stream in the scanning search process, obtain feature string shared ratio in this webpage character stream according to Counter Value, when this ratio reaches preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.
Described feature string is encoded to the syllable point coding in the Tibetan code, and in the time of between the number of characters between the character of the webpage of adjacent 2 syllable points coding correspondence in the scanning search process is 1 to 7 the time, counter adds one; When Counter Value reaches when falling the preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.
Described feature string is encoded to the syllable point coding in the Tibetan code, the character that the webpage of the syllable point coding correspondence that order is adjacent more than 3 in scanning, occurs, number of characters between the character of the webpage of every adjacent 2 syllable points coding correspondence is 1 between 7 the time, and counter adds one; When Counter Value reaches preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.
The syllable of Tibetan language (also claim to hide word, character syllabication in the Tibetan language, syllable constitutes speech) can form by one or more base characters, seven (referring to accompanying drawings) can be arranged at most.A basic word and a vowel sign are arranged in these seven base characters, other character be added in respectively basic word upper and lower, forward and backward, again after.The simplest Tibetan language syllable only comprises a basic word, and does not comprise other ingredients.Tibetan language is write across the page from left to right, and with a point separately, this point is exactly the syllable point between each syllable.The syllable point is similar to the space in the English, and its existence makes the Tibetan web page character stream present clear regularity: a syllable point just occurs every several characters.
According to related data, there are three experts once the philology feature of Tibetan language to be carried out statistical research (seeing Table 4) up to now.(it is the result that unit adds up that so-called high frequency syllable is based on the syllable for Tibetan language high frequency syllable, each syllable is by frequency of occurrences descending sort, the syllable that is illustrated in the appearance of Tibetan language high frequency that the position is forward, be defined as the high frequency syllable) in the cumulative frequencies that occur of preceding ten syllables, three experts' statistics is respectively: 31.83%, 22.99% and 18.97%, and the frequency that this explanation Tibetan language high frequency syllable occurs is still quite high.
The present invention can select a specific syllable from the high frequency syllable that these high-frequencies occur, the webpage character stream is searched for as searching key word with this characteristic character.Which selects at the syllable that the Tibetan language high frequency occurs as for concrete, can set according to actual needs.
Table 4 some high frequency syllables for occurring in the statistics for the present invention, just can be selected a syllable enumerating in the following table as the high frequency syllable.
Table 4 Tibetan language high frequency syllable statistics table
Figure A20071011109900071
In addition, for the sentence structure of Tibetan language, each Tibetan language sentence word of statistical result showed on average contains 7 syllables.Therefore, the present invention also is used for the syllable number between the 2 adjacent syllable points identification of Tibetan web page as the condition of search.
According to the characteristics of Tibetan language, be characteristic character (string) with the syllable point and the high frequency syllable of Tibetan language, can take following several concrete criterion to judge whether webpage is Tibetan web page.
Criterion 1: if occurred above-mentioned characteristic character string encoding in the web page contents, just can assert that this webpage is a Tibetan web page, what assert simultaneously that the Tibetan language of this webpage adopts is the coding identical with feature string.
Criterion 2: calculate the ratio of feature string in the whole web page contents, just can assert that this webpage is a Tibetan web page if reach threshold value, what assert simultaneously that the Tibetan language of this webpage adopts is the coding identical with feature string.
Criterion 3: the number of times that the calculated characteristics character string occurs, if reach threshold value, just can assert that this webpage is a Tibetan web page, assert simultaneously this webpage the Tibetan language employing be the coding identical with feature string.
Criterion 4: with the spacing of adjacent tone node 1 to 7 character being arranged is feature, if the number of times that this feature occurs reaches threshold value, just can assert that this webpage is a Tibetan web page, assert simultaneously this webpage the Tibetan language employing be the coding identical with feature string.
Criterion 5: the continuous appearance (1 to 7 character is arranged between adjacent two) with a plurality of (more than 3) syllable point is a feature, if the number of times that this feature occurs reaches threshold value, just can assert that this webpage is a Tibetan web page, what assert simultaneously that the Tibetan language of this webpage adopts is the coding identical with feature string.
This method makes full use of the statistics characteristics of special characteristics of Tibetan language spoken and written languages syllable structure and Tibetan language usefulness word, in conjunction with using above-mentioned criterion of identification respectively at different codings, can correctly distinguish Tibetan web page and non-Tibetan web page effectively, and identification Tibetan code that webpage uses.
Description of drawings
Tibetan language syllable structure synoptic diagram
The syllable of Tibetan language can be made up of one or more base characters, can have seven at most.A basic word and a vowel sign are arranged in these seven base characters, other character be added in respectively basic word upper and lower, forward and backward, again after.A syllable the inside, except basic word, other parts all may not occur.The simplest Tibetan language syllable only comprises a basic word, and does not comprise other ingredients.
Embodiment
Embodiment 1 adopts whether criterion 4 identification webpage Tibetan codes are upright DOS coding.
For upright DOS coding, the coding of syllable point is that " C0 32 " (are that 16 systems are represented here, down together), the webpage character stream is scanned, if the number of characters of finding to comprise between adjacent two " C0 32 " between 1 to 7, counter adds one, if to before the current web page end of scan, counter has reached predetermined threshold value (for example 10), just thinks that current web page is a Tibetan web page, and it has adopted upright DOS coding.
Embodiment 2 adopts criterion 4 identifications, and whether upright Windows encodes.
Process is with example 1, just at this time the coding of syllable point is changed to " AAAC " by " C0 32 ".
Embodiment 3 adopts criterion 3 to discern whether TCRC encodes.
Encode for TCRC, syllable point coding be " 2D " (16 systems, down together), the TCRC coded sequence of a high frequency syllable is " 7A F4 68 ", be feature string with " 2D 7A F4 68 2D " so, calculate the number of times that it occurs in current web page, if number of times is greater than threshold value (for example 10), just think Tibetan web page, it uses the TCRC coding.
Embodiment 4 adopts criterion 3 to discern whether Tibetan Machine encodes.
Encode for Tibetan Machine, the coding of syllable point is " CD ", with same in the example 3 a high frequency syllable, its Tibetan Machine coded sequence is " FD 37 DC ", be feature string with " CD FD 37 DC CD " so, calculate the number of times that it occurs in current web page, if number of times is greater than threshold value (for example 10), just think Tibetan web page, it uses Tibetan Machine coding.
Above-mentioned criterion also can be united use, can discern Tibetan web page and coding thereof more exactly:
Embodiment 4 is suitable for multiple criterion identification Tibetan web page and coding thereof
1. whether the first step: adopting criterion 4 to detect successively is one of following coding, if changeed for the 4th step, otherwise carry out next step: upright DOS, upright Windows, magnificent light DOS, magnificent light Windows, same primitive encoding, three kinds of codings of expansion sets A, Tibet University's coding, three kinds of codings of satin drill Tibetan code;
2. whether second go on foot: adopting criterion 3 to detect successively is one of following coding, if changeed for the 4th step, otherwise carry out next step: LTibetan, TCRC, Old Sambhota, New Sambhota, Tibetan Machine (TM), Tibetan MachineWeb (TMW), TibKey, TibWord, tsamkey, SUZTIB, UCHAN;
3. the 3rd go on foot: think it is non-Tibetan web page;
4. the 4th go on foot: think Tibetan web page, the output encoder Scenario Name.

Claims (6)

1. the recognition methods of Tibetan web page and coding thereof, its step comprises:
1) the characteristic character string encoding in the given Tibetan code, described feature string are syllable point and/or selected high frequency syllable;
2) as keyword the webpage character stream is carried out scanning search with this characteristic character string encoding;
3) calculate the number of times that the character meet the characteristic character string encoding occurs by counter;
4) according to the counter result, judge whether this webpage is Tibetan web page, and the Tibetan code that adopts.
2. the recognition methods of Tibetan web page as claimed in claim 1 and coding thereof is characterized in that described high frequency audio is selected to be selected from
Figure A2007101110990002C1
Or
Figure A2007101110990002C2
Or
Figure A2007101110990002C3
Or
Figure A2007101110990002C4
Or
Figure A2007101110990002C5
Or
3. the recognition methods of Tibetan web page as claimed in claim 1 or 2 and coding thereof, it is characterized in that described feature string is encoded to a syllable point coding or the selected high frequency syllable coding in the Tibetan code, or the trail of syllable point coding and selected high frequency syllable coding, counter calculates the number of times that occurs the characteristic character string encoding in the webpage character stream in the scanning search process, when this number of times reaches preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.
4. the recognition methods of Tibetan web page as claimed in claim 1 or 2 and coding thereof, it is characterized in that described feature string is encoded to a syllable point coding or the selected high frequency syllable coding in the Tibetan code, counter calculates syllable point coding or the high frequency syllable coding number of times that occurs in the webpage character stream in the scanning search process, obtain feature string shared ratio in this webpage character stream according to Counter Value, when this ratio reaches preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.
5. the recognition methods of Tibetan web page as claimed in claim 1 or 2 and coding thereof, it is characterized in that described feature string is encoded to the syllable point coding in the Tibetan code, in the time of between the number of characters between the character of the webpage that adjacent 2 syllable points coding is corresponding in the scanning search process is 1 to 7 the time, counter adds one; When Counter Value reaches when falling the preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.
6. the recognition methods of Tibetan web page as claimed in claim 1 or 2 and coding thereof, it is characterized in that described feature string is encoded to the syllable point coding in the Tibetan code, the character that the webpage of the syllable point coding correspondence that order is adjacent more than 3 in scanning, occurs, number of characters between the character of the webpage of every adjacent 2 syllable points coding correspondence is 1 between 7 the time, and counter adds one; When Counter Value reaches preset threshold, judge that this webpage is a Tibetan web page, used Tibetan code is given Tibetan code.
CN 200710111099 2007-06-15 2007-06-15 Tibetan web page and its code identification method Pending CN101055593A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200710111099 CN101055593A (en) 2007-06-15 2007-06-15 Tibetan web page and its code identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200710111099 CN101055593A (en) 2007-06-15 2007-06-15 Tibetan web page and its code identification method

Publications (1)

Publication Number Publication Date
CN101055593A true CN101055593A (en) 2007-10-17

Family

ID=38795428

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200710111099 Pending CN101055593A (en) 2007-06-15 2007-06-15 Tibetan web page and its code identification method

Country Status (1)

Country Link
CN (1) CN101055593A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SG155069A1 (en) * 2008-02-14 2009-09-30 Victor Company Of Japan Method of language coding identification and data format therefor
CN101510259B (en) * 2009-03-18 2011-04-06 西北民族大学 On-line identification method for 'ding' of handwriting Tibet character
CN102135956A (en) * 2011-05-06 2011-07-27 中国科学院软件研究所 Word position tagging-based Tibetan word segmentation method
CN102360436A (en) * 2011-10-24 2012-02-22 中国科学院软件研究所 Identification method for on-line handwritten Tibetan characters based on components
CN103176955A (en) * 2013-01-03 2013-06-26 陈灿华 System and method for displaying Chinese character webpage scripts
CN104516862A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Method and system for selecting and reading coded format of target document
CN112003623A (en) * 2020-08-19 2020-11-27 西藏大学 Tibetan text compression algorithm

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SG155069A1 (en) * 2008-02-14 2009-09-30 Victor Company Of Japan Method of language coding identification and data format therefor
CN101510259B (en) * 2009-03-18 2011-04-06 西北民族大学 On-line identification method for 'ding' of handwriting Tibet character
CN102135956A (en) * 2011-05-06 2011-07-27 中国科学院软件研究所 Word position tagging-based Tibetan word segmentation method
CN102135956B (en) * 2011-05-06 2015-09-30 中国科学院软件研究所 A kind of Tibetan language segmenting method based on lexeme mark
CN102360436A (en) * 2011-10-24 2012-02-22 中国科学院软件研究所 Identification method for on-line handwritten Tibetan characters based on components
CN102360436B (en) * 2011-10-24 2012-11-07 中国科学院软件研究所 Identification method for on-line handwritten Tibetan characters based on components
CN103176955A (en) * 2013-01-03 2013-06-26 陈灿华 System and method for displaying Chinese character webpage scripts
CN104516862A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Method and system for selecting and reading coded format of target document
CN104516862B (en) * 2013-09-29 2018-05-01 北大方正集团有限公司 A kind of method and its system of the coded format for selecting to read destination document
CN112003623A (en) * 2020-08-19 2020-11-27 西藏大学 Tibetan text compression algorithm
CN112003623B (en) * 2020-08-19 2023-11-03 西藏大学 Tibetan text compression algorithm

Similar Documents

Publication Publication Date Title
CN101055593A (en) Tibetan web page and its code identification method
CN102142038B (en) Multi-stage query processing system and method for use with tokenspace repository
CN102915299B (en) Word segmentation method and device
Barbay et al. Efficient fully-compressed sequence representations
EP1578020A1 (en) Data compressing method, program and apparatus
CN1702651A (en) Recognition method and apparatus for information files of specific types
CN101079031A (en) Web page subject extraction system and method
CN101079027A (en) Chinese character word distinguishing method and system
CN1193779A (en) Method for dividing sentences in Chinese language into words and its use in error checking system for texts in Chinese language
CN1873643A (en) Method and system to enhance query performance of search engines using lexical affinities
CN1601520A (en) System and method for the recognition of organic chemical names in text documents
CN101079060A (en) Chinese character input simple &#39;pinyin&#39; implementation method and system
CN101046809A (en) New word identification method based on association rule model
CN1345426A (en) System and method for extracting index key data fields
CN1253815C (en) Computer recognizing and indexing method of Chinese names
Farina et al. Boosting text compression with word-based statistical encoding
CN1256688C (en) Chinese segmenting method
Sirén Compressed Full-Text Indexes for Highly Repetitive Collections.
CN1928854A (en) Syntax analysis method and device for layering Chinese long sentences based on punctuation treatment
CN1492359A (en) Automatic state machine searching and matching method of multiple key words
CN1694092A (en) Method for global search of text containing four-byte character
CN1677389A (en) Mobile internet intelligent information retrieval engine based on key-word retrieval
CN102722527B (en) Full-text search method supporting search request containing missing symbols
CN1302415C (en) English-Chinese translation machine
CN1034245C (en) Burmese characters four-code intelligent coding method and keyboard thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication