CN104598593A - Traditional Mongolian webpage recognition method and traditional Mongolian webpage recognition system - Google Patents

Traditional Mongolian webpage recognition method and traditional Mongolian webpage recognition system Download PDF

Info

Publication number
CN104598593A
CN104598593A CN201510033629.0A CN201510033629A CN104598593A CN 104598593 A CN104598593 A CN 104598593A CN 201510033629 A CN201510033629 A CN 201510033629A CN 104598593 A CN104598593 A CN 104598593A
Authority
CN
China
Prior art keywords
web pages
word
traditional mongolian
webpage
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510033629.0A
Other languages
Chinese (zh)
Other versions
CN104598593B (en
Inventor
王志娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Minzu University of China
Original Assignee
Minzu University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Minzu University of China filed Critical Minzu University of China
Priority to CN201510033629.0A priority Critical patent/CN104598593B/en
Publication of CN104598593A publication Critical patent/CN104598593A/en
Application granted granted Critical
Publication of CN104598593B publication Critical patent/CN104598593B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention relates to a traditional Mongolian webpage recognition method and a traditional Mongolian webpage recognition system. The method includes the following steps: the word frequency and document frequency of each word in a traditional Mongolian webpage corpus are obtained and counted, and the harmonic mean of each word is calculated; according to the harmonic means in descending order, a first previous number of words are chosen, and the harmonic means of the first previous number of words are accumulated, so that a first accumulated sum is obtained; the word frequencies of the first previous number of words in a webpage to be recognized are obtained and counted, and are accumulated, so that a second accumulated sum is obtained; when the difference between the first accumulated sum and the second accumulated sum is less than or equal to a first threshold, the webpage to be recognized is determined to be a traditional Mongolian webpage. The traditional Mongolian webpage recognition method provided by the invention can carry out the recognition of traditional Mongolian webpages with high accuracy and high efficiency, and thereby can help to collect traditional Mongolian webpages and implement a traditional Mongolian full-text search engine.

Description

Tradition Mongolian Characters in Web Pages recognition methods and device
Technical field
The present invention relates to networking technology area, particularly relate to a kind of traditional Mongolian Characters in Web Pages recognition methods and device.
Background technology
Tradition Mongolian is the municipal Mongolian official in inner mongolia ways of writing (namely writing Mongolian positive literary style with Mongolian letter).Tradition Mongolian Internet resources are Mongols masses important channels with this national writing transmission of information, shared resource, the main platform of Ye Shi Mongols traditional culture succession, traditional Mongolian Internet resources are significant for studying Mongol, Mongols's culture and realizing traditional Mongolian full-text search engine.Traditional Mongolian Internet resources Chinese, English Internet resources negligible amounts relatively of China, and coding is complicated, therefore, collect traditional Mongolian Internet resources accurately and efficiently most important, early-stage Study finds, collects the accurate identification that traditional Mongolian Internet resources key is traditional Mongolian Characters in Web Pages accurately and efficiently.
At present, web page identification method comprises following several: 1) language belonging to the LANG determined property webpage word of HTML (Hypertext Markup Language) (HyperTextMark-up Language, HTML).The LANG attribute of html language needs to declare webpage word used, and this attribute can make search engine and browser read the content of webpage exactly.2) language belonging to " font-family " and " charset " determined property webpage word of HTML.Html language provides the character code of webpage, and different character codes can use different fonts, therefore judges the word of webpage by " font-family " attribute of HTML.Such as: webpage " charset " is GB2312, and " font-family " be " BZDBT ", " charset " of " TIBETBT " or webpage be UTF8, and " font family " is " Microsoft Himalaya ", then can judge that this webpage is Tibetan language.3) based on specific languages high frequency words identification webpage word belonging to language.Often kind of languages have oneself high frequency syntactic units, therefore can by judging that the frequency that webpage medium-high frequency word to be analyzed occurs judges homepages language.The frequency such as occurred according to Tibetan language syllable point and high frequency words judges whether webpage is Tibetan language.
For the method for the LANG determined property webpage word according to HTML, according to World Wide Web Consortium (WorldWide Web Consortium, W3C) standard, each webpage should declare LANG attribute, owing to there is no the LANG attribute of html language in a lot of traditional Mongolian Characters in Web Pages, therefore, can not whether be only traditional Mongolian according to the LANG determined property homepages language of webpage.For the method for language belonging to " font-family " and " charset " determined property webpage word of HTML, a lot of traditional Mongolian Characters in Web Pages only has " charset " information, does not have " font-family " information, therefore can not judge whether webpage word is traditional Mongolian according to " charset " and " font-family ".For language belonging to the high frequency words identification webpage word based on specific languages, different language has oneself language feature, therefore the high frequency words of various language is not identical, such as: " ", " " be the word that Chinese frequency of utilization is higher, " it ", " the " are the words that in English, frequency of utilization is higher (he, she, it), (with) be the word that in Uighur, frequency of utilization is higher, the high frequency syntactic units come out towards same language, different pieces of information also has a great difference.Existing three kinds identify in the technology of homepages language, homepages language recognition technology based on high frequency words is comparatively effective relative to other two kinds of methods, but this technology only considers the absolute frequency of linguistic unit, the wording characteristics do not considered in different field text, and therefore the accuracy of identification of homepages language differs greatly.
Summary of the invention
The object of the invention is the defect for prior art, a kind of traditional Mongolian Characters in Web Pages recognition methods is provided, to realize the identification of traditional Mongolian Characters in Web Pages compared with high-accuracy and greater efficiency.
For achieving the above object, the invention provides a kind of traditional Mongolian Characters in Web Pages recognition methods, described method comprises:
Obtain and add up the word frequency TF of each word in traditional Mongolian Characters in Web Pages corpus iwith document frequency DF i, wherein, i>=0;
According to obtain the harmonic-mean F of each word in described traditional Mongolian Characters in Web Pages corpus respectively i;
In each word by described traditional Mongolian Characters in Web Pages corpus, according to F ivalue descending, choose a front first quantity word, and the F to a described front first quantity word ivalue adds up, and obtains the first cumulative sum;
Obtain and add up the word frequency TF of a front first quantity word described in webpage to be identified j, wherein, j>=0;
To the TF of a first quantity word front in described webpage to be identified jvalue adds up, and obtains the second cumulative sum;
When difference between described first cumulative sum and described second cumulative sum is less than or equal to first threshold, determine that described webpage to be identified is traditional Mongolian Characters in Web Pages.
On the other hand, present invention also offers a kind of traditional Mongolian Characters in Web Pages recognition device, described device comprises:
First acquiring unit, for obtaining and adding up the word frequency TF of each word in traditional Mongolian Characters in Web Pages corpus iwith document frequency DF i, wherein, i>=0;
First computing unit, for basis obtain the harmonic-mean F of each word in described traditional Mongolian Characters in Web Pages corpus respectively i;
Second computing unit, in each word by described traditional Mongolian Characters in Web Pages corpus, according to F ivalue descending, choose a front first quantity word, and the F to a described front first quantity word ivalue adds up, and obtains the first cumulative sum;
Second acquisition unit, for obtaining and adding up the word frequency TF of a front first quantity word described in webpage to be identified j, wherein, j>=0;
3rd computing unit, to the TF of a first quantity word front in described webpage to be identified jvalue adds up, and obtains the second cumulative sum;
Decision package, when being less than or equal to first threshold for the difference between described first cumulative sum and described second cumulative sum, determines that described webpage to be identified is traditional Mongolian Characters in Web Pages.
Traditional Mongolian Characters in Web Pages recognition methods provided by the invention and device, whether the language judging a webpage based on the word frequency of traditional Mongolian Characters in Web Pages corpus and the harmonic-mean of document frequency is traditional Mongolian, to realize the identification of traditional Mongolian Characters in Web Pages compared with high-accuracy and greater efficiency, and then the collection of traditional Mongolian Characters in Web Pages and the realization of traditional Mongolian full-text search engine can be contributed to.
Accompanying drawing explanation
Traditional Mongolian Characters in Web Pages recognition methods process flow diagram that Fig. 1 provides for the embodiment of the present invention one;
Traditional Mongolian Characters in Web Pages recognition device schematic diagram that Fig. 2 provides for the embodiment of the present invention two.
Embodiment
Below by drawings and Examples, technical scheme of the present invention is described in further detail.
Fig. 1 is traditional Mongolian Characters in Web Pages recognition methods process flow diagram that the present embodiment one provides, and as shown in Figure 1, described method comprises:
Step S101, obtains and adds up word frequency and the document frequency of each word in traditional Mongolian Characters in Web Pages corpus.
Particularly, obtain each word in traditional Mongolian Characters in Web Pages corpus, add up the word frequency TF of each word iwith document frequency DF i, wherein, i>=0.
Wherein, in the file that portion is given, word frequency (term frequency, TF) refers to the number of times that some given words occur in this document.
In given file set, document frequency (Document Frequency, DF) refers to appearance concentrated by some given files number of times at this file.
Alternatively, obtaining and before the word frequency of adding up each word in traditional Mongolian Characters in Web Pages corpus and document frequency, also comprising:
Download traditional Mongolian Characters in Web Pages, and pre-service is carried out to described traditional Mongolian Characters in Web Pages;
Build traditional Mongolian Characters in Web Pages corpus.
It should be noted that, when building traditional Mongolian corpus, following problem will be noted:
(1) language material scale is large
Language material scale is at least 1,000,000 word levels, and time span is a certain website, the webpage in a certain year.
(2) language material cover type is complete
This corpus should comprise the webpage of news, education, culture (especially national culture), science and technology, amusement, forum, business, other type.
(3) language material composition is reasonable
According to language feature and the network resource conditions of traditional Mongolian, the language material ratio situation of this several types is about: news, culture and forum each 20%, education, amusement, business and other types each 10%.
(4) website type of coding is complete
Because the coding of traditional Mongolian Characters in Web Pages is comparatively complicated, because realize the webpage identification of all traditional Mongolian codes, need the webpage downloading the traditional Mongolian code be at present, as: the webpage of Meng Keli coding, Unicode coding, coding such as match sound, Ming Antu etc.
Build extensive, multi-field traditional Mongolian Characters in Web Pages corpus to need to download and a collection ofly take into account the webpages such as type of coding, the Type of website, language material ratio; And the pre-service such as garbage information filtering, extend markup language (Extensible Markup Language, XML) format conversion and code conversion (other types code conversion is Unicode coding) are carried out to the Mongolian Characters in Web Pages downloaded.
Step S102, calculates the harmonic-mean of each word in described traditional Mongolian Characters in Web Pages corpus according to harmonic-mean computing formula.
Particularly, according to harmonic-mean computing formula calculate the harmonic-mean F of each word in traditional Mongolian Characters in Web Pages corpus i, wherein, i>=0.
Step S103, in each word by described traditional Mongolian Characters in Web Pages corpus, descending according to harmonic-mean, choose a front first quantity word, and the harmonic-mean of a described front first quantity word is added up, obtain the first cumulative sum.
Particularly, to the harmonic-mean F of each word calculated in step S102 i, according to the order that harmonic-mean is descending, choose a front first quantity word, and the harmonic-mean of a described front first quantity word added up, obtain the first cumulative sum.
Such as, according to harmonic-mean F idescending order chooses the F of before rank 5% iadd up, obtain the first cumulative sum A, computing formula is as follows:
A = Σ i = 1 n F i = Σ i = 1 n 2 T F i · D F i T F i + D F i , Wherein, i >=0.
Step S104, obtains and adds up the word frequency of a front first quantity word described in webpage to be identified.
Particularly, the first quantity word before obtaining in step S103 is corresponded in webpage to be identified, from webpage to be identified, obtains the word frequency TF of a described first quantity word j, wherein, j>=0.
Alternatively, obtain and before adding up the word frequency of a front first quantity word described in webpage to be identified, also comprise: garbage information filtering, format conversion and code conversion are carried out to described webpage to be identified, obtaining the webpage to be identified after processing.
Step S105, adds up to the word frequency of a described front first quantity word, obtains the second cumulative sum.
Such as, to before obtaining from webpage to be identified 5% the word frequency TF of word jadd up, obtain the second cumulative sum B, computing formula is as follows:
B = Σ j = 1 n TF j , j ≥ 0 .
Step S106, when the difference between described first cumulative sum and described second cumulative sum is less than or equal to first threshold, determines that described webpage to be identified is traditional Mongolian Characters in Web Pages.
Such as, if first threshold is α, judge | whether A-B| is less than or equal to α, and if so, then webpage to be identified is traditional Mongolian Characters in Web Pages; If not, then webpage to be identified is not traditional Mongolian Characters in Web Pages, wherein α be one determined by experiment, characterize both the constant of difference.
Traditional Mongolian Characters in Web Pages recognition methods provided by the invention, whether the language judging a webpage based on the word frequency of traditional Mongolian Characters in Web Pages corpus and the harmonic-mean of document frequency is traditional Mongolian, to realize the identification of traditional Mongolian Characters in Web Pages compared with high-accuracy and greater efficiency, and then the collection of traditional Mongolian Characters in Web Pages and the realization of traditional Mongolian full-text search engine can be contributed to.
Be more than the detailed description that traditional Mongolian Characters in Web Pages recognition methods provided by the present invention is carried out, below traditional Mongolian Characters in Web Pages recognition device provided by the invention be described in detail.
Traditional Mongolian Characters in Web Pages recognition device schematic diagram that Fig. 2 provides for the embodiment of the present invention two, as shown in Figure 2, described device comprises: the first acquiring unit 201, first computing unit 202, second computing unit 203, second acquisition unit 204, the 3rd computing unit 205 and decision package 206.
First acquiring unit 201, for obtaining and adding up the word frequency TF of each word in traditional Mongolian Characters in Web Pages corpus iwith document frequency DF i, wherein, i>=0;
First computing unit 202, for basis obtain the harmonic-mean F of each word in described traditional Mongolian Characters in Web Pages corpus respectively i;
Second computing unit 203, in each word by described traditional Mongolian Characters in Web Pages corpus, according to F ivalue descending, choose a front first quantity word, and the F to a described front first quantity word ivalue adds up, and obtains the first cumulative sum;
Second acquisition unit 204, for obtaining and adding up the word frequency TF of a front first quantity word described in webpage to be identified j, wherein, j>=0;
3rd computing unit 205, to the TF of a first quantity word front in described webpage to be identified jvalue adds up, and obtains the second cumulative sum;
Decision package 206, when being less than or equal to first threshold for the difference between described first cumulative sum and described second cumulative sum, determines that described webpage to be identified is traditional Mongolian Characters in Web Pages.
Alternatively, described device also comprises:
First processing unit 207, for downloading traditional Mongolian Characters in Web Pages, and carries out pre-service to described traditional Mongolian Characters in Web Pages;
Creating unit 208, for building traditional Mongolian Characters in Web Pages corpus.
Alternatively, described device also comprises:
Second processing unit 209, for carrying out garbage information filtering, format conversion and code conversion to described webpage to be identified, obtains the webpage to be identified after processing.
Alternatively, described traditional Mongolian Characters in Web Pages corpus at least comprises 1,000,000 Mongolian clictions of tradition.
The device that the embodiment of the present application two provides implants the method that the embodiment of the present application one provides, and therefore, the specific works process of the device that the application provides, does not repeat again at this.
Traditional Mongolian Characters in Web Pages recognition device provided by the invention, whether the language judging a webpage based on the word frequency of traditional Mongolian Characters in Web Pages corpus and the harmonic-mean of document frequency is traditional Mongolian, to realize the identification of traditional Mongolian Characters in Web Pages compared with high-accuracy and greater efficiency, and then the collection of traditional Mongolian Characters in Web Pages and the realization of traditional Mongolian full-text search engine can be contributed to.
Professional should recognize further, in conjunction with unit and the algorithm steps of each example of embodiment disclosed herein description, can realize with electronic hardware, computer software or the combination of the two, in order to the interchangeability of hardware and software is clearly described, generally describe composition and the step of each example in the above description according to function.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.
The software module that the method described in conjunction with embodiment disclosed herein or the step of algorithm can use hardware, processor to perform, or the combination of the two is implemented.Software module can be placed in the storage medium of other form any known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.
Above-described embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only the specific embodiment of the present invention; the protection domain be not intended to limit the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (8)

1. a traditional Mongolian Characters in Web Pages recognition methods, is characterized in that, described method comprises:
Obtain and add up the word frequency TF of each word in traditional Mongolian Characters in Web Pages corpus iwith document frequency DF i, wherein, i>=0;
According to obtain the harmonic-mean F of each word in described traditional Mongolian Characters in Web Pages corpus respectively i;
In each word by described traditional Mongolian Characters in Web Pages corpus, according to F ivalue descending, choose a front first quantity word, and the F to a described front first quantity word ivalue adds up, and obtains the first cumulative sum;
Obtain and add up the word frequency TF of a front first quantity word described in webpage to be identified j, wherein, j>=0;
To the TF of a first quantity word front in described webpage to be identified jvalue adds up, and obtains the second cumulative sum;
When difference between described first cumulative sum and described second cumulative sum is less than or equal to first threshold, determine that described webpage to be identified is traditional Mongolian Characters in Web Pages.
2. traditional Mongolian Characters in Web Pages recognition methods according to claim 1, is characterized in that, adds up the word frequency TF of each word in traditional Mongolian Characters in Web Pages corpus in described acquisition iwith document frequency DF ibefore, described method also comprises:
Download traditional Mongolian Characters in Web Pages, and pre-service is carried out to described traditional Mongolian Characters in Web Pages;
Build traditional Mongolian Characters in Web Pages corpus.
3. traditional Mongolian Characters in Web Pages recognition methods according to claim 1, is characterized in that, is obtaining and is adding up the word frequency TF of a front first quantity word described in webpage to be identified jbefore, described method also comprises:
Garbage information filtering, format conversion and code conversion are carried out to described webpage to be identified, obtains the webpage to be identified after processing.
4. the traditional Mongolian Characters in Web Pages recognition methods according to any one of claim 1-3, is characterized in that, described traditional Mongolian Characters in Web Pages corpus at least comprises 1,000,000 Mongolian clictions of tradition.
5. a traditional Mongolian Characters in Web Pages recognition device, is characterized in that, described device comprises:
First acquiring unit, for obtaining and adding up the word frequency TF of each word in traditional Mongolian Characters in Web Pages corpus iwith document frequency DF i, wherein, i>=0;
First computing unit, for basis obtain the harmonic-mean F of each word in described traditional Mongolian Characters in Web Pages corpus respectively i;
Second computing unit, in each word by described traditional Mongolian Characters in Web Pages corpus, according to F ivalue descending, choose a front first quantity word, and the F to a described front first quantity word ivalue adds up, and obtains the first cumulative sum;
Second acquisition unit, for obtaining and adding up the word frequency TF of a front first quantity word described in webpage to be identified j, wherein, j>=0;
3rd computing unit, to the TF of a first quantity word front in described webpage to be identified jvalue adds up, and obtains the second cumulative sum;
Decision package, when being less than or equal to first threshold for the difference between described first cumulative sum and described second cumulative sum, determines that described webpage to be identified is traditional Mongolian Characters in Web Pages.
6. traditional Mongolian Characters in Web Pages recognition device according to claim 5, it is characterized in that, described device also comprises:
First processing unit, for downloading traditional Mongolian Characters in Web Pages, and carries out pre-service to described traditional Mongolian Characters in Web Pages;
Creating unit, for building traditional Mongolian Characters in Web Pages corpus.
7. traditional Mongolian Characters in Web Pages recognition device according to claim 5, it is characterized in that, described device also comprises:
Second processing unit, for carrying out garbage information filtering, format conversion and code conversion to described webpage to be identified, obtains the webpage to be identified after processing.
8. the traditional Mongolian Characters in Web Pages recognition device according to any one of claim 5-7, is characterized in that, described traditional Mongolian Characters in Web Pages corpus at least comprises 1,000,000 Mongolian clictions of tradition.
CN201510033629.0A 2015-01-22 2015-01-22 Traditional Mongolian Characters in Web Pages recognition methods and device Expired - Fee Related CN104598593B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510033629.0A CN104598593B (en) 2015-01-22 2015-01-22 Traditional Mongolian Characters in Web Pages recognition methods and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510033629.0A CN104598593B (en) 2015-01-22 2015-01-22 Traditional Mongolian Characters in Web Pages recognition methods and device

Publications (2)

Publication Number Publication Date
CN104598593A true CN104598593A (en) 2015-05-06
CN104598593B CN104598593B (en) 2017-12-22

Family

ID=53124378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510033629.0A Expired - Fee Related CN104598593B (en) 2015-01-22 2015-01-22 Traditional Mongolian Characters in Web Pages recognition methods and device

Country Status (1)

Country Link
CN (1) CN104598593B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020156760A1 (en) * 1998-01-05 2002-10-24 Nec Research Institute, Inc. Autonomous citation indexing and literature browsing using citation context
CN102129479A (en) * 2011-04-29 2011-07-20 南京邮电大学 World wide web service discovery method based on probabilistic latent semantic analysis model
CN103942188A (en) * 2013-01-22 2014-07-23 腾讯科技(深圳)有限公司 Method and device for identifying corpus languages

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020156760A1 (en) * 1998-01-05 2002-10-24 Nec Research Institute, Inc. Autonomous citation indexing and literature browsing using citation context
CN102129479A (en) * 2011-04-29 2011-07-20 南京邮电大学 World wide web service discovery method based on probabilistic latent semantic analysis model
CN103942188A (en) * 2013-01-22 2014-07-23 腾讯科技(深圳)有限公司 Method and device for identifying corpus languages

Also Published As

Publication number Publication date
CN104598593B (en) 2017-12-22

Similar Documents

Publication Publication Date Title
Yin et al. Empirical strong-line oxygen abundance calibrations from galaxies with electron-temperature measurements
CN107797991B (en) Dependency syntax tree-based knowledge graph expansion method and system
CN102541874B (en) Webpage text content extracting method and device
Sun et al. Dom based content extraction via text density
Bragaglia et al. Old open clusters as key tracers of Galactic chemical evolution-II. Iron and elemental abundances in NGC 2324, NGC 2477, NGC 2660, NGC 3960, and Berkeley 32
CN103853760B (en) Method and device for extracting contents of bodies of web pages
CN104615593A (en) Method and device for automatic detection of microblog hot topics
CN102915361B (en) Webpage text extracting method based on character distribution characteristic
CN101079031A (en) Web page subject extraction system and method
CN101231661A (en) Method and system for digging object grade knowledge
CN101833579B (en) Method and system for automatically detecting academic misconduct literature
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN103810251A (en) Method and device for extracting text
CN103514213A (en) Term extraction method and device
De Becker et al. Early-type stars in the young open cluster IC 1805-II. The probably single stars HD 15570 and HD 15629, and the massive binary/triple system HD 15558
CN101968801A (en) Method for extracting key words of single text
Ozturkmenoglu et al. Comparison of different lemmatization approaches for information retrieval on Turkish text collection
Kumar et al. FST based morphological analyzer for Hindi language
Ashari et al. Document summarization using TextRank and semantic network
CN104598593A (en) Traditional Mongolian webpage recognition method and traditional Mongolian webpage recognition system
EP2096561B1 (en) Method for extracting relevant content from a markup language file, in particular from a HTML file
CN102147731A (en) Automatic functional requirement extraction system based on extended functional requirement description framework
Fahr et al. Longitudinal variation of the pickup-proton-injection efficiency and rate at the heliospheric termination shock
Yasukawa et al. Stemming Malay text and its application in automatic text categorization
US20150019208A1 (en) Method for identifying a set of sentences in a digital document, method for generating a digital document, and associated device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171222

Termination date: 20210122

CF01 Termination of patent right due to non-payment of annual fee