CN104598593A

CN104598593A - Traditional Mongolian webpage recognition method and traditional Mongolian webpage recognition system

Info

Publication number: CN104598593A
Application number: CN201510033629.0A
Authority: CN
Inventors: 王志娟
Original assignee: Minzu University of China
Current assignee: Minzu University of China
Priority date: 2015-01-22
Filing date: 2015-01-22
Publication date: 2015-05-06
Anticipated expiration: 2035-01-22
Also published as: CN104598593B

Abstract

The invention relates to a traditional Mongolian webpage recognition method and a traditional Mongolian webpage recognition system. The method includes the following steps: the word frequency and document frequency of each word in a traditional Mongolian webpage corpus are obtained and counted, and the harmonic mean of each word is calculated; according to the harmonic means in descending order, a first previous number of words are chosen, and the harmonic means of the first previous number of words are accumulated, so that a first accumulated sum is obtained; the word frequencies of the first previous number of words in a webpage to be recognized are obtained and counted, and are accumulated, so that a second accumulated sum is obtained; when the difference between the first accumulated sum and the second accumulated sum is less than or equal to a first threshold, the webpage to be recognized is determined to be a traditional Mongolian webpage. The traditional Mongolian webpage recognition method provided by the invention can carry out the recognition of traditional Mongolian webpages with high accuracy and high efficiency, and thereby can help to collect traditional Mongolian webpages and implement a traditional Mongolian full-text search engine.

Description

Tradition Mongolian Characters in Web Pages recognition methods and device

Technical field

The present invention relates to networking technology area, particularly relate to a kind of traditional Mongolian Characters in Web Pages recognition methods and device.

Background technology

Tradition Mongolian is the municipal Mongolian official in inner mongolia ways of writing (namely writing Mongolian positive literary style with Mongolian letter).Tradition Mongolian Internet resources are Mongols masses important channels with this national writing transmission of information, shared resource, the main platform of Ye Shi Mongols traditional culture succession, traditional Mongolian Internet resources are significant for studying Mongol, Mongols's culture and realizing traditional Mongolian full-text search engine.Traditional Mongolian Internet resources Chinese, English Internet resources negligible amounts relatively of China, and coding is complicated, therefore, collect traditional Mongolian Internet resources accurately and efficiently most important, early-stage Study finds, collects the accurate identification that traditional Mongolian Internet resources key is traditional Mongolian Characters in Web Pages accurately and efficiently.

At present, web page identification method comprises following several: 1) language belonging to the LANG determined property webpage word of HTML (Hypertext Markup Language) (HyperTextMark-up Language, HTML).The LANG attribute of html language needs to declare webpage word used, and this attribute can make search engine and browser read the content of webpage exactly.2) language belonging to " font-family " and " charset " determined property webpage word of HTML.Html language provides the character code of webpage, and different character codes can use different fonts, therefore judges the word of webpage by " font-family " attribute of HTML.Such as: webpage " charset " is GB2312, and " font-family " be " BZDBT ", " charset " of " TIBETBT " or webpage be UTF8, and " font family " is " Microsoft Himalaya ", then can judge that this webpage is Tibetan language.3) based on specific languages high frequency words identification webpage word belonging to language.Often kind of languages have oneself high frequency syntactic units, therefore can by judging that the frequency that webpage medium-high frequency word to be analyzed occurs judges homepages language.The frequency such as occurred according to Tibetan language syllable point and high frequency words judges whether webpage is Tibetan language.

For the method for the LANG determined property webpage word according to HTML, according to World Wide Web Consortium (WorldWide Web Consortium, W3C) standard, each webpage should declare LANG attribute, owing to there is no the LANG attribute of html language in a lot of traditional Mongolian Characters in Web Pages, therefore, can not whether be only traditional Mongolian according to the LANG determined property homepages language of webpage.For the method for language belonging to " font-family " and " charset " determined property webpage word of HTML, a lot of traditional Mongolian Characters in Web Pages only has " charset " information, does not have " font-family " information, therefore can not judge whether webpage word is traditional Mongolian according to " charset " and " font-family ".For language belonging to the high frequency words identification webpage word based on specific languages, different language has oneself language feature, therefore the high frequency words of various language is not identical, such as: " ", " " be the word that Chinese frequency of utilization is higher, " it ", " the " are the words that in English, frequency of utilization is higher (he, she, it), (with) be the word that in Uighur, frequency of utilization is higher, the high frequency syntactic units come out towards same language, different pieces of information also has a great difference.Existing three kinds identify in the technology of homepages language, homepages language recognition technology based on high frequency words is comparatively effective relative to other two kinds of methods, but this technology only considers the absolute frequency of linguistic unit, the wording characteristics do not considered in different field text, and therefore the accuracy of identification of homepages language differs greatly.

Summary of the invention

The object of the invention is the defect for prior art, a kind of traditional Mongolian Characters in Web Pages recognition methods is provided, to realize the identification of traditional Mongolian Characters in Web Pages compared with high-accuracy and greater efficiency.

For achieving the above object, the invention provides a kind of traditional Mongolian Characters in Web Pages recognition methods, described method comprises:

Obtain and add up the word frequency TF of each word in traditional Mongolian Characters in Web Pages corpus _iwith document frequency DF _i, wherein, i>=0;

According to obtain the harmonic-mean F of each word in described traditional Mongolian Characters in Web Pages corpus respectively _i;

In each word by described traditional Mongolian Characters in Web Pages corpus, according to F _ivalue descending, choose a front first quantity word, and the F to a described front first quantity word _ivalue adds up, and obtains the first cumulative sum;

Obtain and add up the word frequency TF of a front first quantity word described in webpage to be identified _j, wherein, j>=0;

To the TF of a first quantity word front in described webpage to be identified _jvalue adds up, and obtains the second cumulative sum;

When difference between described first cumulative sum and described second cumulative sum is less than or equal to first threshold, determine that described webpage to be identified is traditional Mongolian Characters in Web Pages.

On the other hand, present invention also offers a kind of traditional Mongolian Characters in Web Pages recognition device, described device comprises:

First acquiring unit, for obtaining and adding up the word frequency TF of each word in traditional Mongolian Characters in Web Pages corpus _iwith document frequency DF _i, wherein, i>=0;

First computing unit, for basis obtain the harmonic-mean F of each word in described traditional Mongolian Characters in Web Pages corpus respectively _i;

Second computing unit, in each word by described traditional Mongolian Characters in Web Pages corpus, according to F _ivalue descending, choose a front first quantity word, and the F to a described front first quantity word _ivalue adds up, and obtains the first cumulative sum;

Second acquisition unit, for obtaining and adding up the word frequency TF of a front first quantity word described in webpage to be identified _j, wherein, j>=0;

3rd computing unit, to the TF of a first quantity word front in described webpage to be identified _jvalue adds up, and obtains the second cumulative sum;

Decision package, when being less than or equal to first threshold for the difference between described first cumulative sum and described second cumulative sum, determines that described webpage to be identified is traditional Mongolian Characters in Web Pages.

Traditional Mongolian Characters in Web Pages recognition methods provided by the invention and device, whether the language judging a webpage based on the word frequency of traditional Mongolian Characters in Web Pages corpus and the harmonic-mean of document frequency is traditional Mongolian, to realize the identification of traditional Mongolian Characters in Web Pages compared with high-accuracy and greater efficiency, and then the collection of traditional Mongolian Characters in Web Pages and the realization of traditional Mongolian full-text search engine can be contributed to.

Accompanying drawing explanation

Traditional Mongolian Characters in Web Pages recognition methods process flow diagram that Fig. 1 provides for the embodiment of the present invention one;

Traditional Mongolian Characters in Web Pages recognition device schematic diagram that Fig. 2 provides for the embodiment of the present invention two.

Embodiment

Below by drawings and Examples, technical scheme of the present invention is described in further detail.

Fig. 1 is traditional Mongolian Characters in Web Pages recognition methods process flow diagram that the present embodiment one provides, and as shown in Figure 1, described method comprises:

Step S101, obtains and adds up word frequency and the document frequency of each word in traditional Mongolian Characters in Web Pages corpus.

Particularly, obtain each word in traditional Mongolian Characters in Web Pages corpus, add up the word frequency TF of each word _iwith document frequency DF _i, wherein, i>=0.

Wherein, in the file that portion is given, word frequency (term frequency, TF) refers to the number of times that some given words occur in this document.

In given file set, document frequency (Document Frequency, DF) refers to appearance concentrated by some given files number of times at this file.

Alternatively, obtaining and before the word frequency of adding up each word in traditional Mongolian Characters in Web Pages corpus and document frequency, also comprising:

Download traditional Mongolian Characters in Web Pages, and pre-service is carried out to described traditional Mongolian Characters in Web Pages;

Build traditional Mongolian Characters in Web Pages corpus.

It should be noted that, when building traditional Mongolian corpus, following problem will be noted:

(1) language material scale is large

Language material scale is at least 1,000,000 word levels, and time span is a certain website, the webpage in a certain year.

(2) language material cover type is complete

This corpus should comprise the webpage of news, education, culture (especially national culture), science and technology, amusement, forum, business, other type.

(3) language material composition is reasonable

According to language feature and the network resource conditions of traditional Mongolian, the language material ratio situation of this several types is about: news, culture and forum each 20%, education, amusement, business and other types each 10%.

(4) website type of coding is complete

Because the coding of traditional Mongolian Characters in Web Pages is comparatively complicated, because realize the webpage identification of all traditional Mongolian codes, need the webpage downloading the traditional Mongolian code be at present, as: the webpage of Meng Keli coding, Unicode coding, coding such as match sound, Ming Antu etc.

Build extensive, multi-field traditional Mongolian Characters in Web Pages corpus to need to download and a collection ofly take into account the webpages such as type of coding, the Type of website, language material ratio; And the pre-service such as garbage information filtering, extend markup language (Extensible Markup Language, XML) format conversion and code conversion (other types code conversion is Unicode coding) are carried out to the Mongolian Characters in Web Pages downloaded.

Step S102, calculates the harmonic-mean of each word in described traditional Mongolian Characters in Web Pages corpus according to harmonic-mean computing formula.

Particularly, according to harmonic-mean computing formula calculate the harmonic-mean F of each word in traditional Mongolian Characters in Web Pages corpus _i, wherein, i>=0.

Step S103, in each word by described traditional Mongolian Characters in Web Pages corpus, descending according to harmonic-mean, choose a front first quantity word, and the harmonic-mean of a described front first quantity word is added up, obtain the first cumulative sum.

Particularly, to the harmonic-mean F of each word calculated in step S102 _i, according to the order that harmonic-mean is descending, choose a front first quantity word, and the harmonic-mean of a described front first quantity word added up, obtain the first cumulative sum.

Such as, according to harmonic-mean F _idescending order chooses the F of before rank 5% _iadd up, obtain the first cumulative sum A, computing formula is as follows:

A = Σ_{i = 1}^{n} F_{i} = Σ_{i = 1}^{n} \frac{2 T F_{i} \cdot D F_{i}}{T F_{i} + D F_{i}},

Wherein, i >=0.

Step S104, obtains and adds up the word frequency of a front first quantity word described in webpage to be identified.

Particularly, the first quantity word before obtaining in step S103 is corresponded in webpage to be identified, from webpage to be identified, obtains the word frequency TF of a described first quantity word _j, wherein, j>=0.

Alternatively, obtain and before adding up the word frequency of a front first quantity word described in webpage to be identified, also comprise: garbage information filtering, format conversion and code conversion are carried out to described webpage to be identified, obtaining the webpage to be identified after processing.

Step S105, adds up to the word frequency of a described front first quantity word, obtains the second cumulative sum.

Such as, to before obtaining from webpage to be identified 5% the word frequency TF of word _jadd up, obtain the second cumulative sum B, computing formula is as follows:

B = Σ_{j = 1}^{n} {TF}_{j}, j &GreaterEqual; 0 .

Step S106, when the difference between described first cumulative sum and described second cumulative sum is less than or equal to first threshold, determines that described webpage to be identified is traditional Mongolian Characters in Web Pages.

Such as, if first threshold is α, judge | whether A-B| is less than or equal to α, and if so, then webpage to be identified is traditional Mongolian Characters in Web Pages; If not, then webpage to be identified is not traditional Mongolian Characters in Web Pages, wherein α be one determined by experiment, characterize both the constant of difference.

Traditional Mongolian Characters in Web Pages recognition methods provided by the invention, whether the language judging a webpage based on the word frequency of traditional Mongolian Characters in Web Pages corpus and the harmonic-mean of document frequency is traditional Mongolian, to realize the identification of traditional Mongolian Characters in Web Pages compared with high-accuracy and greater efficiency, and then the collection of traditional Mongolian Characters in Web Pages and the realization of traditional Mongolian full-text search engine can be contributed to.

Be more than the detailed description that traditional Mongolian Characters in Web Pages recognition methods provided by the present invention is carried out, below traditional Mongolian Characters in Web Pages recognition device provided by the invention be described in detail.

Traditional Mongolian Characters in Web Pages recognition device schematic diagram that Fig. 2 provides for the embodiment of the present invention two, as shown in Figure 2, described device comprises: the first acquiring unit 201, first computing unit 202, second computing unit 203, second acquisition unit 204, the 3rd computing unit 205 and decision package 206.

First acquiring unit 201, for obtaining and adding up the word frequency TF of each word in traditional Mongolian Characters in Web Pages corpus _iwith document frequency DF _i, wherein, i>=0;

First computing unit 202, for basis obtain the harmonic-mean F of each word in described traditional Mongolian Characters in Web Pages corpus respectively _i;

Second computing unit 203, in each word by described traditional Mongolian Characters in Web Pages corpus, according to F _ivalue descending, choose a front first quantity word, and the F to a described front first quantity word _ivalue adds up, and obtains the first cumulative sum;

Second acquisition unit 204, for obtaining and adding up the word frequency TF of a front first quantity word described in webpage to be identified _j, wherein, j>=0;

3rd computing unit 205, to the TF of a first quantity word front in described webpage to be identified _jvalue adds up, and obtains the second cumulative sum;

Decision package 206, when being less than or equal to first threshold for the difference between described first cumulative sum and described second cumulative sum, determines that described webpage to be identified is traditional Mongolian Characters in Web Pages.

Alternatively, described device also comprises:

First processing unit 207, for downloading traditional Mongolian Characters in Web Pages, and carries out pre-service to described traditional Mongolian Characters in Web Pages;

Creating unit 208, for building traditional Mongolian Characters in Web Pages corpus.

Alternatively, described device also comprises:

Second processing unit 209, for carrying out garbage information filtering, format conversion and code conversion to described webpage to be identified, obtains the webpage to be identified after processing.

Alternatively, described traditional Mongolian Characters in Web Pages corpus at least comprises 1,000,000 Mongolian clictions of tradition.

The device that the embodiment of the present application two provides implants the method that the embodiment of the present application one provides, and therefore, the specific works process of the device that the application provides, does not repeat again at this.

Traditional Mongolian Characters in Web Pages recognition device provided by the invention, whether the language judging a webpage based on the word frequency of traditional Mongolian Characters in Web Pages corpus and the harmonic-mean of document frequency is traditional Mongolian, to realize the identification of traditional Mongolian Characters in Web Pages compared with high-accuracy and greater efficiency, and then the collection of traditional Mongolian Characters in Web Pages and the realization of traditional Mongolian full-text search engine can be contributed to.

Professional should recognize further, in conjunction with unit and the algorithm steps of each example of embodiment disclosed herein description, can realize with electronic hardware, computer software or the combination of the two, in order to the interchangeability of hardware and software is clearly described, generally describe composition and the step of each example in the above description according to function.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.

The software module that the method described in conjunction with embodiment disclosed herein or the step of algorithm can use hardware, processor to perform, or the combination of the two is implemented.Software module can be placed in the storage medium of other form any known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.

Above-described embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only the specific embodiment of the present invention; the protection domain be not intended to limit the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a traditional Mongolian Characters in Web Pages recognition methods, is characterized in that, described method comprises:

2. traditional Mongolian Characters in Web Pages recognition methods according to claim 1, is characterized in that, adds up the word frequency TF of each word in traditional Mongolian Characters in Web Pages corpus in described acquisition _iwith document frequency DF _ibefore, described method also comprises:

Build traditional Mongolian Characters in Web Pages corpus.

3. traditional Mongolian Characters in Web Pages recognition methods according to claim 1, is characterized in that, is obtaining and is adding up the word frequency TF of a front first quantity word described in webpage to be identified _jbefore, described method also comprises:

Garbage information filtering, format conversion and code conversion are carried out to described webpage to be identified, obtains the webpage to be identified after processing.

4. the traditional Mongolian Characters in Web Pages recognition methods according to any one of claim 1-3, is characterized in that, described traditional Mongolian Characters in Web Pages corpus at least comprises 1,000,000 Mongolian clictions of tradition.

5. a traditional Mongolian Characters in Web Pages recognition device, is characterized in that, described device comprises:

6. traditional Mongolian Characters in Web Pages recognition device according to claim 5, it is characterized in that, described device also comprises:

First processing unit, for downloading traditional Mongolian Characters in Web Pages, and carries out pre-service to described traditional Mongolian Characters in Web Pages;

Creating unit, for building traditional Mongolian Characters in Web Pages corpus.

7. traditional Mongolian Characters in Web Pages recognition device according to claim 5, it is characterized in that, described device also comprises:

Second processing unit, for carrying out garbage information filtering, format conversion and code conversion to described webpage to be identified, obtains the webpage to be identified after processing.

8. the traditional Mongolian Characters in Web Pages recognition device according to any one of claim 5-7, is characterized in that, described traditional Mongolian Characters in Web Pages corpus at least comprises 1,000,000 Mongolian clictions of tradition.