CN103729466A

CN103729466A - Name country identification method based on WEB and GBBoosting algorithms

Info

Publication number: CN103729466A
Application number: CN201410019885.XA
Authority: CN
Inventors: 苏畅; 贾文强; 王裕坤; 余跃; 吴琪
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2014-01-16
Filing date: 2014-01-16
Publication date: 2014-04-16
Anticipated expiration: 2034-01-16
Also published as: CN103729466B

Abstract

The invention discloses a name and country identification method based on WEB and GBBoosting algorithm, which belongs to the technical field of WEB data mining. The method includes the following steps: Step 1: Extract the names of academics in colleges and universities through WEB data extraction technology; Step 2: Construct GBBoosting algorithm: Construct weak classifiers, each weak classifier outputs a weak classification hypothesis for the input sample, and pass all weak classifiers The fusion of weights constitutes a strong classifier; Step 3: Identify the country to which it belongs through the GBBoosting algorithm. The national name recognition method based on WEB and GBBoosting algorithm of the present invention effectively solves the problem that the names of two countries cannot be classified under the similar spelling mode; at the same time, this method is easier to implement than other existing classification methods, and can It is better used in engineering practices such as semantic labeling of names and cities or cities and countries.

Description

Recognition Method of Person's Name and Country Based on WEB and GBBoosting Algorithm

技术领域technical field

本发明属于WEB数据挖掘技术领域，具体涉及一种基于WEB及GBBoosting算法的人名国别识别方法。The invention belongs to the technical field of WEB data mining, and in particular relates to a name and country recognition method based on WEB and GBBoosting algorithm.

背景技术Background technique

随着Internet的高速发展和WEB资源的日益丰富，为了从海量的数据信息中快速准确的挖掘需要且有意义的数据，近年来，WEB语义分析技术和文本分类技术在WEB数据挖掘领域得到广泛的应用，基于WEB的应用在某些程度上改变了用户的生活习惯和工作方式，也受到越来越多的广大用户的欢迎与赞赏。With the rapid development of the Internet and the increasing abundance of WEB resources, in order to quickly and accurately mine necessary and meaningful data from massive data information, in recent years, WEB semantic analysis technology and text classification technology have been widely used in the field of WEB data mining. Applications, WEB-based applications have changed users' living habits and working styles to some extent, and are also welcomed and appreciated by more and more users.

KNN、贝叶斯等分类方法在众多分类领域中取得了良好的分类效果，例如，解梅等人将KNN应用于图像处理领域，提出了一种基于KNN分类算法的MR图像灰度不均匀性校正分割方法(专利号：201010583560.6，公开日：2011.07.27)；杨柳等人将贝叶斯应用于计算机软件领域，提出了一种基于改进贝叶斯分类的短信智能分类及搜索方法(专利号：201310356056.6，公开日：2013.12.04)。但是上述分类方法在人名国别分类场景中的分类准确率有待进一步提高，尤其是在两个国家人名拼写方式相近的情况下，其分类准确率仅仅高于随机猜测。可见上述分类算法在人名国别分类应用中存在极大的局限性。Classification methods such as KNN and Bayesian have achieved good classification results in many classification fields. For example, Jiemei et al. applied KNN to the field of image processing, and proposed a KNN-based classification algorithm to detect the gray level inhomogeneity of MR images. Correction segmentation method (Patent No.: 201010583560.6, publication date: 2011.07.27); Yang Liu et al. applied Bayesian to the field of computer software, and proposed an intelligent SMS classification and search method based on improved Bayesian classification (Patent No. : 201310356056.6, public date: 2013.12.04). However, the classification accuracy of the above-mentioned classification method in the scene of classification of personal names by country needs to be further improved, especially when the spelling of personal names in two countries is similar, the classification accuracy is only higher than random guessing. It can be seen that the above classification algorithm has great limitations in the application of the classification of personal names and countries.

基于上述分类方法在人名国别分类问题中存在的不足，本发明提出了一种基于Boosting的GBBoosting算法，旨在解决人名国别分类场景中存在的问题，与其他的分类算法相比，其分类准确率和召回率有了较大的提高，尤其是分类两个国家人名拼写方式相近的情况下，性能出色。将GBBoosting算法应用于人名国别、城市国别等识别场景中，进行人名或者城市的国别语义标注，进而应用到火热的社交领域中，具有非常重要的现实意义和广阔的应用前景。Based on the deficiencies of the above-mentioned classification methods in the classification of personal names and countries, the present invention proposes a Boosting-based GBBoosting algorithm, which aims to solve the problems existing in the scene of classification of personal names and countries. Compared with other classification algorithms, its classification The accuracy and recall rate have been greatly improved, especially when the spelling of names in two countries is similar, the performance is excellent. It has very important practical significance and broad application prospects to apply the GBBoosting algorithm to recognition scenarios such as person names and cities, etc., to carry out semantic annotation of names or cities by country, and then apply it to the hot social field.

发明内容Contents of the invention

有鉴于此，本发明的目的在于提供一种基于WEB及GBBoosting算法的人名国别识别方法，该方法通过WEB数据抽取技术提取高校学者人名，通过构造弱分类器，每个弱分类器对输入样本输出一个弱分类假设，通过所有弱分类器的权重融合构成一个强分类器，最后通过GBBoosting算法识别人名所属的国家。In view of this, the object of the present invention is to provide a method for identifying national names based on WEB and GBBoosting algorithm. The method uses WEB data extraction technology to extract the names of academics in colleges and universities, and constructs weak classifiers. Output a weak classification hypothesis, form a strong classifier through the weight fusion of all weak classifiers, and finally use the GBBoosting algorithm to identify the country to which the name belongs.

为达到上述目的，本发明提供如下技术方案：To achieve the above object, the present invention provides the following technical solutions:

一种基于WEB及GBBoosting算法的人名国别识别方法，包括以下步骤：步骤一：通过WEB数据抽取技术提取高校学者人名；步骤二：构造GBBoosting算法：构造弱分类器，每个弱分类器对输入样本输出一个弱分类假设，通过所有弱分类器的权重融合构成一个强分类器；步骤三：通过GBBoosting算法识别所属的国别。A name and country recognition method based on WEB and GBBoosting algorithm, including the following steps: Step 1: Extracting the names of academics in universities through WEB data extraction technology; Step 2: Constructing GBBoosting algorithm: Constructing weak classifiers, each weak classifier pairs input The sample outputs a weak classification hypothesis, and a strong classifier is formed by merging the weights of all weak classifiers; Step 3: Identify the country to which it belongs through the GBBoosting algorithm.

进一步，在步骤一中，通过GOOGLE搜索引擎接口得到高校学院页面，然后在学院页面进行语义分析得到学院学者所在页面，最终通过命名实体识别技术和语义分析技术得到抽取页面中的学者信息。Further, in step 1, the college page of the university is obtained through the GOOGLE search engine interface, and then semantic analysis is performed on the college page to obtain the page where the scholars of the college are located, and finally the scholar information in the extracted page is obtained through named entity recognition technology and semantic analysis technology.

进一步，在步骤二中，弱分类器的构造步骤具体包括：Further, in step 2, the construction steps of the weak classifier specifically include:

1）将两种类型的训练文本用向量表示为 ${\overset{&RightArrow;}{V}}_{1} = (x_{1}, x_{2}, . . ., x_{i}, . . ., x_{n}), {\overset{&RightArrow;}{V}}_{2} = (y_{1}, y_{2}, . . ., y_{i}, . . ., y_{n});$ 1) Represent the two types of training text as vectors ${\overset{&Right Arrow;}{V}}_{1} = (x_{1}, x_{2}, . . ., x_{i}, . . ., x_{no}), {\overset{&Right Arrow;}{V}}_{2} = ({the y}_{1}, {they}_{2}, . . ., {the y}_{i}, . . ., {the y}_{no});$

2）根据公式计算出两种训练文本

的中间向量 2) According to the formula Two training texts are calculated

The intermediate vector of

${\overset{&RightArrow; &Right Arrow;}{V V}}_{33} = = (({z z}_{11} {,, z z}_{22},, . . . . . .,, {z z}_{i i},, . . . . . .,, {z z}_{n no}));;$

3）根据公式

计算出中间向量

的垂直向量

对于任意一个测试向量a_i，如果(w_i·a_i)＞0，则将a_i的标签标记为+1，如果(w_i·a_i)＜0，则将a_i的标签标记为-1；3) According to the formula

Calculate the intermediate vector

The vertical vector of

For any test vector a _i , if (w _i ·a _i )>0, mark the label of a _i as +1, and if (w _i ·a _i )<0, mark the label of a _i as - 1;

迭代弱分类器，其权值融合形成强分类器，其具体步骤如下：The weak classifiers are iterated, and their weights are fused to form a strong classifier. The specific steps are as follows:

首先，给定两个训练集D₁＝(x₁,x₂,...,x_i,...,x_n)，D₂＝(y₁,y₂,...,y_i,...,y_n)，一个测试集D_Test＝(z₁,z₂,...,z_i,...,z_n)，将训练集D1、D2，测试集D_Test，分别表示成向量形式： $D_{1} = (\overset{&RightArrow;}{x_{1}}, \overset{&RightArrow;}{x_{2}}, . . ., \overset{&RightArrow;}{x_{i}}, . . ., \overset{&RightArrow;}{x_{n}}), D_{2} = (\overset{&RightArrow;}{y_{1}}, \overset{&RightArrow;}{y_{2}}, . . ., \overset{&RightArrow;}{y_{i}}, . . ., \overset{&RightArrow;}{y_{n}}), D_{Test} = (\overset{&RightArrow;}{z_{1}}, \overset{&RightArrow;}{z_{2}}, . . ., \overset{&RightArrow;}{z_{i}}, . . ., \overset{&RightArrow;}{z_{n}}),$ 并分别初始化D₁、D₂，D_Test中的样本权重；First, given two training sets D ₁ =(x ₁ ,x ₂ ,..., _xi ,...,x _n ), D ₂ =(y ₁ ,y ₂ ,...,y _i , ...,y _n ), a test set D _Test ＝(z ₁ ,z ₂ ,..., _zi ,...,z _n ), the training sets D1, D2, and the test set D _Test represent respectively In vector form: ${D.}_{1} = (\overset{&Right Arrow;}{x_{1}}, \overset{&Right Arrow;}{x_{2}}, . . ., \overset{&Right Arrow;}{x_{i}}, . . ., \overset{&Right Arrow;}{x_{no}}), {D.}_{2} = (\overset{&Right Arrow;}{{the y}_{1}}, \overset{&Right Arrow;}{{they}_{2}}, . . ., \overset{&Right Arrow;}{{the y}_{i}}, . . ., \overset{&Right Arrow;}{{the y}_{no}}), {D.}_{test} = (\overset{&Right Arrow;}{z_{1}}, \overset{&Right Arrow;}{z_{2}}, . . ., \overset{&Right Arrow;}{z_{i}}, . . ., \overset{&Right Arrow;}{z_{no}}),$ And initialize D ₁ , D ₂ , and the sample weights in D _Test respectively;

其次，1)从D₁，D₂中随机选取M(N/5<M<N)个样本组成子集D₁₁、D₂₁，分别对子集D₁₁、D₂₁中的向量对应相加并且单位化得到两个向量

2)根据线性分类器的构造过程，得到与两个向量

的中间向量

垂直的向量

生成弱分类器H(x)₁；经过p次循环，得到p个不同的垂直向量

p个弱分类器h(x)₁,h(x)₂,...,h(x)_p；最终H(x)＝h(x)₁+h(x)₂+...+h(x)_p，即

Secondly, 1) Randomly select M(N/5<M<N) samples from D ₁ and D ₂ to form subsets D ₁₁ and D ₂₁ , add correspondingly to the vectors in subsets D ₁₁ and D ₂₁ and Normalize to get two vectors

2) According to the construction process of the linear classifier, the two vectors

The intermediate vector of

vertical vector

Generate a weak classifier H(x) ₁ ; after p cycles, get p different vertical vectors

p weak classifiers h(x) ₁ ,h(x) ₂ ,...,h(x) _p ; final H(x)=h(x) ₁ +h(x) ₂ +...+h (x) _p , ie

进一步，在步骤三中，将高校学者人名通过GBBoosting算法识别出学者所属国家。Further, in Step 3, the name of the university scholar is identified by the GBBoosting algorithm to identify the country to which the scholar belongs.

本发明的有益效果在于：本发明提供了一种基于WEB及GBBoosting算法的人名国别识别方法，有效的解决了两个国家人名拼写方式相近的情况下不能分类的问题；同时本方法比现有的其它分类方法更易实施，能更好的应用于人名国别或者城市国别语义标注等工程实践中。The beneficial effect of the present invention is that: the present invention provides a method for identifying country names based on WEB and GBBoosting algorithm, which effectively solves the problem that the names of people in two countries have similar spellings and cannot be classified; at the same time, this method is better than the existing Other classification methods are easier to implement and can be better applied to engineering practices such as semantic labeling of person names or cities.

附图说明Description of drawings

为了使本发明的目的、技术方案和有益效果更加清楚，本发明提供如下附图进行说明：In order to make the purpose, technical scheme and beneficial effect of the present invention clearer, the present invention provides the following drawings for illustration:

图1为本发明所述方法的宏观流程图；Fig. 1 is the macro flow chart of method for the present invention;

图2为向量相似度计算图；Figure 2 is a vector similarity calculation diagram;

图3为弱分类器构造图；Figure 3 is a structural diagram of a weak classifier;

图4为本方法的微观流程图。Fig. 4 is the micro flow chart of this method.

具体实施方式Detailed ways

下面将结合附图，对本发明的优选实施例进行详细的描述。The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

图1为本发明所述方法的宏观流程图，如图所示，本方法包括以下步骤：步骤一：通过WEB数据抽取技术提取高校学者人名；步骤二：构造GBBoosting算法：构造弱分类器，每个弱分类器对输入样本输出一个弱分类假设，通过所有弱分类器的权重融合构成一个强分类器；步骤三：通过GBBoosting算法识别所属的国别。Fig. 1 is the macro-flow chart of the method of the present invention, as shown in the figure, this method comprises the following steps: Step 1: Extract the names of academics in universities through WEB data extraction technology; Step 2: Construct GBBoosting algorithm: Construct a weak classifier, each A weak classifier outputs a weak classification hypothesis for the input sample, and forms a strong classifier through the weight fusion of all weak classifiers; Step 3: Identify the country to which it belongs through the GBBoosting algorithm.

图4为本方法的微观流程图，现结合图4对本方法的具体实施步骤进行说明。Fig. 4 is a micro-flow chart of the method, and the specific implementation steps of the method will now be described in conjunction with Fig. 4 .

1.通过WEB数据抽取技术提取高校学者人名1. Extract the names of university scholars through WEB data extraction technology

1)通过GOOGLE搜索引擎搜索“university+computerscience”找到学院主页；2)通过学院首页找到包含该学院中所有学者信息页面。学校中学者的姓名一般都会存在于对应学院（系），只要找到对应学院（系）的URL就可以得到学校所有学者的姓名及主页地址。步骤二中找对应大学中计算机学院（系）的URL，经过观察学院（系）和学院学者两个页面的URL地址，可以得到两个规则：1) Search "university+computerscience" through the GOOGLE search engine to find the homepage of the college; 2) Find the information page of all scholars in the college through the homepage of the college. The names of the scholars in the school generally exist in the corresponding college (department). As long as you find the URL of the corresponding college (department), you can get the names and home page addresses of all the scholars in the school. In step 2, find the URL of the computer school (department) corresponding to the university. After observing the URL addresses of the two pages of the school (department) and school scholars, you can get two rules:

①后一个地址包含前一个地址。①The latter address contains the former address.

②后一个地址中还包含“people、faculty、faculty&Advisors”特征。②The latter address also includes the features of "people, faculty, faculty&Advisors".

只需要遍历计算机学院（系）中的所有链接，筛选出链接中符合上述两个规则并且链接对应文字为“faculty或people”的URL，通过实验发现一般可以过滤出两个URL地址，之所以出现两个URL是由于学院中一般含有people菜单，而faculty属于people的子菜单链接，第二个URL才是需要的链接，所以当出现两个URL时选择第二个URL地址，反之选择第一个地址。最后输入过滤出URL即可得到所有学者的姓名及对应的个人主页。3)通过计算机学院（系）faculty页面提取所有学者的姓名和主页。提取计算机学院（系）faculty页面所有的链接，找到链接对应的文本，通过命名实体技术分析文本是否为人名。It is only necessary to traverse all the links in the School of Computer Science (Department), and filter out the URLs in the links that meet the above two rules and whose corresponding text is "faculty or people". Through experiments, it is found that generally two URL addresses can be filtered out. The two URLs are because the college generally contains the people menu, and faculty belongs to the submenu link of people, the second URL is the required link, so when there are two URLs, select the second URL address, otherwise select the first URL address. Finally, enter the filtered URL to get the names and corresponding personal homepages of all scholars. 3) Extract the names and homepages of all scholars through the faculty page of the School of Computer Science (Department). Extract all the links on the faculty page of the School of Computer Science (Department), find the text corresponding to the link, and analyze whether the text is a person's name through named entity technology.

2.实现GBBoosting算法：构造弱分类器，每个弱分类器对输入样本输出一个弱分类假设，通过所有弱分类器的权重融合构成一个强分类器。2. Implement the GBBoosting algorithm: construct weak classifiers, each weak classifier outputs a weak classification hypothesis for the input sample, and form a strong classifier through the weight fusion of all weak classifiers.

弱分类器的构造是通过简单空间向量相似度是判断两类文本的向量内积大小，即求两个向量的夹角大小。如图2所示，两个文本越相似，则对应向量的夹角越小，夹角的余弦值越大。如图3所示，弱分类器在简单空间向量相似度的基础上做了改进，构造一个简单的线性分类器。其具体步骤如下：The construction of the weak classifier is to judge the size of the vector inner product of two types of texts through the simple space vector similarity, that is, to find the angle between the two vectors. As shown in Figure 2, the more similar the two texts are, the smaller the angle between the corresponding vectors is, and the larger the cosine of the angle is. As shown in Figure 3, the weak classifier is improved on the basis of the similarity of simple space vectors to construct a simple linear classifier. The specific steps are as follows:

步骤一：给定两种类型的训练文本向量表示 ${\overset{&RightArrow;}{V}}_{1} = (x_{1}, x_{2}, . . ., x_{i}, . . ., x_{n}), {\overset{&RightArrow;}{V}}_{2} = (y_{1}, y_{2}, . . ., y_{i}, . . ., y_{n});$ 步骤二：1)根据公式

计算出两种训练文本

的中间向量

2)根据公式

计算出中间向量的垂直向量

Step 1: Given two types of training text vector representations

{\overset{&Right Arrow;}{V}}_{1} = (x_{1}, x_{2}, . . ., x_{i}, . . ., x_{no}), {\overset{&Right Arrow;}{V}}_{2} = ({the y}_{1}, {they}_{2}, . . ., {the y}_{i}, . . ., {the y}_{no});

Step 2: 1) According to the formula

Two training texts are calculated

The intermediate vector of

2) According to the formula

Calculate the intermediate vector The vertical vector of

$\overset{&RightArrow; &Right Arrow;}{V V} = = (({m m}_{11},, {m m}_{22},, . . . . . .,, {m m}_{i i},, . . . . . .,, {m m}_{n no})) . .$

步骤三：存在一个d维的向量

和门限值0，对于任意一个测试向量a_i，如果(w_i·a_i)＞0，则将a_i的标签标记为+1，如果(w_i·a_i)＜0，则将a_i的标签标记为-1。Step 3: There is a d-dimensional vector

and threshold value 0, for any test vector a _i , if (w _i ·a _i )>0, mark the label of a _i as +1, if (w _i ·a _i )<0, set a The label of _i is marked as -1.

通过弱分类器是实现GBBoosting算法的基础，每个弱分类器对输入样本输出一个弱分类假设，通过所有弱分类器的权重融合构成一个强分类器。给定两个训练集D₁＝(x₁,x₂,...,x_i,...,x_n)，D₂＝(y₁,y₂,...,y_i,...,y_n)。分别从D₁，D₂中随机选取M个样本，生成两个向量

通过计算得到与两个向量的中间向量

垂直的向量将测试集D_Test＝(z₁,z₂,...,z_i,...,z_n)中的每个样本与向量V做点积，通过点积结果的正负判断样本的分类，其具体步骤如下：Weak classifiers are the basis for implementing the GBBoosting algorithm. Each weak classifier outputs a weak classification hypothesis for the input sample, and a strong classifier is formed by merging the weights of all weak classifiers. Given two training sets D ₁ =(x ₁ ,x ₂ ,..., _xi ,...,x _n ), D ₂ =(y ₁ ,y ₂ ,...,y _i ,.. .,y _n ). Randomly select M samples from D ₁ and D ₂ respectively, and generate two vectors

Calculate the intermediate vector with two vectors

vertical vector Do a dot product of each sample in the test set D _Test = (z ₁ ,z ₂ ,..., _zi ,...,z _n ) with the vector V, and judge the classification of the sample by the positive or negative of the dot product result , the specific steps are as follows:

步骤一：两个训练集D₁＝(x₁,x₂,...,x_i,...,x_n)，D₂＝(y₁,y₂,...,y_i,...,y_n)，一个测试集D_Test＝(z₁,z₂,...,z_i,...,z_n)，将训练集D1、D2，测试集D_Test，分别表示成向量形式： $D_{1} = (\overset{&RightArrow;}{x_{1}}, \overset{&RightArrow;}{x_{2}}, . . ., \overset{&RightArrow;}{x_{i}}, . . ., \overset{&RightArrow;}{x_{n}}), D_{2} = (\overset{&RightArrow;}{y_{1}}, \overset{&RightArrow;}{y_{2}}, . . ., \overset{&RightArrow;}{y_{i}}, . . ., \overset{&RightArrow;}{y_{n}}), D_{Test} = (\overset{&RightArrow;}{z_{1}}, \overset{&RightArrow;}{z_{2}}, . . ., \overset{&RightArrow;}{z_{i}}, . . ., \overset{&RightArrow;}{z_{n}}),$ 并分别初始化D₁、D₂，D_Test中的样本权重。Step 1: Two training sets D ₁ =(x ₁ ,x ₂ ,..., _xi ,...,x _n ), D ₂ =(y ₁ ,y ₂ ,...,y _i ,. ..,y _n ), a test set D _Test =(z ₁ ,z ₂ ,..., _zi ,...,z _n ), the training sets D1, D2, and the test set D _Test are expressed as Vector form: ${D.}_{1} = (\overset{&Right Arrow;}{x_{1}}, \overset{&Right Arrow;}{x_{2}}, . . ., \overset{&Right Arrow;}{x_{i}}, . . ., \overset{&Right Arrow;}{x_{no}}), {D.}_{2} = (\overset{&Right Arrow;}{{the y}_{1}}, \overset{&Right Arrow;}{{they}_{2}}, . . ., \overset{&Right Arrow;}{{the y}_{i}}, . . ., \overset{&Right Arrow;}{{the y}_{no}}), {D.}_{test} = (\overset{&Right Arrow;}{z_{1}}, \overset{&Right Arrow;}{z_{2}}, . . ., \overset{&Right Arrow;}{z_{i}}, . . ., \overset{&Right Arrow;}{z_{no}}),$ And initialize D ₁ , D ₂ , and sample weights in D _Test respectively.

步骤二：1)从D₁，D₂中随机选取M(N/5<M<N)个样本组成子集D₁₁、D₂₁，分别对子集D₁₁、D₂₁中的向量对应相加并且单位化得到两个向量

2)根据线性分类器的构造过程，得到与两个向量

的中间向量

垂直的向量

生成弱分类器H(x)₁。经过p次循环，得到p个不同的垂直向量

p个弱分类器h(x)₁,h(x)₂,...,h(x)_p。Step 2: 1) Randomly select M (N/5<M<N) samples from D ₁ and D ₂ to form subsets D ₁₁ and D ₂₁ , and add correspondingly to the vectors in subsets D ₁₁ and D ₂₁ and normalize to get two vectors

The intermediate vector of

vertical vector

Generate a weak classifier H(x) ₁ . After p cycles, get p different vertical vectors

p weak classifiers h(x) ₁ ,h(x) ₂ ,...,h(x) _p .

步骤三：H(x)＝h(x)₁+h(x)₂+...+h(x)_p，即

Step 3: H(x)＝h(x) ₁ +h(x) ₂ +...+h(x) _p , namely

最后说明的是，以上优选实施例仅用以说明本发明的技术方案而非限制，尽管通过上述优选实施例已经对本发明进行了详细的描述，但本领域技术人员应当理解，可以在形式上和细节上对其作出各种各样的改变，而不偏离本发明权利要求书所限定的范围。Finally, it should be noted that the above preferred embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail through the above preferred embodiments, those skilled in the art should understand that it can be described in terms of form and Various changes may be made in the details without departing from the scope of the invention defined by the claims.

Claims

1. the name country origin recognition methods based on WEB and GBBoosting algorithm, is characterized in that: comprise the following steps: step 1: by WEB Data Extraction Technology, extract the scholar of colleges and universities name;

Step 2: structure GBBoosting algorithm: structure Weak Classifier, each Weak Classifier, to weak typing hypothesis of input sample output, is merged and is formed a strong classifier by the weight of all Weak Classifiers;

Step 3: by the country origin under GBBoosting algorithm identified.

2. the name country origin recognition methods based on WEB and GBBoosting algorithm according to claim 1, it is characterized in that: in step 1, by GOOGLE search engine interface, obtain institute of the colleges and universities page, then at institute's page, carry out semantic analysis and obtain the scholar of the institute place page, finally by named entity recognition technology and semantic analysis technology, obtain extracting the scholar's information in the page.

3. the name country origin recognition methods based on WEB and GBBoosting algorithm according to claim 1, is characterized in that: in step 2, the constitution step of Weak Classifier specifically comprises:

1) by the training text vector representation of two types, be

{\overset{&RightArrow;}{V}}_{1} = (x_{1}, x_{2}, . . ., x_{i}, . . ., x_{n}), {\overset{&RightArrow;}{V}}_{2} = (y_{1}, y_{2}, . . ., y_{i}, . . ., y_{n});

2) according to formula

calculate two kinds of training texts

intermediate vector

{\overset{&RightArrow;}{V}}_{3} = (z_{1} {, z}_{2}, . . ., z_{i}, . . ., z_{n});

3) according to formula

calculate intermediate vector

vertical vector

for any one test vector a _iif, (w _ia _i) > 0, by a _ilabel be+1, if (w _ia _i) < 0, by a _ilabel be-1;

Iteration Weak Classifier, its weights merge and form strong classifier, and its concrete steps are as follows:

First, given two training set D ₁=(x ₁, x ₂..., x _i..., x _n), D ₂=(y ₁, y ₂..., y _i..., y _n), a test set D _test=(z ₁, z ₂..., z _i..., z _n), by training set D1, D2, test set D _test, be expressed as vector form:

D_{1} = (\overset{&RightArrow;}{x_{1}}, \overset{&RightArrow;}{x_{2}}, . . ., \overset{&RightArrow;}{x_{i}}, . . ., \overset{&RightArrow;}{x_{n}}), D_{2} = (\overset{&RightArrow;}{y_{1}}, \overset{&RightArrow;}{y_{2}}, . . ., \overset{&RightArrow;}{y_{i}}, . . ., \overset{&RightArrow;}{y_{n}}), D_{Test} = (\overset{&RightArrow;}{z_{1}}, \overset{&RightArrow;}{z_{2}}, . . ., \overset{&RightArrow;}{z_{i}}, . . ., \overset{&RightArrow;}{z_{n}}),

And difference initialization D ₁, D ₂, D _testin sample weights;

Secondly, 1) from D ₁, D ₂in choose at random the individual sample of M (N/5<M<N) composition subset D ₁₁, D ₂₁, respectively to subset D ₁₁, D ₂₁in corresponding be added and unit obtains two vectors of vector 2), according to the construction process of linear classifier, obtain and two vectors

intermediate vector

vertical vector

generate Weak Classifier H (x) ₁; Through p circulation, obtain p different vertical vector p Weak Classifier h (x) ₁, h (x) ₂..., h (x) _p; Final H (x)=h (x) ₁+ h (x) ₂+ ...+h (x) _p,

4. the name country origin recognition methods based on WEB and GBBoosting algorithm according to claim 1, is characterized in that: in step 3, the scholar of colleges and universities name is gone out to scholar belonging country by GBBoosting algorithm identified.