CN103729466A - Name country identification method based on WEB and GBBoosting algorithms - Google Patents

Name country identification method based on WEB and GBBoosting algorithms Download PDF

Info

Publication number
CN103729466A
CN103729466A CN201410019885.XA CN201410019885A CN103729466A CN 103729466 A CN103729466 A CN 103729466A CN 201410019885 A CN201410019885 A CN 201410019885A CN 103729466 A CN103729466 A CN 103729466A
Authority
CN
China
Prior art keywords
rightarrow
gbboosting
algorithm
web
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410019885.XA
Other languages
Chinese (zh)
Other versions
CN103729466B (en
Inventor
苏畅
贾文强
王裕坤
余跃
吴琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201410019885.XA priority Critical patent/CN103729466B/en
Publication of CN103729466A publication Critical patent/CN103729466A/en
Application granted granted Critical
Publication of CN103729466B publication Critical patent/CN103729466B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于WEB及GBBoosting算法的人名国别识别方法,属于WEB数据挖掘技术领域。该方法包括以下步骤:步骤一:通过WEB数据抽取技术提取高校学者人名;步骤二:构造GBBoosting算法:构造弱分类器,每个弱分类器对输入样本输出一个弱分类假设,通过所有弱分类器的权重融合构成一个强分类器;步骤三:通过GBBoosting算法识别所属的国别。本发明所述的基于WEB及GBBoosting算法的人名国别识别方法,有效的解决了两个国家人名拼写方式相近的情况下不能分类的问题;同时本方法比现有的其它分类方法更易实施,能更好的应用于人名国别或者城市国别语义标注等工程实践中。

Figure 201410019885

The invention discloses a name and country identification method based on WEB and GBBoosting algorithm, which belongs to the technical field of WEB data mining. The method includes the following steps: Step 1: Extract the names of academics in colleges and universities through WEB data extraction technology; Step 2: Construct GBBoosting algorithm: Construct weak classifiers, each weak classifier outputs a weak classification hypothesis for the input sample, and pass all weak classifiers The fusion of weights constitutes a strong classifier; Step 3: Identify the country to which it belongs through the GBBoosting algorithm. The national name recognition method based on WEB and GBBoosting algorithm of the present invention effectively solves the problem that the names of two countries cannot be classified under the similar spelling mode; at the same time, this method is easier to implement than other existing classification methods, and can It is better used in engineering practices such as semantic labeling of names and cities or cities and countries.

Figure 201410019885

Description

基于WEB及GBBoosting算法的人名国别识别方法Recognition Method of Person's Name and Country Based on WEB and GBBoosting Algorithm

技术领域technical field

本发明属于WEB数据挖掘技术领域,具体涉及一种基于WEB及GBBoosting算法的人名国别识别方法。The invention belongs to the technical field of WEB data mining, and in particular relates to a name and country recognition method based on WEB and GBBoosting algorithm.

背景技术Background technique

随着Internet的高速发展和WEB资源的日益丰富,为了从海量的数据信息中快速准确的挖掘需要且有意义的数据,近年来,WEB语义分析技术和文本分类技术在WEB数据挖掘领域得到广泛的应用,基于WEB的应用在某些程度上改变了用户的生活习惯和工作方式,也受到越来越多的广大用户的欢迎与赞赏。With the rapid development of the Internet and the increasing abundance of WEB resources, in order to quickly and accurately mine necessary and meaningful data from massive data information, in recent years, WEB semantic analysis technology and text classification technology have been widely used in the field of WEB data mining. Applications, WEB-based applications have changed users' living habits and working styles to some extent, and are also welcomed and appreciated by more and more users.

KNN、贝叶斯等分类方法在众多分类领域中取得了良好的分类效果,例如,解梅等人将KNN应用于图像处理领域,提出了一种基于KNN分类算法的MR图像灰度不均匀性校正分割方法(专利号:201010583560.6,公开日:2011.07.27);杨柳等人将贝叶斯应用于计算机软件领域,提出了一种基于改进贝叶斯分类的短信智能分类及搜索方法(专利号:201310356056.6,公开日:2013.12.04)。但是上述分类方法在人名国别分类场景中的分类准确率有待进一步提高,尤其是在两个国家人名拼写方式相近的情况下,其分类准确率仅仅高于随机猜测。可见上述分类算法在人名国别分类应用中存在极大的局限性。Classification methods such as KNN and Bayesian have achieved good classification results in many classification fields. For example, Jiemei et al. applied KNN to the field of image processing, and proposed a KNN-based classification algorithm to detect the gray level inhomogeneity of MR images. Correction segmentation method (Patent No.: 201010583560.6, publication date: 2011.07.27); Yang Liu et al. applied Bayesian to the field of computer software, and proposed an intelligent SMS classification and search method based on improved Bayesian classification (Patent No. : 201310356056.6, public date: 2013.12.04). However, the classification accuracy of the above-mentioned classification method in the scene of classification of personal names by country needs to be further improved, especially when the spelling of personal names in two countries is similar, the classification accuracy is only higher than random guessing. It can be seen that the above classification algorithm has great limitations in the application of the classification of personal names and countries.

基于上述分类方法在人名国别分类问题中存在的不足,本发明提出了一种基于Boosting的GBBoosting算法,旨在解决人名国别分类场景中存在的问题,与其他的分类算法相比,其分类准确率和召回率有了较大的提高,尤其是分类两个国家人名拼写方式相近的情况下,性能出色。将GBBoosting算法应用于人名国别、城市国别等识别场景中,进行人名或者城市的国别语义标注,进而应用到火热的社交领域中,具有非常重要的现实意义和广阔的应用前景。Based on the deficiencies of the above-mentioned classification methods in the classification of personal names and countries, the present invention proposes a Boosting-based GBBoosting algorithm, which aims to solve the problems existing in the scene of classification of personal names and countries. Compared with other classification algorithms, its classification The accuracy and recall rate have been greatly improved, especially when the spelling of names in two countries is similar, the performance is excellent. It has very important practical significance and broad application prospects to apply the GBBoosting algorithm to recognition scenarios such as person names and cities, etc., to carry out semantic annotation of names or cities by country, and then apply it to the hot social field.

发明内容Contents of the invention

有鉴于此,本发明的目的在于提供一种基于WEB及GBBoosting算法的人名国别识别方法,该方法通过WEB数据抽取技术提取高校学者人名,通过构造弱分类器,每个弱分类器对输入样本输出一个弱分类假设,通过所有弱分类器的权重融合构成一个强分类器,最后通过GBBoosting算法识别人名所属的国家。In view of this, the object of the present invention is to provide a method for identifying national names based on WEB and GBBoosting algorithm. The method uses WEB data extraction technology to extract the names of academics in colleges and universities, and constructs weak classifiers. Output a weak classification hypothesis, form a strong classifier through the weight fusion of all weak classifiers, and finally use the GBBoosting algorithm to identify the country to which the name belongs.

为达到上述目的,本发明提供如下技术方案:To achieve the above object, the present invention provides the following technical solutions:

一种基于WEB及GBBoosting算法的人名国别识别方法,包括以下步骤:步骤一:通过WEB数据抽取技术提取高校学者人名;步骤二:构造GBBoosting算法:构造弱分类器,每个弱分类器对输入样本输出一个弱分类假设,通过所有弱分类器的权重融合构成一个强分类器;步骤三:通过GBBoosting算法识别所属的国别。A name and country recognition method based on WEB and GBBoosting algorithm, including the following steps: Step 1: Extracting the names of academics in universities through WEB data extraction technology; Step 2: Constructing GBBoosting algorithm: Constructing weak classifiers, each weak classifier pairs input The sample outputs a weak classification hypothesis, and a strong classifier is formed by merging the weights of all weak classifiers; Step 3: Identify the country to which it belongs through the GBBoosting algorithm.

进一步,在步骤一中,通过GOOGLE搜索引擎接口得到高校学院页面,然后在学院页面进行语义分析得到学院学者所在页面,最终通过命名实体识别技术和语义分析技术得到抽取页面中的学者信息。Further, in step 1, the college page of the university is obtained through the GOOGLE search engine interface, and then semantic analysis is performed on the college page to obtain the page where the scholars of the college are located, and finally the scholar information in the extracted page is obtained through named entity recognition technology and semantic analysis technology.

进一步,在步骤二中,弱分类器的构造步骤具体包括:Further, in step 2, the construction steps of the weak classifier specifically include:

1)将两种类型的训练文本用向量表示为 V → 1 = ( x 1 , x 2 , . . . , x i , . . . , x n ) , V → 2 = ( y 1 , y 2 , . . . , y i , . . . , y n ) ; 1) Represent the two types of training text as vectors V &Right Arrow; 1 = ( x 1 , x 2 , . . . , x i , . . . , x no ) , V &Right Arrow; 2 = ( the y 1 , the y 2 , . . . , the y i , . . . , the y no ) ;

2)根据公式计算出两种训练文本

Figure BDA0000457910150000023
的中间向量 2) According to the formula Two training texts are calculated
Figure BDA0000457910150000023
The intermediate vector of

VV →&Right Arrow; 33 == (( zz 11 ,, zz 22 ,, .. .. .. ,, zz ii ,, .. .. .. ,, zz nno )) ;;

3)根据公式

Figure BDA0000457910150000026
计算出中间向量
Figure BDA0000457910150000027
的垂直向量
Figure BDA0000457910150000028
Figure BDA0000457910150000029
对于任意一个测试向量ai,如果(wi·ai)>0,则将ai的标签标记为+1,如果(wi·ai)<0,则将ai的标签标记为-1;3) According to the formula
Figure BDA0000457910150000026
Calculate the intermediate vector
Figure BDA0000457910150000027
The vertical vector of
Figure BDA0000457910150000028
Figure BDA0000457910150000029
For any test vector a i , if (w i ·a i )>0, mark the label of a i as +1, and if (w i ·a i )<0, mark the label of a i as - 1;

迭代弱分类器,其权值融合形成强分类器,其具体步骤如下:The weak classifiers are iterated, and their weights are fused to form a strong classifier. The specific steps are as follows:

首先,给定两个训练集D1=(x1,x2,...,xi,...,xn),D2=(y1,y2,...,yi,...,yn),一个测试集DTest=(z1,z2,...,zi,...,zn),将训练集D1、D2,测试集DTest,分别表示成向量形式: D 1 = ( x 1 &RightArrow; , x 2 &RightArrow; , . . . , x i &RightArrow; , . . . , x n &RightArrow; ) , D 2 = ( y 1 &RightArrow; , y 2 &RightArrow; , . . . , y i &RightArrow; , . . . , y n &RightArrow; ) , D Test = ( z 1 &RightArrow; , z 2 &RightArrow; , . . . , z i &RightArrow; , . . . , z n &RightArrow; ) , 并分别初始化D1、D2,DTest中的样本权重;First, given two training sets D 1 =(x 1 ,x 2 ,..., xi ,...,x n ), D 2 =(y 1 ,y 2 ,...,y i , ...,y n ), a test set D Test =(z 1 ,z 2 ,..., zi ,...,z n ), the training sets D1, D2, and the test set D Test represent respectively In vector form: D. 1 = ( x 1 &Right Arrow; , x 2 &Right Arrow; , . . . , x i &Right Arrow; , . . . , x no &Right Arrow; ) , D. 2 = ( the y 1 &Right Arrow; , the y 2 &Right Arrow; , . . . , the y i &Right Arrow; , . . . , the y no &Right Arrow; ) , D. test = ( z 1 &Right Arrow; , z 2 &Right Arrow; , . . . , z i &Right Arrow; , . . . , z no &Right Arrow; ) , And initialize D 1 , D 2 , and the sample weights in D Test respectively;

其次,1)从D1,D2中随机选取M(N/5<M<N)个样本组成子集D11、D21,分别对子集D11、D21中的向量对应相加并且单位化得到两个向量

Figure BDA00004579101500000211
2)根据线性分类器的构造过程,得到与两个向量
Figure BDA00004579101500000212
的中间向量
Figure BDA00004579101500000213
垂直的向量
Figure BDA00004579101500000214
生成弱分类器H(x)1;经过p次循环,得到p个不同的垂直向量
Figure BDA00004579101500000215
p个弱分类器h(x)1,h(x)2,...,h(x)p;最终H(x)=h(x)1+h(x)2+...+h(x)p,即
Figure BDA00004579101500000216
Secondly, 1) Randomly select M(N/5<M<N) samples from D 1 and D 2 to form subsets D 11 and D 21 , add correspondingly to the vectors in subsets D 11 and D 21 and Normalize to get two vectors
Figure BDA00004579101500000211
2) According to the construction process of the linear classifier, the two vectors
Figure BDA00004579101500000212
The intermediate vector of
Figure BDA00004579101500000213
vertical vector
Figure BDA00004579101500000214
Generate a weak classifier H(x) 1 ; after p cycles, get p different vertical vectors
Figure BDA00004579101500000215
p weak classifiers h(x) 1 ,h(x) 2 ,...,h(x) p ; final H(x)=h(x) 1 +h(x) 2 +...+h (x) p , ie
Figure BDA00004579101500000216

进一步,在步骤三中,将高校学者人名通过GBBoosting算法识别出学者所属国家。Further, in Step 3, the name of the university scholar is identified by the GBBoosting algorithm to identify the country to which the scholar belongs.

本发明的有益效果在于:本发明提供了一种基于WEB及GBBoosting算法的人名国别识别方法,有效的解决了两个国家人名拼写方式相近的情况下不能分类的问题;同时本方法比现有的其它分类方法更易实施,能更好的应用于人名国别或者城市国别语义标注等工程实践中。The beneficial effect of the present invention is that: the present invention provides a method for identifying country names based on WEB and GBBoosting algorithm, which effectively solves the problem that the names of people in two countries have similar spellings and cannot be classified; at the same time, this method is better than the existing Other classification methods are easier to implement and can be better applied to engineering practices such as semantic labeling of person names or cities.

附图说明Description of drawings

为了使本发明的目的、技术方案和有益效果更加清楚,本发明提供如下附图进行说明:In order to make the purpose, technical scheme and beneficial effect of the present invention clearer, the present invention provides the following drawings for illustration:

图1为本发明所述方法的宏观流程图;Fig. 1 is the macro flow chart of method for the present invention;

图2为向量相似度计算图;Figure 2 is a vector similarity calculation diagram;

图3为弱分类器构造图;Figure 3 is a structural diagram of a weak classifier;

图4为本方法的微观流程图。Fig. 4 is the micro flow chart of this method.

具体实施方式Detailed ways

下面将结合附图,对本发明的优选实施例进行详细的描述。The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

图1为本发明所述方法的宏观流程图,如图所示,本方法包括以下步骤:步骤一:通过WEB数据抽取技术提取高校学者人名;步骤二:构造GBBoosting算法:构造弱分类器,每个弱分类器对输入样本输出一个弱分类假设,通过所有弱分类器的权重融合构成一个强分类器;步骤三:通过GBBoosting算法识别所属的国别。Fig. 1 is the macro-flow chart of the method of the present invention, as shown in the figure, this method comprises the following steps: Step 1: Extract the names of academics in universities through WEB data extraction technology; Step 2: Construct GBBoosting algorithm: Construct a weak classifier, each A weak classifier outputs a weak classification hypothesis for the input sample, and forms a strong classifier through the weight fusion of all weak classifiers; Step 3: Identify the country to which it belongs through the GBBoosting algorithm.

图4为本方法的微观流程图,现结合图4对本方法的具体实施步骤进行说明。Fig. 4 is a micro-flow chart of the method, and the specific implementation steps of the method will now be described in conjunction with Fig. 4 .

1.通过WEB数据抽取技术提取高校学者人名1. Extract the names of university scholars through WEB data extraction technology

1)通过GOOGLE搜索引擎搜索“university+computerscience”找到学院主页;2)通过学院首页找到包含该学院中所有学者信息页面。学校中学者的姓名一般都会存在于对应学院(系),只要找到对应学院(系)的URL就可以得到学校所有学者的姓名及主页地址。步骤二中找对应大学中计算机学院(系)的URL,经过观察学院(系)和学院学者两个页面的URL地址,可以得到两个规则:1) Search "university+computerscience" through the GOOGLE search engine to find the homepage of the college; 2) Find the information page of all scholars in the college through the homepage of the college. The names of the scholars in the school generally exist in the corresponding college (department). As long as you find the URL of the corresponding college (department), you can get the names and home page addresses of all the scholars in the school. In step 2, find the URL of the computer school (department) corresponding to the university. After observing the URL addresses of the two pages of the school (department) and school scholars, you can get two rules:

①后一个地址包含前一个地址。①The latter address contains the former address.

②后一个地址中还包含“people、faculty、faculty&Advisors”特征。②The latter address also includes the features of "people, faculty, faculty&Advisors".

只需要遍历计算机学院(系)中的所有链接,筛选出链接中符合上述两个规则并且链接对应文字为“faculty或people”的URL,通过实验发现一般可以过滤出两个URL地址,之所以出现两个URL是由于学院中一般含有people菜单,而faculty属于people的子菜单链接,第二个URL才是需要的链接,所以当出现两个URL时选择第二个URL地址,反之选择第一个地址。最后输入过滤出URL即可得到所有学者的姓名及对应的个人主页。3)通过计算机学院(系)faculty页面提取所有学者的姓名和主页。提取计算机学院(系)faculty页面所有的链接,找到链接对应的文本,通过命名实体技术分析文本是否为人名。It is only necessary to traverse all the links in the School of Computer Science (Department), and filter out the URLs in the links that meet the above two rules and whose corresponding text is "faculty or people". Through experiments, it is found that generally two URL addresses can be filtered out. The two URLs are because the college generally contains the people menu, and faculty belongs to the submenu link of people, the second URL is the required link, so when there are two URLs, select the second URL address, otherwise select the first URL address. Finally, enter the filtered URL to get the names and corresponding personal homepages of all scholars. 3) Extract the names and homepages of all scholars through the faculty page of the School of Computer Science (Department). Extract all the links on the faculty page of the School of Computer Science (Department), find the text corresponding to the link, and analyze whether the text is a person's name through named entity technology.

2.实现GBBoosting算法:构造弱分类器,每个弱分类器对输入样本输出一个弱分类假设,通过所有弱分类器的权重融合构成一个强分类器。2. Implement the GBBoosting algorithm: construct weak classifiers, each weak classifier outputs a weak classification hypothesis for the input sample, and form a strong classifier through the weight fusion of all weak classifiers.

弱分类器的构造是通过简单空间向量相似度是判断两类文本的向量内积大小,即求两个向量的夹角大小。如图2所示,两个文本越相似,则对应向量的夹角越小,夹角的余弦值越大。如图3所示,弱分类器在简单空间向量相似度的基础上做了改进,构造一个简单的线性分类器。其具体步骤如下:The construction of the weak classifier is to judge the size of the vector inner product of two types of texts through the simple space vector similarity, that is, to find the angle between the two vectors. As shown in Figure 2, the more similar the two texts are, the smaller the angle between the corresponding vectors is, and the larger the cosine of the angle is. As shown in Figure 3, the weak classifier is improved on the basis of the similarity of simple space vectors to construct a simple linear classifier. The specific steps are as follows:

步骤一:给定两种类型的训练文本向量表示 V &RightArrow; 1 = ( x 1 , x 2 , . . . , x i , . . . , x n ) , V &RightArrow; 2 = ( y 1 , y 2 , . . . , y i , . . . , y n ) ; 步骤二:1)根据公式

Figure BDA0000457910150000042
计算出两种训练文本
Figure BDA0000457910150000043
的中间向量
Figure BDA0000457910150000044
2)根据公式
Figure BDA0000457910150000046
计算出中间向量的垂直向量
Figure BDA0000457910150000048
Step 1: Given two types of training text vector representations V &Right Arrow; 1 = ( x 1 , x 2 , . . . , x i , . . . , x no ) , V &Right Arrow; 2 = ( the y 1 , the y 2 , . . . , the y i , . . . , the y no ) ; Step 2: 1) According to the formula
Figure BDA0000457910150000042
Two training texts are calculated
Figure BDA0000457910150000043
The intermediate vector of
Figure BDA0000457910150000044
2) According to the formula
Figure BDA0000457910150000046
Calculate the intermediate vector The vertical vector of
Figure BDA0000457910150000048

VV &RightArrow;&Right Arrow; == (( mm 11 ,, mm 22 ,, .. .. .. ,, mm ii ,, .. .. .. ,, mm nno )) ..

步骤三:存在一个d维的向量

Figure BDA00004579101500000410
和门限值0,对于任意一个测试向量ai,如果(wi·ai)>0,则将ai的标签标记为+1,如果(wi·ai)<0,则将ai的标签标记为-1。Step 3: There is a d-dimensional vector
Figure BDA00004579101500000410
and threshold value 0, for any test vector a i , if (w i ·a i )>0, mark the label of a i as +1, if (w i ·a i )<0, set a The label of i is marked as -1.

通过弱分类器是实现GBBoosting算法的基础,每个弱分类器对输入样本输出一个弱分类假设,通过所有弱分类器的权重融合构成一个强分类器。给定两个训练集D1=(x1,x2,...,xi,...,xn),D2=(y1,y2,...,yi,...,yn)。分别从D1,D2中随机选取M个样本,生成两个向量

Figure BDA00004579101500000411
通过计算得到与两个向量的中间向量
Figure BDA00004579101500000415
垂直的向量将测试集DTest=(z1,z2,...,zi,...,zn)中的每个样本与向量V做点积,通过点积结果的正负判断样本的分类,其具体步骤如下:Weak classifiers are the basis for implementing the GBBoosting algorithm. Each weak classifier outputs a weak classification hypothesis for the input sample, and a strong classifier is formed by merging the weights of all weak classifiers. Given two training sets D 1 =(x 1 ,x 2 ,..., xi ,...,x n ), D 2 =(y 1 ,y 2 ,...,y i ,.. .,y n ). Randomly select M samples from D 1 and D 2 respectively, and generate two vectors
Figure BDA00004579101500000411
Calculate the intermediate vector with two vectors
Figure BDA00004579101500000415
vertical vector Do a dot product of each sample in the test set D Test = (z 1 ,z 2 ,..., zi ,...,z n ) with the vector V, and judge the classification of the sample by the positive or negative of the dot product result , the specific steps are as follows:

步骤一:两个训练集D1=(x1,x2,...,xi,...,xn),D2=(y1,y2,...,yi,...,yn),一个测试集DTest=(z1,z2,...,zi,...,zn),将训练集D1、D2,测试集DTest,分别表示成向量形式: D 1 = ( x 1 &RightArrow; , x 2 &RightArrow; , . . . , x i &RightArrow; , . . . , x n &RightArrow; ) , D 2 = ( y 1 &RightArrow; , y 2 &RightArrow; , . . . , y i &RightArrow; , . . . , y n &RightArrow; ) , D Test = ( z 1 &RightArrow; , z 2 &RightArrow; , . . . , z i &RightArrow; , . . . , z n &RightArrow; ) , 并分别初始化D1、D2,DTest中的样本权重。Step 1: Two training sets D 1 =(x 1 ,x 2 ,..., xi ,...,x n ), D 2 =(y 1 ,y 2 ,...,y i ,. ..,y n ), a test set D Test =(z 1 ,z 2 ,..., zi ,...,z n ), the training sets D1, D2, and the test set D Test are expressed as Vector form: D. 1 = ( x 1 &Right Arrow; , x 2 &Right Arrow; , . . . , x i &Right Arrow; , . . . , x no &Right Arrow; ) , D. 2 = ( the y 1 &Right Arrow; , the y 2 &Right Arrow; , . . . , the y i &Right Arrow; , . . . , the y no &Right Arrow; ) , D. test = ( z 1 &Right Arrow; , z 2 &Right Arrow; , . . . , z i &Right Arrow; , . . . , z no &Right Arrow; ) , And initialize D 1 , D 2 , and sample weights in D Test respectively.

步骤二:1)从D1,D2中随机选取M(N/5<M<N)个样本组成子集D11、D21,分别对子集D11、D21中的向量对应相加并且单位化得到两个向量

Figure BDA00004579101500000414
2)根据线性分类器的构造过程,得到与两个向量
Figure BDA0000457910150000051
的中间向量
Figure BDA0000457910150000052
垂直的向量
Figure BDA0000457910150000053
生成弱分类器H(x)1。经过p次循环,得到p个不同的垂直向量
Figure BDA0000457910150000054
p个弱分类器h(x)1,h(x)2,...,h(x)p。Step 2: 1) Randomly select M (N/5<M<N) samples from D 1 and D 2 to form subsets D 11 and D 21 , and add correspondingly to the vectors in subsets D 11 and D 21 and normalize to get two vectors
Figure BDA00004579101500000414
2) According to the construction process of the linear classifier, the two vectors
Figure BDA0000457910150000051
The intermediate vector of
Figure BDA0000457910150000052
vertical vector
Figure BDA0000457910150000053
Generate a weak classifier H(x) 1 . After p cycles, get p different vertical vectors
Figure BDA0000457910150000054
p weak classifiers h(x) 1 ,h(x) 2 ,...,h(x) p .

步骤三:H(x)=h(x)1+h(x)2+...+h(x)p,即

Figure BDA0000457910150000055
Step 3: H(x)=h(x) 1 +h(x) 2 +...+h(x) p , namely
Figure BDA0000457910150000055

最后说明的是,以上优选实施例仅用以说明本发明的技术方案而非限制,尽管通过上述优选实施例已经对本发明进行了详细的描述,但本领域技术人员应当理解,可以在形式上和细节上对其作出各种各样的改变,而不偏离本发明权利要求书所限定的范围。Finally, it should be noted that the above preferred embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail through the above preferred embodiments, those skilled in the art should understand that it can be described in terms of form and Various changes may be made in the details without departing from the scope of the invention defined by the claims.

Claims (4)

1. the name country origin recognition methods based on WEB and GBBoosting algorithm, is characterized in that: comprise the following steps: step 1: by WEB Data Extraction Technology, extract the scholar of colleges and universities name;
Step 2: structure GBBoosting algorithm: structure Weak Classifier, each Weak Classifier, to weak typing hypothesis of input sample output, is merged and is formed a strong classifier by the weight of all Weak Classifiers;
Step 3: by the country origin under GBBoosting algorithm identified.
2. the name country origin recognition methods based on WEB and GBBoosting algorithm according to claim 1, it is characterized in that: in step 1, by GOOGLE search engine interface, obtain institute of the colleges and universities page, then at institute's page, carry out semantic analysis and obtain the scholar of the institute place page, finally by named entity recognition technology and semantic analysis technology, obtain extracting the scholar's information in the page.
3. the name country origin recognition methods based on WEB and GBBoosting algorithm according to claim 1, is characterized in that: in step 2, the constitution step of Weak Classifier specifically comprises:
1) by the training text vector representation of two types, be V &RightArrow; 1 = ( x 1 , x 2 , . . . , x i , . . . , x n ) , V &RightArrow; 2 = ( y 1 , y 2 , . . . , y i , . . . , y n ) ;
2) according to formula
Figure FDA0000457910140000012
calculate two kinds of training texts
Figure FDA0000457910140000013
Figure FDA0000457910140000014
intermediate vector
Figure FDA0000457910140000015
V &RightArrow; 3 = ( z 1 , z 2 , . . . , z i , . . . , z n ) ;
3) according to formula
Figure FDA0000457910140000017
calculate intermediate vector
Figure FDA0000457910140000018
vertical vector
Figure FDA0000457910140000019
for any one test vector a iif, (w ia i) > 0, by a ilabel be+1, if (w ia i) < 0, by a ilabel be-1;
Iteration Weak Classifier, its weights merge and form strong classifier, and its concrete steps are as follows:
First, given two training set D 1=(x 1, x 2..., x i..., x n), D 2=(y 1, y 2..., y i..., y n), a test set D test=(z 1, z 2..., z i..., z n), by training set D1, D2, test set D test, be expressed as vector form: D 1 = ( x 1 &RightArrow; , x 2 &RightArrow; , . . . , x i &RightArrow; , . . . , x n &RightArrow; ) , D 2 = ( y 1 &RightArrow; , y 2 &RightArrow; , . . . , y i &RightArrow; , . . . , y n &RightArrow; ) , D Test = ( z 1 &RightArrow; , z 2 &RightArrow; , . . . , z i &RightArrow; , . . . , z n &RightArrow; ) , And difference initialization D 1, D 2, D testin sample weights;
Secondly, 1) from D 1, D 2in choose at random the individual sample of M (N/5<M<N) composition subset D 11, D 21, respectively to subset D 11, D 21in corresponding be added and unit obtains two vectors of vector 2), according to the construction process of linear classifier, obtain and two vectors
Figure FDA00004579101400000112
intermediate vector
Figure FDA00004579101400000113
vertical vector
Figure FDA00004579101400000114
generate Weak Classifier H (x) 1; Through p circulation, obtain p different vertical vector p Weak Classifier h (x) 1, h (x) 2..., h (x) p; Final H (x)=h (x) 1+ h (x) 2+ ...+h (x) p,
Figure FDA0000457910140000021
4. the name country origin recognition methods based on WEB and GBBoosting algorithm according to claim 1, is characterized in that: in step 3, the scholar of colleges and universities name is gone out to scholar belonging country by GBBoosting algorithm identified.
CN201410019885.XA 2014-01-16 2014-01-16 Name country origin recognition methods based on WEB and GBBoosting algorithms Expired - Fee Related CN103729466B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410019885.XA CN103729466B (en) 2014-01-16 2014-01-16 Name country origin recognition methods based on WEB and GBBoosting algorithms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410019885.XA CN103729466B (en) 2014-01-16 2014-01-16 Name country origin recognition methods based on WEB and GBBoosting algorithms

Publications (2)

Publication Number Publication Date
CN103729466A true CN103729466A (en) 2014-04-16
CN103729466B CN103729466B (en) 2017-07-04

Family

ID=50453540

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410019885.XA Expired - Fee Related CN103729466B (en) 2014-01-16 2014-01-16 Name country origin recognition methods based on WEB and GBBoosting algorithms

Country Status (1)

Country Link
CN (1) CN103729466B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484412A (en) * 2014-12-16 2015-04-01 芜湖乐锐思信息咨询有限公司 Big data analysis system based on multiform processing
CN108108371A (en) * 2016-11-24 2018-06-01 北京国双科技有限公司 A kind of file classification method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080168070A1 (en) * 2007-01-08 2008-07-10 Naphade Milind R Method and apparatus for classifying multimedia artifacts using ontology selection and semantic classification
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
CN102142078A (en) * 2010-02-03 2011-08-03 中国科学院自动化研究所 Method for detecting and identifying targets based on component structure model
US20130218872A1 (en) * 2012-02-16 2013-08-22 Benzion Jair Jehuda Dynamic filters for data extraction plan
CN103400471A (en) * 2013-08-12 2013-11-20 电子科技大学 Detecting system and detecting method for fatigue driving of driver

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080168070A1 (en) * 2007-01-08 2008-07-10 Naphade Milind R Method and apparatus for classifying multimedia artifacts using ontology selection and semantic classification
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
CN102142078A (en) * 2010-02-03 2011-08-03 中国科学院自动化研究所 Method for detecting and identifying targets based on component structure model
US20130218872A1 (en) * 2012-02-16 2013-08-22 Benzion Jair Jehuda Dynamic filters for data extraction plan
CN103400471A (en) * 2013-08-12 2013-11-20 电子科技大学 Detecting system and detecting method for fatigue driving of driver

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
肖江,张亚非: "Boosting算法在文本自动分类中的应用", 《解放军理工大学学报(自然科学版)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484412A (en) * 2014-12-16 2015-04-01 芜湖乐锐思信息咨询有限公司 Big data analysis system based on multiform processing
CN108108371A (en) * 2016-11-24 2018-06-01 北京国双科技有限公司 A kind of file classification method and device
CN108108371B (en) * 2016-11-24 2021-06-29 北京国双科技有限公司 Text classification method and device

Also Published As

Publication number Publication date
CN103729466B (en) 2017-07-04

Similar Documents

Publication Publication Date Title
CN109740148B (en) Text emotion analysis method combining BiLSTM with Attention mechanism
CN104899298B (en) A kind of microblog emotional analysis method based on large-scale corpus feature learning
CN107480125B (en) Relation linking method based on knowledge graph
WO2019071754A1 (en) Method for sensing image privacy on the basis of deep learning
CN107526799A (en) A kind of knowledge mapping construction method based on deep learning
CN102708164B (en) Method and system for calculating movie expectation
CN103034726B (en) Text filtering system and method
CN110472652B (en) Small sample classification method based on semantic guidance
CN104636325B (en) A kind of method based on Maximum-likelihood estimation determination Documents Similarity
CN105912716A (en) Short text classification method and apparatus
CN110830489B (en) Method and system for detecting counterattack type fraud website based on content abstract representation
CN104036010A (en) Semi-supervised CBOW based user search term subject classification method
CN104361059B (en) A kind of harmful information identification and Web page classification method based on multi-instance learning
CN103324708A (en) Method of transfer learning from long text to short text
CN104700100A (en) Feature extraction method for high spatial resolution remote sensing big data
Fengmei et al. FSFP: Transfer learning from long texts to the short
CN105893484A (en) Microblog Spammer recognition method based on text characteristics and behavior characteristics
CN102693316B (en) Linear generalization regression model based cross-media retrieval method
CN109918648B (en) A Rumor Depth Detection Method Based on Dynamic Sliding Window Feature Scoring
CN115718792A (en) Sensitive information extraction method based on natural semantic processing and deep learning
Hao et al. Similarity evaluation between graphs: a formal concept analysis approach
Zhang et al. Enhanced semantic similarity learning framework for image-text matching
CN106445914A (en) Microblog emotion classifier establishing method and device
CN103729466B (en) Name country origin recognition methods based on WEB and GBBoosting algorithms
CN110866087A (en) An Entity-Oriented Text Sentiment Analysis Method Based on Topic Model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170704