CN103729466A - Name country identification method based on WEB and GBBoosting algorithms - Google Patents
Name country identification method based on WEB and GBBoosting algorithms Download PDFInfo
- Publication number
- CN103729466A CN103729466A CN201410019885.XA CN201410019885A CN103729466A CN 103729466 A CN103729466 A CN 103729466A CN 201410019885 A CN201410019885 A CN 201410019885A CN 103729466 A CN103729466 A CN 103729466A
- Authority
- CN
- China
- Prior art keywords
- rightarrow
- gbboosting
- name
- web
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a name country identification method based on WEB and GBBoosting algorithms, and belongs to the technical field of WEB data mining. The method comprises the steps of I. extracting names of scholars in universities through a WEB data extraction technology; II. constructing a GBBoosting algorithm: constructing weak classifiers, wherein each weak classifier outputs a weak classification hypothesis to an input sample, and a strong classifier is formed through weight fusion of all weak classifiers; and III. identifying countries of the names through the GBBoosting algorithm. The name country identification method based on WEB and GBBoosting algorithms disclosed by the invention effectively solves a problem on classifying names of two countries which are similar in the spelling way; and meanwhile, the method, compared to existing other classifying methods, is easier to implement, and can be better applied to engineering practices such as name country or city country semantic annotation.
Description
Technical field
The invention belongs to WEB data mining technology field, be specifically related to a kind of name country origin recognition methods based on WEB and GBBoosting algorithm.
Background technology
Along with the high speed development of Internet and becoming increasingly abundant of WEB resource, in order to excavate and to need and significant data fast and accurately from the data message of magnanimity, in recent years, WEB semantic analysis technology and Text Classification are widely used at WEB Data Mining, based on being applied in of WEB, in some degree, change user's habits and customs and working method, be also subject to increasing users' welcome and appreciation.
The sorting technique such as KNN, Bayes has obtained good classifying quality in numerous classification field, for example, the people such as Xie Mei are applied to image processing field by KNN, proposed a kind of MR gradation of image nonuniformity correction dividing method based on KNN sorting algorithm (patent No.: 201010583560.6, open day: 2011.07.27); The people such as willow are applied to computer software fields by Bayes, proposed a kind of based on improve the short message intelligent classification of Bayes's classification and searching method (patent No.: 201310356056.6, open day: 2013.12.04).But the classification accuracy of above-mentioned sorting technique in name country origin classification scene needs further to be improved, and especially, in the situation that two national name spell modes are close, its classification accuracy is only higher than random conjecture.In name country origin classification application, there is great limitation in visible above-mentioned sorting algorithm.
The deficiency existing in name country origin classification problem based on above-mentioned sorting technique, the present invention proposes a kind of GBBoosting algorithm based on Boosting, be intended to solve the problem existing in name country origin classification scene, compared with other sorting algorithm, its classification accuracy and recall rate are enhanced, especially, in the close situation of two the national name spell modes of classifying, performance is outstanding.GBBoosting algorithm application, in the identification scenes such as name country origin, city country origin, is carried out to the country origin semantic tagger in name or city, and then is applied in burning hot social field, there is very important realistic meaning and wide application prospect.
Summary of the invention
In view of this, the object of the present invention is to provide a kind of name country origin recognition methods based on WEB and GBBoosting algorithm, the method is extracted the scholar of colleges and universities name by WEB Data Extraction Technology, by structure Weak Classifier, each Weak Classifier is to weak typing hypothesis of input sample output, by the weight of all Weak Classifiers, merge and form a strong classifier, finally by the country under GBBoosting algorithm identified name.
For achieving the above object, the invention provides following technical scheme:
A name country origin recognition methods based on WEB and GBBoosting algorithm, comprises the following steps: step 1: by WEB Data Extraction Technology, extract the scholar of colleges and universities name; Step 2: structure GBBoosting algorithm: structure Weak Classifier, each Weak Classifier, to weak typing hypothesis of input sample output, is merged and is formed a strong classifier by the weight of all Weak Classifiers; Step 3: by the country origin under GBBoosting algorithm identified.
Further, in step 1, by GOOGLE search engine interface, obtain institute of the colleges and universities page, then at institute's page, carry out semantic analysis and obtain the scholar of the institute place page, finally by named entity recognition technology and semantic analysis technology, obtain extracting the scholar's information in the page.
Further, in step 2, the constitution step of Weak Classifier specifically comprises:
1) by the training text vector representation of two types, be
3) according to formula
calculate intermediate vector
vertical vector
for any one test vector a
iif, (w
ia
i) > 0, by a
ilabel be+1, if (w
ia
i) < 0, by a
ilabel be-1;
Iteration Weak Classifier, its weights merge and form strong classifier, and its concrete steps are as follows:
First, given two training set D
1=(x
1, x
2..., x
i..., x
n), D
2=(y
1, y
2..., y
i..., y
n), a test set D
test=(z
1, z
2..., z
i..., z
n), by training set D1, D2, test set D
test, be expressed as vector form:
And difference initialization D
1, D
2, D
testin sample weights;
Secondly, 1) from D
1, D
2in choose at random the individual sample of M (N/5<M<N) composition subset D
11, D
21, respectively to subset D
11, D
21in corresponding be added and unit obtains two vectors of vector
2), according to the construction process of linear classifier, obtain and two vectors
intermediate vector
vertical vector
generate Weak Classifier H (x)
1; Through p circulation, obtain p different vertical vector
p Weak Classifier h (x)
1, h (x)
2..., h (x)
p; Final H (x)=h (x)
1+ h (x)
2+ ...+h (x)
p,
Further, in step 3, the scholar of colleges and universities name is gone out to scholar belonging country by GBBoosting algorithm identified.
Beneficial effect of the present invention is: the invention provides a kind of name country origin recognition methods based on WEB and GBBoosting algorithm, effectively solved unclassified problem in the situation that two national name spell modes are close; This method is more easily implemented than existing other sorting technique simultaneously, can better be applied in the engineering practices such as name country origin or city country origin semantic tagger.
Accompanying drawing explanation
In order to make object of the present invention, technical scheme and beneficial effect clearer, the invention provides following accompanying drawing and describe:
Fig. 1 is the macro flow chart of the method for the invention;
Fig. 2 is vector similarity calculating chart;
Fig. 3 is Weak Classifier structural map;
Fig. 4 is the microcosmic process flow diagram of this method.
Embodiment
Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described in detail.
Fig. 1 is the macro flow chart of the method for the invention, and as shown in the figure, this method comprises the following steps: step 1: by WEB Data Extraction Technology, extract the scholar of colleges and universities name; Step 2: structure GBBoosting algorithm: structure Weak Classifier, each Weak Classifier, to weak typing hypothesis of input sample output, is merged and is formed a strong classifier by the weight of all Weak Classifiers; Step 3: by the country origin under GBBoosting algorithm identified.
Fig. 4 is the microcosmic process flow diagram of this method, now in conjunction with Fig. 4, the concrete implementation step of this method is described.
1. by WEB Data Extraction Technology, extract the scholar of colleges and universities name
1) by GOOGLE search engine search " university+computerscience ", find institute's homepage; 2) by institute's homepage, find and comprise all scholar's information pages in this institute.In school, scholar's name generally all can be present in corresponding institute (being), as long as find the URL of corresponding institute (being) just can obtain all scholars' of school name and homepage address.In step 2, look for the URL of Computer institute of corresponding university (being), through observing the URL address of institute's (being) and two pages of the scholar of institute, can obtain two rules:
1. a rear address packet is containing previous address.
2. in a rear address, also comprise " people, faculty, faculty & Advisors " feature.
Only need to travel through the all-links in School of Computer Science's (being), filter out and in link, meet above-mentioned two rules and link the URL that corresponding word is " faculty or people ", found through experiments and generally can filter out two URL addresses, why occur that two URL are owing to generally containing people menu in institute, and faculty belongs to the submenu link of people, second URL is only the link needing, so select second URL address when there is two URL, otherwise select first address.Finally input filters out URL can obtain all scholars' name and the personal homepage of correspondence.3) by the faculty of School of Computer Science's (being) page, extract all scholars' name and homepage.Whether extract all links of the faculty of School of Computer Science's (being) page, find text corresponding to link, be name by named entity technical Analysis text.
2. realize GBBoosting algorithm: structure Weak Classifier, each Weak Classifier, to weak typing hypothesis of input sample output, is merged and formed a strong classifier by the weight of all Weak Classifiers.
The structure of Weak Classifier is to be the inner product of vectors size that judges two class texts by simple space vector similarity, asks two vectorial corner dimensions.As shown in Figure 2, two texts are more similar, and the angle of corresponding vector is less, and the cosine value of angle is larger.As shown in Figure 3, Weak Classifier improves on the basis of simple space vector similarity, constructs a simple linear classifier.Its concrete steps are as follows:
Step 1: the training text vector representation of given two types
Step 2: 1) according to formula
calculate two kinds of training texts
intermediate vector
2) according to formula
calculate intermediate vector
vertical vector
Step 3: the vector that has a d dimension
with threshold value 0, for any one test vector a
iif, (w
ia
i) > 0, by a
ilabel be+1, if (w
ia
i) < 0, by a
ilabel be-1.
By Weak Classifier, be the basis of realizing GBBoosting algorithm, each Weak Classifier, to weak typing hypothesis of input sample output, is merged and is formed a strong classifier by the weight of all Weak Classifiers.Given two training set D
1=(x
1, x
2..., x
i..., x
n), D
2=(y
1, y
2..., y
i..., y
n).Respectively from D
1, D
2in choose at random M sample, generate two vectors
by calculating the intermediate vector vectorial with two
vertical vector
by test set D
test=(z
1, z
2..., z
i..., z
n) in each sample and vectorial V do dot product, the classification of the positive and negative judgement sample by dot product result, its concrete steps are as follows:
Step 1: two training set D
1=(x
1, x
2..., x
i..., x
n), D
2=(y
1, y
2..., y
i..., y
n), a test set D
test=(z
1, z
2..., z
i..., z
n), by training set D1, D2, test set D
test, be expressed as vector form:
And difference initialization D
1, D
2, D
testin sample weights.
Step 2: 1) from D
1, D
2in choose at random the individual sample of M (N/5<M<N) composition subset D
11, D
21, respectively to subset D
11, D
21in corresponding be added and unit obtains two vectors of vector
2), according to the construction process of linear classifier, obtain and two vectors
intermediate vector
vertical vector
generate Weak Classifier H (x)
1.Through p circulation, obtain p different vertical vector
p Weak Classifier h (x)
1, h (x)
2..., h (x)
p.
Finally explanation is, above preferred embodiment is only unrestricted in order to technical scheme of the present invention to be described, although the present invention is described in detail by above preferred embodiment, but those skilled in the art are to be understood that, can to it, make various changes in the form and details, and not depart from the claims in the present invention book limited range.
Claims (4)
1. the name country origin recognition methods based on WEB and GBBoosting algorithm, is characterized in that: comprise the following steps: step 1: by WEB Data Extraction Technology, extract the scholar of colleges and universities name;
Step 2: structure GBBoosting algorithm: structure Weak Classifier, each Weak Classifier, to weak typing hypothesis of input sample output, is merged and is formed a strong classifier by the weight of all Weak Classifiers;
Step 3: by the country origin under GBBoosting algorithm identified.
2. the name country origin recognition methods based on WEB and GBBoosting algorithm according to claim 1, it is characterized in that: in step 1, by GOOGLE search engine interface, obtain institute of the colleges and universities page, then at institute's page, carry out semantic analysis and obtain the scholar of the institute place page, finally by named entity recognition technology and semantic analysis technology, obtain extracting the scholar's information in the page.
3. the name country origin recognition methods based on WEB and GBBoosting algorithm according to claim 1, is characterized in that: in step 2, the constitution step of Weak Classifier specifically comprises:
1) by the training text vector representation of two types, be
3) according to formula
calculate intermediate vector
vertical vector
for any one test vector a
iif, (w
ia
i) > 0, by a
ilabel be+1, if (w
ia
i) < 0, by a
ilabel be-1;
Iteration Weak Classifier, its weights merge and form strong classifier, and its concrete steps are as follows:
First, given two training set D
1=(x
1, x
2..., x
i..., x
n), D
2=(y
1, y
2..., y
i..., y
n), a test set D
test=(z
1, z
2..., z
i..., z
n), by training set D1, D2, test set D
test, be expressed as vector form:
And difference initialization D
1, D
2, D
testin sample weights;
Secondly, 1) from D
1, D
2in choose at random the individual sample of M (N/5<M<N) composition subset D
11, D
21, respectively to subset D
11, D
21in corresponding be added and unit obtains two vectors of vector
2), according to the construction process of linear classifier, obtain and two vectors
intermediate vector
vertical vector
generate Weak Classifier H (x)
1; Through p circulation, obtain p different vertical vector
p Weak Classifier h (x)
1, h (x)
2..., h (x)
p; Final H (x)=h (x)
1+ h (x)
2+ ...+h (x)
p,
4. the name country origin recognition methods based on WEB and GBBoosting algorithm according to claim 1, is characterized in that: in step 3, the scholar of colleges and universities name is gone out to scholar belonging country by GBBoosting algorithm identified.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410019885.XA CN103729466B (en) | 2014-01-16 | 2014-01-16 | Name country origin recognition methods based on WEB and GBBoosting algorithms |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410019885.XA CN103729466B (en) | 2014-01-16 | 2014-01-16 | Name country origin recognition methods based on WEB and GBBoosting algorithms |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103729466A true CN103729466A (en) | 2014-04-16 |
CN103729466B CN103729466B (en) | 2017-07-04 |
Family
ID=50453540
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410019885.XA Active CN103729466B (en) | 2014-01-16 | 2014-01-16 | Name country origin recognition methods based on WEB and GBBoosting algorithms |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103729466B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104484412A (en) * | 2014-12-16 | 2015-04-01 | 芜湖乐锐思信息咨询有限公司 | Big data analysis system based on multiform processing |
CN108108371A (en) * | 2016-11-24 | 2018-06-01 | 北京国双科技有限公司 | A kind of file classification method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080168070A1 (en) * | 2007-01-08 | 2008-07-10 | Naphade Milind R | Method and apparatus for classifying multimedia artifacts using ontology selection and semantic classification |
CN101609450A (en) * | 2009-04-10 | 2009-12-23 | 南京邮电大学 | Web page classification method based on training set |
CN102142078A (en) * | 2010-02-03 | 2011-08-03 | 中国科学院自动化研究所 | Method for detecting and identifying targets based on component structure model |
US20130218872A1 (en) * | 2012-02-16 | 2013-08-22 | Benzion Jair Jehuda | Dynamic filters for data extraction plan |
CN103400471A (en) * | 2013-08-12 | 2013-11-20 | 电子科技大学 | Detecting system and detecting method for fatigue driving of driver |
-
2014
- 2014-01-16 CN CN201410019885.XA patent/CN103729466B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080168070A1 (en) * | 2007-01-08 | 2008-07-10 | Naphade Milind R | Method and apparatus for classifying multimedia artifacts using ontology selection and semantic classification |
CN101609450A (en) * | 2009-04-10 | 2009-12-23 | 南京邮电大学 | Web page classification method based on training set |
CN102142078A (en) * | 2010-02-03 | 2011-08-03 | 中国科学院自动化研究所 | Method for detecting and identifying targets based on component structure model |
US20130218872A1 (en) * | 2012-02-16 | 2013-08-22 | Benzion Jair Jehuda | Dynamic filters for data extraction plan |
CN103400471A (en) * | 2013-08-12 | 2013-11-20 | 电子科技大学 | Detecting system and detecting method for fatigue driving of driver |
Non-Patent Citations (1)
Title |
---|
肖江,张亚非: "Boosting算法在文本自动分类中的应用", 《解放军理工大学学报(自然科学版)》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104484412A (en) * | 2014-12-16 | 2015-04-01 | 芜湖乐锐思信息咨询有限公司 | Big data analysis system based on multiform processing |
CN108108371A (en) * | 2016-11-24 | 2018-06-01 | 北京国双科技有限公司 | A kind of file classification method and device |
CN108108371B (en) * | 2016-11-24 | 2021-06-29 | 北京国双科技有限公司 | Text classification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN103729466B (en) | 2017-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101768521B1 (en) | Method and system providing informational data of object included in image | |
CN110750656B (en) | Multimedia detection method based on knowledge graph | |
CN107153713A (en) | Overlapping community detection method and system based on similitude between node in social networks | |
CN105718579A (en) | Information push method based on internet-surfing log mining and user activity recognition | |
CN103324664B (en) | A kind of document similarity method of discrimination based on Fourier transformation | |
CN105843796A (en) | Microblog emotional tendency analysis method and device | |
Özcan et al. | An augmented reality application for smart campus urbanization: MSKU campus prototype | |
Zhang et al. | City2vec: Urban knowledge discovery based on population mobile network | |
CN106156287A (en) | Analyze public sentiment satisfaction method based on the scenic spot evaluating data of tourism demand template | |
CN102033947A (en) | Region recognizing device and method based on retrieval word | |
CN103744958B (en) | A kind of Web page classification method based on Distributed Calculation | |
CN107506370A (en) | Multi-medium data depth method for digging, storage medium and electronic equipment | |
CN110472115A (en) | A kind of social networks text emotion fine grit classification method based on deep learning | |
CN106408014A (en) | Semi-supervision classification method based on flow shape alignment | |
CN103729466A (en) | Name country identification method based on WEB and GBBoosting algorithms | |
Yu et al. | News recommendation model based on encoder graph neural network and bat optimization in online social multimedia art education | |
CN103942224B (en) | A kind of method and device for the mark rule obtaining web page release | |
Wei et al. | A method for topic classification of web pages using LDA-SVM model | |
Wang et al. | Urban function zoning using geotagged photos and openstreetmap | |
Puspasari et al. | A Survey of Data Mining Techniques for Smart Museum Applications | |
Cheng et al. | Negative emotion diffusion and intervention countermeasures of social networks based on deep learning | |
Feng et al. | A system for region search and exploration | |
CN103530656B (en) | Hidden structure learning-based image digest generation method | |
Zhao et al. | An universal perturbation generator for black-box attacks against object detectors | |
Wang et al. | Object proposal via depth connectivity constrained grouping |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |