CN107908749A

CN107908749A - A kind of personage's searching system and method based on search engine

Info

Publication number: CN107908749A
Application number: CN201711147336.0A
Authority: CN
Inventors: 周奇; 刘扬; 王佰玲; 辛国栋; 孙云霄; 王巍
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Harbin Institute of Technology Weihai
Priority date: 2017-11-17
Filing date: 2017-11-17
Publication date: 2018-04-13
Anticipated expiration: 2037-11-17
Also published as: CN107908749B

Abstract

The present invention relates to a kind of personage's searching system and method based on search engine, including sequentially connected data acquisition module, data preprocessing module, feature extraction module, cluster module；Data acquisition module crawls the webpage information of search engine retrieving name return；Data preprocessing module filters the webpage unrelated with name, carries out piecemeal processing, the vision block unrelated with retrieving name in filtering web page；Feature extraction module extracts relevant with retrieval personage attribute and entity, count word frequency in vision block, construct the vector representation form of each webpage, the Feature Words that appropriate increase is extracted correspond to the value of dimension in vector space, cluster module is using the vector representation form of each webpage as input, carry out the cluster of web page text, the list of output webpage class label composition.It is of the invention effectively to solve the problems, such as to return to the name ambiguity and information clutter in webpage during retrieval personage, construct personage by extracting character attribute and character relation and make a summary, facility is provided for user search name.

Description

A kind of personage's searching system and method based on search engine

Technical field

The present invention relates to a kind of personage's searching system and method based on search engine, belongs to internet and search technique neck Domain.

Background technology

At present, the Major Difficulties of personage's retrieval are that there are name ambiguity and information clutter in the webpage that retrieval name returns The problem of.Name disambiguation refers to distinguish multiple personage's individuals with identical name.The generally existing of name ambiguity is given Information is propagated and the acquisition of resource causes inconvenience, and the name search result that the search engine of mainstream provides instantly often will The mixing of all duplication of name people webpages and uncorrelated webpage, these webpages are according to definitely rule compositor, the high personage of attention Information more likely comes position above.For example, in Baidu search engine to " Li Na ", page rank is forward in retrieval result Have " tennis player ", " singer ", " most U.S. cancer girl ", etc. identity Li Na, " Li Na " as common tutor Information be just submerged in these information oceans, cause user's needs to take a substantial amount of time and checked and screened.

For above the problem of, have three classes solution at present：First, there is the sorting algorithm of supervision：By manually marking language Expect storehouse, select suitable sorter model to realize the classification of web page text, the classification number of such method determines, it is impossible to adapts to number According to dynamic increase, and the quality of grader be somewhat dependent on mark corpus size.2nd, it is unsupervised poly- Class algorithm：It is broadly divided into traditional clustering algorithm, the clustering algorithm based on figure segmentation and the clustering algorithm based on Internet resources.Pass The clustering algorithm of system, by constructing the vector space model of web page text, people is realized using K-Means or hierarchical clustering algorithm Name disambiguation；Based on the clustering algorithm of figure segmentation, with document or be characterized as node in advance, by the use of the relation between document or feature as While to construct social relationships net, the method for recycling figure segmentation is clustered；Clustering algorithm based on Internet resources, first with The Internet resources such as Chinese thesaurus, Yahoo's network documentation taxonomic hierarchies and wikipedia alleviate shortage of data and it is sparse the problem of, Then the disambiguation that clustering algorithm realizes name is reused.3rd, mixed model：The strategy gathered using multiple steps, by it is multiple classification or Person's clustering algorithm combines, and realizes name disambiguation.It is big along with lacking due to the diversity and uncertainty of the network information The corpus that scale manually marks, and the unusual time and effort consuming of handmarking, see, based on unsupervised people in this sense Name disambiguation method is better than having supervision.

At present, the research of name disambiguation mainly relies on text modeling, and pretreatment includes extracting character attribute and name Entity, and combine the mapping relations of name contextual information research name and personage's individual.But make discovery from observation, have in webpage Many and the distant text message of name and some abstracted informations, have great help to name disambiguation, and such as two webpages are same Belong to musical theme, or belong to computer realm, then two pages probably correspond to same person, therefore we are to whole A webpage is modeled；And current solution, it is impossible to which classification number that automatically identification webpage is concentrated is, it is necessary to artificial Intervene.

Chinese patent literature 102054029A discloses a kind of people information based on community network and name context and disappears Discrimination processing method, the present invention relates to a kind of disambiguation processing method of internet character information.It solves the search of the prior art Engine is to the problem of retrieval result of a certain specific name is often the mixing for the different personage's related web pages for sharing this name. For network retrieval of person's information.It comprises the following steps：First, user inputs the name to be retrieved, utilizes search engine Retrieval is completed, using downloading software the page download retrieved to local computer；2nd, text is carried out respectively to above-mentioned webpage The processing of extraction process, word segmentation processing and part-of-speech tagging, forms document；3rd, first document is divided using personage's realm information Class, recycles community network and contextual information to carry out clustering processing to personage's realm information, finally shows each personage's neck Correspondence between domain information and entity personage, and show community network existing for each entity personage.But this is special Profit directly Web page text text is extracted, segment and part-of-speech tagging formed document, at present search engine retrieving return net Page species is complicated, various structures, and sidebar and multistage title usually contain most information of retrieval name in webpage.By this The method of patent can not extract the name relevant information in non-body text in webpage, seriously affect the effect of cluster；The patent Clustering algorithm, it is necessary to extract personage's realm information in text, the information content of extraction is very big to the influential effect of cluster, and Need to be manually specified the threshold value of cluster, there are influence of the manual intervention to Clustering Effect.

The content of the invention

In view of the deficiencies of the prior art, the present invention provides a kind of personage's searching system based on search engine；

Present invention also offers a kind of person searching method based on search engine；

First, according to the practical layout of webpage, Vision-based Page Segmentation (VIPS) algorithm is used Realize the piecemeal of webpage, and extract each vision text in the block, position and chain feature, using in SVM algorithm filtering web page The vision block unrelated with name；Then, using the Text Clustering Method based on mixing Di Li Cray processes, this method being capable of basis Word frequency statistics feature automatic decision the document in text belongs to existing classification, or newly-generated classification, automatic identification net The classification number of page text set, the influence of the manual intervention of reduction to Clustering Effect, efficiently solves retrieval name and returns to net Name ambiguity problem in page；Finally, generate personage by the attribute and character relation of extraction to make a summary, carried for user search name For facility.

Term is explained：

1st, TF-IDF values, term frequency-inverse document frequency, pass through in natural language processing Commonly used a kind of statistical method, to assess a words for a copy of it text in a text set or a corpus Significance level.The number that the importance of words occurs in the text with it is directly proportional, but at the same time can be as it is in corpus The frequency of middle appearance is inversely proportional.

2nd, VIPS visions block algorithm, Vision-based Page Segmentation.

3rd, svm classifier algorithm, Support Vector Machine.

The technical scheme is that：

A kind of personage's searching system based on search engine, including sequentially connected data acquisition module, data prediction Module, feature extraction module, cluster module；

The name of retrieval is inputted, the data acquisition module crawls multiple search engine retrieving names using crawler system and returns The webpage information returned, forms webpage collection；The webpage information refers to：Some webpages that search engine retrieving name returns, each Webpage includes title (title), url, summary (content), entire Web page；

The url in every information of different search engine retrieving names returns is crawled by reptile engine first, is then made The entire Web page information in url is downloaded with page download instrument httrack.By it has been observed that search engine retrieving name returns In the information returned only first page 10 it is larger with the degree of correlation of name, so before only crawling the return of each search engine retrieving name The webpage information of page 10)

The data preprocessing module filtering web page concentrates the webpage unrelated with name, the webpage concentrated to webpage after filtering Carry out piecemeal processing, obtain multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, filter out in vision block with The unrelated vision block message of name；

Vision block refers to the piecemeal effect that webpage is formed after VIPS algorithms；

Vision block includes picture, hexa-atomic group information<With the distance of webpage upper edge, the distance with webpage left margin, vision The length of block, the width of vision block, the numbering of vision block, the text in vision block>；The vision block message unrelated with name includes wide Announcement, navigation, pop-up box, copyright information and other vision blocks unrelated with name.

The feature extraction module extracts relevant with retrieval personage attribute and entity, entity from vision block and refers to webpage The name of middle appearance；With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage is constructed, it is described Vector representation form is：(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to word in webpage In appearance number；According to the relevant attribute of the personage of extraction and entity, the Feature Words that appropriate increase is extracted are in vector space The value of middle corresponding dimension；

The cluster module uses Di Li Cray process mixed models using the vector representation form of each webpage as input Carry out the cluster of web page text, the list of output webpage class label composition.Di Li Cray process mixed models can be known automatically The classification number that other web page text is concentrated, it is not necessary to manual intervention.

Preferable according to the present invention, the data preprocessing module includes sequentially connected data cleansing module, webpage point Block module, personage's multi view block abstraction module, the data acquisition module connect the data cleansing module, the figure picture Close vision block abstraction module and connect the feature extraction module；

Whether the data cleansing module is wrapped by naming in each webpage that Entity recognition device identification crawler system crawls Name containing retrieval：If a certain webpage without retrieval name or the name number different from the name of retrieval more than 5, It is then the webpage unrelated with name directly by the Web Page Tags, otherwise, is and the relevant webpage of name by the Web Page Tags；

The web page release module to obtained after the data cleansing module data cleansing with the relevant webpage of name into The processing of row vision piecemeal：Realize that web page release is handled by VIPS vision block algorithms, export each vision block split in webpage Hexa-atomic group information, hexa-atomic group information includes：Distance with webpage upper edge, with the distance of webpage left margin, the length of vision block, The width of vision block, the numbering of vision block, the text in vision block；

Since there are advertisement, navigation, pop-up box, copyright information, pop-up box and other unrelated with name in webpage visual block Vision block, personage's multi view block abstraction module filters the vision block unrelated with name by svm classifier algorithm, i.e.,： The TF-IDF values of text in every piece of vision block, the size of vision block, the position of vision block are inputted, the size of vision block includes vision The length of block, the width of vision block；The position of vision block is represented with the distance with webpage upper edge, with the distance of webpage left margin； Chain enters the vector representation form that chain goes out to form vision block than feature, and output 0 or 1,0 represents the vision block with retrieving name not phase Close, 1 represents that the vision block is related to retrieval name.

Preferable according to the present invention, the feature extraction module includes personage's correlation attribute extraction module, character relation is taken out Modulus block, text vector module, the data preprocessing module connect personage's correlation attribute extraction module, personage respectively Relation extraction module, personage's correlation attribute extraction module, character relation abstraction module are all connected with the text vector mould Block；Personage's correlation attribute extraction module extracts some dimension personages in each webpage using the method for rule and template matches Attribute.

The character relation abstraction module identifies the name entity in each webpage using name Entity recognition device, and statistics is every The weight of the number that a name entity occurs and the distance with retrieving name, occurrence number and the Distance Judgment entity with retrieving name Want degree；The name entity, that is, the entity；

Name entity and the computational methods of the distance of retrieval name are：If retrieve name and the name entity extracted occurs In a vision block, the distance of the name entity and retrieval name is 0, and otherwise the name entity is with the distance for retrieving name 1；

By name entity in webpage occurrence number and with retrieve name Distance Judgment entity significance level meter Calculation method is：The number that name entity occurs+(distance of 1- names entity and retrieval name)；

The character attribute extracted in webpage is first carried out word segmentation processing by the text vector module, counts name therein Word；Web page text is segmented again, removes stop words, and counts the word frequency of each webpage, constructs the vector representation form of web page text； I.e.：In webpage in text word word frequency statistics：{(word₁,count₁),(word₂,count₂) ..., (word_n,count_n), word_iRepresent i-th of word in webpage, count_iRepresent the frequency that i-th of word occurs in webpage；Finally, webpage is searched one by one The vector representation form of text and the value of character attribute and the corresponding word of entity, and according to the important of character attribute value and entity Degree suitably increases weights.

The significance level of character attribute refers to different attributes, and the differentiation degree to personage is different, and it is higher to distinguish degree Attribute：Gender, previous graduate college, make the name of an article, educational background, height, weight, mailbox, phone, date of birth, and increased weighted value is 5；Other 11 increased weights of attribute are 3.By the weights of name entity and character attribute and corresponding web page text vector table The vector representation form for the web page text that the corresponding value of word shown is added to the end.

It is preferable according to the present invention, extract 20 dimension character attribute in each webpage, 20 dimension character attributes include birthplace, Professional name, previous graduate college, the date of birth, nationality, gender, make the name of an article, personal story, political affiliation, educational background, religious belief, body Height, weight, mailbox, marital status, nationality, achievement, blood group, hobby, phone.

Preferable according to the present invention, the crawler system is that the distribution based on Scrapy-redis crawls system.

A kind of person searching method based on search engine, including：

(1) webpage information of multiple search engine retrieving names returns is crawled using crawler system, forms webpage collection；

(2) filtering web page concentrates the webpage unrelated with name, carries out piecemeal processing to the webpage that webpage after filtering is concentrated, obtains To multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, vision block unrelated with name in webpage is filtered out；

(3) relevant with retrieval personage attribute and entity, entity are extracted from vision block and refers to the name occurred in webpage； With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage, the vector representation form are constructed For：(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to time of appearance of the word in webpage Number；According to the character attribute and entity of extraction, the Feature Words that appropriate increase is extracted correspond to the value of dimension in vector space；

(4) using the vector representation form of each webpage as input, webpage is carried out using Di Li Cray processes mixed model The cluster of text, the list of output webpage class label composition.Di Li Cray processes mixed model being capable of automatic identification webpage text The classification number of this concentration, it is not necessary to manual intervention.

It is preferable according to the present invention, the step (2), including：

A, the name of retrieval whether is included in each webpage for crawling of crawler system by naming Entity recognition device to identify：Such as The a certain webpage of fruit without retrieval name or the name number different from the name of retrieval more than 5, then directly by the webpage Labeled as the webpage unrelated with name, otherwise, it is and the relevant webpage of name by the Web Page Tags；

B, vision piecemeal processing is carried out to the webpage relevant with name obtained after step A data cleansings：Regarded by VIPS Feel that block algorithm realizes that web page release is handled, export the hexa-atomic group information for each vision block split in webpage, hexa-atomic group information bag Include：Distance with webpage upper edge, with the distance of webpage left margin, the length of vision block, the width of vision block, vision block numbering, Text in vision block；

C, the unrelated vision block of name is filtered by svm classifier algorithm, i.e.,：Input the TF-IDF of text in every piece of vision block Value, the size of vision block, position, chain enter the vector representation form that chain goes out the vision block formed than feature, and output 0 or 1,0 represents The vision block is uncorrelated to retrieval name, and 1 represents that the vision block is related to retrieval name, and removes the unrelated vision block of name.

It is preferable according to the present invention, the step (3), including：

A, using rule and the method for template matches, 20 dimension character attribute in each webpage is extracted；

B, the name entity in each webpage is identified using name Entity recognition device, counts time that each name entity occurs Number and the distance with retrieving name, according to occurrence number and the significance level of the Distance Judgment name entity with retrieving name；

C, some dimension character attributes in the webpage of extraction are subjected to word segmentation processing, count noun therein；

D, web page text is segmented, removes stop words, and count the word frequency of word in each webpage, construct the vector of web page text Representation；

E, search one by one in the vector representation form of web page text with character attribute value and the corresponding word of name entity Value, and weights are suitably increased according to the significance level of character attribute and name entity.

Preferable according to the present invention, the step a, extracts 20 dimension character attribute in each webpage, 20 dimension character attributes Including birthplace, professional name, previous graduate college, the date of birth, nationality, gender, make the name of an article, personal story, political affiliation, educational background, Religious belief, height, weight, mailbox, marital status, nationality, achievement, blood group, hobby, phone.

Beneficial effects of the present invention are：

1st, the present invention provides a kind of person searching method based on mixing Di Li Crays procedural text cluster, this method energy It is enough effectively to solve the problems, such as to return to the name ambiguity and information clutter in webpage during retrieval personage, and by extract character attribute with Character relation construction personage's summary, facility is provided for user search name.

2nd, (isomery webpage refers to that (mhkc, news, forum, learns different web pages type to a kind of isomery webpage of present invention offer School government website, blog, finance and economics etc.)) in personage's correlation information extraction method, webpage is divided first by VIPS algorithms Block processing, is then based on the vector representation form of hexa-atomic group of vision block and link information structure vision block, using SVM algorithm, Each vision block is divided into name is related or name is unrelated, effectively avoids in webpage name irrelevant information to name disambiguation Influence.

Brief description of the drawings

Fig. 1 is the structure diagram of personage's searching system of the invention based on search engine；

Fig. 2 is the flow diagram of the person searching method of the invention based on search engine；

Embodiment

The present invention is further qualified with reference to Figure of description and embodiment, but not limited to this.

Embodiment 1

A kind of personage's searching system based on search engine, as shown in Figure 1, including sequentially connected data acquisition module, Data preprocessing module, feature extraction module, cluster module；

Input retrieval name, data acquisition module using the distribution based on Scrapy-redis crawl system crawl it is more The webpage information that a search engine retrieving name returns, forms webpage collection；Webpage information refers to：Search engine retrieving name returns Some webpages, each webpage include title (title), url, summary (content), entire Web page；

The url in every information of different search engine retrieving names returns is crawled by reptile engine first, is then made The entire Web page information in url is downloaded with page download instrument httrack.By it has been observed that search engine retrieving name returns In the information returned only first page 10 it is larger with the degree of correlation of name, so before only crawling the return of each search engine retrieving name The webpage information of page 10.

Data preprocessing module filtering web page concentrates the webpage unrelated with name, and the webpage that webpage after filtering is concentrated is carried out Piecemeal processing, obtains multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, filter out in vision block with name Unrelated vision block message；

Vision block includes picture, hexa-atomic group information<With the distance of webpage upper edge, the distance with webpage left margin, vision The length of block, the width of vision block, the numbering of vision block, the text in vision block>；The vision block unrelated with name includes advertisement, leads Boat, pop-up box, copyright information and other vision blocks unrelated with name.

Feature extraction module extracts relevant with retrieval personage attribute and entity, entity from vision block and refers to go out in webpage Existing name；With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage, the vector are constructed Representation is：(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to word in webpage The number of appearance；According to the character attribute value and name entity of extraction, the Feature Words that appropriate increase is extracted are right in vector space Answer the value of dimension；

Cluster module is carried out using the vector representation form of each webpage as input using Di Li Cray processes mixed model The cluster of web page text, the list of output webpage class label composition.Di Li Cray processes mixed model being capable of automatic identification net Classification number in page text set, it is not necessary to manual intervention.

Di Li Cray process mixed models can be understood as a unlimited mixed model with unlimited abundance, be to have The limiting form of the Finite mixture model of Di Li Cray process a priori assumptions.(often one-dimensional is each net to the sample set of hypothesized model The vector representation form of page) it is X={ x₁,x₂,…x_nIt is the independent identically distributed variable for obeying lower column distribution：

G~DP (α, H) (1)

θ_i| G~G (2)

x_i|θ_i~F (θ_i) (3)

Observational variable x_iObedience parameter is θ_iDistribution F (θ_i), G is parameter θ_iPrior distribution, and G is that parameter is α, Base is distributed as the probability measure of the Di Li Cray processes of H；If sample x_iAnd x_jWith identical parameter, then two samples, which gather, is It is a kind of；

Embodiment 2

A kind of personage's searching system based on search engine according to embodiment 1, its difference lies in：

Data preprocessing module includes sequentially connected data cleansing module, web page release module, personage's multi view block Abstraction module, data acquisition module connection data cleansing module, personage's multi view block abstraction module connection features abstraction module；

Whether data cleansing module is by naming in each webpage that Entity recognition device identification crawler system crawls comprising inspection The name of rope：If a certain webpage without retrieval name or the name number different from the name of retrieval more than 5, directly It is the webpage unrelated with name to connect the Web Page Tags, otherwise, is and the relevant webpage of name by the Web Page Tags；

Web page release module carries out vision point to the webpage relevant with name obtained after data cleansing module data cleansing Block processing：Realize that web page release is handled by VIPS vision block algorithms, export hexa-atomic group of each vision block split in webpage Information, hexa-atomic group information include：Distance with webpage upper edge, with the distance of webpage left margin, the length of vision block, vision block Text in width, the numbering of vision block, vision block；

Since there are advertisement, navigation, pop-up box, copyright information, pop-up box and other unrelated with name in webpage visual block Vision block, personage's multi view block abstraction module filters the unrelated vision block of name by svm classifier algorithm, i.e.,：Input is every The TF-IDF values of text, the size of vision block, the position of vision block in block vision block, the size of vision block include vision block The long, width of vision block；The position of vision block is represented with the distance with webpage upper edge, with the distance of webpage left margin；Chain enters Chain goes out to form the vector representation form of vision block than feature, and output 0 or 1,0 represents that the vision block is uncorrelated to retrieval name, 1 table Show that the vision block is related to retrieval name.

Embodiment 3

A kind of personage's searching system based on search engine according to embodiment 1 or 2, its difference lies in：

Feature extraction module includes personage's correlation attribute extraction module, character relation abstraction module, text vector module, Data preprocessing module connects personage's correlation attribute extraction module, character relation abstraction module, personage's correlation attribute extraction respectively Module, character relation abstraction module are all connected with text vector module；Personage's correlation attribute extraction module uses rule and template Matched method extracts 20 dimension character attribute in each webpage, and 20 dimension character attributes include birthplace, professional name, graduation School, the date of birth, nationality, gender, make the name of an article, personal story, political affiliation, educational background, religious belief, height, weight, mailbox, Marital status, nationality, achievement, blood group, hobby, phone.

Rule match is the personage's association attributes extracted using regular expression in web page text, for example, "：" before colon If 20 dimension attributes specified, then colon is followed by the value of corresponding attribute；Use the numeral matching telephone number of 11.

Template matches refer to the character attribute according to 20 dimensions of every words in the template matches web page text manually formulated, example Such as,<Name>It is born in<Date of birth>；<Name>Engage in<Professional name>.

Character relation abstraction module identifies the name entity in each webpage using name Entity recognition device, counts everyone The number that name entity occurs and the distance with retrieving name, pass through name entity occurrence number and the Distance Judgment with retrieving name The significance level of name entity；

Name entity and the computational methods of the distance of retrieval name are：If retrieval name and name entity appear in one In vision block, the distance of the name entity and retrieval name is 0, and otherwise the distance of the name entity and retrieval name is 1；

It is by occurrence number and with the computational methods of the significance level of the Distance Judgment entity of retrieval name：Name entity The number of appearance+(distance of 1-name entity and retrieval name)；

The character attribute extracted in webpage is first carried out word segmentation processing by text vector module, counts noun therein；Again Web page text is segmented, removes stop words, and counts the word frequency of each webpage, constructs the vector representation form of web page text；I.e.：Net In page in text word word frequency statistics：{(word₁,count₁),(word₂,count₂) ..., (word_n,count_n), word_i Represent i-th of word in webpage, count_iRepresent the frequency that i-th of word occurs in webpage；Finally, web page text is searched one by one Vector representation form and character attribute and the corresponding word of entity value, and according to character attribute value and the significance level of entity Appropriate increase weights.

Embodiment 4

A kind of person searching method based on search engine, as shown in Fig. 2, including：

(2) filtering web page concentrates the webpage unrelated with name, carries out piecemeal processing to the webpage that webpage after filtering is concentrated, obtains To multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, vision block unrelated with name in webpage is filtered out； Including：

C, the unrelated vision block of name is filtered by svm classifier algorithm, i.e.,：Input the TF-IDF of text in every piece of vision block Value, the size of vision block, position and chain enter the vector representation form that chain goes out the vision block formed than feature；Output 0 or 1,0 represents The vision block is uncorrelated to retrieval name, and 1 represents that the vision block is related with retrieval name, and removes and retrieve that name is unrelated regards Feel block.

(3) relevant with retrieval personage attribute and entity, entity are extracted from vision block and refers to the name occurred in webpage； With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage, the vector representation form are constructed For：(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to time of appearance of the word in webpage Number；According to the character attribute value and name entity of extraction, the Feature Words that appropriate increase is extracted correspond to dimension in vector space Value；Including：

A, using rule and the method for template matches, 20 dimension character attribute in each webpage, 20 dimension character attributes are extracted Including birthplace, professional name, previous graduate college, the date of birth, nationality, gender, make the name of an article, personal story, political affiliation, educational background, Religious belief, height, weight, mailbox, marital status, nationality, achievement, blood group, hobby, phone.

B, the name entity in each webpage is identified using name Entity recognition device, counts time that each name entity occurs Number and the distance with retrieving name, according to occurrence number and the significance level of the Distance Judgment entity with retrieving name；

C, 20 dimension character attributes in the webpage of extraction are subjected to word segmentation processing, count noun therein；

Embodiment 5

A kind of person searching method based on search engine according to embodiment 1, its difference lies in：If Di Li Crays The parameter of process is α, including step is as follows：

(1) name of input retrieval；

(2) according to the name of input, crawler system crawls the web data collection of different search engine retrievings name return；

(3) the web data collection crawled for crawler system, the people in each webpage is identified using name Entity recognition device Name entity, is nothing directly by the Web Page Tags if the different name number of the name or name do not retrieved is more than 5 Classification is closed, other Web Page Tags are and the relevant webpage of name；

(4) for the name related web page after step (3) processing, using VIPS vision block algorithms, dividing for webpage is carried out Block processing；

(5) for the webpage of piecemeal in step (4), vision piecemeal is extracted, TF-IDF values of text in each vision block, regard Feel size, the position (size of vision block of block：The length of vision block, the width of vision block；Position：Distance with webpage upper edge, with The distance of webpage left margin), chain enters the vector representation form that chain goes out to form vision block than feature, uses svm classifier algorithm, filtering The unrelated vision block of name in webpage；

(6) for the relevant vision block of name in step (5), extraction personage's multi view 20 dimension character attribute in the block, Name entity and text message, construct the vector representation form of web page text, and according to the character attribute and name entity of extraction The value of appropriate adjustment vector；

(7) for the vector representation form of step (6) construction web page text collection, using Di Li Cray process mixed models, Carry out text cluster operation, output be web data collection list of categories：[label₁,label₂,…label_n], wherein label_i∈ (1, n) and label_i∈ N, N represent last classification number；

(8) character attribute in the class label after cluster and each classification, carries out the fusion of character attribute, and structure Make the triple of each classification：<[name of i-th of classification retrieval, relevant name], [attribute list of fusion], [classification i Collections of web pages]>, then according to the significance level of each real person individual, the visual triple for showing all categories.

Claims

1. a kind of personage's searching system based on search engine, it is characterised in that including sequentially connected data acquisition module, number Data preprocess module, feature extraction module, cluster module；

The name of retrieval is inputted, the data acquisition module crawls what multiple search engine retrieving names returned using crawler system Webpage information, forms webpage collection；The webpage information refers to：Some webpages that search engine retrieving name returns, each webpage Including title, url, summary, entire Web page；

The data preprocessing module filtering web page concentrates the webpage unrelated with name, and the webpage that webpage after filtering is concentrated is carried out Piecemeal processing, obtains multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, filter out in vision block with name Unrelated vision block message；

The feature extraction module extracts relevant with retrieval personage attribute and entity, entity from vision block and refers to go out in webpage Existing name；With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage, the vector are constructed Representation is：(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to word in webpage The number of appearance；According to the character attribute value and name entity of extraction, the Feature Words that appropriate increase is extracted are right in vector space Answer the value of dimension；

The cluster module is carried out using the vector representation form of each webpage as input using Di Li Cray processes mixed model The cluster of web page text, the list of output webpage class label composition.

2. a kind of personage's searching system based on search engine according to claim 1, it is characterised in that the data are pre- Processing module includes sequentially connected data cleansing module, web page release module, personage's multi view block abstraction module, the number The data cleansing module is connected according to acquisition module, personage's multi view block abstraction module connects the feature extraction mould Block；

Whether the data cleansing module is by naming in each webpage that Entity recognition device identification crawler system crawls comprising inspection The name of rope：If a certain webpage without retrieval name or the name number different from the name of retrieval more than 5, directly It is the webpage unrelated with name to connect the Web Page Tags, otherwise, is and the relevant webpage of name by the Web Page Tags；

The web page release module after the data cleansing module data cleansing to obtaining regarding with the relevant webpage of name Feel piecemeal processing：Realize that web page release is handled by VIPS vision block algorithms, export the six of each vision block split in webpage Tuple information, hexa-atomic group information include：Distance and the distance of webpage left margin, the length of vision block, vision with webpage upper edge The width of block, the numbering of vision block, the text in vision block；

Personage's multi view block abstraction module filters the unrelated vision block of name by svm classifier algorithm, i.e.,：Input every piece The TF-IDF values of text, the size of vision block, the position of vision block in vision block, length of the size including vision block of vision block, The width of vision block；The position of vision block is represented with the distance with webpage upper edge, with the distance of webpage left margin；Chain enters chain and goes out Than the vector representation form that feature forms vision block, output 0 or 1,0 represents that the vision block is uncorrelated to retrieval name, and 1 expression should Vision block is related to retrieval name.

A kind of 3. personage's searching system based on search engine according to claim 1 or 2, it is characterised in that the spy Levying abstraction module includes personage's correlation attribute extraction module, character relation abstraction module, text vector module, and the data are pre- Processing module connects personage's correlation attribute extraction module, character relation abstraction module respectively, and personage's association attributes is taken out Modulus block, character relation abstraction module are all connected with the text vector module；Personage's correlation attribute extraction module uses The method of rule and template matches extracts some dimension character attributes in each webpage；

The character relation abstraction module identifies the name entity in each webpage using name Entity recognition device, counts everyone The important journey of the number that name entity occurs and the distance with retrieving name, occurrence number and the Distance Judgment entity with retrieving name Degree；

Name entity and the computational methods of the distance of retrieval name are：If retrieval name and the name entity extracted appear in one In a vision block, the distance of the name entity and retrieval name is 0, and otherwise the distance of the name entity and retrieval name is 1；

By name entity in webpage occurrence number and with retrieve name Distance Judgment entity significance level calculating side Method is：The number that name entity occurs+(distance of 1- names entity and retrieval name)；

Character attribute in the webpage of extraction is first carried out word segmentation processing by the text vector module, counts noun therein； Web page text is segmented again, removes stop words, and counts the word frequency of each webpage, constructs the vector representation form of web page text, by One searches the value of vector representation form and the character attribute of web page text and the corresponding word of entity, and according to character attribute value and The significance level of entity suitably increases weights.

4. a kind of personage's searching system based on search engine according to claim 3, it is characterised in that extract each In webpage 20 dimension character attribute, 20 dimension character attributes include birthplace, professional name, previous graduate college, the date of birth, nationality, gender, Make the name of an article, personal story, political affiliation, educational background, religious belief, height, weight, mailbox, marital status, nationality, achievement, blood Type, hobby, phone.

A kind of 5. personage's searching system based on search engine according to claim 3, it is characterised in that the reptile system System is that the distribution based on Scrapy-redis crawls system.

A kind of 6. person searching method based on search engine, it is characterised in that including：

(2) filtering web page concentrates the webpage unrelated with name, carries out piecemeal processing to the webpage that webpage after filtering is concentrated, obtains every Multiple vision blocks of a webpage, and by there is the sorting algorithm of supervision, filter out vision block unrelated with name in webpage；

(3) relevant with retrieval personage attribute and entity, entity are extracted from vision block and refers to the name occurred in webpage；Statistics With the word frequency of name multi view block in webpage, the vector representation form of each webpage is constructed, the vector representation form is： (x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to the number of appearance of the word in webpage； According to the character attribute value and name entity of extraction, the Feature Words that appropriate increase is extracted correspond to the value of dimension in vector space；

(4) using the vector representation form of each webpage as input, web page text is carried out using Di Li Cray processes mixed model Cluster, the list of output webpage class label composition.

A kind of 7. person searching method based on search engine according to claim 6, it is characterised in that the step (2), including：

A, the name of retrieval whether is included in each webpage for crawling of crawler system by naming Entity recognition device to identify：If certain One webpage without retrieval name or the name number different from the name of retrieval more than 5, then directly by the Web Page Tags For the webpage unrelated with name, otherwise, it is and the relevant webpage of name by the Web Page Tags；

B, vision piecemeal processing is carried out to the webpage relevant with name obtained after step A data cleansings：Pass through VIPS vision blocks Algorithm realizes that web page release is handled, and exports the hexa-atomic group information for each vision block split in webpage, and hexa-atomic group information includes：With The distance of webpage upper edge and the distance of webpage left margin, the length of vision block, the width of vision block, the numbering of vision block, vision block Interior text；

C, the unrelated vision block of name is filtered by svm classifier algorithm, i.e.,：Input the TF-IDF values of text in every piece of vision block, The size of vision block, the position of vision block, chain enter the vector representation form that chain goes out to form vision block than feature, output 0 or 1,0 table Show that the vision block is uncorrelated to retrieval name, 1 represents that the vision block is related to retrieval name, and removes unrelated with retrieval name Vision block.

A kind of 8. person searching method based on search engine according to claim 6 or 7, it is characterised in that the step Suddenly (3), including：

A, using rule and the method for template matches, some dimension character attributes in each webpage are extracted

B, the name entity in each webpage is identified using name Entity recognition device, count number that each name entity occurs and The significance level of distance with retrieving name, occurrence number and the Distance Judgment entity with retrieving name；

D, web page text is segmented, removes stop words, and count the word frequency of word in each webpage, construct the vector representation of web page text Form；

E, one by one search web page text vector representation form in character attribute value and the value of the corresponding word of name entity, and Weights are suitably increased according to the significance level of character attribute and name entity.

A kind of 9. person searching method based on search engine according to claim 8, it is characterised in that the step a, Extract in each webpage 20 dimension character attributes, 20 dimension character attributes include birthplace, professional name, previous graduate college, the date of birth, Nationality, gender, make the name of an article, personal story, political affiliation, educational background, religious belief, height, weight, mailbox, marital status, state Nationality, achievement, blood group, hobby, phone.