A kind of personage's searching system and method based on search engine
Technical field
The present invention relates to a kind of personage's searching system and method based on search engine, belongs to internet and search technique neck
Domain.
Background technology
At present, the Major Difficulties of personage's retrieval are that there are name ambiguity and information clutter in the webpage that retrieval name returns
The problem of.Name disambiguation refers to distinguish multiple personage's individuals with identical name.The generally existing of name ambiguity is given
Information is propagated and the acquisition of resource causes inconvenience, and the name search result that the search engine of mainstream provides instantly often will
The mixing of all duplication of name people webpages and uncorrelated webpage, these webpages are according to definitely rule compositor, the high personage of attention
Information more likely comes position above.For example, in Baidu search engine to " Li Na ", page rank is forward in retrieval result
Have " tennis player ", " singer ", " most U.S. cancer girl ", etc. identity Li Na, " Li Na " as common tutor
Information be just submerged in these information oceans, cause user's needs to take a substantial amount of time and checked and screened.
For above the problem of, have three classes solution at present:First, there is the sorting algorithm of supervision:By manually marking language
Expect storehouse, select suitable sorter model to realize the classification of web page text, the classification number of such method determines, it is impossible to adapts to number
According to dynamic increase, and the quality of grader be somewhat dependent on mark corpus size.2nd, it is unsupervised poly-
Class algorithm:It is broadly divided into traditional clustering algorithm, the clustering algorithm based on figure segmentation and the clustering algorithm based on Internet resources.Pass
The clustering algorithm of system, by constructing the vector space model of web page text, people is realized using K-Means or hierarchical clustering algorithm
Name disambiguation;Based on the clustering algorithm of figure segmentation, with document or be characterized as node in advance, by the use of the relation between document or feature as
While to construct social relationships net, the method for recycling figure segmentation is clustered;Clustering algorithm based on Internet resources, first with
The Internet resources such as Chinese thesaurus, Yahoo's network documentation taxonomic hierarchies and wikipedia alleviate shortage of data and it is sparse the problem of,
Then the disambiguation that clustering algorithm realizes name is reused.3rd, mixed model:The strategy gathered using multiple steps, by it is multiple classification or
Person's clustering algorithm combines, and realizes name disambiguation.It is big along with lacking due to the diversity and uncertainty of the network information
The corpus that scale manually marks, and the unusual time and effort consuming of handmarking, see, based on unsupervised people in this sense
Name disambiguation method is better than having supervision.
At present, the research of name disambiguation mainly relies on text modeling, and pretreatment includes extracting character attribute and name
Entity, and combine the mapping relations of name contextual information research name and personage's individual.But make discovery from observation, have in webpage
Many and the distant text message of name and some abstracted informations, have great help to name disambiguation, and such as two webpages are same
Belong to musical theme, or belong to computer realm, then two pages probably correspond to same person, therefore we are to whole
A webpage is modeled;And current solution, it is impossible to which classification number that automatically identification webpage is concentrated is, it is necessary to artificial
Intervene.
Chinese patent literature 102054029A discloses a kind of people information based on community network and name context and disappears
Discrimination processing method, the present invention relates to a kind of disambiguation processing method of internet character information.It solves the search of the prior art
Engine is to the problem of retrieval result of a certain specific name is often the mixing for the different personage's related web pages for sharing this name.
For network retrieval of person's information.It comprises the following steps:First, user inputs the name to be retrieved, utilizes search engine
Retrieval is completed, using downloading software the page download retrieved to local computer;2nd, text is carried out respectively to above-mentioned webpage
The processing of extraction process, word segmentation processing and part-of-speech tagging, forms document;3rd, first document is divided using personage's realm information
Class, recycles community network and contextual information to carry out clustering processing to personage's realm information, finally shows each personage's neck
Correspondence between domain information and entity personage, and show community network existing for each entity personage.But this is special
Profit directly Web page text text is extracted, segment and part-of-speech tagging formed document, at present search engine retrieving return net
Page species is complicated, various structures, and sidebar and multistage title usually contain most information of retrieval name in webpage.By this
The method of patent can not extract the name relevant information in non-body text in webpage, seriously affect the effect of cluster;The patent
Clustering algorithm, it is necessary to extract personage's realm information in text, the information content of extraction is very big to the influential effect of cluster, and
Need to be manually specified the threshold value of cluster, there are influence of the manual intervention to Clustering Effect.
The content of the invention
In view of the deficiencies of the prior art, the present invention provides a kind of personage's searching system based on search engine;
Present invention also offers a kind of person searching method based on search engine;
First, according to the practical layout of webpage, Vision-based Page Segmentation (VIPS) algorithm is used
Realize the piecemeal of webpage, and extract each vision text in the block, position and chain feature, using in SVM algorithm filtering web page
The vision block unrelated with name;Then, using the Text Clustering Method based on mixing Di Li Cray processes, this method being capable of basis
Word frequency statistics feature automatic decision the document in text belongs to existing classification, or newly-generated classification, automatic identification net
The classification number of page text set, the influence of the manual intervention of reduction to Clustering Effect, efficiently solves retrieval name and returns to net
Name ambiguity problem in page;Finally, generate personage by the attribute and character relation of extraction to make a summary, carried for user search name
For facility.
Term is explained:
1st, TF-IDF values, term frequency-inverse document frequency, pass through in natural language processing
Commonly used a kind of statistical method, to assess a words for a copy of it text in a text set or a corpus
Significance level.The number that the importance of words occurs in the text with it is directly proportional, but at the same time can be as it is in corpus
The frequency of middle appearance is inversely proportional.
2nd, VIPS visions block algorithm, Vision-based Page Segmentation.
3rd, svm classifier algorithm, Support Vector Machine.
The technical scheme is that:
A kind of personage's searching system based on search engine, including sequentially connected data acquisition module, data prediction
Module, feature extraction module, cluster module;
The name of retrieval is inputted, the data acquisition module crawls multiple search engine retrieving names using crawler system and returns
The webpage information returned, forms webpage collection;The webpage information refers to:Some webpages that search engine retrieving name returns, each
Webpage includes title (title), url, summary (content), entire Web page;
The url in every information of different search engine retrieving names returns is crawled by reptile engine first, is then made
The entire Web page information in url is downloaded with page download instrument httrack.By it has been observed that search engine retrieving name returns
In the information returned only first page 10 it is larger with the degree of correlation of name, so before only crawling the return of each search engine retrieving name
The webpage information of page 10)
The data preprocessing module filtering web page concentrates the webpage unrelated with name, the webpage concentrated to webpage after filtering
Carry out piecemeal processing, obtain multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, filter out in vision block with
The unrelated vision block message of name;
Vision block refers to the piecemeal effect that webpage is formed after VIPS algorithms;
Vision block includes picture, hexa-atomic group information<With the distance of webpage upper edge, the distance with webpage left margin, vision
The length of block, the width of vision block, the numbering of vision block, the text in vision block>;The vision block message unrelated with name includes wide
Announcement, navigation, pop-up box, copyright information and other vision blocks unrelated with name.
The feature extraction module extracts relevant with retrieval personage attribute and entity, entity from vision block and refers to webpage
The name of middle appearance;With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage is constructed, it is described
Vector representation form is:(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to word in webpage
In appearance number;According to the relevant attribute of the personage of extraction and entity, the Feature Words that appropriate increase is extracted are in vector space
The value of middle corresponding dimension;
The cluster module uses Di Li Cray process mixed models using the vector representation form of each webpage as input
Carry out the cluster of web page text, the list of output webpage class label composition.Di Li Cray process mixed models can be known automatically
The classification number that other web page text is concentrated, it is not necessary to manual intervention.
Preferable according to the present invention, the data preprocessing module includes sequentially connected data cleansing module, webpage point
Block module, personage's multi view block abstraction module, the data acquisition module connect the data cleansing module, the figure picture
Close vision block abstraction module and connect the feature extraction module;
Whether the data cleansing module is wrapped by naming in each webpage that Entity recognition device identification crawler system crawls
Name containing retrieval:If a certain webpage without retrieval name or the name number different from the name of retrieval more than 5,
It is then the webpage unrelated with name directly by the Web Page Tags, otherwise, is and the relevant webpage of name by the Web Page Tags;
The web page release module to obtained after the data cleansing module data cleansing with the relevant webpage of name into
The processing of row vision piecemeal:Realize that web page release is handled by VIPS vision block algorithms, export each vision block split in webpage
Hexa-atomic group information, hexa-atomic group information includes:Distance with webpage upper edge, with the distance of webpage left margin, the length of vision block,
The width of vision block, the numbering of vision block, the text in vision block;
Since there are advertisement, navigation, pop-up box, copyright information, pop-up box and other unrelated with name in webpage visual block
Vision block, personage's multi view block abstraction module filters the vision block unrelated with name by svm classifier algorithm, i.e.,:
The TF-IDF values of text in every piece of vision block, the size of vision block, the position of vision block are inputted, the size of vision block includes vision
The length of block, the width of vision block;The position of vision block is represented with the distance with webpage upper edge, with the distance of webpage left margin;
Chain enters the vector representation form that chain goes out to form vision block than feature, and output 0 or 1,0 represents the vision block with retrieving name not phase
Close, 1 represents that the vision block is related to retrieval name.
Preferable according to the present invention, the feature extraction module includes personage's correlation attribute extraction module, character relation is taken out
Modulus block, text vector module, the data preprocessing module connect personage's correlation attribute extraction module, personage respectively
Relation extraction module, personage's correlation attribute extraction module, character relation abstraction module are all connected with the text vector mould
Block;Personage's correlation attribute extraction module extracts some dimension personages in each webpage using the method for rule and template matches
Attribute.
The character relation abstraction module identifies the name entity in each webpage using name Entity recognition device, and statistics is every
The weight of the number that a name entity occurs and the distance with retrieving name, occurrence number and the Distance Judgment entity with retrieving name
Want degree;The name entity, that is, the entity;
Name entity and the computational methods of the distance of retrieval name are:If retrieve name and the name entity extracted occurs
In a vision block, the distance of the name entity and retrieval name is 0, and otherwise the name entity is with the distance for retrieving name
1;
By name entity in webpage occurrence number and with retrieve name Distance Judgment entity significance level meter
Calculation method is:The number that name entity occurs+(distance of 1- names entity and retrieval name);
The character attribute extracted in webpage is first carried out word segmentation processing by the text vector module, counts name therein
Word;Web page text is segmented again, removes stop words, and counts the word frequency of each webpage, constructs the vector representation form of web page text;
I.e.:In webpage in text word word frequency statistics:{(word1,count1),(word2,count2) ..., (wordn,countn),
wordiRepresent i-th of word in webpage, countiRepresent the frequency that i-th of word occurs in webpage;Finally, webpage is searched one by one
The vector representation form of text and the value of character attribute and the corresponding word of entity, and according to the important of character attribute value and entity
Degree suitably increases weights.
The significance level of character attribute refers to different attributes, and the differentiation degree to personage is different, and it is higher to distinguish degree
Attribute:Gender, previous graduate college, make the name of an article, educational background, height, weight, mailbox, phone, date of birth, and increased weighted value is
5;Other 11 increased weights of attribute are 3.By the weights of name entity and character attribute and corresponding web page text vector table
The vector representation form for the web page text that the corresponding value of word shown is added to the end.
It is preferable according to the present invention, extract 20 dimension character attribute in each webpage, 20 dimension character attributes include birthplace,
Professional name, previous graduate college, the date of birth, nationality, gender, make the name of an article, personal story, political affiliation, educational background, religious belief, body
Height, weight, mailbox, marital status, nationality, achievement, blood group, hobby, phone.
Preferable according to the present invention, the crawler system is that the distribution based on Scrapy-redis crawls system.
A kind of person searching method based on search engine, including:
(1) webpage information of multiple search engine retrieving names returns is crawled using crawler system, forms webpage collection;
(2) filtering web page concentrates the webpage unrelated with name, carries out piecemeal processing to the webpage that webpage after filtering is concentrated, obtains
To multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, vision block unrelated with name in webpage is filtered out;
(3) relevant with retrieval personage attribute and entity, entity are extracted from vision block and refers to the name occurred in webpage;
With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage, the vector representation form are constructed
For:(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to time of appearance of the word in webpage
Number;According to the character attribute and entity of extraction, the Feature Words that appropriate increase is extracted correspond to the value of dimension in vector space;
(4) using the vector representation form of each webpage as input, webpage is carried out using Di Li Cray processes mixed model
The cluster of text, the list of output webpage class label composition.Di Li Cray processes mixed model being capable of automatic identification webpage text
The classification number of this concentration, it is not necessary to manual intervention.
It is preferable according to the present invention, the step (2), including:
A, the name of retrieval whether is included in each webpage for crawling of crawler system by naming Entity recognition device to identify:Such as
The a certain webpage of fruit without retrieval name or the name number different from the name of retrieval more than 5, then directly by the webpage
Labeled as the webpage unrelated with name, otherwise, it is and the relevant webpage of name by the Web Page Tags;
B, vision piecemeal processing is carried out to the webpage relevant with name obtained after step A data cleansings:Regarded by VIPS
Feel that block algorithm realizes that web page release is handled, export the hexa-atomic group information for each vision block split in webpage, hexa-atomic group information bag
Include:Distance with webpage upper edge, with the distance of webpage left margin, the length of vision block, the width of vision block, vision block numbering,
Text in vision block;
C, the unrelated vision block of name is filtered by svm classifier algorithm, i.e.,:Input the TF-IDF of text in every piece of vision block
Value, the size of vision block, position, chain enter the vector representation form that chain goes out the vision block formed than feature, and output 0 or 1,0 represents
The vision block is uncorrelated to retrieval name, and 1 represents that the vision block is related to retrieval name, and removes the unrelated vision block of name.
It is preferable according to the present invention, the step (3), including:
A, using rule and the method for template matches, 20 dimension character attribute in each webpage is extracted;
B, the name entity in each webpage is identified using name Entity recognition device, counts time that each name entity occurs
Number and the distance with retrieving name, according to occurrence number and the significance level of the Distance Judgment name entity with retrieving name;
C, some dimension character attributes in the webpage of extraction are subjected to word segmentation processing, count noun therein;
D, web page text is segmented, removes stop words, and count the word frequency of word in each webpage, construct the vector of web page text
Representation;
E, search one by one in the vector representation form of web page text with character attribute value and the corresponding word of name entity
Value, and weights are suitably increased according to the significance level of character attribute and name entity.
Preferable according to the present invention, the step a, extracts 20 dimension character attribute in each webpage, 20 dimension character attributes
Including birthplace, professional name, previous graduate college, the date of birth, nationality, gender, make the name of an article, personal story, political affiliation, educational background,
Religious belief, height, weight, mailbox, marital status, nationality, achievement, blood group, hobby, phone.
Beneficial effects of the present invention are:
1st, the present invention provides a kind of person searching method based on mixing Di Li Crays procedural text cluster, this method energy
It is enough effectively to solve the problems, such as to return to the name ambiguity and information clutter in webpage during retrieval personage, and by extract character attribute with
Character relation construction personage's summary, facility is provided for user search name.
2nd, (isomery webpage refers to that (mhkc, news, forum, learns different web pages type to a kind of isomery webpage of present invention offer
School government website, blog, finance and economics etc.)) in personage's correlation information extraction method, webpage is divided first by VIPS algorithms
Block processing, is then based on the vector representation form of hexa-atomic group of vision block and link information structure vision block, using SVM algorithm,
Each vision block is divided into name is related or name is unrelated, effectively avoids in webpage name irrelevant information to name disambiguation
Influence.
Brief description of the drawings
Fig. 1 is the structure diagram of personage's searching system of the invention based on search engine;
Fig. 2 is the flow diagram of the person searching method of the invention based on search engine;
Embodiment
The present invention is further qualified with reference to Figure of description and embodiment, but not limited to this.
Embodiment 1
A kind of personage's searching system based on search engine, as shown in Figure 1, including sequentially connected data acquisition module,
Data preprocessing module, feature extraction module, cluster module;
Input retrieval name, data acquisition module using the distribution based on Scrapy-redis crawl system crawl it is more
The webpage information that a search engine retrieving name returns, forms webpage collection;Webpage information refers to:Search engine retrieving name returns
Some webpages, each webpage include title (title), url, summary (content), entire Web page;
The url in every information of different search engine retrieving names returns is crawled by reptile engine first, is then made
The entire Web page information in url is downloaded with page download instrument httrack.By it has been observed that search engine retrieving name returns
In the information returned only first page 10 it is larger with the degree of correlation of name, so before only crawling the return of each search engine retrieving name
The webpage information of page 10.
Data preprocessing module filtering web page concentrates the webpage unrelated with name, and the webpage that webpage after filtering is concentrated is carried out
Piecemeal processing, obtains multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, filter out in vision block with name
Unrelated vision block message;
Vision block refers to the piecemeal effect that webpage is formed after VIPS algorithms;
Vision block includes picture, hexa-atomic group information<With the distance of webpage upper edge, the distance with webpage left margin, vision
The length of block, the width of vision block, the numbering of vision block, the text in vision block>;The vision block unrelated with name includes advertisement, leads
Boat, pop-up box, copyright information and other vision blocks unrelated with name.
Feature extraction module extracts relevant with retrieval personage attribute and entity, entity from vision block and refers to go out in webpage
Existing name;With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage, the vector are constructed
Representation is:(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to word in webpage
The number of appearance;According to the character attribute value and name entity of extraction, the Feature Words that appropriate increase is extracted are right in vector space
Answer the value of dimension;
Cluster module is carried out using the vector representation form of each webpage as input using Di Li Cray processes mixed model
The cluster of web page text, the list of output webpage class label composition.Di Li Cray processes mixed model being capable of automatic identification net
Classification number in page text set, it is not necessary to manual intervention.
Di Li Cray process mixed models can be understood as a unlimited mixed model with unlimited abundance, be to have
The limiting form of the Finite mixture model of Di Li Cray process a priori assumptions.(often one-dimensional is each net to the sample set of hypothesized model
The vector representation form of page) it is X={ x1,x2,…xnIt is the independent identically distributed variable for obeying lower column distribution:
G~DP (α, H) (1)
θi| G~G (2)
xi|θi~F (θi) (3)
Observational variable xiObedience parameter is θiDistribution F (θi), G is parameter θiPrior distribution, and G is that parameter is α,
Base is distributed as the probability measure of the Di Li Cray processes of H;If sample xiAnd xjWith identical parameter, then two samples, which gather, is
It is a kind of;
Embodiment 2
A kind of personage's searching system based on search engine according to embodiment 1, its difference lies in:
Data preprocessing module includes sequentially connected data cleansing module, web page release module, personage's multi view block
Abstraction module, data acquisition module connection data cleansing module, personage's multi view block abstraction module connection features abstraction module;
Whether data cleansing module is by naming in each webpage that Entity recognition device identification crawler system crawls comprising inspection
The name of rope:If a certain webpage without retrieval name or the name number different from the name of retrieval more than 5, directly
It is the webpage unrelated with name to connect the Web Page Tags, otherwise, is and the relevant webpage of name by the Web Page Tags;
Web page release module carries out vision point to the webpage relevant with name obtained after data cleansing module data cleansing
Block processing:Realize that web page release is handled by VIPS vision block algorithms, export hexa-atomic group of each vision block split in webpage
Information, hexa-atomic group information include:Distance with webpage upper edge, with the distance of webpage left margin, the length of vision block, vision block
Text in width, the numbering of vision block, vision block;
Since there are advertisement, navigation, pop-up box, copyright information, pop-up box and other unrelated with name in webpage visual block
Vision block, personage's multi view block abstraction module filters the unrelated vision block of name by svm classifier algorithm, i.e.,:Input is every
The TF-IDF values of text, the size of vision block, the position of vision block in block vision block, the size of vision block include vision block
The long, width of vision block;The position of vision block is represented with the distance with webpage upper edge, with the distance of webpage left margin;Chain enters
Chain goes out to form the vector representation form of vision block than feature, and output 0 or 1,0 represents that the vision block is uncorrelated to retrieval name, 1 table
Show that the vision block is related to retrieval name.
Embodiment 3
A kind of personage's searching system based on search engine according to embodiment 1 or 2, its difference lies in:
Feature extraction module includes personage's correlation attribute extraction module, character relation abstraction module, text vector module,
Data preprocessing module connects personage's correlation attribute extraction module, character relation abstraction module, personage's correlation attribute extraction respectively
Module, character relation abstraction module are all connected with text vector module;Personage's correlation attribute extraction module uses rule and template
Matched method extracts 20 dimension character attribute in each webpage, and 20 dimension character attributes include birthplace, professional name, graduation
School, the date of birth, nationality, gender, make the name of an article, personal story, political affiliation, educational background, religious belief, height, weight, mailbox,
Marital status, nationality, achievement, blood group, hobby, phone.
Rule match is the personage's association attributes extracted using regular expression in web page text, for example, ":" before colon
If 20 dimension attributes specified, then colon is followed by the value of corresponding attribute;Use the numeral matching telephone number of 11.
Template matches refer to the character attribute according to 20 dimensions of every words in the template matches web page text manually formulated, example
Such as,<Name>It is born in<Date of birth>;<Name>Engage in<Professional name>.
Character relation abstraction module identifies the name entity in each webpage using name Entity recognition device, counts everyone
The number that name entity occurs and the distance with retrieving name, pass through name entity occurrence number and the Distance Judgment with retrieving name
The significance level of name entity;
Name entity and the computational methods of the distance of retrieval name are:If retrieval name and name entity appear in one
In vision block, the distance of the name entity and retrieval name is 0, and otherwise the distance of the name entity and retrieval name is 1;
It is by occurrence number and with the computational methods of the significance level of the Distance Judgment entity of retrieval name:Name entity
The number of appearance+(distance of 1-name entity and retrieval name);
The character attribute extracted in webpage is first carried out word segmentation processing by text vector module, counts noun therein;Again
Web page text is segmented, removes stop words, and counts the word frequency of each webpage, constructs the vector representation form of web page text;I.e.:Net
In page in text word word frequency statistics:{(word1,count1),(word2,count2) ..., (wordn,countn), wordi
Represent i-th of word in webpage, countiRepresent the frequency that i-th of word occurs in webpage;Finally, web page text is searched one by one
Vector representation form and character attribute and the corresponding word of entity value, and according to character attribute value and the significance level of entity
Appropriate increase weights.
The significance level of character attribute refers to different attributes, and the differentiation degree to personage is different, and it is higher to distinguish degree
Attribute:Gender, previous graduate college, make the name of an article, educational background, height, weight, mailbox, phone, date of birth, and increased weighted value is
5;Other 11 increased weights of attribute are 3.By the weights of name entity and character attribute and corresponding web page text vector table
The vector representation form for the web page text that the corresponding value of word shown is added to the end.
Embodiment 4
A kind of person searching method based on search engine, as shown in Fig. 2, including:
(1) webpage information of multiple search engine retrieving names returns is crawled using crawler system, forms webpage collection;
(2) filtering web page concentrates the webpage unrelated with name, carries out piecemeal processing to the webpage that webpage after filtering is concentrated, obtains
To multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, vision block unrelated with name in webpage is filtered out;
Including:
A, the name of retrieval whether is included in each webpage for crawling of crawler system by naming Entity recognition device to identify:Such as
The a certain webpage of fruit without retrieval name or the name number different from the name of retrieval more than 5, then directly by the webpage
Labeled as the webpage unrelated with name, otherwise, it is and the relevant webpage of name by the Web Page Tags;
B, vision piecemeal processing is carried out to the webpage relevant with name obtained after step A data cleansings:Regarded by VIPS
Feel that block algorithm realizes that web page release is handled, export the hexa-atomic group information for each vision block split in webpage, hexa-atomic group information bag
Include:Distance with webpage upper edge, with the distance of webpage left margin, the length of vision block, the width of vision block, vision block numbering,
Text in vision block;
C, the unrelated vision block of name is filtered by svm classifier algorithm, i.e.,:Input the TF-IDF of text in every piece of vision block
Value, the size of vision block, position and chain enter the vector representation form that chain goes out the vision block formed than feature;Output 0 or 1,0 represents
The vision block is uncorrelated to retrieval name, and 1 represents that the vision block is related with retrieval name, and removes and retrieve that name is unrelated regards
Feel block.
(3) relevant with retrieval personage attribute and entity, entity are extracted from vision block and refers to the name occurred in webpage;
With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage, the vector representation form are constructed
For:(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to time of appearance of the word in webpage
Number;According to the character attribute value and name entity of extraction, the Feature Words that appropriate increase is extracted correspond to dimension in vector space
Value;Including:
A, using rule and the method for template matches, 20 dimension character attribute in each webpage, 20 dimension character attributes are extracted
Including birthplace, professional name, previous graduate college, the date of birth, nationality, gender, make the name of an article, personal story, political affiliation, educational background,
Religious belief, height, weight, mailbox, marital status, nationality, achievement, blood group, hobby, phone.
B, the name entity in each webpage is identified using name Entity recognition device, counts time that each name entity occurs
Number and the distance with retrieving name, according to occurrence number and the significance level of the Distance Judgment entity with retrieving name;
C, 20 dimension character attributes in the webpage of extraction are subjected to word segmentation processing, count noun therein;
D, web page text is segmented, removes stop words, and count the word frequency of word in each webpage, construct the vector of web page text
Representation;
E, search one by one in the vector representation form of web page text with character attribute value and the corresponding word of name entity
Value, and weights are suitably increased according to the significance level of character attribute and name entity.
Embodiment 5
A kind of person searching method based on search engine according to embodiment 1, its difference lies in:If Di Li Crays
The parameter of process is α, including step is as follows:
(1) name of input retrieval;
(2) according to the name of input, crawler system crawls the web data collection of different search engine retrievings name return;
(3) the web data collection crawled for crawler system, the people in each webpage is identified using name Entity recognition device
Name entity, is nothing directly by the Web Page Tags if the different name number of the name or name do not retrieved is more than 5
Classification is closed, other Web Page Tags are and the relevant webpage of name;
(4) for the name related web page after step (3) processing, using VIPS vision block algorithms, dividing for webpage is carried out
Block processing;
(5) for the webpage of piecemeal in step (4), vision piecemeal is extracted, TF-IDF values of text in each vision block, regard
Feel size, the position (size of vision block of block:The length of vision block, the width of vision block;Position:Distance with webpage upper edge, with
The distance of webpage left margin), chain enters the vector representation form that chain goes out to form vision block than feature, uses svm classifier algorithm, filtering
The unrelated vision block of name in webpage;
(6) for the relevant vision block of name in step (5), extraction personage's multi view 20 dimension character attribute in the block,
Name entity and text message, construct the vector representation form of web page text, and according to the character attribute and name entity of extraction
The value of appropriate adjustment vector;
(7) for the vector representation form of step (6) construction web page text collection, using Di Li Cray process mixed models,
Carry out text cluster operation, output be web data collection list of categories:[label1,label2,…labeln], wherein
labeli∈ (1, n) and labeli∈ N, N represent last classification number;
(8) character attribute in the class label after cluster and each classification, carries out the fusion of character attribute, and structure
Make the triple of each classification:<[name of i-th of classification retrieval, relevant name], [attribute list of fusion], [classification i
Collections of web pages]>, then according to the significance level of each real person individual, the visual triple for showing all categories.