CN107908749A - A kind of personage's searching system and method based on search engine - Google Patents

A kind of personage's searching system and method based on search engine Download PDF

Info

Publication number
CN107908749A
CN107908749A CN201711147336.0A CN201711147336A CN107908749A CN 107908749 A CN107908749 A CN 107908749A CN 201711147336 A CN201711147336 A CN 201711147336A CN 107908749 A CN107908749 A CN 107908749A
Authority
CN
China
Prior art keywords
name
webpage
block
vision
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711147336.0A
Other languages
Chinese (zh)
Other versions
CN107908749B (en
Inventor
周奇
刘扬
王佰玲
辛国栋
孙云霄
王巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Weihai
Original Assignee
Harbin Institute of Technology Weihai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology Weihai filed Critical Harbin Institute of Technology Weihai
Priority to CN201711147336.0A priority Critical patent/CN107908749B/en
Publication of CN107908749A publication Critical patent/CN107908749A/en
Application granted granted Critical
Publication of CN107908749B publication Critical patent/CN107908749B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of personage's searching system and method based on search engine, including sequentially connected data acquisition module, data preprocessing module, feature extraction module, cluster module;Data acquisition module crawls the webpage information of search engine retrieving name return;Data preprocessing module filters the webpage unrelated with name, carries out piecemeal processing, the vision block unrelated with retrieving name in filtering web page;Feature extraction module extracts relevant with retrieval personage attribute and entity, count word frequency in vision block, construct the vector representation form of each webpage, the Feature Words that appropriate increase is extracted correspond to the value of dimension in vector space, cluster module is using the vector representation form of each webpage as input, carry out the cluster of web page text, the list of output webpage class label composition.It is of the invention effectively to solve the problems, such as to return to the name ambiguity and information clutter in webpage during retrieval personage, construct personage by extracting character attribute and character relation and make a summary, facility is provided for user search name.

Description

A kind of personage's searching system and method based on search engine
Technical field
The present invention relates to a kind of personage's searching system and method based on search engine, belongs to internet and search technique neck Domain.
Background technology
At present, the Major Difficulties of personage's retrieval are that there are name ambiguity and information clutter in the webpage that retrieval name returns The problem of.Name disambiguation refers to distinguish multiple personage's individuals with identical name.The generally existing of name ambiguity is given Information is propagated and the acquisition of resource causes inconvenience, and the name search result that the search engine of mainstream provides instantly often will The mixing of all duplication of name people webpages and uncorrelated webpage, these webpages are according to definitely rule compositor, the high personage of attention Information more likely comes position above.For example, in Baidu search engine to " Li Na ", page rank is forward in retrieval result Have " tennis player ", " singer ", " most U.S. cancer girl ", etc. identity Li Na, " Li Na " as common tutor Information be just submerged in these information oceans, cause user's needs to take a substantial amount of time and checked and screened.
For above the problem of, have three classes solution at present:First, there is the sorting algorithm of supervision:By manually marking language Expect storehouse, select suitable sorter model to realize the classification of web page text, the classification number of such method determines, it is impossible to adapts to number According to dynamic increase, and the quality of grader be somewhat dependent on mark corpus size.2nd, it is unsupervised poly- Class algorithm:It is broadly divided into traditional clustering algorithm, the clustering algorithm based on figure segmentation and the clustering algorithm based on Internet resources.Pass The clustering algorithm of system, by constructing the vector space model of web page text, people is realized using K-Means or hierarchical clustering algorithm Name disambiguation;Based on the clustering algorithm of figure segmentation, with document or be characterized as node in advance, by the use of the relation between document or feature as While to construct social relationships net, the method for recycling figure segmentation is clustered;Clustering algorithm based on Internet resources, first with The Internet resources such as Chinese thesaurus, Yahoo's network documentation taxonomic hierarchies and wikipedia alleviate shortage of data and it is sparse the problem of, Then the disambiguation that clustering algorithm realizes name is reused.3rd, mixed model:The strategy gathered using multiple steps, by it is multiple classification or Person's clustering algorithm combines, and realizes name disambiguation.It is big along with lacking due to the diversity and uncertainty of the network information The corpus that scale manually marks, and the unusual time and effort consuming of handmarking, see, based on unsupervised people in this sense Name disambiguation method is better than having supervision.
At present, the research of name disambiguation mainly relies on text modeling, and pretreatment includes extracting character attribute and name Entity, and combine the mapping relations of name contextual information research name and personage's individual.But make discovery from observation, have in webpage Many and the distant text message of name and some abstracted informations, have great help to name disambiguation, and such as two webpages are same Belong to musical theme, or belong to computer realm, then two pages probably correspond to same person, therefore we are to whole A webpage is modeled;And current solution, it is impossible to which classification number that automatically identification webpage is concentrated is, it is necessary to artificial Intervene.
Chinese patent literature 102054029A discloses a kind of people information based on community network and name context and disappears Discrimination processing method, the present invention relates to a kind of disambiguation processing method of internet character information.It solves the search of the prior art Engine is to the problem of retrieval result of a certain specific name is often the mixing for the different personage's related web pages for sharing this name. For network retrieval of person's information.It comprises the following steps:First, user inputs the name to be retrieved, utilizes search engine Retrieval is completed, using downloading software the page download retrieved to local computer;2nd, text is carried out respectively to above-mentioned webpage The processing of extraction process, word segmentation processing and part-of-speech tagging, forms document;3rd, first document is divided using personage's realm information Class, recycles community network and contextual information to carry out clustering processing to personage's realm information, finally shows each personage's neck Correspondence between domain information and entity personage, and show community network existing for each entity personage.But this is special Profit directly Web page text text is extracted, segment and part-of-speech tagging formed document, at present search engine retrieving return net Page species is complicated, various structures, and sidebar and multistage title usually contain most information of retrieval name in webpage.By this The method of patent can not extract the name relevant information in non-body text in webpage, seriously affect the effect of cluster;The patent Clustering algorithm, it is necessary to extract personage's realm information in text, the information content of extraction is very big to the influential effect of cluster, and Need to be manually specified the threshold value of cluster, there are influence of the manual intervention to Clustering Effect.
The content of the invention
In view of the deficiencies of the prior art, the present invention provides a kind of personage's searching system based on search engine;
Present invention also offers a kind of person searching method based on search engine;
First, according to the practical layout of webpage, Vision-based Page Segmentation (VIPS) algorithm is used Realize the piecemeal of webpage, and extract each vision text in the block, position and chain feature, using in SVM algorithm filtering web page The vision block unrelated with name;Then, using the Text Clustering Method based on mixing Di Li Cray processes, this method being capable of basis Word frequency statistics feature automatic decision the document in text belongs to existing classification, or newly-generated classification, automatic identification net The classification number of page text set, the influence of the manual intervention of reduction to Clustering Effect, efficiently solves retrieval name and returns to net Name ambiguity problem in page;Finally, generate personage by the attribute and character relation of extraction to make a summary, carried for user search name For facility.
Term is explained:
1st, TF-IDF values, term frequency-inverse document frequency, pass through in natural language processing Commonly used a kind of statistical method, to assess a words for a copy of it text in a text set or a corpus Significance level.The number that the importance of words occurs in the text with it is directly proportional, but at the same time can be as it is in corpus The frequency of middle appearance is inversely proportional.
2nd, VIPS visions block algorithm, Vision-based Page Segmentation.
3rd, svm classifier algorithm, Support Vector Machine.
The technical scheme is that:
A kind of personage's searching system based on search engine, including sequentially connected data acquisition module, data prediction Module, feature extraction module, cluster module;
The name of retrieval is inputted, the data acquisition module crawls multiple search engine retrieving names using crawler system and returns The webpage information returned, forms webpage collection;The webpage information refers to:Some webpages that search engine retrieving name returns, each Webpage includes title (title), url, summary (content), entire Web page;
The url in every information of different search engine retrieving names returns is crawled by reptile engine first, is then made The entire Web page information in url is downloaded with page download instrument httrack.By it has been observed that search engine retrieving name returns In the information returned only first page 10 it is larger with the degree of correlation of name, so before only crawling the return of each search engine retrieving name The webpage information of page 10)
The data preprocessing module filtering web page concentrates the webpage unrelated with name, the webpage concentrated to webpage after filtering Carry out piecemeal processing, obtain multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, filter out in vision block with The unrelated vision block message of name;
Vision block refers to the piecemeal effect that webpage is formed after VIPS algorithms;
Vision block includes picture, hexa-atomic group information<With the distance of webpage upper edge, the distance with webpage left margin, vision The length of block, the width of vision block, the numbering of vision block, the text in vision block>;The vision block message unrelated with name includes wide Announcement, navigation, pop-up box, copyright information and other vision blocks unrelated with name.
The feature extraction module extracts relevant with retrieval personage attribute and entity, entity from vision block and refers to webpage The name of middle appearance;With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage is constructed, it is described Vector representation form is:(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to word in webpage In appearance number;According to the relevant attribute of the personage of extraction and entity, the Feature Words that appropriate increase is extracted are in vector space The value of middle corresponding dimension;
The cluster module uses Di Li Cray process mixed models using the vector representation form of each webpage as input Carry out the cluster of web page text, the list of output webpage class label composition.Di Li Cray process mixed models can be known automatically The classification number that other web page text is concentrated, it is not necessary to manual intervention.
Preferable according to the present invention, the data preprocessing module includes sequentially connected data cleansing module, webpage point Block module, personage's multi view block abstraction module, the data acquisition module connect the data cleansing module, the figure picture Close vision block abstraction module and connect the feature extraction module;
Whether the data cleansing module is wrapped by naming in each webpage that Entity recognition device identification crawler system crawls Name containing retrieval:If a certain webpage without retrieval name or the name number different from the name of retrieval more than 5, It is then the webpage unrelated with name directly by the Web Page Tags, otherwise, is and the relevant webpage of name by the Web Page Tags;
The web page release module to obtained after the data cleansing module data cleansing with the relevant webpage of name into The processing of row vision piecemeal:Realize that web page release is handled by VIPS vision block algorithms, export each vision block split in webpage Hexa-atomic group information, hexa-atomic group information includes:Distance with webpage upper edge, with the distance of webpage left margin, the length of vision block, The width of vision block, the numbering of vision block, the text in vision block;
Since there are advertisement, navigation, pop-up box, copyright information, pop-up box and other unrelated with name in webpage visual block Vision block, personage's multi view block abstraction module filters the vision block unrelated with name by svm classifier algorithm, i.e.,: The TF-IDF values of text in every piece of vision block, the size of vision block, the position of vision block are inputted, the size of vision block includes vision The length of block, the width of vision block;The position of vision block is represented with the distance with webpage upper edge, with the distance of webpage left margin; Chain enters the vector representation form that chain goes out to form vision block than feature, and output 0 or 1,0 represents the vision block with retrieving name not phase Close, 1 represents that the vision block is related to retrieval name.
Preferable according to the present invention, the feature extraction module includes personage's correlation attribute extraction module, character relation is taken out Modulus block, text vector module, the data preprocessing module connect personage's correlation attribute extraction module, personage respectively Relation extraction module, personage's correlation attribute extraction module, character relation abstraction module are all connected with the text vector mould Block;Personage's correlation attribute extraction module extracts some dimension personages in each webpage using the method for rule and template matches Attribute.
The character relation abstraction module identifies the name entity in each webpage using name Entity recognition device, and statistics is every The weight of the number that a name entity occurs and the distance with retrieving name, occurrence number and the Distance Judgment entity with retrieving name Want degree;The name entity, that is, the entity;
Name entity and the computational methods of the distance of retrieval name are:If retrieve name and the name entity extracted occurs In a vision block, the distance of the name entity and retrieval name is 0, and otherwise the name entity is with the distance for retrieving name 1;
By name entity in webpage occurrence number and with retrieve name Distance Judgment entity significance level meter Calculation method is:The number that name entity occurs+(distance of 1- names entity and retrieval name);
The character attribute extracted in webpage is first carried out word segmentation processing by the text vector module, counts name therein Word;Web page text is segmented again, removes stop words, and counts the word frequency of each webpage, constructs the vector representation form of web page text; I.e.:In webpage in text word word frequency statistics:{(word1,count1),(word2,count2) ..., (wordn,countn), wordiRepresent i-th of word in webpage, countiRepresent the frequency that i-th of word occurs in webpage;Finally, webpage is searched one by one The vector representation form of text and the value of character attribute and the corresponding word of entity, and according to the important of character attribute value and entity Degree suitably increases weights.
The significance level of character attribute refers to different attributes, and the differentiation degree to personage is different, and it is higher to distinguish degree Attribute:Gender, previous graduate college, make the name of an article, educational background, height, weight, mailbox, phone, date of birth, and increased weighted value is 5;Other 11 increased weights of attribute are 3.By the weights of name entity and character attribute and corresponding web page text vector table The vector representation form for the web page text that the corresponding value of word shown is added to the end.
It is preferable according to the present invention, extract 20 dimension character attribute in each webpage, 20 dimension character attributes include birthplace, Professional name, previous graduate college, the date of birth, nationality, gender, make the name of an article, personal story, political affiliation, educational background, religious belief, body Height, weight, mailbox, marital status, nationality, achievement, blood group, hobby, phone.
Preferable according to the present invention, the crawler system is that the distribution based on Scrapy-redis crawls system.
A kind of person searching method based on search engine, including:
(1) webpage information of multiple search engine retrieving names returns is crawled using crawler system, forms webpage collection;
(2) filtering web page concentrates the webpage unrelated with name, carries out piecemeal processing to the webpage that webpage after filtering is concentrated, obtains To multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, vision block unrelated with name in webpage is filtered out;
(3) relevant with retrieval personage attribute and entity, entity are extracted from vision block and refers to the name occurred in webpage; With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage, the vector representation form are constructed For:(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to time of appearance of the word in webpage Number;According to the character attribute and entity of extraction, the Feature Words that appropriate increase is extracted correspond to the value of dimension in vector space;
(4) using the vector representation form of each webpage as input, webpage is carried out using Di Li Cray processes mixed model The cluster of text, the list of output webpage class label composition.Di Li Cray processes mixed model being capable of automatic identification webpage text The classification number of this concentration, it is not necessary to manual intervention.
It is preferable according to the present invention, the step (2), including:
A, the name of retrieval whether is included in each webpage for crawling of crawler system by naming Entity recognition device to identify:Such as The a certain webpage of fruit without retrieval name or the name number different from the name of retrieval more than 5, then directly by the webpage Labeled as the webpage unrelated with name, otherwise, it is and the relevant webpage of name by the Web Page Tags;
B, vision piecemeal processing is carried out to the webpage relevant with name obtained after step A data cleansings:Regarded by VIPS Feel that block algorithm realizes that web page release is handled, export the hexa-atomic group information for each vision block split in webpage, hexa-atomic group information bag Include:Distance with webpage upper edge, with the distance of webpage left margin, the length of vision block, the width of vision block, vision block numbering, Text in vision block;
C, the unrelated vision block of name is filtered by svm classifier algorithm, i.e.,:Input the TF-IDF of text in every piece of vision block Value, the size of vision block, position, chain enter the vector representation form that chain goes out the vision block formed than feature, and output 0 or 1,0 represents The vision block is uncorrelated to retrieval name, and 1 represents that the vision block is related to retrieval name, and removes the unrelated vision block of name.
It is preferable according to the present invention, the step (3), including:
A, using rule and the method for template matches, 20 dimension character attribute in each webpage is extracted;
B, the name entity in each webpage is identified using name Entity recognition device, counts time that each name entity occurs Number and the distance with retrieving name, according to occurrence number and the significance level of the Distance Judgment name entity with retrieving name;
C, some dimension character attributes in the webpage of extraction are subjected to word segmentation processing, count noun therein;
D, web page text is segmented, removes stop words, and count the word frequency of word in each webpage, construct the vector of web page text Representation;
E, search one by one in the vector representation form of web page text with character attribute value and the corresponding word of name entity Value, and weights are suitably increased according to the significance level of character attribute and name entity.
Preferable according to the present invention, the step a, extracts 20 dimension character attribute in each webpage, 20 dimension character attributes Including birthplace, professional name, previous graduate college, the date of birth, nationality, gender, make the name of an article, personal story, political affiliation, educational background, Religious belief, height, weight, mailbox, marital status, nationality, achievement, blood group, hobby, phone.
Beneficial effects of the present invention are:
1st, the present invention provides a kind of person searching method based on mixing Di Li Crays procedural text cluster, this method energy It is enough effectively to solve the problems, such as to return to the name ambiguity and information clutter in webpage during retrieval personage, and by extract character attribute with Character relation construction personage's summary, facility is provided for user search name.
2nd, (isomery webpage refers to that (mhkc, news, forum, learns different web pages type to a kind of isomery webpage of present invention offer School government website, blog, finance and economics etc.)) in personage's correlation information extraction method, webpage is divided first by VIPS algorithms Block processing, is then based on the vector representation form of hexa-atomic group of vision block and link information structure vision block, using SVM algorithm, Each vision block is divided into name is related or name is unrelated, effectively avoids in webpage name irrelevant information to name disambiguation Influence.
Brief description of the drawings
Fig. 1 is the structure diagram of personage's searching system of the invention based on search engine;
Fig. 2 is the flow diagram of the person searching method of the invention based on search engine;
Embodiment
The present invention is further qualified with reference to Figure of description and embodiment, but not limited to this.
Embodiment 1
A kind of personage's searching system based on search engine, as shown in Figure 1, including sequentially connected data acquisition module, Data preprocessing module, feature extraction module, cluster module;
Input retrieval name, data acquisition module using the distribution based on Scrapy-redis crawl system crawl it is more The webpage information that a search engine retrieving name returns, forms webpage collection;Webpage information refers to:Search engine retrieving name returns Some webpages, each webpage include title (title), url, summary (content), entire Web page;
The url in every information of different search engine retrieving names returns is crawled by reptile engine first, is then made The entire Web page information in url is downloaded with page download instrument httrack.By it has been observed that search engine retrieving name returns In the information returned only first page 10 it is larger with the degree of correlation of name, so before only crawling the return of each search engine retrieving name The webpage information of page 10.
Data preprocessing module filtering web page concentrates the webpage unrelated with name, and the webpage that webpage after filtering is concentrated is carried out Piecemeal processing, obtains multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, filter out in vision block with name Unrelated vision block message;
Vision block refers to the piecemeal effect that webpage is formed after VIPS algorithms;
Vision block includes picture, hexa-atomic group information<With the distance of webpage upper edge, the distance with webpage left margin, vision The length of block, the width of vision block, the numbering of vision block, the text in vision block>;The vision block unrelated with name includes advertisement, leads Boat, pop-up box, copyright information and other vision blocks unrelated with name.
Feature extraction module extracts relevant with retrieval personage attribute and entity, entity from vision block and refers to go out in webpage Existing name;With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage, the vector are constructed Representation is:(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to word in webpage The number of appearance;According to the character attribute value and name entity of extraction, the Feature Words that appropriate increase is extracted are right in vector space Answer the value of dimension;
Cluster module is carried out using the vector representation form of each webpage as input using Di Li Cray processes mixed model The cluster of web page text, the list of output webpage class label composition.Di Li Cray processes mixed model being capable of automatic identification net Classification number in page text set, it is not necessary to manual intervention.
Di Li Cray process mixed models can be understood as a unlimited mixed model with unlimited abundance, be to have The limiting form of the Finite mixture model of Di Li Cray process a priori assumptions.(often one-dimensional is each net to the sample set of hypothesized model The vector representation form of page) it is X={ x1,x2,…xnIt is the independent identically distributed variable for obeying lower column distribution:
G~DP (α, H) (1)
θi| G~G (2)
xii~F (θi) (3)
Observational variable xiObedience parameter is θiDistribution F (θi), G is parameter θiPrior distribution, and G is that parameter is α, Base is distributed as the probability measure of the Di Li Cray processes of H;If sample xiAnd xjWith identical parameter, then two samples, which gather, is It is a kind of;
Embodiment 2
A kind of personage's searching system based on search engine according to embodiment 1, its difference lies in:
Data preprocessing module includes sequentially connected data cleansing module, web page release module, personage's multi view block Abstraction module, data acquisition module connection data cleansing module, personage's multi view block abstraction module connection features abstraction module;
Whether data cleansing module is by naming in each webpage that Entity recognition device identification crawler system crawls comprising inspection The name of rope:If a certain webpage without retrieval name or the name number different from the name of retrieval more than 5, directly It is the webpage unrelated with name to connect the Web Page Tags, otherwise, is and the relevant webpage of name by the Web Page Tags;
Web page release module carries out vision point to the webpage relevant with name obtained after data cleansing module data cleansing Block processing:Realize that web page release is handled by VIPS vision block algorithms, export hexa-atomic group of each vision block split in webpage Information, hexa-atomic group information include:Distance with webpage upper edge, with the distance of webpage left margin, the length of vision block, vision block Text in width, the numbering of vision block, vision block;
Since there are advertisement, navigation, pop-up box, copyright information, pop-up box and other unrelated with name in webpage visual block Vision block, personage's multi view block abstraction module filters the unrelated vision block of name by svm classifier algorithm, i.e.,:Input is every The TF-IDF values of text, the size of vision block, the position of vision block in block vision block, the size of vision block include vision block The long, width of vision block;The position of vision block is represented with the distance with webpage upper edge, with the distance of webpage left margin;Chain enters Chain goes out to form the vector representation form of vision block than feature, and output 0 or 1,0 represents that the vision block is uncorrelated to retrieval name, 1 table Show that the vision block is related to retrieval name.
Embodiment 3
A kind of personage's searching system based on search engine according to embodiment 1 or 2, its difference lies in:
Feature extraction module includes personage's correlation attribute extraction module, character relation abstraction module, text vector module, Data preprocessing module connects personage's correlation attribute extraction module, character relation abstraction module, personage's correlation attribute extraction respectively Module, character relation abstraction module are all connected with text vector module;Personage's correlation attribute extraction module uses rule and template Matched method extracts 20 dimension character attribute in each webpage, and 20 dimension character attributes include birthplace, professional name, graduation School, the date of birth, nationality, gender, make the name of an article, personal story, political affiliation, educational background, religious belief, height, weight, mailbox, Marital status, nationality, achievement, blood group, hobby, phone.
Rule match is the personage's association attributes extracted using regular expression in web page text, for example, ":" before colon If 20 dimension attributes specified, then colon is followed by the value of corresponding attribute;Use the numeral matching telephone number of 11.
Template matches refer to the character attribute according to 20 dimensions of every words in the template matches web page text manually formulated, example Such as,<Name>It is born in<Date of birth>;<Name>Engage in<Professional name>.
Character relation abstraction module identifies the name entity in each webpage using name Entity recognition device, counts everyone The number that name entity occurs and the distance with retrieving name, pass through name entity occurrence number and the Distance Judgment with retrieving name The significance level of name entity;
Name entity and the computational methods of the distance of retrieval name are:If retrieval name and name entity appear in one In vision block, the distance of the name entity and retrieval name is 0, and otherwise the distance of the name entity and retrieval name is 1;
It is by occurrence number and with the computational methods of the significance level of the Distance Judgment entity of retrieval name:Name entity The number of appearance+(distance of 1-name entity and retrieval name);
The character attribute extracted in webpage is first carried out word segmentation processing by text vector module, counts noun therein;Again Web page text is segmented, removes stop words, and counts the word frequency of each webpage, constructs the vector representation form of web page text;I.e.:Net In page in text word word frequency statistics:{(word1,count1),(word2,count2) ..., (wordn,countn), wordi Represent i-th of word in webpage, countiRepresent the frequency that i-th of word occurs in webpage;Finally, web page text is searched one by one Vector representation form and character attribute and the corresponding word of entity value, and according to character attribute value and the significance level of entity Appropriate increase weights.
The significance level of character attribute refers to different attributes, and the differentiation degree to personage is different, and it is higher to distinguish degree Attribute:Gender, previous graduate college, make the name of an article, educational background, height, weight, mailbox, phone, date of birth, and increased weighted value is 5;Other 11 increased weights of attribute are 3.By the weights of name entity and character attribute and corresponding web page text vector table The vector representation form for the web page text that the corresponding value of word shown is added to the end.
Embodiment 4
A kind of person searching method based on search engine, as shown in Fig. 2, including:
(1) webpage information of multiple search engine retrieving names returns is crawled using crawler system, forms webpage collection;
(2) filtering web page concentrates the webpage unrelated with name, carries out piecemeal processing to the webpage that webpage after filtering is concentrated, obtains To multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, vision block unrelated with name in webpage is filtered out; Including:
A, the name of retrieval whether is included in each webpage for crawling of crawler system by naming Entity recognition device to identify:Such as The a certain webpage of fruit without retrieval name or the name number different from the name of retrieval more than 5, then directly by the webpage Labeled as the webpage unrelated with name, otherwise, it is and the relevant webpage of name by the Web Page Tags;
B, vision piecemeal processing is carried out to the webpage relevant with name obtained after step A data cleansings:Regarded by VIPS Feel that block algorithm realizes that web page release is handled, export the hexa-atomic group information for each vision block split in webpage, hexa-atomic group information bag Include:Distance with webpage upper edge, with the distance of webpage left margin, the length of vision block, the width of vision block, vision block numbering, Text in vision block;
C, the unrelated vision block of name is filtered by svm classifier algorithm, i.e.,:Input the TF-IDF of text in every piece of vision block Value, the size of vision block, position and chain enter the vector representation form that chain goes out the vision block formed than feature;Output 0 or 1,0 represents The vision block is uncorrelated to retrieval name, and 1 represents that the vision block is related with retrieval name, and removes and retrieve that name is unrelated regards Feel block.
(3) relevant with retrieval personage attribute and entity, entity are extracted from vision block and refers to the name occurred in webpage; With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage, the vector representation form are constructed For:(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to time of appearance of the word in webpage Number;According to the character attribute value and name entity of extraction, the Feature Words that appropriate increase is extracted correspond to dimension in vector space Value;Including:
A, using rule and the method for template matches, 20 dimension character attribute in each webpage, 20 dimension character attributes are extracted Including birthplace, professional name, previous graduate college, the date of birth, nationality, gender, make the name of an article, personal story, political affiliation, educational background, Religious belief, height, weight, mailbox, marital status, nationality, achievement, blood group, hobby, phone.
B, the name entity in each webpage is identified using name Entity recognition device, counts time that each name entity occurs Number and the distance with retrieving name, according to occurrence number and the significance level of the Distance Judgment entity with retrieving name;
C, 20 dimension character attributes in the webpage of extraction are subjected to word segmentation processing, count noun therein;
D, web page text is segmented, removes stop words, and count the word frequency of word in each webpage, construct the vector of web page text Representation;
E, search one by one in the vector representation form of web page text with character attribute value and the corresponding word of name entity Value, and weights are suitably increased according to the significance level of character attribute and name entity.
Embodiment 5
A kind of person searching method based on search engine according to embodiment 1, its difference lies in:If Di Li Crays The parameter of process is α, including step is as follows:
(1) name of input retrieval;
(2) according to the name of input, crawler system crawls the web data collection of different search engine retrievings name return;
(3) the web data collection crawled for crawler system, the people in each webpage is identified using name Entity recognition device Name entity, is nothing directly by the Web Page Tags if the different name number of the name or name do not retrieved is more than 5 Classification is closed, other Web Page Tags are and the relevant webpage of name;
(4) for the name related web page after step (3) processing, using VIPS vision block algorithms, dividing for webpage is carried out Block processing;
(5) for the webpage of piecemeal in step (4), vision piecemeal is extracted, TF-IDF values of text in each vision block, regard Feel size, the position (size of vision block of block:The length of vision block, the width of vision block;Position:Distance with webpage upper edge, with The distance of webpage left margin), chain enters the vector representation form that chain goes out to form vision block than feature, uses svm classifier algorithm, filtering The unrelated vision block of name in webpage;
(6) for the relevant vision block of name in step (5), extraction personage's multi view 20 dimension character attribute in the block, Name entity and text message, construct the vector representation form of web page text, and according to the character attribute and name entity of extraction The value of appropriate adjustment vector;
(7) for the vector representation form of step (6) construction web page text collection, using Di Li Cray process mixed models, Carry out text cluster operation, output be web data collection list of categories:[label1,label2,…labeln], wherein labeli∈ (1, n) and labeli∈ N, N represent last classification number;
(8) character attribute in the class label after cluster and each classification, carries out the fusion of character attribute, and structure Make the triple of each classification:<[name of i-th of classification retrieval, relevant name], [attribute list of fusion], [classification i Collections of web pages]>, then according to the significance level of each real person individual, the visual triple for showing all categories.

Claims (9)

1. a kind of personage's searching system based on search engine, it is characterised in that including sequentially connected data acquisition module, number Data preprocess module, feature extraction module, cluster module;
The name of retrieval is inputted, the data acquisition module crawls what multiple search engine retrieving names returned using crawler system Webpage information, forms webpage collection;The webpage information refers to:Some webpages that search engine retrieving name returns, each webpage Including title, url, summary, entire Web page;
The data preprocessing module filtering web page concentrates the webpage unrelated with name, and the webpage that webpage after filtering is concentrated is carried out Piecemeal processing, obtains multiple vision blocks of each webpage, and by there is the sorting algorithm of supervision, filter out in vision block with name Unrelated vision block message;
The feature extraction module extracts relevant with retrieval personage attribute and entity, entity from vision block and refers to go out in webpage Existing name;With the word frequency of name multi view block in statistical web page, the vector representation form of each webpage, the vector are constructed Representation is:(x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to word in webpage The number of appearance;According to the character attribute value and name entity of extraction, the Feature Words that appropriate increase is extracted are right in vector space Answer the value of dimension;
The cluster module is carried out using the vector representation form of each webpage as input using Di Li Cray processes mixed model The cluster of web page text, the list of output webpage class label composition.
2. a kind of personage's searching system based on search engine according to claim 1, it is characterised in that the data are pre- Processing module includes sequentially connected data cleansing module, web page release module, personage's multi view block abstraction module, the number The data cleansing module is connected according to acquisition module, personage's multi view block abstraction module connects the feature extraction mould Block;
Whether the data cleansing module is by naming in each webpage that Entity recognition device identification crawler system crawls comprising inspection The name of rope:If a certain webpage without retrieval name or the name number different from the name of retrieval more than 5, directly It is the webpage unrelated with name to connect the Web Page Tags, otherwise, is and the relevant webpage of name by the Web Page Tags;
The web page release module after the data cleansing module data cleansing to obtaining regarding with the relevant webpage of name Feel piecemeal processing:Realize that web page release is handled by VIPS vision block algorithms, export the six of each vision block split in webpage Tuple information, hexa-atomic group information include:Distance and the distance of webpage left margin, the length of vision block, vision with webpage upper edge The width of block, the numbering of vision block, the text in vision block;
Personage's multi view block abstraction module filters the unrelated vision block of name by svm classifier algorithm, i.e.,:Input every piece The TF-IDF values of text, the size of vision block, the position of vision block in vision block, length of the size including vision block of vision block, The width of vision block;The position of vision block is represented with the distance with webpage upper edge, with the distance of webpage left margin;Chain enters chain and goes out Than the vector representation form that feature forms vision block, output 0 or 1,0 represents that the vision block is uncorrelated to retrieval name, and 1 expression should Vision block is related to retrieval name.
A kind of 3. personage's searching system based on search engine according to claim 1 or 2, it is characterised in that the spy Levying abstraction module includes personage's correlation attribute extraction module, character relation abstraction module, text vector module, and the data are pre- Processing module connects personage's correlation attribute extraction module, character relation abstraction module respectively, and personage's association attributes is taken out Modulus block, character relation abstraction module are all connected with the text vector module;Personage's correlation attribute extraction module uses The method of rule and template matches extracts some dimension character attributes in each webpage;
The character relation abstraction module identifies the name entity in each webpage using name Entity recognition device, counts everyone The important journey of the number that name entity occurs and the distance with retrieving name, occurrence number and the Distance Judgment entity with retrieving name Degree;
Name entity and the computational methods of the distance of retrieval name are:If retrieval name and the name entity extracted appear in one In a vision block, the distance of the name entity and retrieval name is 0, and otherwise the distance of the name entity and retrieval name is 1;
By name entity in webpage occurrence number and with retrieve name Distance Judgment entity significance level calculating side Method is:The number that name entity occurs+(distance of 1- names entity and retrieval name);
Character attribute in the webpage of extraction is first carried out word segmentation processing by the text vector module, counts noun therein; Web page text is segmented again, removes stop words, and counts the word frequency of each webpage, constructs the vector representation form of web page text, by One searches the value of vector representation form and the character attribute of web page text and the corresponding word of entity, and according to character attribute value and The significance level of entity suitably increases weights.
4. a kind of personage's searching system based on search engine according to claim 3, it is characterised in that extract each In webpage 20 dimension character attribute, 20 dimension character attributes include birthplace, professional name, previous graduate college, the date of birth, nationality, gender, Make the name of an article, personal story, political affiliation, educational background, religious belief, height, weight, mailbox, marital status, nationality, achievement, blood Type, hobby, phone.
A kind of 5. personage's searching system based on search engine according to claim 3, it is characterised in that the reptile system System is that the distribution based on Scrapy-redis crawls system.
A kind of 6. person searching method based on search engine, it is characterised in that including:
(1) webpage information of multiple search engine retrieving names returns is crawled using crawler system, forms webpage collection;
(2) filtering web page concentrates the webpage unrelated with name, carries out piecemeal processing to the webpage that webpage after filtering is concentrated, obtains every Multiple vision blocks of a webpage, and by there is the sorting algorithm of supervision, filter out vision block unrelated with name in webpage;
(3) relevant with retrieval personage attribute and entity, entity are extracted from vision block and refers to the name occurred in webpage;Statistics With the word frequency of name multi view block in webpage, the vector representation form of each webpage is constructed, the vector representation form is: (x, y), x refer to after filtering out name extraneous visual block that the word in web page text, y refers to the number of appearance of the word in webpage; According to the character attribute value and name entity of extraction, the Feature Words that appropriate increase is extracted correspond to the value of dimension in vector space;
(4) using the vector representation form of each webpage as input, web page text is carried out using Di Li Cray processes mixed model Cluster, the list of output webpage class label composition.
A kind of 7. person searching method based on search engine according to claim 6, it is characterised in that the step (2), including:
A, the name of retrieval whether is included in each webpage for crawling of crawler system by naming Entity recognition device to identify:If certain One webpage without retrieval name or the name number different from the name of retrieval more than 5, then directly by the Web Page Tags For the webpage unrelated with name, otherwise, it is and the relevant webpage of name by the Web Page Tags;
B, vision piecemeal processing is carried out to the webpage relevant with name obtained after step A data cleansings:Pass through VIPS vision blocks Algorithm realizes that web page release is handled, and exports the hexa-atomic group information for each vision block split in webpage, and hexa-atomic group information includes:With The distance of webpage upper edge and the distance of webpage left margin, the length of vision block, the width of vision block, the numbering of vision block, vision block Interior text;
C, the unrelated vision block of name is filtered by svm classifier algorithm, i.e.,:Input the TF-IDF values of text in every piece of vision block, The size of vision block, the position of vision block, chain enter the vector representation form that chain goes out to form vision block than feature, output 0 or 1,0 table Show that the vision block is uncorrelated to retrieval name, 1 represents that the vision block is related to retrieval name, and removes unrelated with retrieval name Vision block.
A kind of 8. person searching method based on search engine according to claim 6 or 7, it is characterised in that the step Suddenly (3), including:
A, using rule and the method for template matches, some dimension character attributes in each webpage are extracted
B, the name entity in each webpage is identified using name Entity recognition device, count number that each name entity occurs and The significance level of distance with retrieving name, occurrence number and the Distance Judgment entity with retrieving name;
C, 20 dimension character attributes in the webpage of extraction are subjected to word segmentation processing, count noun therein;
D, web page text is segmented, removes stop words, and count the word frequency of word in each webpage, construct the vector representation of web page text Form;
E, one by one search web page text vector representation form in character attribute value and the value of the corresponding word of name entity, and Weights are suitably increased according to the significance level of character attribute and name entity.
A kind of 9. person searching method based on search engine according to claim 8, it is characterised in that the step a, Extract in each webpage 20 dimension character attributes, 20 dimension character attributes include birthplace, professional name, previous graduate college, the date of birth, Nationality, gender, make the name of an article, personal story, political affiliation, educational background, religious belief, height, weight, mailbox, marital status, state Nationality, achievement, blood group, hobby, phone.
CN201711147336.0A 2017-11-17 2017-11-17 Character retrieval system and method based on search engine Active CN107908749B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711147336.0A CN107908749B (en) 2017-11-17 2017-11-17 Character retrieval system and method based on search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711147336.0A CN107908749B (en) 2017-11-17 2017-11-17 Character retrieval system and method based on search engine

Publications (2)

Publication Number Publication Date
CN107908749A true CN107908749A (en) 2018-04-13
CN107908749B CN107908749B (en) 2020-04-10

Family

ID=61846123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711147336.0A Active CN107908749B (en) 2017-11-17 2017-11-17 Character retrieval system and method based on search engine

Country Status (1)

Country Link
CN (1) CN107908749B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359301A (en) * 2018-10-19 2019-02-19 国家计算机网络与信息安全管理中心 A kind of the various dimensions mask method and device of web page contents
CN109948154A (en) * 2019-03-12 2019-06-28 南京邮电大学 A kind of personage's acquisition and relationship recommender system and method based on name
CN111241283A (en) * 2020-01-15 2020-06-05 电子科技大学 Rapid characterization method for portrait of scientific research student

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1687924A (en) * 2005-04-28 2005-10-26 中国科学院计算技术研究所 Method for producing internet personage information search engine
CN102054029A (en) * 2010-12-17 2011-05-11 哈尔滨工业大学 Figure information disambiguation treatment method based on social network and name context
CN102831128A (en) * 2011-06-15 2012-12-19 富士通株式会社 Method and device for sorting information of namesake persons on Internet
CN102880623A (en) * 2011-07-13 2013-01-16 富士通株式会社 Method and device for searching people with same name
CN104182420A (en) * 2013-05-27 2014-12-03 华东师范大学 Ontology-based Chinese name disambiguation method
CN104376116A (en) * 2014-12-01 2015-02-25 国家电网公司 Search method and device for figure information
US20160314130A1 (en) * 2015-04-24 2016-10-27 Tribune Broadcasting Company, Llc Computing device with spell-check feature
CN106484675A (en) * 2016-09-29 2017-03-08 北京理工大学 Fusion distributed semantic and the character relation abstracting method of sentence justice feature

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1687924A (en) * 2005-04-28 2005-10-26 中国科学院计算技术研究所 Method for producing internet personage information search engine
CN102054029A (en) * 2010-12-17 2011-05-11 哈尔滨工业大学 Figure information disambiguation treatment method based on social network and name context
CN102831128A (en) * 2011-06-15 2012-12-19 富士通株式会社 Method and device for sorting information of namesake persons on Internet
CN102880623A (en) * 2011-07-13 2013-01-16 富士通株式会社 Method and device for searching people with same name
CN104182420A (en) * 2013-05-27 2014-12-03 华东师范大学 Ontology-based Chinese name disambiguation method
CN104376116A (en) * 2014-12-01 2015-02-25 国家电网公司 Search method and device for figure information
US20160314130A1 (en) * 2015-04-24 2016-10-27 Tribune Broadcasting Company, Llc Computing device with spell-check feature
CN106484675A (en) * 2016-09-29 2017-03-08 北京理工大学 Fusion distributed semantic and the character relation abstracting method of sentence justice feature

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
沈剑平: ""面向网络人物搜索的中文人名消歧"", 《中国优秀硕士学位论文全文数据库•信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359301A (en) * 2018-10-19 2019-02-19 国家计算机网络与信息安全管理中心 A kind of the various dimensions mask method and device of web page contents
CN109948154A (en) * 2019-03-12 2019-06-28 南京邮电大学 A kind of personage's acquisition and relationship recommender system and method based on name
CN111241283A (en) * 2020-01-15 2020-06-05 电子科技大学 Rapid characterization method for portrait of scientific research student
CN111241283B (en) * 2020-01-15 2023-04-07 电子科技大学 Rapid characterization method for portrait of scientific research student

Also Published As

Publication number Publication date
CN107908749B (en) 2020-04-10

Similar Documents

Publication Publication Date Title
CN101794311B (en) Fuzzy data mining based automatic classification method of Chinese web pages
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN110059271B (en) Searching method and device applying tag knowledge network
CN109960763B (en) Photography community personalized friend recommendation method based on user fine-grained photography preference
CN106886580B (en) Image emotion polarity analysis method based on deep learning
CN107578292B (en) User portrait construction system
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN106776562A (en) A kind of keyword extracting method and extraction system
CN108629633A (en) A kind of method and system for establishing user&#39;s portrait based on big data
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN110334202A (en) User interest label construction method and relevant device based on news application software
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN110674252A (en) High-precision semantic search system for judicial domain
CN110059181A (en) Short text stamp methods, system, device towards extensive classification system
CN110196945B (en) Microblog user age prediction method based on LSTM and LeNet fusion
CN112559684A (en) Keyword extraction and information retrieval method
CN107679110A (en) The method and device of knowledge mapping is improved with reference to text classification and picture attribute extraction
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN115796181A (en) Text relation extraction method for chemical field
CN109918648B (en) Rumor depth detection method based on dynamic sliding window feature score
CN106599824B (en) A kind of GIF animation emotion identification method based on emotion pair
CN107908749A (en) A kind of personage&#39;s searching system and method based on search engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Liu Yang

Inventor after: Wang Bailing

Inventor after: Zhou Qi

Inventor after: Xin Guodong

Inventor after: Sun Yunxiao

Inventor after: Wang Wei

Inventor before: Zhou Qi

Inventor before: Liu Yang

Inventor before: Wang Bailing

Inventor before: Xin Guodong

Inventor before: Sun Yunxiao

Inventor before: Wang Wei

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Sun Yunxiao

Inventor after: Liu Yang

Inventor after: Wang Bailing

Inventor after: Zhou Qi

Inventor after: Xin Guodong

Inventor after: Wang Wei

Inventor before: Liu Yang

Inventor before: Wang Bailing

Inventor before: Zhou Qi

Inventor before: Xin Guodong

Inventor before: Sun Yunxiao

Inventor before: Wang Wei

CB03 Change of inventor or designer information