CN105138708A - Method and device for identifying names of points of interest (POI) - Google Patents

Method and device for identifying names of points of interest (POI) Download PDF

Info

Publication number
CN105138708A
CN105138708A CN201510643119.5A CN201510643119A CN105138708A CN 105138708 A CN105138708 A CN 105138708A CN 201510643119 A CN201510643119 A CN 201510643119A CN 105138708 A CN105138708 A CN 105138708A
Authority
CN
China
Prior art keywords
interest point
point name
interest
keyword
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510643119.5A
Other languages
Chinese (zh)
Inventor
王智广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510643119.5A priority Critical patent/CN105138708A/en
Publication of CN105138708A publication Critical patent/CN105138708A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method and a device for identifying names of points of interest (POI). The method comprises the following steps: extracting POI data from web pages, wherein the POI data comprise the names of POI; setting the names of POI identifying the same object as a set of the names of POI; and identifying correct names of POI of a second object from the set of the names of POI. The correct POI data are applied to subsequent operations, thus reducing the error rates of the operations and reducing resource waste.

Description

A kind of recognition methods of interest point name and device
Technical field
The present invention relates to the technical field of computer disposal, particularly relate to a kind of recognition methods of interest point name and a kind of recognition device of interest point name.
Background technology
Point of interest (PointofInterest, POI), can translate into " information point " again, and it comprises many-sided information, as title, classification, latitude, longitude etc.
In Geographic Information System, POI can be a house, retail shop, mailbox, a bus station etc.
Traditional geographical information collection method needs map mapping worker to adopt accurate instrument of surveying and mapping to remove the longitude and latitude of an acquisition point of interest, and then marks.
Just because of the collection of POI data is a very time-consuming bothersome job, concerning a Geographic Information System, the quantity of POI is in the value that to a certain degree represent whole system.
In order to enrich the quantity of the POI data of Geographic Information System, from webpage, excavating POI data at present, being according to the suitable template of the structural allocation of webpage mostly, being extracted by template.
But user might not go to release news according to the regulation of webpage, these being comprised in the website of POI and is flooded with a large amount of dirty datas, is the POI data of mistake.
Such as, Business Name is issued in a region of some websites agreement webpage, but some users may issue such as data such as " world five top 100 enterprises ", be not a real POI title.
If the POI data of these mistakes of subsequent applications carries out the operations such as navigation, the error rate of operation is high, causes the wasting of resources.
Summary of the invention
In view of the above problems, the present invention is proposed to provide a kind of overcoming the problems referred to above or a kind of recognition methods of interest point name solved the problem at least in part and the recognition device of corresponding a kind of interest point name.
According to one aspect of the present invention, provide a kind of recognition methods of interest point name, comprising:
Interest point data is extracted in webpage; Described interest point data comprises interest point name;
The interest point name of mark same target is set to interest point name set;
The second correct target interest point name is identified from described interest point name set.
Alternatively, the described step extracting interest point data in webpage comprises:
Search the template for webpage configuration;
In described webpage, interest point data is extracted in the position indicated according to described template.
Alternatively, described interest point data also comprises point of interest address;
The step that the described interest point name by mark same target is set to interest point name set comprises:
Judge that whether described point of interest address is same or similar; If so, then the interest point name of described point of interest address information is set to interest point name set.
Alternatively, describedly from described interest point name set, identify that the step of the second correct target interest point name comprises:
Interest point name in described interest point name set chooses keyword;
From described interest point name, the second correct target interest point name is identified according to described keyword.
Alternatively, the step that described interest point name in described interest point name set chooses keyword comprises:
Word segmentation processing is carried out to the interest point name in described interest point name set, obtains one or more participle;
Search first word frequency of described participle in the interest point set preset;
By X minimum for the first word frequency in a same interest point name participle, as the keyword of described interest point name, wherein, X is positive integer.
Alternatively, the step that described interest point name in described interest point name set chooses keyword also comprises:
When described participle mates with the address date preset, remove described participle.
Alternatively, alternatively, describedly from described interest point name, identify that the step of the second correct target interest point name comprises according to described keyword:
Calculate second word frequency of described keyword in described interest point name set;
Interest point name belonging to Z the highest for described second word frequency keyword is defined as correct target interest point name, and wherein, Z is positive integer.
According to a further aspect in the invention, provide a kind of recognition device of interest point name, comprising:
Interest point data extraction module, is suitable for extracting interest point data in webpage; Described interest point data comprises interest point name;
Interest point name set arranges module, is suitable for the interest point name of mark same target to be set to interest point name set;
Correct interest point name identification module, is suitable for from described interest point name set, identify the second correct target interest point name.
Alternatively, described interest point data extraction module is also suitable for:
Search the template for webpage configuration;
In described webpage, interest point data is extracted in the position indicated according to described template.
Alternatively, described interest point data also comprises point of interest address;
Described interest point name set arranges module and is also suitable for:
Judge that whether described point of interest address is same or similar; If so, then the interest point name of described point of interest address information is set to interest point name set.
Alternatively, described correct interest point name identification module is also suitable for:
Interest point name in described interest point name set chooses keyword;
From described interest point name, the second correct target interest point name is identified according to described keyword.
Alternatively, described correct interest point name identification module is also suitable for:
Word segmentation processing is carried out to the interest point name in described interest point name set, obtains one or more participle;
Search first word frequency of described participle in the interest point set preset;
By X minimum for the first word frequency in a same interest point name participle, as the keyword of described interest point name, wherein, X is positive integer.
Alternatively, described correct interest point name identification module is also suitable for:
When described participle mates with the address date preset, remove described participle.
Alternatively, described correct interest point name identification module is also suitable for:
Calculate second word frequency of described keyword in described interest point name set;
Interest point name belonging to Z the highest for described second word frequency keyword is defined as correct target interest point name, and wherein, Z is positive integer.
The embodiment of the present invention extracts the interest point name of identification marking same target interest point data from webpage, thus identify the second correct target interest point name, thus in follow-up operation, apply these correct POI data, reduce the error rate of operation, decrease the wasting of resources.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows a kind of according to an embodiment of the invention flow chart of steps of recognition methods embodiment 1 of interest point name;
Fig. 2 shows a kind of according to an embodiment of the invention flow chart of steps of recognition methods embodiment 2 of interest point name;
Fig. 3 shows a kind of according to an embodiment of the invention flow chart of steps of recognition methods embodiment 3 of interest point name;
Fig. 4 shows a kind of according to an embodiment of the invention structured flowchart of recognition device embodiment 1 of interest point name;
Fig. 5 shows a kind of according to an embodiment of the invention structured flowchart of recognition device embodiment 2 of interest point name; And
Fig. 6 shows a kind of according to an embodiment of the invention structured flowchart of recognition device embodiment 3 of interest point name.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
With reference to Fig. 1, show a kind of according to an embodiment of the invention flow chart of steps of recognition methods embodiment 1 of interest point name, specifically can comprise the steps:
Step 101, extracts interest point data in webpage;
In embodiments of the present invention, reptile can in advance by the linking relationship between webpage, and capture the webpage of internet and preserve, the webpage of crawler capturing is kept in web database and forms a large amount of searching resources.
For there is more POI data and the regular webpage of POI data distribution tool, as the webpage that user carries out food and drink, tourism is carried out in the website of commenting on, webpage in map web site etc., the template for webpage configuration can be searched, in webpage, interest point data is extracted in position according to template instruction, thus get a large amount of POI data, comprising the interest point name, point of interest address, URL (UniformResourceLocator, URL(uniform resource locator)) etc. of association.
Such as, the part structure of web page of some websites is as follows:
Wherein, " * * * " is domain name.
In the template of this website, can extract interest point name in the first row, in the end a line can extract point of interest address.
By template, be extracted following interest point data at the webpage of different web sites:
Wherein, " * * * A " and " * * * B " is different domain names.
Step 102, is set to interest point name set by the interest point name of mark same target;
POI data generally all can identify an object, as a house, retail shop, mailbox, a bus station etc.
Because the accuracy of the address information of this object is general higher, therefore, in embodiments of the present invention, by being normalized point of interest address, can judge that whether point of interest address is same or similar; If so, then the interest point name of point of interest address information is set to interest point name set.
Such as, by normalization, " 3rd floor, the permanent general merchandise in east, tide today hotel next door, Yu Yangfushi road, Yulin ", " Fu Shi road, Yuyang District, Yulin tidal rip today wall east permanent general merchandise 3rd floors the first sales departments ", " 3rd floors, Kou Dongheng department store, Yu Yang south gate, Yulin " and " wholesale 3rd floors of the permanent general merchandise in south gate, Yulin mouth east ", although these 4 point of interest addresses are in form incomplete same, can determine that their address is all " 3rd floors, Dong Heng department store, rate in Yuyang county ".
Namely " the 500 tops of the world enterprise ", " China Ping'an Insurance company ", " Chinese safety Yulin branch office " and " Yulin branch office of China Ping'an Insurance Co., Ltd. Branch " of its association is interest point name set.
Step 103, the first object interest point name of identification error from described interest point name set.
In embodiments of the present invention, the POI title of mistake can be screened by the keyword excavating interest point name, i.e. first object interest point name.
In an alternate embodiment of the present invention where, step 103 can comprise following sub-step:
Sub-step S11, the interest point name in described interest point name set chooses keyword;
In embodiments of the present invention, keyword can be maximum for the quantity of information comprised, and embodies the word of interest point name feature.
In specific implementation, word segmentation processing can be carried out to the interest point name in interest point name set, obtain one or more participle;
In the embodiment of the present invention, one or more following word segmentation processing can be adopted:
1, based on the participle of string matching: refer to and to be mated with the entry in a preset machine dictionary by Chinese character string to be analyzed according to certain strategy, if find certain character string in dictionary, then the match is successful (identifying a word).
2, the participle of feature based scanning or mark cutting: refer to and preferential identify and be syncopated as some words with obvious characteristic in character string to be analyzed, using these words as breakpoint, can less string be divided into come into mechanical Chinese word segmentation more former character string, thus reduce the error rate of coupling; Or participle and part-of-speech tagging are combined, utilizes abundant grammatical category information to offer help to participle decision-making, and conversely word segmentation result tested again in annotation process, adjust, thus improve the accuracy rate of cutting.
3, based on the participle understood: referring to by allowing the understanding of anthropomorphic distich of computer mould, reaching the effect identifying word.Its basic thought is exactly carry out syntax, semantic analysis while participle, utilizes syntactic information and semantic information to process Ambiguity.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Under the coordination of master control part, participle subsystem can obtain about the syntax of word, sentence etc. and semantic information judge segmentation ambiguity, and namely it simulates the understanding process of people to sentence.This segmenting method needs to use a large amount of linguistries and information.
4, the segmenting method of Corpus--based Method: refer to, because the frequency of the adjacent co-occurrence of word and word or probability can reflect into the confidence level of word preferably in Chinese information, so can add up the frequency of each combinatorics on words of co-occurrence adjacent in language material, calculate their information that appears alternatively, and calculate the adjacent co-occurrence probabilities of two Chinese characters X, Y.The information of appearing alternatively can embody the tightness degree of marriage relation between Chinese character.When tightness degree is higher than some threshold values, just can think that this word group may constitute a word.
Such as, for above-mentioned interest point name, can the following participle of cutting:
Search first word frequency of participle in the interest point set preset, this interest point set is the set of the POI data in the webpage grabbed, and the quantity of this POI data can be nearly tens million of, and this first word frequency is added up according to the title of this tens million of POI data.
When the first word frequency is minimum, its quantity of information comprised is generally maximum, then can by X minimum for the first word frequency in a same interest point name participle, and as the keyword of interest point name, wherein, X is positive integer.
Such as, for above-mentioned interest point name, following keyword can be extracted:
Interest point name Keyword
500 tops of the world enterprise The world
China Ping'an Insurance company Safety
Safety Yulin branch office of China Safety
Yulin branch office of China Ping'an Insurance Co., Ltd. Branch Safety
Wherein, first word frequency of the word such as " enterprise ", " company ", " branch office " is higher, the quantity of information comprised is less, only represent business/company identity, directive property is indefinite, is not suitable for as keyword, first word frequency of words such as " safeties " is lower, the quantity of information comprised is more, namely conventional enterprise's abbreviation title, suitable to keyword.
It should be noted that, the address date such as province, city, county (district), small towns, road in the whole nation can be obtained in advance, create an address database.
When participle mates with the address date preset, such as " China ", " Yulin " etc., be invalid keyword, can remove this participle.
Sub-step S12, according to the target interest point name of described keyword identification error first from described interest point name.
In specific implementation, can calculate second word frequency of keyword in interest point name set, the interest point name belonging to Y minimum for a second word frequency keyword is defined as the first object interest point name of mistake, wherein, Y is positive integer.
Such as, for the keyword of above-mentioned interest point name, second word frequency in " world " is 1, and second word frequency of " safety " is 3, second word frequency in " world " is lower, can confirm that " the 500 tops of the world enterprise " belonging to it is wrong first object interest point name.
The embodiment of the present invention extracts the interest point name of identification marking same target interest point data from webpage, thus the first object interest point name of identification error, thus in follow-up operation, reject the POI data of these mistakes, reduce the error rate of operation, decrease the wasting of resources.
With reference to Fig. 2, show a kind of according to an embodiment of the invention flow chart of steps of recognition methods embodiment 2 of interest point name, specifically can comprise the steps:
Step 201, extracts interest point data in webpage;
In embodiments of the present invention, reptile can in advance by the linking relationship between webpage, and capture the webpage of internet and preserve, the webpage of crawler capturing is kept in web database and forms a large amount of searching resources.
For there is more POI data and the regular webpage of POI data distribution tool, as the webpage that user carries out food and drink, tourism is carried out in the website of commenting on, webpage in map web site etc., the template for webpage configuration can be searched, in webpage, interest point data is extracted in position according to template instruction, thus get a large amount of POI data, comprising the interest point name, point of interest address, URL (UniformResourceLocator, URL(uniform resource locator)) etc. of association.
Such as, the part structure of web page of some websites is as follows:
Wherein, " * * * " is domain name.
In the template of this website, can extract interest point name in the first row, in the end a line can extract point of interest address.
By template, be extracted following interest point data at the webpage of different web sites:
Wherein, " * * * A " and " * * * B " is different domain names.
Step 202, is set to interest point name set by the interest point name of mark same target;
POI data generally all can identify an object, as a house, retail shop, mailbox, a bus station etc.
Because the accuracy of the address information of this object is general higher, therefore, in embodiments of the present invention, by being normalized point of interest address, can judge that whether point of interest address is same or similar; If so, then the interest point name of point of interest address information is set to interest point name set.
Such as, by normalization, " 3rd floor, the permanent general merchandise in east, tide today hotel next door, Yu Yangfushi road, Yulin ", " Fu Shi road, Yuyang District, Yulin tidal rip today wall east permanent general merchandise 3rd floors the first sales departments ", " 3rd floors, Kou Dongheng department store, Yu Yang south gate, Yulin " and " wholesale 3rd floors of the permanent general merchandise in south gate, Yulin mouth east ", although these 4 point of interest addresses are in form incomplete same, can determine that their address is all " 3rd floors, Dong Heng department store, rate in Yuyang county ".
Namely " the 500 tops of the world enterprise ", " China Ping'an Insurance company ", " Chinese safety Yulin branch office " and " Yulin branch office of China Ping'an Insurance Co., Ltd. Branch " of its association is interest point name set.
Step 203, identifies the second correct target interest point name from described interest point name set.
In embodiments of the present invention, correct POI title can be screened, i.e. the second target interest point name by the keyword excavating interest point name.
In an alternate embodiment of the present invention where, step 203 can comprise following sub-step:
Sub-step S21, the interest point name in described interest point name set chooses keyword;
In embodiments of the present invention, keyword can be maximum for the quantity of information comprised, and embodies the word of interest point name feature.
In specific implementation, word segmentation processing can be carried out to the interest point name in interest point name set, obtain one or more participle;
In the embodiment of the present invention, one or more following word segmentation processing can be adopted:
1, based on the participle of string matching: refer to and to be mated with the entry in a preset machine dictionary by Chinese character string to be analyzed according to certain strategy, if find certain character string in dictionary, then the match is successful (identifying a word).
2, the participle of feature based scanning or mark cutting: refer to and preferential identify and be syncopated as some words with obvious characteristic in character string to be analyzed, using these words as breakpoint, can less string be divided into come into mechanical Chinese word segmentation more former character string, thus reduce the error rate of coupling; Or participle and part-of-speech tagging are combined, utilizes abundant grammatical category information to offer help to participle decision-making, and conversely word segmentation result tested again in annotation process, adjust, thus improve the accuracy rate of cutting.
3, based on the participle understood: referring to by allowing the understanding of anthropomorphic distich of computer mould, reaching the effect identifying word.Its basic thought is exactly carry out syntax, semantic analysis while participle, utilizes syntactic information and semantic information to process Ambiguity.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Under the coordination of master control part, participle subsystem can obtain about the syntax of word, sentence etc. and semantic information judge segmentation ambiguity, and namely it simulates the understanding process of people to sentence.This segmenting method needs to use a large amount of linguistries and information.
4, the segmenting method of Corpus--based Method: refer to, because the frequency of the adjacent co-occurrence of word and word or probability can reflect into the confidence level of word preferably in Chinese information, so can add up the frequency of each combinatorics on words of co-occurrence adjacent in language material, calculate their information that appears alternatively, and calculate the adjacent co-occurrence probabilities of two Chinese characters X, Y.The information of appearing alternatively can embody the tightness degree of marriage relation between Chinese character.When tightness degree is higher than some threshold values, just can think that this word group may constitute a word.
Such as, for above-mentioned interest point name, can the following participle of cutting:
Search first word frequency of participle in the interest point set preset, this interest point set is the set of the POI data in the webpage grabbed, and the quantity of this POI data can be nearly tens million of, and this first word frequency is added up according to the title of this tens million of POI data.
When the first word frequency is minimum, its quantity of information comprised is generally maximum, then can by X minimum for the first word frequency in a same interest point name participle, and as the keyword of interest point name, wherein, X is positive integer.
Such as, for above-mentioned interest point name, following keyword can be extracted:
Interest point name Keyword
500 tops of the world enterprise The world
China Ping'an Insurance company Safety
Safety Yulin branch office of China Safety
Yulin branch office of China Ping'an Insurance Co., Ltd. Branch Safety
Wherein, first word frequency of the word such as " enterprise ", " company ", " branch office " is higher, the quantity of information comprised is less, only represent business/company identity, directive property is indefinite, is not suitable for as keyword, first word frequency of words such as " safeties " is lower, the quantity of information comprised is more, namely conventional enterprise's abbreviation title, suitable to keyword.
It should be noted that, the address date such as province, city, county (district), small towns, road in the whole nation can be obtained in advance, create an address database.
When participle mates with the address date preset, such as " China ", " Yulin " etc., be invalid keyword, can remove this participle.
Sub-step S22, identifies the second correct target interest point name according to described keyword from described interest point name.
In specific implementation, can calculate second word frequency of keyword in interest point name set, the interest point name belonging to Z the highest for the second word frequency keyword is defined as correct target interest point name, wherein, Z is positive integer.
Such as, for the keyword of above-mentioned interest point name, second word frequency in " world " is 1, second word frequency of " safety " is 3, second word frequency of " safety " is higher, can confirm " China Ping'an Insurance company " belonging to it, " Chinese safety Yulin branch office " and " Yulin branch office of China Ping'an Insurance Co., Ltd. Branch " be the second correct target interest point name.
The embodiment of the present invention extracts the interest point name of identification marking same target interest point data from webpage, thus identify the second correct target interest point name, thus in follow-up operation, apply these correct POI data, reduce the error rate of operation, decrease the wasting of resources.
With reference to Fig. 3, show a kind of according to an embodiment of the invention flow chart of steps of recognition methods embodiment 3 of interest point name, specifically can comprise the steps:
Step 301, extracts interest point data in webpage; Described interest point data comprises interest point name;
Step 302, is set to interest point name set by the interest point name of mark same target;
Step 303, the first object interest point name of identification error and the second correct target interest point name from described interest point name set.
In an alternate embodiment of the present invention where, step 301 can comprise following sub-step:
Sub-step S31, searches the template for webpage configuration;
Sub-step S32, in described webpage, interest point data is extracted in the position indicated according to described template.
In one embodiment of the invention, described interest point data also comprises point of interest address; Then in embodiments of the present invention, step 302 can comprise following sub-step:
Sub-step S41, judges that whether described point of interest address is same or similar; If so, then sub-step S42 is performed;
Sub-step S42, is set to interest point name set by the interest point name of described point of interest address information.
In an alternate embodiment of the present invention where, step 303 can comprise following sub-step:
Sub-step S51, the interest point name in described interest point name set chooses keyword;
Sub-step S52, according to target interest point name and the second correct target interest point name of described keyword identification error first from described interest point name.
In an alternate embodiment of the present invention where, sub-step S51 can comprise following sub-step further:
Sub-step S511, carries out word segmentation processing to the interest point name in described interest point name set, obtains one or more participle;
Sub-step S512, searches first word frequency of described participle in the interest point set preset;
Sub-step S513, by X minimum for the first word frequency in a same interest point name participle, as the keyword of described interest point name, wherein, X is positive integer.
In one embodiment of the invention, sub-step S51 can also comprise following sub-step:
Sub-step S514, when described participle mates with the address date preset, removes described participle.
In an alternate embodiment of the present invention where, sub-step S52 can comprise following sub-step further:
Sub-step S521, calculates second word frequency of described keyword in described interest point name set;
Sub-step S522, is defined as the first object interest point name of mistake by the interest point name belonging to Y minimum for described second word frequency keyword;
Sub-step S523, is defined as correct target interest point name by the interest point name belonging to Z the highest for described second word frequency keyword, and wherein, Y, Z are positive integer.
The embodiment of the present invention extracts the interest point name of identification marking same target interest point data from webpage, thus the first object interest point name of identification error and the second correct target interest point name, from reject subsequent operation these mistakes POI data, apply these correct POI data, reduce the error rate of operation, decrease the wasting of resources.
In embodiments of the present invention, due to the application basic simlarity with embodiment of the method 1,2, so description is fairly simple, relevant part illustrates see the part of embodiment of the method 1,2, and the embodiment of the present invention is not described in detail at this.
For embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the embodiment of the present invention is not by the restriction of described sequence of movement, because according to the embodiment of the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action might not be that the embodiment of the present invention is necessary.
With reference to Fig. 4, show a kind of according to an embodiment of the invention structured flowchart of recognition device embodiment 1 of interest point name, specifically can comprise as lower module:
Interest point data extraction module 401, is suitable for extracting interest point data in webpage; Described interest point data comprises interest point name;
Interest point name set arranges module 402, is suitable for the interest point name of mark same target to be set to interest point name set;
Mistake interest point name identification module 403, is suitable for the first object interest point name of identification error from described interest point name set.
In an alternate embodiment of the present invention where, described interest point data extraction module 401 can also be suitable for:
Search the template for webpage configuration;
In described webpage, interest point data is extracted in the position indicated according to described template.
In an alternate embodiment of the present invention where, described interest point data also comprises point of interest address;
Described interest point name set arranges module 402 and can also be suitable for:
Judge that whether described point of interest address is same or similar; If so, then the interest point name of described point of interest address information is set to interest point name set.
In an alternate embodiment of the present invention where, described wrong interest point name identification module 403 can also be suitable for:
Interest point name in described interest point name set chooses keyword;
According to the target interest point name of described keyword identification error first from described interest point name.
In an alternate embodiment of the present invention where, described wrong interest point name identification module 403 can also be suitable for:
Word segmentation processing is carried out to the interest point name in described interest point name set, obtains one or more participle;
Search first word frequency of described participle in the interest point set preset;
By X minimum for the first word frequency in a same interest point name participle, as the keyword of described interest point name, wherein, X is positive integer.
In an alternate embodiment of the present invention where, described wrong interest point name identification module 403 can also be suitable for:
When described participle mates with the address date preset, remove described participle.
In an alternate embodiment of the present invention where, described wrong interest point name identification module 403 can also be suitable for:
Calculate second word frequency of described keyword in described interest point name set;
Interest point name belonging to Y minimum for described second word frequency keyword is defined as the first object interest point name of mistake, wherein, Y is positive integer.
With reference to Fig. 5, show a kind of according to an embodiment of the invention structured flowchart of recognition device embodiment 2 of interest point name, specifically can comprise as lower module:
Interest point data extraction module 501, is suitable for extracting interest point data in webpage; Described interest point data comprises interest point name;
Interest point name set arranges module 502, is suitable for the interest point name of mark same target to be set to interest point name set;
Correct interest point name identification module 503, is suitable for from described interest point name set, identify the second correct target interest point name.
In an alternate embodiment of the present invention where, described interest point data extraction module 501 can also be suitable for:
Search the template for webpage configuration;
In described webpage, interest point data is extracted in the position indicated according to described template.
In an alternate embodiment of the present invention where, described interest point data also comprises point of interest address;
Described interest point name set arranges module 502 and can also be suitable for:
Judge that whether described point of interest address is same or similar; If so, then the interest point name of described point of interest address information is set to interest point name set.
In an alternate embodiment of the present invention where, described correct interest point name identification module 503 can also be suitable for:
Interest point name in described interest point name set chooses keyword;
From described interest point name, the second correct target interest point name is identified according to described keyword.
In an alternate embodiment of the present invention where, described correct interest point name identification module 503 can also be suitable for:
Word segmentation processing is carried out to the interest point name in described interest point name set, obtains one or more participle;
Search first word frequency of described participle in the interest point set preset;
By X minimum for the first word frequency in a same interest point name participle, as the keyword of described interest point name, wherein, X is positive integer.
In an alternate embodiment of the present invention where, described correct interest point name identification module 503 can also be suitable for:
When described participle mates with the address date preset, remove described participle.
In an alternate embodiment of the present invention where, described correct interest point name identification module 503 can also be suitable for:
Calculate second word frequency of described keyword in described interest point name set;
Interest point name belonging to Z the highest for described second word frequency keyword is defined as correct target interest point name, and wherein, Z is positive integer.
With reference to Fig. 6, show a kind of according to an embodiment of the invention structured flowchart of recognition device embodiment 3 of interest point name, specifically can comprise as lower module:
Interest point data extraction module 601, is suitable for extracting interest point data in webpage; Described interest point data comprises interest point name;
Interest point name set arranges module 602, is suitable for the interest point name of mark same target to be set to interest point name set;
Interest point name identification module 603, is suitable for the first object interest point name of identification error from described interest point name set and the second correct target interest point name.
In an alternate embodiment of the present invention where, described interest point data extraction module 601 can also be suitable for:
Search the template for webpage configuration;
In described webpage, interest point data is extracted in the position indicated according to described template.
In an alternate embodiment of the present invention where, described interest point data also comprises point of interest address;
Described interest point name set arranges module 602 and can also be suitable for:
Judge that whether described point of interest address is same or similar; If so, then the interest point name of described point of interest address information is set to interest point name set.
In an alternate embodiment of the present invention where, described interest point name identification module 603 can also be suitable for:
Interest point name in described interest point name set chooses keyword;
According to target interest point name and the second correct target interest point name of described keyword identification error first from described interest point name.
In an alternate embodiment of the present invention where, described interest point name identification module 403 can also be suitable for:
Word segmentation processing is carried out to the interest point name in described interest point name set, obtains one or more participle;
Search first word frequency of described participle in the interest point set preset;
By X minimum for the first word frequency in a same interest point name participle, as the keyword of described interest point name, wherein, X is positive integer.
In an alternate embodiment of the present invention where, described wrong interest point name identification module 603 can also be suitable for:
When described participle mates with the address date preset, remove described participle.
In an alternate embodiment of the present invention where, described wrong interest point name identification module 603 can also be suitable for:
Calculate second word frequency of described keyword in described interest point name set;
Interest point name belonging to Y minimum for described second word frequency keyword is defined as the first object interest point name of mistake;
Interest point name belonging to Z the highest for described second word frequency keyword is defined as correct target interest point name, and wherein, Y, Z are positive integer.
For device embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with display at this algorithm provided.Various general-purpose system also can with use based on together with this teaching.According to description above, the structure constructed required by this type systematic is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the identification equipment of the interest point name of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims (10)

1. a recognition methods for interest point name, comprising:
Interest point data is extracted in webpage; Described interest point data comprises interest point name;
The interest point name of mark same target is set to interest point name set;
The second correct target interest point name is identified from described interest point name set.
2. the method for claim 1, is characterized in that, the described step extracting interest point data in webpage comprises:
Search the template for webpage configuration;
In described webpage, interest point data is extracted in the position indicated according to described template.
3. method as claimed in claim 1 or 2, it is characterized in that, described interest point data also comprises point of interest address;
The step that the described interest point name by mark same target is set to interest point name set comprises:
Judge that whether described point of interest address is same or similar; If so, then the interest point name of described point of interest address information is set to interest point name set.
4. the method as described in claim 1 or 2 or 3, is characterized in that, describedly from described interest point name set, identifies that the step of the second correct target interest point name comprises:
Interest point name in described interest point name set chooses keyword;
From described interest point name, the second correct target interest point name is identified according to described keyword.
5. method as claimed in claim 1 or 2 or 3 or 4, it is characterized in that, the step that described interest point name in described interest point name set chooses keyword comprises:
Word segmentation processing is carried out to the interest point name in described interest point name set, obtains one or more participle;
Search first word frequency of described participle in the interest point set preset;
By X minimum for the first word frequency in a same interest point name participle, as the keyword of described interest point name, wherein, X is positive integer.
6. the method as described in claim 1 or 2 or 3 or 4 or 5, it is characterized in that, the step that described interest point name in described interest point name set chooses keyword also comprises:
When described participle mates with the address date preset, remove described participle.
7. the method as described in claim 1 or 2 or 3 or 4 or 5 or 6, is characterized in that, describedly from described interest point name, identifies that the step of the second correct target interest point name comprises according to described keyword:
Calculate second word frequency of described keyword in described interest point name set;
Interest point name belonging to Z the highest for described second word frequency keyword is defined as correct target interest point name, and wherein, Z is positive integer.
8. a recognition device for interest point name, comprising:
Interest point data extraction module, is suitable for extracting interest point data in webpage; Described interest point data comprises interest point name;
Interest point name set arranges module, is suitable for the interest point name of mark same target to be set to interest point name set;
Correct interest point name identification module, is suitable for from described interest point name set, identify the second correct target interest point name.
9. device as claimed in claim 8, it is characterized in that, described interest point data extraction module is also suitable for:
Search the template for webpage configuration;
In described webpage, interest point data is extracted in the position indicated according to described template.
10. device as claimed in claim 8 or 9, it is characterized in that, described interest point data also comprises point of interest address;
Described interest point name set arranges module and is also suitable for:
Judge that whether described point of interest address is same or similar; If so, then the interest point name of described point of interest address information is set to interest point name set.
CN201510643119.5A 2015-09-30 2015-09-30 Method and device for identifying names of points of interest (POI) Pending CN105138708A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510643119.5A CN105138708A (en) 2015-09-30 2015-09-30 Method and device for identifying names of points of interest (POI)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510643119.5A CN105138708A (en) 2015-09-30 2015-09-30 Method and device for identifying names of points of interest (POI)

Publications (1)

Publication Number Publication Date
CN105138708A true CN105138708A (en) 2015-12-09

Family

ID=54724055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510643119.5A Pending CN105138708A (en) 2015-09-30 2015-09-30 Method and device for identifying names of points of interest (POI)

Country Status (1)

Country Link
CN (1) CN105138708A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550169A (en) * 2015-12-11 2016-05-04 北京奇虎科技有限公司 Method and device for identifying point of interest names based on character length
CN110334349A (en) * 2019-06-28 2019-10-15 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that commercial circle is named automatically
CN110457706A (en) * 2019-08-15 2019-11-15 腾讯科技(深圳)有限公司 Interest point name preference pattern training method, application method, device and storage medium
CN111881225A (en) * 2020-04-01 2020-11-03 北京嘀嘀无限科技发展有限公司 Method and system for correcting name of boarding point

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030097429A1 (en) * 2001-11-20 2003-05-22 Wen-Che Wu Method of forming a website server cluster and structure thereof
CN102340536A (en) * 2011-07-13 2012-02-01 北京世纪高通科技有限公司 Method and device for searching points of interest
CN104572957A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 POI name determination system based on clustering and method thereof
CN104699835A (en) * 2015-03-31 2015-06-10 北京奇虎科技有限公司 Method and device used for determining webpages including POI (point of interest) data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030097429A1 (en) * 2001-11-20 2003-05-22 Wen-Che Wu Method of forming a website server cluster and structure thereof
CN102340536A (en) * 2011-07-13 2012-02-01 北京世纪高通科技有限公司 Method and device for searching points of interest
CN104572957A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 POI name determination system based on clustering and method thereof
CN104699835A (en) * 2015-03-31 2015-06-10 北京奇虎科技有限公司 Method and device used for determining webpages including POI (point of interest) data

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550169A (en) * 2015-12-11 2016-05-04 北京奇虎科技有限公司 Method and device for identifying point of interest names based on character length
CN110334349A (en) * 2019-06-28 2019-10-15 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that commercial circle is named automatically
CN110457706A (en) * 2019-08-15 2019-11-15 腾讯科技(深圳)有限公司 Interest point name preference pattern training method, application method, device and storage medium
CN110457706B (en) * 2019-08-15 2023-08-22 腾讯科技(深圳)有限公司 Point-of-interest name selection model training method, using method, device and storage medium
CN111881225A (en) * 2020-04-01 2020-11-03 北京嘀嘀无限科技发展有限公司 Method and system for correcting name of boarding point

Similar Documents

Publication Publication Date Title
CN102831121B (en) Method and system for extracting webpage information
CN107832229A (en) A kind of system testing case automatic generating method based on NLP
CN105159885A (en) Point-of-interest name identification method and device
CN104102639B (en) Popularization triggering method based on text classification and device
CN113158653B (en) Training method, application method, device and equipment for pre-training language model
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN103488724A (en) Book-oriented reading field knowledge map construction method
CN103514234A (en) Method and device for extracting page information
CN110909170B (en) Interest point knowledge graph construction method and device, electronic equipment and storage medium
CN104572955A (en) System and method for determining POI name based on clustering
CN102841920A (en) Method and device for extracting webpage frame information
CN104699785A (en) Paper similarity detection method
CN103544266A (en) Method and device for generating search suggestion words
CN105550169A (en) Method and device for identifying point of interest names based on character length
CN105138708A (en) Method and device for identifying names of points of interest (POI)
CN103246644A (en) Method and device for processing Internet public opinion information
CN102760150A (en) Webpage extraction method based on attribute reproduction and labeled path
CN109344355A (en) Automatic returning detection and Block- matching adaptive approach and device for Web evolution
CN104268283A (en) Method for automatically analyzing Internet web page
CN103473285A (en) Web information extraction method and device based on location markers
Matsuda et al. Annotating geographical entities on microblog text
CN116414823A (en) Address positioning method and device based on word segmentation model
CN106372232B (en) Information mining method and device based on artificial intelligence
CN105279249B (en) The determination method and device of the confidence level of interest point data in a kind of website
CN105160032B (en) The determination method and device of the confidence level of interest point data in a kind of website

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20151209

RJ01 Rejection of invention patent application after publication