WO2016107352A1 - System and method for determining poi name and for determining validity of poi information - Google Patents

System and method for determining poi name and for determining validity of poi information Download PDF

Info

Publication number
WO2016107352A1
WO2016107352A1 PCT/CN2015/095857 CN2015095857W WO2016107352A1 WO 2016107352 A1 WO2016107352 A1 WO 2016107352A1 CN 2015095857 W CN2015095857 W CN 2015095857W WO 2016107352 A1 WO2016107352 A1 WO 2016107352A1
Authority
WO
WIPO (PCT)
Prior art keywords
name
poi
frequency
information
determining
Prior art date
Application number
PCT/CN2015/095857
Other languages
French (fr)
Chinese (zh)
Inventor
王智广
魏少俊
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201410849382.5A external-priority patent/CN104572957B/en
Priority claimed from CN201410849380.6A external-priority patent/CN104572956B/en
Priority claimed from CN201410849123.2A external-priority patent/CN104572955B/en
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2016107352A1 publication Critical patent/WO2016107352A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to the field of electronic map technology, and in particular to a system and method for determining a POI name based on clustering, a cluster-based POI name determining system and method, and a POI information based on address data in a network. System and method of effectiveness.
  • the Point of Interest is generally a geographic information point marked in an electronic map, and usually includes information such as a POI identifier, a POI name, a POI type, a longitude, and a latitude.
  • the POI can be marked on the map with latitude and longitude information, which can be used to find and calculate navigation landmarks or buildings, such as shopping malls, parking lots, schools, hospitals, hotels, restaurants, supermarkets, parks, tourist attractions, etc.
  • the POI data stored in the database provides data support for the POI query.
  • the POI data in the database is updated mainly by performing data mining, updating the POI data stored in the database according to the data obtained by the actual acquisition, or obtaining POI data from various life information websites on the Internet, as long as The acquired data includes the name and address of the POI, and the data can be determined as a piece of POI data. Due to the characteristics of the acquisition and update of POI data, it is inevitable that there will be various POI data on the Internet. Therefore, there may be repetitive data in the POI data obtained from different source websites.
  • the present invention has been made in order to provide a cluster-based POI name-based system and a corresponding cluster-based POI name-based method for overcoming the above problems or at least partially solving or alleviating the above problems, a clustering-based method
  • a system for determining a POI name based on clustering comprising:
  • An address data grabber for fetching address data from network data
  • An address data parser configured to separately extract a name field and address information from the captured one or more address data
  • a keyword determiner for determining one or more keywords based on the name field
  • a keyword clusterer for clustering the keywords corresponding to the same address information to generate at least one class
  • the POI name generator is configured to determine a POI name corresponding to the address information according to the clustered keywords.
  • a method for determining a POI name based on clustering including:
  • the POI name corresponding to the address information is determined according to the clustered keywords.
  • a cluster-based POI name determination system comprising:
  • An address data grabber for extracting address data from network data based on a search engine, the address data including a name field and address information;
  • a name field clusterer for clustering name fields corresponding to the same address information according to keywords
  • the second frequency statistic is used for counting the frequency of occurrence of the name field in each category after clustering, as the second frequency
  • the POI name determining unit is configured to determine, according to the second frequency, a POI name corresponding to the address information of the category.
  • a cluster-based POI name determining method including:
  • Obtaining address data from network data the address data including a name field and address information
  • the name fields corresponding to the same address information are clustered according to keywords
  • the POI name corresponding to the address information of the category is determined according to the second frequency.
  • a system for determining validity of POI information based on address data in a network comprising:
  • a POI information acquiring unit configured to acquire, according to the search engine, a plurality of related POI information corresponding to the same POI name by using address data in the network;
  • a statistical unit configured to count the number of occurrences of the POI information in the address data in the network
  • a POI information determining unit configured to determine, according to the number of occurrences of the POI information in the address data in the network, corresponding to the same Valid POI information for the POI name.
  • a method for determining validity of POI information based on address data in a network including:
  • the valid POI information corresponding to the same POI name is determined according to the number of occurrences of the POI information in the address data in the network.
  • a computer program comprising computer readable code that, when executed on a computing device, causes the computing device to perform a cluster-based determination as described above a method for determining a POI name, or causing the computing device to perform a cluster-based POI name determining method as described above, or causing the computing device to perform the network-based address data described in the network to determine the validity of the POI information. method.
  • a computer readable medium wherein the computer program described above is stored.
  • the invention extracts the name field and the address information by fetching the address data from the network data, determines one or more keywords based on the name field, and clusters the keywords corresponding to the same address information, based on the key after clustering
  • the word determines the POI name corresponding to the address information, so that the user can quickly and accurately search for the POI name corresponding to the POI address of the same latitude and longitude, thereby improving the user experience.
  • FIG. 1 is a block diagram schematically showing a system for determining a POI name based on clustering according to an embodiment of the present invention
  • FIG. 2 is a block diagram schematically showing a keyword determiner in a system for determining a POI name based on clustering according to another embodiment of the present invention
  • FIG. 3 is a block diagram schematically showing a POI name generator in a system for determining a POI name based on clustering according to another embodiment of the present invention
  • FIG. 4 is a block diagram schematically showing a POI name generator in a system for determining a POI name based on clustering according to another embodiment of the present invention
  • FIG. 5 is a flow chart schematically showing a method for determining a POI name based on clustering according to an embodiment of the present invention
  • FIG. 6 is a schematic diagram showing a subdivided flowchart of step S13 of a method for determining a POI name based on clustering according to another embodiment of the present invention
  • FIG. 7 is a view schematically showing a subdivision flowchart of step S15 of a method for determining a POI name based on clustering according to another embodiment of the present invention.
  • FIG. 8 is a schematic diagram showing a subdivided flowchart of step S15 of the method for determining a POI name based on clustering according to another embodiment of the present invention.
  • FIG. 9 is a block diagram schematically showing a cluster-based POI name determining system according to an embodiment of the present invention.
  • FIG. 10 is a block diagram schematically showing a name field clusterer in a cluster-based POI name determination system according to another embodiment of the present invention.
  • FIG. 11 is a block diagram schematically showing a second frequency statistic in a cluster-based POI name determining system according to another embodiment of the present invention.
  • FIG. 12 is a flow chart schematically showing a cluster-based POI name determining method according to an embodiment of the present invention.
  • FIG. 13 is a schematic flowchart showing a subdivision of step S122 of the cluster-based POI name determining method according to another embodiment of the present invention.
  • Fig. 14 is a view schematically showing a subdivision flow chart of step S123 of the cluster-based POI name determining method of another embodiment of the present invention.
  • 15 is a block diagram schematically showing a system for determining validity of POI information based on address data in a network according to an embodiment of the present invention
  • 16 is a block diagram schematically showing a statistical unit in a system for determining validity of POI information based on address data in a network according to another embodiment of the present invention
  • FIG. 17 is a block diagram schematically showing a POI information determining unit in a system for determining validity of POI information based on address data in a network according to another embodiment of the present invention.
  • FIG. 18 is a flow chart schematically showing a method for determining validity of POI information based on address data in a network according to an embodiment of the present invention
  • FIG. 19 is a schematic flowchart showing a subdivision of step S812 of a method for determining validity of POI information based on address data in a network according to another embodiment of the present invention.
  • FIG. 20 is a schematic diagram showing a subdivided flow chart of step S813 of a method for determining validity of POI information based on address data in a network according to another embodiment of the present invention.
  • Figure 21 schematically shows a block diagram of a computing device for performing a method in accordance with the present invention
  • Fig. 22 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
  • FIG. 1 shows a block diagram of a system for determining a PO1 name based on clustering in accordance with one embodiment of the present invention.
  • a system for determining a POI name based on clustering includes:
  • An address data grabber 11 for fetching address data from network data
  • the address data parser 12 is configured to separately extract a name field and address information from the captured one or more address data;
  • a keyword determiner 13 for determining one or more keywords based on the name field
  • a keyword clusterer 14 configured to cluster the keywords corresponding to the same address information to generate at least one class
  • the POI name generator 15 is configured to determine a POI name corresponding to the address information according to the clustered keywords.
  • the address data is used by the search engine, and the address data includes a name field, address information, and a plurality of related POI information.
  • the plurality of related POI information is at least one corresponding POI.
  • Preset attribute information is a latitude and longitude, an address, a building name, or a unit name included.
  • the address data is captured from the network data based on the search engine, and the address data includes a name field and address information, based on map address data excavated by the search engine from the Internet, such as name: a certain real estate group Company company; address: ** City ** District 8 * Fortune Center Building A, 14th floor, of which "Some Real Estate Group ** Branch Company" is the name of POI, "** City ** District 8 * Fortune Center A
  • the 14th floor of the office building "for the address of the POI, the latitude and longitude information of the address can be obtained by analyzing the latitude and longitude of the address, such as the latitude and longitude of the latitude and longitude analysis of the address "** City District 8* Fortune Center Building A, 14th Floor” For: East longitude: 102.733445 North latitude: 25.08108.
  • Table 1 Format Table of POI Information from Different Information Sources
  • the address data is fetched from the network data, and the name field and the address information are respectively extracted from the captured one or more address data, and one or more keywords are determined based on the name field;
  • the keywords corresponding to the same address information are clustered to generate at least one class, and the POI name corresponding to the address information is determined according to the clustered keywords, thereby obtaining the best poi name.
  • the internal structure of the keyword determiner 13 in the system for determining the POI name based on clustering in the other embodiment is further disclosed as follows to embody another implementation implemented by the keyword determiner 13. The details of an embodiment. Referring to FIG. 2, the keyword determiner 13 further includes a word segmentation unit 131 and a keyword acquisition unit 132:
  • the word-cutting unit 131 is configured to perform word-cutting processing on the name in the name field to generate a word segmentation
  • the keyword acquiring unit 132 is configured to acquire keywords of the address data according to the word segmentation.
  • the keyword obtaining unit further includes:
  • a first frequency statistics module configured to count frequency of occurrence of each participle corresponding to the same address information, as the first frequency
  • a keyword generating module configured to generate a keyword of the address data according to the first frequency.
  • the keyword generating module selects a word segment with the smallest frequency and is a non-place name as a keyword of the address data.
  • the name of the POI information in the mined address data is cut, and the number of occurrences of each word after the word is cut is counted.
  • the least frequent occurrence of the same POI name includes the largest amount of information, and is a non-place name.
  • the word is recorded as the keyword of the POI name.
  • the POI name in the relevant POI information corresponding to the address data appearing in Table 1 is as shown in Table 2 (the word frequency is based on the name of about 90 million poi)
  • the second column in Table 2 is the obtained keywords, as follows:
  • the POI names corresponding to the same keyword are recorded as the same class.
  • the above POI names can be classified into five classes, that is, there are five different poi names on the POI address.
  • the internal structure of the POI name generator 15 in the system for determining the POI name based on clustering in the other embodiment is further disclosed as follows to embody another implementation implemented by the POI name generator 15. The details of an embodiment. Referring to FIG. 3, the POI name generator 15 further includes a frequency statistics unit 151, a class identification name determining unit 152, and a POI name determining unit 153:
  • the frequency statistics unit 151 is configured to calculate an appearance frequency of a name field in each class
  • the class identifier name determining unit 152 is configured to use a name field with the highest frequency of occurrence in each class as a class identifier name;
  • the POI name determining unit 153 is configured to use each class identifier name as a POI name.
  • each class identifier name is used as the POI name, and further is: clustering according to keywords: the POI names corresponding to the same keyword are recorded as the same class, and the above POI names can be classified into five classes. That is to say, there are 5 different poi names on this POI address, which are:
  • the internal structure of the POI name generator 15 in the system for determining the POI name based on clustering in the other embodiment is further disclosed as follows to embody another implementation implemented by the POI name generator 15. The details of an embodiment. Referring to FIG. 4, the POI name generator 15 further includes a frequency statistics unit 151', a class identification name determining unit 152', and a POI name determining unit 153':
  • a frequency statistics unit 151' for calculating an appearance frequency of a name field in each class
  • a class identifier name determining unit 152' configured to use a name field having the highest frequency of occurrence in each of the classes as a class identifier name;
  • the POI name determining unit 153' is configured to select the class identification name having the highest frequency of occurrence as the POI name.
  • the best POI name is selected according to the "voting" on the Internet.
  • the so-called “voting” is mainly based on the frequency of the POI name appearing on the Internet and the source. Reliability, the name with the highest frequency and the most trusted source on the Internet is the best name to choose. such as:
  • class D There is only one name in class D and class E, similar to A.
  • the reliable source is a source having a predetermined degree of confidence.
  • the source is a website or a webpage.
  • reliable sources of websites or web pages include, but are not limited to, large websites such as Sina and Phoenix, websites that have been officially certified, websites with high frequency of access, large data traffic, and no malicious links, virus links, and customer satisfaction. High-profile websites, etc.
  • the credibility of the website or the webpage of the reliable source is quantifiable, and the credibility of each website or webpage can be quantified according to the number of visits by the user and the customer evaluation. Moreover, the credibility of each website or webpage is dynamically changed. If the current website is infected with viruses, fraudulent advertisements or used by other malicious fraudulent websites, the credibility thereof will be reduced, and the present invention quantifies the credibility of the website. And dynamic adjustment to further ensure that the acquired POI information is reliable and effective.
  • the system for determining the POI name based on clustering searches for the keyword of the poi name according to the frequency of the word after the word is cut, and clusters the keyword to cluster the same poi name of different sayings.
  • One type solves the problem that the same latitude and longitude corresponds to multiple poi names, and uses the Internet "voting" mechanism to select the best poi name.
  • FIG. 5 shows a flow chart of a method for determining a POI name based on clustering according to an embodiment of the present invention.
  • a method for determining a POI name based on clustering includes the following steps:
  • the address data is captured from the network data based on the search engine, and the address data includes a name field and address information, based on map address data excavated by the search engine from the Internet, such as name: Evergrande Real Estate Group Kunming Company; Address: 14th Floor, Office Building, Block A, Beichen Fortune Center, Panlong District, Kunming City, including “Chengda Real Estate Group Kunming Company” is the name of POI, “14th Floor, Office Building, Block A, Beichen Fortune Center, Panlong District, Kunming”. Address of this POI The latitude and longitude information of the address can be obtained by analyzing the latitude and longitude of the address.
  • the address is “14th floor, office building, Block A, Beichen Fortune Center, Panlong District, Kunming”.
  • the latitude and longitude of the latitude and longitude analysis is: east longitude: 102.733445 north latitude: 25.08108.
  • the POI data obtained from different source websites in the same geographical location there may be repetitive data, that is, there may be multiple POI names in the same address (latitude and longitude), as there are multiple companies in a latitude and longitude, the actual The POI longitude and latitude are the same, but the POI name and the POI address are described in different ways.
  • the address data is fetched from the network data, and the name field and the address information are respectively extracted from the captured one or more address data, and one or more keywords are determined based on the name field;
  • the keywords corresponding to the same address information are clustered to generate at least one class, and the POI name corresponding to the address information is determined according to the clustered keywords, thereby obtaining the best poi name.
  • step S13 in the method for determining the POI name based on clustering according to the present invention is further disclosed as follows to embody another embodiment implemented according to this step.
  • the subdivision steps of this step include:
  • Step S132 Acquire the keyword of the address data according to the word segmentation, and further include:
  • Step: the keyword for generating the address data according to the first frequency is specifically:
  • a word segment having the smallest frequency and being a non-place name is selected as a keyword of the address data.
  • step S15 in the method for determining the POI name based on clustering according to the present invention are further disclosed as follows.
  • a further step is taken to embody another embodiment implemented in accordance with this step. Referring to Figure 7, the subdivision steps of this step include:
  • the name field with the highest frequency of occurrence in each class is used as a class identifier name.
  • Each class identifier name is taken as a POI name.
  • each type of identifier name is used as the POI name corresponding to the address information, and clustered according to keywords: the POI name corresponding to the same keyword is recorded as the same category, and the above POI names can be classified into five categories. That is, there are 5 different poi names on this POI address, which are:
  • step S15 in the method for determining the POI name based on clustering according to the present invention is further disclosed as follows to embody another embodiment implemented according to this step.
  • the subdivision steps of this step include:
  • the best POI name is selected according to the "voting" on the Internet.
  • the so-called “voting” is mainly based on the frequency of the POI name appearing on the Internet and the source. Reliability, the name with the highest frequency and the most trusted source on the Internet is the best name to choose. such as:
  • class D There is only one name in class D and class E, similar to A.
  • the reliable source is a source having a predetermined degree of confidence.
  • the source is a website or a webpage.
  • reliable sources of websites or web pages include, but are not limited to, large websites such as Sina and Phoenix, websites that have been officially certified, websites with high frequency of access, large data traffic, and no malicious links, virus links, and customer satisfaction. High-profile websites, etc.
  • the credibility of the website or the webpage of the reliable source is quantifiable, and the credibility of each website or webpage can be quantified according to the number of visits by the user and the customer evaluation. Moreover, the credibility of each website or webpage is dynamically changed. If the current website is infected with viruses, fraudulent advertisements or used by other malicious fraudulent websites, the credibility thereof will be reduced, and the present invention quantifies the credibility of the website. And dynamic adjustment to further ensure that the acquired POI information is reliable and effective.
  • the method for determining a POI name based on clustering searches for a keyword of a poi name according to the frequency of the word after the word is cut, and clusters the keyword, and aggregates the same poi name of different sayings into One type solves the problem that the same latitude and longitude corresponds to multiple poi names, and uses the Internet "voting" mechanism to select the best poi name.
  • the name field and the address information are extracted by fetching the address data from the network data, the keyword is determined based on the name field, and the keywords corresponding to the same address information are clustered.
  • the POI name corresponding to the address information is determined based on the clustered keywords, so that the user can quickly and accurately search for the POI name corresponding to the POI address of the same latitude and longitude, thereby improving the user experience.
  • Figure 9 is a block diagram showing a cluster-based POI name determination system in accordance with one embodiment of the present invention.
  • a cluster-based POI name determining system includes:
  • An address data grabber 91 configured to fetch address data from network data based on a search engine, where the address data includes a name field and address information;
  • a name field clusterer 92 configured to cluster the name fields corresponding to the same address information according to keywords
  • the second frequency statistic unit 93 is configured to count the frequency of occurrence of the name field in each category after clustering, as the second frequency;
  • the POI name determining unit 94 is configured to determine, according to the second frequency, a POI name corresponding to the address information of the category.
  • the address data is used by the search engine, and the address data includes a name field, address information, and a plurality of related POI information.
  • the plurality of related POI information is at least one corresponding POI.
  • Preset attribute information Preset attribute information. Further, The preset attributes are latitude and longitude, address, building name, or unit name.
  • the address data is captured from the network data based on the search engine, and the address data includes a name field and address information, and the name fields corresponding to the same address information are clustered according to keywords, after statistical clustering.
  • the frequency at which the name field appears in each category is used as the second frequency, and the POI name corresponding to the address information of the category is determined according to the second frequency, thereby obtaining the best poi name.
  • the internal structure of the name field clusterer 92 in the cluster-based POI name determining system of the present invention is further disclosed as follows to implement the clustering by the name field clusterer 92. Details of another embodiment. Referring to FIG. 10, the name field clusterer 92 further includes a keyword determining unit 921, a keyword clustering unit 922, and a name field cluster determining unit 923:
  • the keyword determining unit 921 is configured to determine one or more keywords based on the name field
  • the keyword clustering unit 922 is configured to cluster the keywords corresponding to the same address information
  • the name field cluster determining unit 923 is configured to determine the clustered name field according to the clustered keywords.
  • the keyword determining unit 921 further includes a word cutting module and a keyword obtaining module: the word cutting module is configured to perform word segmentation processing on the name in the name field to generate a word segmentation; a module, configured to acquire a keyword of the name field according to the word segmentation.
  • the keyword obtaining module further includes a first frequency statistics sub-module and a keyword generation sub-module: the first frequency statistics sub-module is configured to count the frequency of occurrence of each participle corresponding to the same address information, as a first frequency; the keyword generating submodule, configured to generate a keyword of the name field according to the first frequency.
  • the keyword generation sub-module selects the word segment with the first frequency minimum and is not a place name as the keyword of the name field.
  • the name of the POI information in the mined address data is cut, and the number of occurrences of each word after the word is cut is counted.
  • the least frequent occurrence of the same POI name includes the largest amount of information, and is a non-place name.
  • the word is recorded as the keyword of the POI name.
  • the POI name of the relevant POI information corresponding to the address data appearing in Table 1 above is as shown in Table 2 above (the word frequency is based on the name of about 90 million poi). of).
  • the internal structure of the second frequency statistic 93 in the cluster-based POI name determining system of the present invention is further disclosed as follows to implement the second frequency statistic 93. Details of another embodiment. Referring to FIG. 11, the second frequency statisticator 93 further includes a name field source obtaining unit 931, a source reliability determining unit 932, and a second frequency counting unit 933:
  • the name field source obtaining unit 931 is configured to obtain a source of the name field
  • the source reliability determining unit 932 is configured to determine whether the source is a reliable source
  • the second frequency statistics unit 933 is configured to: when the determination is yes, count the frequency of occurrence of the name field as the second frequency, otherwise it is not counted.
  • the reliable source is a source having a predetermined degree of confidence.
  • the source is a website or a webpage.
  • reliable sources of websites or web pages include, but are not limited to, large websites such as Sina and Phoenix, websites that have been officially certified, websites with high frequency of access, large data traffic, and no malicious links, virus links, and customer satisfaction. High-profile websites, etc.
  • the credibility of the website or the webpage of the reliable source is quantifiable, and the credibility of each website or webpage can be quantified according to the number of visits by the user and the customer evaluation. Moreover, the credibility of each website or webpage is dynamically changed. If the current website is infected with viruses, fraudulent advertisements or used by other malicious fraudulent websites, the credibility thereof will be reduced, and the present invention quantifies the credibility of the website. And dynamic adjustment to further ensure that the acquired POI information is reliable and effective.
  • the internal structure of the POI name determining unit 94 in the cluster-based POI name determining system of the present invention in another embodiment is further disclosed as follows to embody another implementation implemented by the POI name determining unit 94.
  • the POI name determining unit 94 further includes a first class identification name determining module and a first POI name determining module:
  • the first type identifier name determining module is configured to use the name field with the highest frequency in the respective classes as the class identifier name;
  • the first POI name determining module is configured to use each type of identifier name as a POI name corresponding to the address information.
  • each type of identification name is used as the POI name corresponding to the address information, and is clustered according to the keyword: the POI name corresponding to the same keyword is recorded as the same category, see Table 1 and Table 2, the above several POIs.
  • the name can be classified into 5 classes, which means that there are 5 different poi names on this POI address, which are:
  • the internal structure of the POI name determining unit 94 in the cluster-based POI name determining system of the present invention in another embodiment is further disclosed as follows to embody another implementation implemented by the POI name determining unit 94.
  • the POI name determining unit 94 further includes a second class identification name determining module and a second POI name determining module:
  • the second type identifier name determining module is configured to use a name field with the highest frequency in the respective classes as a class identifier name;
  • the second POI name determining module is configured to use the class identifier name that has the most occurrences on the network as the POI name corresponding to the address information.
  • the best POI name is selected according to the "voting" on the Internet.
  • the so-called “voting” is mainly based on the frequency of the POI name appearing on the Internet and the source. Reliability, the name with the highest frequency and the most trusted source on the Internet is the best name to choose. such as:
  • class D There is only one name in class D and class E, similar to A.
  • the cluster-based POI name determination system mines the keywords of the POI name according to the frequency of the words after the word is cut, and clusters the keywords, and aggregates the same POI name of different sayings into One class solves the problem that the same latitude and longitude corresponds to multiple POI names, and uses the Internet "voting" mechanism to select the best POI name.
  • FIG. 12 is a flow chart showing a cluster-based POI name determination method according to an embodiment of the present invention.
  • a cluster-based POI name determining method includes the following steps:
  • the address data is used by the search engine, and the address data includes a name field, address information, and a plurality of related POI information.
  • the plurality of related POI information is information corresponding to at least one preset attribute of the POI.
  • the preset attribute is a latitude and longitude, an address, a building name, or a unit name included.
  • the address data is captured from the network data based on the search engine, and the address data includes a name field and address information, based on map address data excavated by the search engine from the Internet, such as name: Evergrande Real Estate Group Kunming Company; Address: 14th Floor, Office Building, Block A, Beichen Fortune Center, Panlong District, Kunming City, including “Chengda Real Estate Group Kunming Company” is the name of POI, “14th Floor, Office Building, Block A, Beichen Fortune Center, Panlong District, Kunming”. Address of this POI The latitude and longitude information of the address can be obtained by analyzing the latitude and longitude of the address.
  • the address is “14th floor, office building, Block A, Beichen Fortune Center, Panlong District, Kunming”.
  • the latitude and longitude of the latitude and longitude analysis is: east longitude: 102.733445 north latitude: 25.08108.
  • step S122 in the cluster-based POI name determining method of the present invention is further disclosed as follows to embody another embodiment implemented according to this step.
  • the subdivision steps of this step include:
  • S1221 Determine one or more keywords based on the name field
  • step S1221 determining one or more keywords based on the name field, further comprising: performing word segmentation on the name field to generate a word segmentation; and acquiring a keyword of the name field according to the word segmentation.
  • the step of: acquiring the keyword of the name field according to the word segmentation further comprising: counting frequency of occurrence of each participle corresponding to the same address information as the first frequency; determining the name according to the first frequency Key words for the field.
  • the step of determining, according to the first frequency, the keyword of the name field is specifically: selecting a word segment with a first frequency minimum and a non-place name as a keyword of the name.
  • the name of the POI information in the mined address data is cut, and the number of occurrences of each word after the word is cut is counted.
  • the least frequent occurrence of the same POI name includes the largest amount of information, and is a non-place name.
  • the word is recorded as the keyword of the POI name, and is clustered according to the keyword: the POI name corresponding to the same keyword is recorded as the same class.
  • step S123 in the cluster-based POI name determining method of the present invention is further disclosed as follows to embody another embodiment implemented according to this step.
  • the subdivision steps of this step include:
  • the reliable source is a source having a predetermined degree of confidence.
  • the source is a website or a webpage.
  • reliable sources of websites or web pages include, but are not limited to, large websites such as Sina and Phoenix, websites that have been officially certified, websites with high frequency of access, large data traffic, and no malicious links, virus links, and customer satisfaction. High-profile websites, etc.
  • the credibility of the website or the webpage of the reliable source is quantifiable, and the credibility of each website or webpage can be quantified according to the number of visits by the user and the customer evaluation. Moreover, the credibility of each website or webpage is dynamically changed. If the current website is infected with viruses, fraudulent advertisements or used by other malicious fraudulent websites, the credibility thereof will be reduced, and the present invention quantifies the credibility of the website. And dynamic adjustment to further ensure that the acquired POI information is reliable and effective.
  • step S124 in the cluster-based POI name determining method of the present invention is further disclosed as follows to embody another embodiment implemented according to this step.
  • the subdivision steps of this step include:
  • the name field with the highest frequency in the second class is used as the class identification name; each type of identification name is taken as the POI name corresponding to the address information.
  • step S124 in the cluster-based POI name determining method of the present invention is further disclosed as follows to embody another embodiment implemented according to this step.
  • the subdivision steps of this step include:
  • the name field with the highest frequency in the second class is used as the class identification name; the class identification name with the most occurrences on the network is used as the POI name corresponding to the address information.
  • the clustering-based POI name determining method mines the keywords of the POI name according to the frequency of the word after the word cutting, and clusters the keywords, and aggregates the same POI name of different sayings into One class solves the problem that the same latitude and longitude corresponds to multiple POI names, and uses the Internet "voting" mechanism to select the best POI name.
  • the above embodiment of the present invention extracts a name field and address information by fetching address data from network data, determines a keyword based on the name field, and clusters keywords corresponding to the same address information.
  • the POI name corresponding to the address information is determined based on the clustered keywords, so that the user can quickly and accurately search for the POI name corresponding to the POI address of the same latitude and longitude, thereby improving the user experience.
  • Figure 15 is a block diagram schematically showing a system for determining the validity of POI information based on address data in a network, in accordance with one embodiment of the present invention.
  • a system for determining validity of POI information based on address data in a network includes:
  • the POI information obtaining unit 511 is configured to acquire, according to the search engine, the plurality of related POI information corresponding to the same POI name by using the address data in the network;
  • the multiple related POI information is information corresponding to at least one preset attribute of the POI.
  • the preset attribute is a latitude and longitude, an address, a building name, or a unit name included.
  • the statistics unit 512 is configured to count the number of occurrences of the POI information in the address data in the network;
  • the POI information determining unit 513 is configured to determine valid POI information corresponding to the same POI name according to the number of occurrences of the POI information in the address data in the network.
  • the search engine uses the address data in the network to obtain a plurality of related POI information corresponding to the same POI name, where the plurality of related POI information is information corresponding to at least one preset attribute of the POI, and the preset attribute For the latitude and longitude, the address, the building name, or the included unit name, valid POI information corresponding to the same POI name is determined according to the number of occurrences of the POI information in the address data in the network.
  • determining valid POI information corresponding to the same POI name according to the number of occurrences of the POI information in the address data in the network, including: information corresponding to the same address information according to the preset attribute of the related POI information
  • the name field is clustered according to keywords, and the frequency of occurrence of the name field in each category after clustering is counted as the second frequency, and the POI name corresponding to the address information of the category is determined according to the second frequency. It is said that the POI name corresponding to the address information of the category is determined according to the second frequency, and the "voting" mechanism of the Internet is used to select the trusted POI information of the same POI name.
  • the one or more keywords are determined based on the name field, the keywords corresponding to the same address information are clustered, and the clustered name field is determined according to the clustered keywords.
  • a word segmentation process is performed on the name in the name field to generate a word segment, and the keyword of the name field is obtained according to the word segmentation.
  • the frequency of occurrence of each participle corresponding to the same address information is counted as the first frequency, and the keyword of the name field is generated according to the first frequency, specifically, the first frequency is selected to be the smallest and the non-place name is selected.
  • the participle is used as the keyword of the name field.
  • the present invention may use the name field with the highest frequency in the respective classes as the class identifier name, and each type of the identifier name as the POI name corresponding to the address information; or, in the respective classes
  • the name field with the highest frequency is used as the class identification name, and the class identification name with the most occurrences on the network is taken as the POI name corresponding to the address information.
  • the name of the POI information in the mined address data is cut, and the number of occurrences of each word after the word is cut is counted.
  • the least frequent occurrence of the same POI name includes the largest amount of information, and the word of the non-place name is the most
  • the keyword of the POI name for example, the POI name of the relevant POI information corresponding to the address data appearing in Table 1 is as shown in Table 2 above (the word frequency is counted according to the poi name of about 90 million).
  • the internal structure of the statistical unit 512 in the system for determining the validity of the POI information based on the address data in the network in the present invention is further disclosed as follows to implement the implementation according to the statistical unit 512. Details of another embodiment. Referring to FIG. 16, the statistics unit 512 further includes a POI information source obtaining module 5121, a POI information source reliability determining module 5122, and a statistics module 5123:
  • the POI information source obtaining module 5121 is configured to obtain a source of the POI information
  • the POI information source reliability determining module 5122 is configured to determine whether the source is a reliable source
  • the statistic module 5123 is configured to count the number of occurrences of the POI information in the address data in the network if the source belongs to a reliable source; otherwise, it is not counted.
  • the best POI name is selected according to the "voting" on the Internet.
  • the so-called “voting” is mainly based on the frequency of the POI name appearing on the Internet and the source. Reliability, the name with the highest frequency and the most trusted source on the Internet is the best name to choose. such as:
  • class D There is only one name in class D and class E, similar to A.
  • the internal structure of the POI information determining unit 513 in the system for determining the validity of the POI information based on the address data in the network in another embodiment is further disclosed as follows to reflect the determination according to the POI information. Details of another embodiment implemented by unit 513. Referring to FIG. 17, the POI information determining unit 513 further includes a judging subunit 5131 and an information point information determining subunit 5132:
  • the determining subunit 5131 is configured to determine whether the number of occurrences of the POI information in the address data in the network is higher than a predetermined threshold;
  • the information point information determining sub-unit 5132 is configured to determine that the acquired POI information is valid if the determining sub-unit determines to be YES.
  • the best selected POI name is filtered according to the frequency and source of its occurrence on the interconnection. Above a certain threshold is the final POI information.
  • the reliable source is a source having a predetermined degree of confidence.
  • the source is a website or a webpage.
  • the website or the webpage of the source of the predetermined credibility includes, but is not limited to, a large website such as Sina, Fenghuang.com, an officially certified website, a website with a relatively high frequency of access, a large data flow, and no maliciousness. Websites with links, virus links, and high customer satisfaction.
  • the credibility is quantifiable, and the credibility of each website or webpage can be quantified according to the number of visits by the user and the customer evaluation. Moreover, the credibility of each website or webpage is dynamically changed. If the current website is infected with viruses, fraudulent advertisements or used by other malicious fraudulent websites, the credibility thereof will be reduced, and the present invention quantifies the credibility of the website. And dynamic adjustment to further ensure that the acquired POI information is reliable and effective.
  • a plurality of related POI information corresponding to the same POI name are obtained by using address data in the network, and valid POI information corresponding to the same POI name is determined according to the number of occurrences of the POI information in the address data in the network, thereby making the user Ability to search quickly and accurately One or more POI names corresponding to the POI address of the same latitude and longitude, and then using the network voting mechanism to filter from one or more POI names according to the information source and the frequency of occurrence on the Internet, and selecting a highly credible The POI name is used as the POI name corresponding to the current POI address to improve the validity of the POI information.
  • FIG. 18 is a flow chart schematically showing a method of determining validity of POI information based on address data in a network according to an embodiment of the present invention.
  • a method for determining validity of POI information based on address data in a network includes the following steps:
  • the multiple related POI information is information corresponding to at least one preset attribute of the POI.
  • the preset attribute is a latitude and longitude, an address, a building name, or a unit name.
  • the name of the POI information in the mined address data is cut, and the number of occurrences of each word after the word-cutting is counted, and the frequency of occurrence of the same POI name is the least, that is, the amount of information is the largest, and The word of the non-place name is recorded as the keyword of the POI name.
  • step S812 in the method for determining the validity of the POI information based on the address data in the network is further disclosed as follows to embody another embodiment implemented according to this step.
  • the subdivision steps of this step include:
  • step S8122 determining whether the source is a reliable source, and if so, executing step S123;
  • the best POI name is selected according to the "voting" on the Internet.
  • the so-called “voting” is mainly based on the frequency of the POI name appearing on the Internet and the source. Reliability, the name with the highest frequency and the most trusted source on the Internet is the best name to choose.
  • step S813 in the method for determining the validity of the POI information based on the address data in the network is further disclosed as follows to embody another embodiment implemented according to this step.
  • the subdivision steps of this step include:
  • the best selected POI name is filtered according to the frequency and source of its occurrence on the interconnection. Above a certain threshold is the final POI information.
  • the reliable source is a source having a predetermined degree of confidence.
  • the source is a website or a webpage.
  • the website or the webpage of the source of the predetermined credibility includes, but is not limited to, a large website such as Sina, Fenghuang.com, an officially certified website, a website with a relatively high frequency of access, a large data flow, and no maliciousness. Websites with links, virus links, and high customer satisfaction.
  • the credibility is quantifiable, and the credibility of each website or webpage can be quantified according to the number of visits by the user and the customer evaluation. Moreover, the credibility of each website or webpage is dynamically changed. If the current website is infected with viruses, fraudulent advertisements or used by other malicious fraudulent websites, the credibility thereof will be reduced, and the present invention quantifies the credibility of the website. And dynamic adjustment to further ensure that the acquired POI information is reliable and effective.
  • the keywords of the poi name are searched according to the frequency of the word after the word is cut, and the keywords are clustered by the keyword.
  • the same poi name is grouped together to solve the problem of multiple poi names corresponding to one latitude and longitude.
  • the Internet "voting" mechanism is used to select the best poi name, and the "voting" mechanism on the Internet is used to select trusted poi information. .
  • the foregoing embodiment of the present invention acquires a plurality of related POI information corresponding to the same POI name by using address data in the network, and determines, according to the number of occurrences of the POI information in the address data in the network, the same POI name.
  • Effective POI information enabling users to quickly and accurately search for one or more POI names corresponding to the same latitude and longitude POI address, and then use the online voting mechanism to follow the information source from one or more POI names and on the Internet.
  • the frequency of occurrence is filtered, and the POI name with high reliability is selected as the current POI address.
  • the POI name improves the validity of the POI information.
  • modules in the devices of the embodiments can be adaptively changed and placed in one or more devices different from the embodiment.
  • the modules or units or components of the embodiments may be combined into one module or unit or component, and further they may be divided into a plurality of sub-modules or sub-units or sub-components.
  • any combination of the features disclosed in the specification, including the accompanying claims, the abstract and the drawings, and any methods so disclosed, or All processes or units of the device are combined.
  • Each feature disclosed in this specification (including the accompanying claims, the abstract and the drawings) may be replaced by alternative features that provide the same, equivalent or similar purpose.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • a microprocessor or digital signal processor may be used in practice to implement some or all of some or all of the components of the POI name-based system based on clustering in accordance with embodiments of the present invention.
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • Figure 21 schematically illustrates a block diagram of a computing device for performing the method in accordance with the present invention.
  • the computing device conventionally includes a processor 2110 and a computer program product or computer readable medium in the form of a memory 2120.
  • the memory 2120 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • Memory 2120 has a storage space 2130 for program code 2131 for performing any of the method steps described above.
  • storage space 2130 for program code may include various program code 2131 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • Such computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have a storage segment, a storage space, and the like that are similarly arranged to the storage 2120 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit comprises computer readable code 2131' for performing the steps of the method according to the invention, ie code that can be read by a processor such as, for example, 2110, which when executed by the computing device causes the calculation The device performs the various steps in the methods described above.
  • the present invention is applicable to computer systems/servers that can operate with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known computing systems, environments, and/or configurations suitable for use with computer systems/servers include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, based on Microprocessor systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.
  • the computer system/server can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system.
  • program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types.
  • the computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network.
  • program modules may be located on a local or remote computing system storage medium including storage devices.
  • "an embodiment," or "an embodiment," or "one or more embodiments" as used in the context of the present invention means that the particular features, structures, or characteristics described in connection with the embodiments are included in at least one embodiment of the invention.
  • phrase "in one embodiment" is not necessarily referring to the same embodiment.

Abstract

A system and method for determining a POI name and for determining validity of POI information. The method comprises: capturing address data from network data (S11); respectively extracting a name field and address information from one or a plurality of items of captured address data (S12); on the basis of the name field, determining one or more key words (S13); clustering the key words corresponding to the same address information to generate at least one cluster (S14); and according to the clustered key word, determining a POI name corresponding to the address information (S15). The method enables a user to rapidly and accurately search for a POI name corresponding to the POI addresses at the same latitude and longitude, thereby improving the user experience.

Description

确定POI名称、确定POI信息有效性的系统和方法System and method for determining POI name and determining validity of POI information 技术领域Technical field
本发明涉及电子地图技术领域,具体而言,涉及一种基于聚类确定POI名称的系统和方法、一种基于聚类的POI名称确定系统和方法以及一种基于网络中的地址数据确定POI信息有效性的系统和方法。The present invention relates to the field of electronic map technology, and in particular to a system and method for determining a POI name based on clustering, a cluster-based POI name determining system and method, and a POI information based on address data in a network. System and method of effectiveness.
背景技术Background technique
兴趣点(Point of Interest,POI)一般是电子地图中标注的地理信息点,通常包含POI标识、POI名称、POI类型、经度、纬度等信息。POI可以在地图上标注出来,带有经纬度信息,可以用来查找并计算导航的地标点或者建筑物,例如商场、停车场、学校、医院、酒店、饭店、超市、公园、旅游景点等。The Point of Interest (POI) is generally a geographic information point marked in an electronic map, and usually includes information such as a POI identifier, a POI name, a POI type, a longitude, and a latitude. The POI can be marked on the map with latitude and longitude information, which can be used to find and calculate navigation landmarks or buildings, such as shopping malls, parking lots, schools, hospitals, hotels, restaurants, supermarkets, parks, tourist attractions, etc.
越来越多的用户在电子地图中查询POI,数据库中存储的POI数据为POI查询提供数据支撑。目前,对数据库中的POI数据进行更新主要通过进行数据实采,根据实采得到的数据对数据库中存储的POI数据进行更新,或是从互联网上的各个生活类信息网站上获取POI数据,只要获取的数据包括POI的名称和地址,即可将该条数据确定为一条POI数据。由于POI数据的获取及更新方式的特点,不可避免的导致互联网上存在着各种各样的POI数据。因此,从不同来源网站获取的POI数据中,有可能存在重复性数据,即多条POI数据实际描述的是同一POI,其实际的POI经度、纬度相同,但是POI名称和POI地址的描述方式却不同。重复性的POI数据导致用户无法快速、准确的搜索到同一POI地理位置(经纬度)的POI地址对应的POI名称,影响用户体验。More and more users query the POI in the electronic map, and the POI data stored in the database provides data support for the POI query. At present, the POI data in the database is updated mainly by performing data mining, updating the POI data stored in the database according to the data obtained by the actual acquisition, or obtaining POI data from various life information websites on the Internet, as long as The acquired data includes the name and address of the POI, and the data can be determined as a piece of POI data. Due to the characteristics of the acquisition and update of POI data, it is inevitable that there will be various POI data on the Internet. Therefore, there may be repetitive data in the POI data obtained from different source websites. That is, multiple POI data actually describe the same POI, and the actual POI longitude and latitude are the same, but the POI name and the POI address are described in the same way. different. The repetitive POI data causes the user to quickly and accurately search for the POI name corresponding to the POI address of the same POI geographic location (latitude and longitude), which affects the user experience.
发明内容Summary of the invention
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决或者减缓上述问题的基于聚类确定POI名称的系统和相应的基于聚类确定POI名称的方法、一种基于聚类的POI名称确定系统和方法以及一种基于网络中的地址数据确定POI信息有效性的系统和方法。In view of the above problems, the present invention has been made in order to provide a cluster-based POI name-based system and a corresponding cluster-based POI name-based method for overcoming the above problems or at least partially solving or alleviating the above problems, a clustering-based method The POI name determination system and method and a system and method for determining the validity of POI information based on address data in the network.
根据本发明的一个方面,提供了一种基于聚类确定POI名称的系统,该系统包括:According to an aspect of the present invention, a system for determining a POI name based on clustering is provided, the system comprising:
地址数据抓取器,用于从网络数据中抓取地址数据;An address data grabber for fetching address data from network data;
地址数据解析器,用于从抓取到的一个或多个地址数据中分别提取名称字段和地址信息;An address data parser, configured to separately extract a name field and address information from the captured one or more address data;
关键词确定器,用于基于所述名称字段确定一个或多个关键词;a keyword determiner for determining one or more keywords based on the name field;
关键词聚类器,用于将对应相同地址信息的所述关键词进行聚类,生成至少一个类;a keyword clusterer for clustering the keywords corresponding to the same address information to generate at least one class;
POI名称生成器,用于根据聚类后的关键词确定此地址信息对应的POI名称。The POI name generator is configured to determine a POI name corresponding to the address information according to the clustered keywords.
根据本发明的另一个方面,提供了一种基于聚类确定POI名称的方法,包括:According to another aspect of the present invention, a method for determining a POI name based on clustering is provided, including:
从网络数据中抓取地址数据;Grab address data from network data;
从抓取到的一个或多个地址数据中分别提取名称字段和地址信息;Extracting the name field and address information from the captured one or more address data;
基于所述名称字段确定一个或多个关键词;Determining one or more keywords based on the name field;
将对应相同地址信息的所述关键词进行聚类,生成至少一个类;And clustering the keywords corresponding to the same address information to generate at least one class;
根据聚类后的关键词确定此地址信息对应的POI名称。The POI name corresponding to the address information is determined according to the clustered keywords.
根据本发明的又一个方面,提供了一种基于聚类的POI名称确定系统,该系统包括:According to still another aspect of the present invention, a cluster-based POI name determination system is provided, the system comprising:
地址数据抓取器,用于基于搜索引擎从网络数据中抓取地址数据,所述地址数据包括名称字段和地址信息;An address data grabber for extracting address data from network data based on a search engine, the address data including a name field and address information;
名称字段聚类器,用于将对应相同地址信息的名称字段按照关键词进行聚类;a name field clusterer for clustering name fields corresponding to the same address information according to keywords;
第二频次统计器,用于统计聚类后各类别中名称字段出现的频次,作为第二频次;The second frequency statistic is used for counting the frequency of occurrence of the name field in each category after clustering, as the second frequency;
POI名称确定单元,用于根据所述第二频次确定该类别对应该地址信息的POI名称。The POI name determining unit is configured to determine, according to the second frequency, a POI name corresponding to the address information of the category.
根据本发明的再一个方面,提供了一种基于聚类的POI名称确定方法,包括:According to still another aspect of the present invention, a cluster-based POI name determining method is provided, including:
从网络数据中抓取地址数据,所述地址数据包括名称字段和地址信息;Obtaining address data from network data, the address data including a name field and address information;
将对应相同地址信息的名称字段按照关键词进行聚类;The name fields corresponding to the same address information are clustered according to keywords;
统计聚类后各类别中名称字段出现的频次,作为第二频次;The frequency at which the name field appears in each category after statistical clustering, as the second frequency;
根据所述第二频次确定该类别对应该地址信息的POI名称。The POI name corresponding to the address information of the category is determined according to the second frequency.
根据本发明的再一个方面,提供了一种基于网络中的地址数据确定POI信息有效性的系统,该系统包括:According to still another aspect of the present invention, a system for determining validity of POI information based on address data in a network is provided, the system comprising:
POI信息获取单元,用于基于搜索引擎利用网络中的地址数据获取对应相同POI名称的多个相关POI信息;a POI information acquiring unit, configured to acquire, according to the search engine, a plurality of related POI information corresponding to the same POI name by using address data in the network;
统计单元,用于统计所述POI信息在所述网络中的地址数据中的出现次数;a statistical unit, configured to count the number of occurrences of the POI information in the address data in the network;
POI信息确定单元,用于根据所述POI信息在所述网络中的地址数据中的出现次数确定对应所述相同 POI名称的有效POI信息。a POI information determining unit, configured to determine, according to the number of occurrences of the POI information in the address data in the network, corresponding to the same Valid POI information for the POI name.
根据本发明的再一个方面,提供了一种基于网络中的地址数据确定POI信息有效性的方法,包括:According to still another aspect of the present invention, a method for determining validity of POI information based on address data in a network is provided, including:
利用网络中的地址数据获取对应相同POI名称的多个相关POI信息;Acquiring a plurality of related POI information corresponding to the same POI name by using address data in the network;
统计所述POI信息在所述网络中的地址数据中的出现次数;Counting the number of occurrences of the POI information in address data in the network;
根据所述POI信息在所述网络中的地址数据中的出现次数确定对应所述相同POI名称的有效POI信息。The valid POI information corresponding to the same POI name is determined according to the number of occurrences of the POI information in the address data in the network.
根据本发明的再一个方面,提出了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行上丈所述的基于聚类确定POI名称的方法,或者导致所述计算设备执行上丈所述的基于聚类的POI名称确定方法,或者导致所述计算设备执行上丈所述的基于网络中的地址数据确定POI信息有效性的方法。According to still another aspect of the present invention, a computer program is provided, comprising computer readable code that, when executed on a computing device, causes the computing device to perform a cluster-based determination as described above a method for determining a POI name, or causing the computing device to perform a cluster-based POI name determining method as described above, or causing the computing device to perform the network-based address data described in the network to determine the validity of the POI information. method.
根据本发明的再一个方面,提出了一种计算机可读介质,其中存储了上述的计算机程序。According to still another aspect of the present invention, a computer readable medium is proposed, wherein the computer program described above is stored.
本发明的有益效果为:The beneficial effects of the invention are:
本发明对从网络数据中抓取地址数据进行名称字段和地址信息的提取,基于名称字段确定一个或多个关键词,并将对应相同地址信息的关键词进行聚类,基于聚类后的关键词确定地址信息对应的POI名称,从而使得用户能够快速、准确地搜索到同一经、纬度的POI地址对应的POI名称,改善用户体验。The invention extracts the name field and the address information by fetching the address data from the network data, determines one or more keywords based on the name field, and clusters the keywords corresponding to the same address information, based on the key after clustering The word determines the POI name corresponding to the address information, so that the user can quickly and accurately search for the POI name corresponding to the POI address of the same latitude and longitude, thereby improving the user experience.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.
附图说明DRAWINGS
通过阅读下丈优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art in the <RTIgt; The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1示意性示出了本发明一个实施例的基于聚类确定POI名称的系统的框图;1 is a block diagram schematically showing a system for determining a POI name based on clustering according to an embodiment of the present invention;
图2示意性示出了本发明另一个实施例的基于聚类确定POI名称的系统中的关键词确定器的框图;2 is a block diagram schematically showing a keyword determiner in a system for determining a POI name based on clustering according to another embodiment of the present invention;
图3示意性示出了本发明另一个实施例的基于聚类确定POI名称的系统中的POI名称生成器的框图;3 is a block diagram schematically showing a POI name generator in a system for determining a POI name based on clustering according to another embodiment of the present invention;
图4示意性示出了本发明另一个实施例的基于聚类确定POI名称的系统中的POI名称生成器的框图;4 is a block diagram schematically showing a POI name generator in a system for determining a POI name based on clustering according to another embodiment of the present invention;
图5示意性示出了本发明一个实施例的基于聚类确定POI名称的方法的流程图;FIG. 5 is a flow chart schematically showing a method for determining a POI name based on clustering according to an embodiment of the present invention; FIG.
图6示意性示出了本发明另一个实施例的基于聚类确定POI名称的方法的步骤S13的细分流程图;FIG. 6 is a schematic diagram showing a subdivided flowchart of step S13 of a method for determining a POI name based on clustering according to another embodiment of the present invention; FIG.
图7示意性示出了本发明另一个实施例的基于聚类确定POI名称的方法的步骤S15的细分流程图;以及FIG. 7 is a view schematically showing a subdivision flowchart of step S15 of a method for determining a POI name based on clustering according to another embodiment of the present invention;
图8示意性示出了本发明另一个实施例的基于聚类确定POI名称的方法的步骤S15的细分流程图;FIG. 8 is a schematic diagram showing a subdivided flowchart of step S15 of the method for determining a POI name based on clustering according to another embodiment of the present invention; FIG.
图9示意性示出了本发明一个实施例的基于聚类的POI名称确定系统的框图;FIG. 9 is a block diagram schematically showing a cluster-based POI name determining system according to an embodiment of the present invention; FIG.
图10示意性示出了本发明另一个实施例的基于聚类的POI名称确定系统中的名称字段聚类器的框图;FIG. 10 is a block diagram schematically showing a name field clusterer in a cluster-based POI name determination system according to another embodiment of the present invention; FIG.
图11示意性示出了本发明另一个实施例的基于聚类的POI名称确定系统中的第二频次统计器的框图;11 is a block diagram schematically showing a second frequency statistic in a cluster-based POI name determining system according to another embodiment of the present invention;
图12示意性示出了本发明一个实施例的基于聚类的POI名称确定方法的流程图;FIG. 12 is a flow chart schematically showing a cluster-based POI name determining method according to an embodiment of the present invention; FIG.
图13示意性示出了本发明另一个实施例的基于聚类的POI名称确定方法的步骤S122的细分流程图;以及FIG. 13 is a schematic flowchart showing a subdivision of step S122 of the cluster-based POI name determining method according to another embodiment of the present invention;
图14示意性示出了本发明另一个实施例的基于聚类的POI名称确定方法的步骤S123的细分流程图。Fig. 14 is a view schematically showing a subdivision flow chart of step S123 of the cluster-based POI name determining method of another embodiment of the present invention.
图15示意性示出了本发明一个实施例的基于网络中的地址数据确定POI信息有效性的系统的框图;15 is a block diagram schematically showing a system for determining validity of POI information based on address data in a network according to an embodiment of the present invention;
图16示意性示出了本发明另一个实施例的基于网络中的地址数据确定POI信息有效性的系统中的统计单元的框图;16 is a block diagram schematically showing a statistical unit in a system for determining validity of POI information based on address data in a network according to another embodiment of the present invention;
图17示意性示出了本发明另一个实施例的基于网络中的地址数据确定POI信息有效性的系统中的POI信息确定单元的框图;FIG. 17 is a block diagram schematically showing a POI information determining unit in a system for determining validity of POI information based on address data in a network according to another embodiment of the present invention; FIG.
图18示意性示出了本发明一个实施例的基于网络中的地址数据确定POI信息有效性的方法的流程图;FIG. 18 is a flow chart schematically showing a method for determining validity of POI information based on address data in a network according to an embodiment of the present invention; FIG.
图19示意性示出了本发明另一个实施例的基于网络中的地址数据确定POI信息有效性的方法的步骤S812的细分流程图;以及FIG. 19 is a schematic flowchart showing a subdivision of step S812 of a method for determining validity of POI information based on address data in a network according to another embodiment of the present invention;
图20示意性示出了本发明另一个实施例的基于网络中的地址数据确定POI信息有效性的方法的步骤S813的细分流程图。FIG. 20 is a schematic diagram showing a subdivided flow chart of step S813 of a method for determining validity of POI information based on address data in a network according to another embodiment of the present invention.
图21示意性地示出了用于执行根据本发明的方法的计算设备的框图;以及Figure 21 schematically shows a block diagram of a computing device for performing a method in accordance with the present invention;
图22示意性地示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。 Fig. 22 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
具体实施例Specific embodiment
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能解释为对本发明的限制。The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the drawings, wherein the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are intended to be illustrative of the invention and are not to be construed as limiting.
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本发明的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。The singular forms "a", "an", "the" It is to be understood that the phrase "comprise" or "an" Integers, steps, operations, components, components, and/or groups thereof.
本技术领域技术人员可以理解,除非另外定义,这里使用的所有术语(包括技术术语和科学术语),具有与本发明所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是,诸如通用字典中定义的那些术语,应该被理解为具有与现有技术的上下丈中的意义一致的意义,并且除非被特定定义,否则不会用理想化或过于正式的含义来解释。Those skilled in the art will appreciate that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention belongs, unless otherwise defined. It should also be understood that terms such as those defined in a general dictionary should be understood to have a meaning consistent with the meaning of the prior art, and will not be idealized or overly formal unless specifically defined. The meaning is explained.
图1示出了本发明一个实施例的基于聚类确定PO1名称的系统的框图。1 shows a block diagram of a system for determining a PO1 name based on clustering in accordance with one embodiment of the present invention.
参照图1,本发明实施例的基于聚类确定POI名称的系统,包括:Referring to FIG. 1, a system for determining a POI name based on clustering according to an embodiment of the present invention includes:
地址数据抓取器11,用于从网络数据中抓取地址数据;An address data grabber 11 for fetching address data from network data;
地址数据解析器12,用于从抓取到的一个或多个地址数据中分别提取名称字段和地址信息;The address data parser 12 is configured to separately extract a name field and address information from the captured one or more address data;
关键词确定器13,用于基于所述名称字段确定一个或多个关键词;a keyword determiner 13 for determining one or more keywords based on the name field;
关键词聚类器14,用于将对应相同地址信息的所述关键词进行聚类,生成至少一个类;a keyword clusterer 14 configured to cluster the keywords corresponding to the same address information to generate at least one class;
POI名称生成器15,用于根据聚类后的关键词确定此地址信息对应的POI名称。The POI name generator 15 is configured to determine a POI name corresponding to the address information according to the clustered keywords.
本发明实施例,基于搜索引擎利用网络中的地址数据,所述地址数据包括名称字段、地址信息以及多个相关POI信息;本发明实施例中,所述多个相关POI信息为对应POI至少一个预设属性的信息。进一步地,所述预设属性为经纬度、地址、建筑物名称或所囊括单位名称。In the embodiment of the present invention, the address data is used by the search engine, and the address data includes a name field, address information, and a plurality of related POI information. In the embodiment of the present invention, the plurality of related POI information is at least one corresponding POI. Preset attribute information. Further, the preset attribute is a latitude and longitude, an address, a building name, or a unit name included.
本发明实施例,基于搜索引擎从网络数据中抓取地址数据,所述地址数据包括名称字段和地址信息,基于搜索引擎从互联网上挖掘的地图地址数据,比如name:某某地产集团**分公司公司;address:**市**区8*财富中心A座写字楼14楼,其中“某某地产集团**分公司公司”为POI的名称,“**市**区8*财富中心A座写字楼14楼”为此POI的地址,通过对地址的经纬度解析可以获得此地址所在的经纬度信息,比如地址“**市**区8*财富中心A座写字楼14楼”经纬度解析得到的经纬度为:东经:102.733445  北纬:25.08108。另外,需要统计POI信息在互联网上出现的次数以及记录来源。In the embodiment of the present invention, the address data is captured from the network data based on the search engine, and the address data includes a name field and address information, based on map address data excavated by the search engine from the Internet, such as name: a certain real estate group Company company; address: ** City ** District 8 * Fortune Center Building A, 14th floor, of which "Some Real Estate Group ** Branch Company" is the name of POI, "** City ** District 8 * Fortune Center A The 14th floor of the office building "for the address of the POI, the latitude and longitude information of the address can be obtained by analyzing the latitude and longitude of the address, such as the latitude and longitude of the latitude and longitude analysis of the address "** City District 8* Fortune Center Building A, 14th Floor" For: East longitude: 102.733445 North latitude: 25.08108. In addition, it is necessary to count the number of times POI information appears on the Internet and the source of the record.
所以,最终从互联网上挖掘的地址数据对应的不同信息来源的POI信息的格式如表1所示,具体如下:Therefore, the format of the POI information of different information sources corresponding to the address data mined from the Internet is as shown in Table 1, as follows:
表1  不同信息来源的POI信息的格式表Table 1 Format Table of POI Information from Different Information Sources
Figure PCTCN2015095857-appb-000001
Figure PCTCN2015095857-appb-000001
Figure PCTCN2015095857-appb-000002
Figure PCTCN2015095857-appb-000002
由表1可见,在同一地理位置(经纬度相同)从不同来源网站获取的POI数据中,有可能存在重复性数据,即同一个地址(经纬度)可能存在多个POI名字,如表1中同一个经纬度存在多个公司,其实际的POI经度、纬度相同,但是POI名称和POI地址的描述方式却不同;还可以看出,同一个poi名字可能多种不同的说法,比如“保山明志汽车销售有限公司”和“保山明志汽车销售服务有限公司”,重复性的POI数据导致用户无法快速、准确的搜索到同一POI地理位置(经纬度)的POI地址对应的POI名称。It can be seen from Table 1 that there may be repetitive data in the POI data obtained from different source websites in the same geographical location (the same latitude and longitude), that is, there may be multiple POI names in the same address (latitude and longitude), as in the same table 1 There are multiple companies in latitude and longitude, the actual POI longitude and latitude are the same, but the POI name and POI address are described in different ways. It can also be seen that the same poi name may have different opinions, such as "Baoshan Mingzhi Automobile Sales Limited" The company" and "Baoshan Mingzhi Automobile Sales and Service Co., Ltd.", the repetitive POI data caused the user to quickly and accurately search for the POI name corresponding to the POI address of the same POI geographic location (latitude and longitude).
对此,本发明实施例,从网络数据中抓取地址数据,从抓取到的一个或多个地址数据中分别提取名称字段和地址信息,基于所述名称字段确定一个或多个关键词;将对应相同地址信息的所述关键词进行聚类,生成至少一个类,根据聚类后的关键词确定此地址信息对应的POI名称,进而得到最佳的poi名字。In this regard, in the embodiment of the present invention, the address data is fetched from the network data, and the name field and the address information are respectively extracted from the captured one or more address data, and one or more keywords are determined based on the name field; The keywords corresponding to the same address information are clustered to generate at least one class, and the POI name corresponding to the address information is determined according to the clustered keywords, thereby obtaining the best poi name.
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类确定POI名称的系统中的关键词确定器13的在另一实施例中的内部结构,来体现依据关键词确定器13实现的另一实施例的细节。参照图2,关键词确定器13进一步包括切词单元131以及关键词获取单元132:In order to further embody the superiority of the invention, the internal structure of the keyword determiner 13 in the system for determining the POI name based on clustering in the other embodiment is further disclosed as follows to embody another implementation implemented by the keyword determiner 13. The details of an embodiment. Referring to FIG. 2, the keyword determiner 13 further includes a word segmentation unit 131 and a keyword acquisition unit 132:
所述的切词单元131,用于对所述名称字段中的名称进行切词处理生成分词;The word-cutting unit 131 is configured to perform word-cutting processing on the name in the name field to generate a word segmentation;
所述的关键词获取单元132,用于根据所述分词获取所述地址数据的关键词。The keyword acquiring unit 132 is configured to acquire keywords of the address data according to the word segmentation.
其中,关键词获取单元进一步包括:The keyword obtaining unit further includes:
第一频次统计模块,用于统计对应相同地址信息的每个分词出现的频次,作为第一频次;a first frequency statistics module, configured to count frequency of occurrence of each participle corresponding to the same address information, as the first frequency;
关键词生成模块,用于根据所述第一频次生成所述地址数据的关键词。And a keyword generating module, configured to generate a keyword of the address data according to the first frequency.
其中,关键词生成模块选择频次最小并且是非地名的分词作为所述地址数据的关键词。The keyword generating module selects a word segment with the smallest frequency and is a non-place name as a keyword of the address data.
本发明实施例中,对所挖掘的地址数据中POI信息的名称切词,并且统计切词后每个词出现的次数,同一个POI名称中出现频次最少即包含的信息量最大,并且是非地名的那个词记为该POI名称的关键词,比如表1中出现的地址数据对应的相关POI信息中POI名称切词后数据如表2所示(词频是根据约9000万的poi名字统计的),表2中第二列为获取到的关键词,具体如下:In the embodiment of the present invention, the name of the POI information in the mined address data is cut, and the number of occurrences of each word after the word is cut is counted. The least frequent occurrence of the same POI name includes the largest amount of information, and is a non-place name. The word is recorded as the keyword of the POI name. For example, the POI name in the relevant POI information corresponding to the address data appearing in Table 1 is as shown in Table 2 (the word frequency is based on the name of about 90 million poi) The second column in Table 2 is the obtained keywords, as follows:
表2  POI名称的切词后的数据表Table 2 Data table after the wording of the POI name
Figure PCTCN2015095857-appb-000003
Figure PCTCN2015095857-appb-000003
Figure PCTCN2015095857-appb-000004
Figure PCTCN2015095857-appb-000004
根据关键词聚类:同一个关键词对应的POI名称记为同一类,上述几个POI名称可以归为5个类,也就是说在此POI地址上存在5个不同的poi名字。Clustering according to keywords: The POI names corresponding to the same keyword are recorded as the same class. The above POI names can be classified into five classes, that is, there are five different poi names on the POI address.
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类确定POI名称的系统中的POI名称生成器15的在另一实施例中的内部结构,来体现依据POI名称生成器15实现的另一实施例的细节。参照图3,POI名称生成器15进一步包括频率统计单元151、类标识名称确定单元152以及POI名称确定单元153:In order to further embody the superiority of the invention, the internal structure of the POI name generator 15 in the system for determining the POI name based on clustering in the other embodiment is further disclosed as follows to embody another implementation implemented by the POI name generator 15. The details of an embodiment. Referring to FIG. 3, the POI name generator 15 further includes a frequency statistics unit 151, a class identification name determining unit 152, and a POI name determining unit 153:
所述的频率统计单元151,用于计算各个类中名称字段的出现频率;The frequency statistics unit 151 is configured to calculate an appearance frequency of a name field in each class;
所述的类标识名称确定单元152,用于将所述各个类中出现频率最高的名称字段作为类标识名称;The class identifier name determining unit 152 is configured to use a name field with the highest frequency of occurrence in each class as a class identifier name;
所述的POI名称确定单元153,用于将每个类标识名称均作为POI名称。The POI name determining unit 153 is configured to use each class identifier name as a POI name.
本实施例中,将每个类标识名称均作为POI名称,进一步为:根据关键词聚类:同一个关键词对应的POI名称记为同一类,上述几个POI名称可以归为5个类,也就是说在此POI地址上存在5个不同的poi名字,分别为:In this embodiment, each class identifier name is used as the POI name, and further is: clustering according to keywords: the POI names corresponding to the same keyword are recorded as the same class, and the above POI names can be classified into five classes. That is to say, there are 5 different poi names on this POI address, which are:
A:保山博鑫源汽车贸易有限公司;A: Baoshan Bo Xinyuan Automobile Trading Co., Ltd.;
B:云南省澜沧江啤酒集团保山有限公司云南省澜沧江啤酒集团保山有限公司(地图标注);B: Yunnan Province Minjiang Beer Group Baoshan Co., Ltd. Yunnan Province Minjiang Beer Group Baoshan Co., Ltd. (map marked);
C:保山明志汽车销售有限公司保山明志汽车销售服务有限公司C: Baoshan Mingzhi Automobile Sales Co., Ltd. Baoshan Mingzhi Automobile Sales & Service Co., Ltd.
D:保山长城汽车4S店;D: Baoshan Great Wall Motor 4S shop;
E:保山融易通汽车销售有限公司(雪佛兰4S店)。E: Baoshan Rongyitong Automobile Sales Co., Ltd. (Chevrolet 4S shop).
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类确定POI名称的系统中的POI名称生成器15的在另一实施例中的内部结构,来体现依据POI名称生成器15实现的另一实施例的细节。参照图4,POI名称生成器15进一步包括频率统计单元151′、类标识名称确定单元152′以及POI名称确定单元153′:In order to further embody the superiority of the invention, the internal structure of the POI name generator 15 in the system for determining the POI name based on clustering in the other embodiment is further disclosed as follows to embody another implementation implemented by the POI name generator 15. The details of an embodiment. Referring to FIG. 4, the POI name generator 15 further includes a frequency statistics unit 151', a class identification name determining unit 152', and a POI name determining unit 153':
频率统计单元151′,用于计算各个类中名称字段的出现频率;a frequency statistics unit 151' for calculating an appearance frequency of a name field in each class;
类标识名称确定单元152′,用于将所述各个类中出现频率最高的名称字段作为类标识名称;a class identifier name determining unit 152', configured to use a name field having the highest frequency of occurrence in each of the classes as a class identifier name;
POI名称确定单元153′,用于选择出现频率最高的类标识名称作为POI名称。 The POI name determining unit 153' is configured to select the class identification name having the highest frequency of occurrence as the POI name.
本实施例中,在同一类的POI名称中,选取最佳的POI名称是根据互联上的“投票”来解决,所谓“投票”主要是根据此POI名称在互联网上出现的频次以及来源的可信度,互联网上出现的频次最高、来源最可信的那个名字为要选取的最佳名字。比如:In this embodiment, among the POI names of the same type, the best POI name is selected according to the "voting" on the Internet. The so-called "voting" is mainly based on the frequency of the POI name appearing on the Internet and the source. Reliability, the name with the highest frequency and the most trusted source on the Internet is the best name to choose. such as:
A类中只有一个名字,最佳的也是这一个。There is only one name in class A, and the best one is this one.
B类中有两个名字,其中“云南省澜沧江啤酒集团保山有限公司”出现的频率最高,作为最佳名字。There are two names in category B, among which “Yunjiang Beer Group Baoshan Co., Ltd.” has the highest frequency and is the best name.
C类中有两个名字,其中“保山明志汽车销售服务有限公司”出现的频率最高,作为最佳名字。There are two names in category C, of which “Baoshan Mingzhi Automobile Sales Service Co., Ltd.” appears the most frequently, as the best name.
D类和E类中同样是只有一个名字,类似A。There is only one name in class D and class E, similar to A.
本发明实施例中,所述可靠来源为具有预定可信度的来源。其中,所述来源为网站或者网页。In an embodiment of the invention, the reliable source is a source having a predetermined degree of confidence. Wherein, the source is a website or a webpage.
其中,可靠来源的网站或者网页包括但不限于,如新浪、凤凰网等大型网站、通过官方认证的网站、访问频次比较高、数据流量大的网站以及不携带恶意链接、病毒链接且客户满意度交高的网站等。Among them, reliable sources of websites or web pages include, but are not limited to, large websites such as Sina and Phoenix, websites that have been officially certified, websites with high frequency of access, large data traffic, and no malicious links, virus links, and customer satisfaction. High-profile websites, etc.
本发明实施例中,可靠来源的网站或者网页的可信度是可量化的,可根据用户的访问次数以及客户评价等对各个网站或网页的可信度进行量化。而且各个网站或网页的可信度是动态变化的,若当前网站出现病毒、欺诈广告或被其他恶意欺诈网站所利用,则其可信度会随之降低,本发明通过网站可信度的量化和动态调整,进一步保证获取的POI信息的可靠、有效。In the embodiment of the present invention, the credibility of the website or the webpage of the reliable source is quantifiable, and the credibility of each website or webpage can be quantified according to the number of visits by the user and the customer evaluation. Moreover, the credibility of each website or webpage is dynamically changed. If the current website is infected with viruses, fraudulent advertisements or used by other malicious fraudulent websites, the credibility thereof will be reduced, and the present invention quantifies the credibility of the website. And dynamic adjustment to further ensure that the acquired POI information is reliable and effective.
本发明实施例提供的基于聚类确定POI名称的系统,根据切词后词频次的多少来挖掘poi名字的关键词,并且以此关键词来聚类,把不同说法的同一个poi名字聚为一类,解决同一个经纬度对应多个poi名字的问题,利用互联网“投票”机制来选取最佳的poi名字。The system for determining the POI name based on clustering according to the embodiment of the present invention searches for the keyword of the poi name according to the frequency of the word after the word is cut, and clusters the keyword to cluster the same poi name of different sayings. One type solves the problem that the same latitude and longitude corresponds to multiple poi names, and uses the Internet "voting" mechanism to select the best poi name.
图5示出了本发明一个实施例的基于聚类确定POI名称的方法的流程图。FIG. 5 shows a flow chart of a method for determining a POI name based on clustering according to an embodiment of the present invention.
参照图5,本发明实施例的基于聚类确定POI名称的方法包括以下步骤:Referring to FIG. 5, a method for determining a POI name based on clustering according to an embodiment of the present invention includes the following steps:
S11、从网络数据中抓取地址数据;S11. Obtain address data from network data.
S12、从抓取到的一个或多个地址数据中分别提取名称字段和地址信息;S12. Extract name field and address information from the captured one or more address data respectively.
S13、基于所述名称字段确定一个或多个关键词;S13. Determine one or more keywords based on the name field;
S14、将对应相同地址信息的所述关键词进行聚类,生成至少一个类;S14. Cluster the keywords corresponding to the same address information to generate at least one class.
S15、根据聚类后的关键词确定此地址信息对应的POI名称。S15. Determine a POI name corresponding to the address information according to the clustered keyword.
本发明实施例,基于搜索引擎从网络数据中抓取地址数据,所述地址数据包括名称字段和地址信息,基于搜索引擎从互联网上挖掘的地图地址数据,比如name:恒大地产集团昆明公司;address:昆明市盘龙区北辰财富中心A座写字楼14楼,其中”恒大地产集团昆明公司”为POI的名称,“昆明市盘龙区北辰财富中心A座写字楼14楼”为此POI的地址,通过对地址的经纬度解析可以获得此地址所在的经纬度信息,比如地址“昆明市盘龙区北辰财富中心A座写字楼14楼”经纬度解析得到的经纬度为:东经:102.733445北纬:25.08108。另外,需要统计POI信息在互联网上出现的次数以及记录来源。在同一地理位置(经纬度相同)从不同来源网站获取的POI数据中,有可能存在重复性数据,即同一个地址(经纬度)可能存在多个POI名字,如同一个经纬度存在多个公司,其实际的POI经度、纬度相同,但是POI名称和POI地址的描述方式却不同;还可以看出,同一个poi名字可能多种不同的说法,比如“保山明志汽车销售有限公司”和“保山明志汽车销售服务有限公司”,重复性的POI数据导致用户无法快速、准确的搜索到同一POI地理位置(经纬度)的POI地址对应的POI名称。In the embodiment of the present invention, the address data is captured from the network data based on the search engine, and the address data includes a name field and address information, based on map address data excavated by the search engine from the Internet, such as name: Evergrande Real Estate Group Kunming Company; Address: 14th Floor, Office Building, Block A, Beichen Fortune Center, Panlong District, Kunming City, including “Chengda Real Estate Group Kunming Company” is the name of POI, “14th Floor, Office Building, Block A, Beichen Fortune Center, Panlong District, Kunming”. Address of this POI The latitude and longitude information of the address can be obtained by analyzing the latitude and longitude of the address. For example, the address is “14th floor, office building, Block A, Beichen Fortune Center, Panlong District, Kunming”. The latitude and longitude of the latitude and longitude analysis is: east longitude: 102.733445 north latitude: 25.08108. In addition, it is necessary to count the number of times POI information appears on the Internet and the source of the record. In the POI data obtained from different source websites in the same geographical location (same latitude and longitude), there may be repetitive data, that is, there may be multiple POI names in the same address (latitude and longitude), as there are multiple companies in a latitude and longitude, the actual The POI longitude and latitude are the same, but the POI name and the POI address are described in different ways. It can also be seen that the same poi name may have different expressions, such as “Baoshan Mingzhi Automobile Sales Co., Ltd.” and “Baoshan Mingzhi Automobile Sales Service”. Ltd.", repetitive POI data causes users to quickly and accurately search for POI names corresponding to POI addresses of the same POI geographic location (latitude and longitude).
对此,本发明实施例,从网络数据中抓取地址数据,从抓取到的一个或多个地址数据中分别提取名称字段和地址信息,基于所述名称字段确定一个或多个关键词;将对应相同地址信息的所述关键词进行聚类,生成至少一个类,根据聚类后的关键词确定此地址信息对应的POI名称,进而得到最佳的poi名字。In this regard, in the embodiment of the present invention, the address data is fetched from the network data, and the name field and the address information are respectively extracted from the captured one or more address data, and one or more keywords are determined based on the name field; The keywords corresponding to the same address information are clustered to generate at least one class, and the POI name corresponding to the address information is determined according to the clustered keywords, thereby obtaining the best poi name.
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类确定POI名称的方法中步骤S13的细分步骤,来体现依据本步骤实现的另一实施例。参照图6,本步骤的细分步骤包括:In order to further embody the superiority of the invention, the subdivision step of step S13 in the method for determining the POI name based on clustering according to the present invention is further disclosed as follows to embody another embodiment implemented according to this step. Referring to Figure 6, the subdivision steps of this step include:
S131、对所述名称字段中的名称进行切词处理生成分词;S131. Perform word segmentation on the name in the name field to generate a participle;
S132、根据所述分词获取所述地址数据的关键词。S132. Acquire a keyword of the address data according to the word segmentation.
其中,步骤S132:根据所述分词获取所述地址数据的关键词,进一步包括:Step S132: Acquire the keyword of the address data according to the word segmentation, and further include:
统计对应相同地址信息的每个分词出现的频次作为第一频次;Counting the frequency of occurrence of each participle corresponding to the same address information as the first frequency;
根据所述第一频次生成所述地址数据的关键词。Generating keywords of the address data according to the first frequency.
其中步骤:根据所述第一频次生成所述地址数据的关键词具体为:Step: the keyword for generating the address data according to the first frequency is specifically:
选择频次最小并且是非地名的分词作为所述地址数据的关键词。A word segment having the smallest frequency and being a non-place name is selected as a keyword of the address data.
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类确定POI名称的方法中步骤S15的细 分步骤,来体现依据本步骤实现的另一实施例。参照图7,本步骤的细分步骤包括:In order to further embody the superiority of the invention, the details of step S15 in the method for determining the POI name based on clustering according to the present invention are further disclosed as follows. A further step is taken to embody another embodiment implemented in accordance with this step. Referring to Figure 7, the subdivision steps of this step include:
S151、计算各个类中名称字段的出现频率;S151. Calculate an appearance frequency of a name field in each class.
S152、将所述各个类中出现频率最高的名称字段作为类标识名称;S152. The name field with the highest frequency of occurrence in each class is used as a class identifier name.
S153、将每个类标识名称均作为POI名称。S153. Each class identifier name is taken as a POI name.
本实施例中,将每类标识名称均作为对应该地址信息的POI名称,根据关键词聚类:同一个关键词对应的POI名称记为同一类,上述几个POI名称可以归为5个类,也就是说在此POI地址上存在5个不同的poi名字,分别为:In this embodiment, each type of identifier name is used as the POI name corresponding to the address information, and clustered according to keywords: the POI name corresponding to the same keyword is recorded as the same category, and the above POI names can be classified into five categories. That is, there are 5 different poi names on this POI address, which are:
A:保山博鑫源汽车贸易有限公司;A: Baoshan Bo Xinyuan Automobile Trading Co., Ltd.;
B:云南省澜沧江啤酒集团保山有限公司云南省澜沧江啤酒集团保山有限公司(地图标注);B: Yunnan Province Minjiang Beer Group Baoshan Co., Ltd. Yunnan Province Minjiang Beer Group Baoshan Co., Ltd. (map marked);
C:保山明志汽车销售有限公司保山明志汽车销售服务有限公司C: Baoshan Mingzhi Automobile Sales Co., Ltd. Baoshan Mingzhi Automobile Sales & Service Co., Ltd.
D:保山长城汽车4S店;D: Baoshan Great Wall Motor 4S shop;
E:保山融易通汽车销售有限公司(雪佛兰4S店)。E: Baoshan Rongyitong Automobile Sales Co., Ltd. (Chevrolet 4S shop).
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类确定POI名称的方法中步骤S15的细分步骤,来体现依据本步骤实现的另一实施例。参照图8,本步骤的细分步骤包括:In order to further embody the superiority of the invention, the subdivision step of step S15 in the method for determining the POI name based on clustering according to the present invention is further disclosed as follows to embody another embodiment implemented according to this step. Referring to Figure 8, the subdivision steps of this step include:
S151′、计算各个类中名称字段的出现频率;S151', calculating the frequency of occurrence of the name field in each class;
S152′、将所述各个类中出现频率最高的名称字段作为类标识名称;S152', the name field with the highest frequency of occurrence in each class is used as a class identifier name;
S153′、选择出现频率最高的类标识名称作为POI名称。S153', selecting the class identification name with the highest frequency of occurrence as the POI name.
本实施例中,在同一类的POI名称中,选取最佳的POI名称是根据互联上的“投票”来解决,所谓“投票”主要是根据此POI名称在互联网上出现的频次以及来源的可信度,互联网上出现的频次最高、来源最可信的那个名字为要选取的最佳名字。比如:In this embodiment, among the POI names of the same type, the best POI name is selected according to the "voting" on the Internet. The so-called "voting" is mainly based on the frequency of the POI name appearing on the Internet and the source. Reliability, the name with the highest frequency and the most trusted source on the Internet is the best name to choose. such as:
A类中只有一个名字,最佳的也是这一个。There is only one name in class A, and the best one is this one.
B类中有两个名字,其中“云南省澜沧江啤酒集团保山有限公司”出现的频率最高,作为最佳名字。There are two names in category B, among which “Yunjiang Beer Group Baoshan Co., Ltd.” has the highest frequency and is the best name.
C类中有两个名字,其中“保山明志汽车销售服务有限公司”出现的频率最高,作为最佳名字。There are two names in category C, of which “Baoshan Mingzhi Automobile Sales Service Co., Ltd.” appears the most frequently, as the best name.
D类和E类中同样是只有一个名字,类似A。There is only one name in class D and class E, similar to A.
本发明实施例中,所述可靠来源为具有预定可信度的来源。其中,所述来源为网站或者网页。In an embodiment of the invention, the reliable source is a source having a predetermined degree of confidence. Wherein, the source is a website or a webpage.
其中,可靠来源的网站或者网页包括但不限于,如新浪、凤凰网等大型网站、通过官方认证的网站、访问频次比较高、数据流量大的网站以及不携带恶意链接、病毒链接且客户满意度交高的网站等。Among them, reliable sources of websites or web pages include, but are not limited to, large websites such as Sina and Phoenix, websites that have been officially certified, websites with high frequency of access, large data traffic, and no malicious links, virus links, and customer satisfaction. High-profile websites, etc.
本发明实施例中,可靠来源的网站或者网页的可信度是可量化的,可根据用户的访问次数以及客户评价等对各个网站或网页的可信度进行量化。而且各个网站或网页的可信度是动态变化的,若当前网站出现病毒、欺诈广告或被其他恶意欺诈网站所利用,则其可信度会随之降低,本发明通过网站可信度的量化和动态调整,进一步保证获取的POI信息的可靠、有效。In the embodiment of the present invention, the credibility of the website or the webpage of the reliable source is quantifiable, and the credibility of each website or webpage can be quantified according to the number of visits by the user and the customer evaluation. Moreover, the credibility of each website or webpage is dynamically changed. If the current website is infected with viruses, fraudulent advertisements or used by other malicious fraudulent websites, the credibility thereof will be reduced, and the present invention quantifies the credibility of the website. And dynamic adjustment to further ensure that the acquired POI information is reliable and effective.
本发明实施例提供的基于聚类确定POI名称的方法,根据切词后词频次的多少来挖掘poi名字的关键词,并且以此关键词来聚类,把不同说法的同一个poi名字聚为一类,解决同一个经纬度对应多个poi名字的问题,利用互联网“投票”机制来选取最佳的poi名字。The method for determining a POI name based on clustering according to an embodiment of the present invention searches for a keyword of a poi name according to the frequency of the word after the word is cut, and clusters the keyword, and aggregates the same poi name of different sayings into One type solves the problem that the same latitude and longitude corresponds to multiple poi names, and uses the Internet "voting" mechanism to select the best poi name.
综上所述,本发明的上述实施例中通过对从网络数据中抓取地址数据进行名称字段和地址信息的提取,基于名称字段确定关键词,并将对应相同地址信息的关键词进行聚类,基于聚类后的关键词确定地址信息对应的POI名称,从而使得用户能够快速、准确地搜索到同一经、纬度的POI地址对应的POI名称,改善用户体验。In summary, in the foregoing embodiment of the present invention, the name field and the address information are extracted by fetching the address data from the network data, the keyword is determined based on the name field, and the keywords corresponding to the same address information are clustered. The POI name corresponding to the address information is determined based on the clustered keywords, so that the user can quickly and accurately search for the POI name corresponding to the POI address of the same latitude and longitude, thereby improving the user experience.
图9示出了本发明一个实施例的基于聚类的POI名称确定系统的框图。Figure 9 is a block diagram showing a cluster-based POI name determination system in accordance with one embodiment of the present invention.
参照图9,本发明实施例的基于聚类的POI名称确定系统,包括:Referring to FIG. 9, a cluster-based POI name determining system according to an embodiment of the present invention includes:
地址数据抓取器91,用于基于搜索引擎从网络数据中抓取地址数据,所述地址数据包括名称字段和地址信息;An address data grabber 91, configured to fetch address data from network data based on a search engine, where the address data includes a name field and address information;
名称字段聚类器92,用于将对应相同地址信息的名称字段按照关键词进行聚类;a name field clusterer 92, configured to cluster the name fields corresponding to the same address information according to keywords;
第二频次统计器93,用于统计聚类后各类别中名称字段出现的频次,作为第二频次;The second frequency statistic unit 93 is configured to count the frequency of occurrence of the name field in each category after clustering, as the second frequency;
POI名称确定单元94,用于根据所述第二频次确定该类别对应该地址信息的POI名称。The POI name determining unit 94 is configured to determine, according to the second frequency, a POI name corresponding to the address information of the category.
本发明实施例,基于搜索引擎利用网络中的地址数据,所述地址数据包括名称字段、地址信息以及多个相关POI信息;本发明实施例中,所述多个相关POI信息为对应POI至少一个预设属性的信息。进一步地,所 述预设属性为经纬度、地址、建筑物名称或所囊括单位名称。In the embodiment of the present invention, the address data is used by the search engine, and the address data includes a name field, address information, and a plurality of related POI information. In the embodiment of the present invention, the plurality of related POI information is at least one corresponding POI. Preset attribute information. Further, The preset attributes are latitude and longitude, address, building name, or unit name.
由前述的表1可见,在同一地理位置(经纬度相同)从不同来源网站获取的POI数据中,有可能存在重复性数据;还可以看出,同一个poi名字可能多种不同的说法。It can be seen from Table 1 above that in the POI data obtained from different source websites in the same geographical position (the same latitude and longitude), there may be repetitive data; it can also be seen that the same poi name may have many different claims.
对此,本发明实施例,基于搜索引擎从网络数据中抓取地址数据,所述地址数据包括名称字段和地址信息,将对应相同地址信息的名称字段按照关键词进行聚类,统计聚类后各类别中名称字段出现的频次,作为第二频次,根据所述第二频次确定该类别对应该地址信息的POI名称,进而得到最佳的poi名字。In this regard, in the embodiment of the present invention, the address data is captured from the network data based on the search engine, and the address data includes a name field and address information, and the name fields corresponding to the same address information are clustered according to keywords, after statistical clustering. The frequency at which the name field appears in each category is used as the second frequency, and the POI name corresponding to the address information of the category is determined according to the second frequency, thereby obtaining the best poi name.
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类的POI名称确定系统中的名称字段聚类器92的在另一实施例中的内部结构,来体现依据名称字段聚类器92实现的另一实施例的细节。参照图10,名称字段聚类器92进一步包括关键词确定单元921、关键词聚类单元922以及名称字段聚类确定单元923:In order to further embody the superiority of the invention, the internal structure of the name field clusterer 92 in the cluster-based POI name determining system of the present invention is further disclosed as follows to implement the clustering by the name field clusterer 92. Details of another embodiment. Referring to FIG. 10, the name field clusterer 92 further includes a keyword determining unit 921, a keyword clustering unit 922, and a name field cluster determining unit 923:
所述关键词确定单元921,用于基于所述名称字段确定一个或多个关键词;The keyword determining unit 921 is configured to determine one or more keywords based on the name field;
所述关键词聚类单元922,用于将对应相同地址信息的所述关键词进行聚类;The keyword clustering unit 922 is configured to cluster the keywords corresponding to the same address information;
所述名称字段聚类确定单元923,用于根据聚类后的关键词确定聚类后的名称字段。The name field cluster determining unit 923 is configured to determine the clustered name field according to the clustered keywords.
更进一步地,所述关键词确定单元921进一步包括切词模块和关键词获取模块:所述切词模块,用于对所述名称字段中的名称进行切词处理生成分词;所述关键词获取模块,用于根据所述分词获取所述名称字段的关键词。Further, the keyword determining unit 921 further includes a word cutting module and a keyword obtaining module: the word cutting module is configured to perform word segmentation processing on the name in the name field to generate a word segmentation; a module, configured to acquire a keyword of the name field according to the word segmentation.
更进一步地,所述关键词获取模块进一步包括第一频次统计子模块和关键词生成子模块:所述第一频次统计子模块,用于统计对应相同地址信息的每个分词出现的频次,作为第一频次;所述关键词生成子模块,用于根据所述第一频次生成所述名称字段的关键词。Further, the keyword obtaining module further includes a first frequency statistics sub-module and a keyword generation sub-module: the first frequency statistics sub-module is configured to count the frequency of occurrence of each participle corresponding to the same address information, as a first frequency; the keyword generating submodule, configured to generate a keyword of the name field according to the first frequency.
其中,所述关键词生成子模块选择所述第一频次最小并且是非地名的分词作为所述名称字段的关键词。The keyword generation sub-module selects the word segment with the first frequency minimum and is not a place name as the keyword of the name field.
本发明实施例中,对所挖掘的地址数据中POI信息的名称切词,并且统计切词后每个词出现的次数,同一个POI名称中出现频次最少即包含的信息量最大,并且是非地名的那个词记为该POI名称的关键词,比如前述表1中出现的地址数据对应的相关POI信息中POI名称切词后数据如前述表2所示(词频是根据约9000万的poi名字统计的)。为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类的POI名称确定系统中的第二频次统计器93的在另一实施例中的内部结构,来体现依据第二频次统计器93实现的另一实施例的细节。参照图11,第二频次统计器93进一步包括名称字段来源获取单元931、来源可靠性判断单元932以及第二频次统计单元933:In the embodiment of the present invention, the name of the POI information in the mined address data is cut, and the number of occurrences of each word after the word is cut is counted. The least frequent occurrence of the same POI name includes the largest amount of information, and is a non-place name. The word is recorded as the keyword of the POI name. For example, the POI name of the relevant POI information corresponding to the address data appearing in Table 1 above is as shown in Table 2 above (the word frequency is based on the name of about 90 million poi). of). In order to further embody the superiority of the invention, the internal structure of the second frequency statistic 93 in the cluster-based POI name determining system of the present invention is further disclosed as follows to implement the second frequency statistic 93. Details of another embodiment. Referring to FIG. 11, the second frequency statisticator 93 further includes a name field source obtaining unit 931, a source reliability determining unit 932, and a second frequency counting unit 933:
所述名称字段来源获取单元931,用于获取所述名称字段的来源;The name field source obtaining unit 931 is configured to obtain a source of the name field;
所述来源可靠性判断单元932,用于判断所述来源是否属于可靠来源;The source reliability determining unit 932 is configured to determine whether the source is a reliable source;
所述第二频次统计单元933,用于在判断为是的情况下,统计所述名称字段出现的频次,作为第二频次,否则不统计。The second frequency statistics unit 933 is configured to: when the determination is yes, count the frequency of occurrence of the name field as the second frequency, otherwise it is not counted.
本发明实施例中,所述可靠来源为具有预定可信度的来源。其中,所述来源为网站或者网页。In an embodiment of the invention, the reliable source is a source having a predetermined degree of confidence. Wherein, the source is a website or a webpage.
其中,可靠来源的网站或者网页包括但不限于,如新浪、凤凰网等大型网站、通过官方认证的网站、访问频次比较高、数据流量大的网站以及不携带恶意链接、病毒链接且客户满意度交高的网站等。Among them, reliable sources of websites or web pages include, but are not limited to, large websites such as Sina and Phoenix, websites that have been officially certified, websites with high frequency of access, large data traffic, and no malicious links, virus links, and customer satisfaction. High-profile websites, etc.
本发明实施例中,可靠来源的网站或者网页的可信度是可量化的,可根据用户的访问次数以及客户评价等对各个网站或网页的可信度进行量化。而且各个网站或网页的可信度是动态变化的,若当前网站出现病毒、欺诈广告或被其他恶意欺诈网站所利用,则其可信度会随之降低,本发明通过网站可信度的量化和动态调整,进一步保证获取的POI信息的可靠、有效。In the embodiment of the present invention, the credibility of the website or the webpage of the reliable source is quantifiable, and the credibility of each website or webpage can be quantified according to the number of visits by the user and the customer evaluation. Moreover, the credibility of each website or webpage is dynamically changed. If the current website is infected with viruses, fraudulent advertisements or used by other malicious fraudulent websites, the credibility thereof will be reduced, and the present invention quantifies the credibility of the website. And dynamic adjustment to further ensure that the acquired POI information is reliable and effective.
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类的POI名称确定系统中的POI名称确定单元94的在另一实施例中的内部结构,来体现依据POI名称确定单元94实现的另一实施例的细节。POI名称确定单元94进一步包括第一类标识名称确定模块以及第一POI名称确定模块:In order to further embody the superiority of the invention, the internal structure of the POI name determining unit 94 in the cluster-based POI name determining system of the present invention in another embodiment is further disclosed as follows to embody another implementation implemented by the POI name determining unit 94. The details of an embodiment. The POI name determining unit 94 further includes a first class identification name determining module and a first POI name determining module:
所述第一类标识名称确定模块,用于将所述各个类中所述第二频次最高的名称字段作为类标识名称;The first type identifier name determining module is configured to use the name field with the highest frequency in the respective classes as the class identifier name;
所述第一POI名称确定模块,用于将每类标识名称均作为对应该地址信息的POI名称。The first POI name determining module is configured to use each type of identifier name as a POI name corresponding to the address information.
本实施例中,将每类标识名称均作为对应该地址信息的POI名称,根据关键词聚类:同一个关键词对应的POI名称记为同一类,参见表1和表2,上述几个POI名称可以归为5个类,也就是说在此POI地址上存在5个不同的poi名字,分别为:In this embodiment, each type of identification name is used as the POI name corresponding to the address information, and is clustered according to the keyword: the POI name corresponding to the same keyword is recorded as the same category, see Table 1 and Table 2, the above several POIs. The name can be classified into 5 classes, which means that there are 5 different poi names on this POI address, which are:
A:保山博鑫源汽车贸易有限公司;A: Baoshan Bo Xinyuan Automobile Trading Co., Ltd.;
B:云南省澜沧江啤酒集团保山有限公司云南省澜沧江啤酒集团保山有限公司(地图标注);B: Yunnan Province Minjiang Beer Group Baoshan Co., Ltd. Yunnan Province Minjiang Beer Group Baoshan Co., Ltd. (map marked);
C:保山明志汽车销售有限公司保山明志汽车销售服务有限公司 C: Baoshan Mingzhi Automobile Sales Co., Ltd. Baoshan Mingzhi Automobile Sales & Service Co., Ltd.
D:保山长城汽车4S店;D: Baoshan Great Wall Motor 4S shop;
E:保山融易通汽车销售有限公司(雪佛兰4S店)。E: Baoshan Rongyitong Automobile Sales Co., Ltd. (Chevrolet 4S shop).
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类的POI名称确定系统中的POI名称确定单元94的在另一实施例中的内部结构,来体现依据POI名称确定单元94实现的另一实施例的细节。POI名称确定单元94进一步包括第二类标识名称确定模块以及第二POI名称确定模块:In order to further embody the superiority of the invention, the internal structure of the POI name determining unit 94 in the cluster-based POI name determining system of the present invention in another embodiment is further disclosed as follows to embody another implementation implemented by the POI name determining unit 94. The details of an embodiment. The POI name determining unit 94 further includes a second class identification name determining module and a second POI name determining module:
所述第二类标识名称确定模块,用于将所述各个类中第二频次最高的名称字段作为类标识名称;The second type identifier name determining module is configured to use a name field with the highest frequency in the respective classes as a class identifier name;
所述第二POI名称确定模块,用于将网络上出现次数最多的类标识名称作为对应该地址信息的POI名称。The second POI name determining module is configured to use the class identifier name that has the most occurrences on the network as the POI name corresponding to the address information.
本实施例中,在同一类的POI名称中,选取最佳的POI名称是根据互联上的“投票”来解决,所谓“投票”主要是根据此POI名称在互联网上出现的频次以及来源的可信度,互联网上出现的频次最高、来源最可信的那个名字为要选取的最佳名字。比如:In this embodiment, among the POI names of the same type, the best POI name is selected according to the "voting" on the Internet. The so-called "voting" is mainly based on the frequency of the POI name appearing on the Internet and the source. Reliability, the name with the highest frequency and the most trusted source on the Internet is the best name to choose. such as:
A类中只有一个名字,最佳的也是这一个。There is only one name in class A, and the best one is this one.
B类中有两个名字,其中“云南省澜沧江啤酒集团保山有限公司”出现的频率最高,作为最佳名字。There are two names in category B, among which “Yunjiang Beer Group Baoshan Co., Ltd.” has the highest frequency and is the best name.
C类中有两个名字,其中“保山明志汽车销售服务有限公司”出现的频率最高,作为最佳名字。There are two names in category C, of which “Baoshan Mingzhi Automobile Sales Service Co., Ltd.” appears the most frequently, as the best name.
D类和E类中同样是只有一个名字,类似A。There is only one name in class D and class E, similar to A.
本发明实施例提供的基于聚类的POI名称确定系统,根据切词后词频次的多少来挖掘POI名字的关键词,并且以此关键词来聚类,把不同说法的同一个POI名字聚为一类,解决同一个经纬度对应多个POI名字的问题,利用互联网“投票”机制来选取最佳的POI名字。The cluster-based POI name determination system provided by the embodiment of the present invention mines the keywords of the POI name according to the frequency of the words after the word is cut, and clusters the keywords, and aggregates the same POI name of different sayings into One class solves the problem that the same latitude and longitude corresponds to multiple POI names, and uses the Internet "voting" mechanism to select the best POI name.
图12示出了本发明一个实施例的基于聚类的POI名称确定方法的流程图。FIG. 12 is a flow chart showing a cluster-based POI name determination method according to an embodiment of the present invention.
参照图12,本发明实施例的基于聚类的POI名称确定方法包括以下步骤:Referring to FIG. 12, a cluster-based POI name determining method according to an embodiment of the present invention includes the following steps:
S121、从网络数据中抓取地址数据,所述地址数据包括名称字段和地址信息;S121. Obtain address data from network data, where the address data includes a name field and address information.
S122、将对应相同地址信息的名称字段按照关键词进行聚类;S122: Cluster the name fields corresponding to the same address information according to keywords;
S123、统计聚类后各类别中名称字段出现的频次,作为第二频次;S123. The frequency of occurrence of the name field in each category after statistical clustering is used as the second frequency;
S124、根据所述第二频次确定该类别对应该地址信息的POI名称。S124. Determine, according to the second frequency, a POI name corresponding to the address information of the category.
基于搜索引擎利用网络中的地址数据,所述地址数据包括名称字段、地址信息以及多个相关POI信息;本发明实施例中,所述多个相关POI信息为对应POI至少一个预设属性的信息。进一步地,所述预设属性为经纬度、地址、建筑物名称或所囊括单位名称。The address data is used by the search engine, and the address data includes a name field, address information, and a plurality of related POI information. In the embodiment of the present invention, the plurality of related POI information is information corresponding to at least one preset attribute of the POI. . Further, the preset attribute is a latitude and longitude, an address, a building name, or a unit name included.
本发明实施例,基于搜索引擎从网络数据中抓取地址数据,所述地址数据包括名称字段和地址信息,基于搜索引擎从互联网上挖掘的地图地址数据,比如name:恒大地产集团昆明公司;address:昆明市盘龙区北辰财富中心A座写字楼14楼,其中”恒大地产集团昆明公司”为POI的名称,“昆明市盘龙区北辰财富中心A座写字楼14楼”为此POI的地址,通过对地址的经纬度解析可以获得此地址所在的经纬度信息,比如地址“昆明市盘龙区北辰财富中心A座写字楼14楼”经纬度解析得到的经纬度为:东经:102.733445  北纬:25.08108。另外,需要统计POI信息在互联网上出现的次数以及记录来源。In the embodiment of the present invention, the address data is captured from the network data based on the search engine, and the address data includes a name field and address information, based on map address data excavated by the search engine from the Internet, such as name: Evergrande Real Estate Group Kunming Company; Address: 14th Floor, Office Building, Block A, Beichen Fortune Center, Panlong District, Kunming City, including “Chengda Real Estate Group Kunming Company” is the name of POI, “14th Floor, Office Building, Block A, Beichen Fortune Center, Panlong District, Kunming”. Address of this POI The latitude and longitude information of the address can be obtained by analyzing the latitude and longitude of the address. For example, the address is “14th floor, office building, Block A, Beichen Fortune Center, Panlong District, Kunming”. The latitude and longitude of the latitude and longitude analysis is: east longitude: 102.733445 north latitude: 25.08108. In addition, it is necessary to count the number of times POI information appears on the Internet and the source of the record.
但是,在同一地理位置(经纬度相同)从不同来源网站获取的POI数据中,有可能存在重复性数据,即同一个地址(经纬度)可能存在多个POI名字,如同一个经纬度存在多个公司,其实际的POI经度、纬度相同,但是POI名称和POI地址的描述方式却不同;还可以看出,同一个poi名字可能多种不同的说法,比如“保山明志汽车销售有限公司”和“保山明志汽车销售服务有限公司”,重复性的POI数据导致用户无法快速、准确的搜索到同一POI地理位置(经纬度)的POI地址对应的POI名称。However, in the POI data obtained from different source websites in the same geographical location (same latitude and longitude), there may be repetitive data, that is, there may be multiple POI names in the same address (latitude and longitude), as there are multiple companies in one latitude and longitude, The actual POI longitude and latitude are the same, but the POI name and the POI address are described differently. It can also be seen that the same poi name may have different expressions, such as “Baoshan Mingzhi Automobile Sales Co., Ltd.” and “Baoshan Mingzhi Automobile”. Sales Service Co., Ltd.", the repetitive POI data caused the user to quickly and accurately search for the POI name corresponding to the POI address of the same POI geographic location (latitude and longitude).
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类的POI名称确定方法中步骤S122的细分步骤,来体现依据本步骤实现的另一实施例。参照图13,本步骤的细分步骤包括:In order to further embody the advantages of the invention, the subdivision step of step S122 in the cluster-based POI name determining method of the present invention is further disclosed as follows to embody another embodiment implemented according to this step. Referring to Figure 13, the subdivision steps of this step include:
S1221、基于所述名称字段确定一个或多个关键词;S1221: Determine one or more keywords based on the name field;
S1222、将对应相同地址信息的所述关键词进行聚类;S1222: clustering the keywords corresponding to the same address information;
S1223、根据聚类后的关键词确定聚类后的名称字段。S1223. Determine a clustered name field according to the clustered keywords.
更进一步地,所述步骤S1221:基于所述名称字段确定一个或多个关键词,进一步包括:对所述名称字段进行切词处理生成分词;根据分词获取所述名称字段的关键词。Further, the step S1221: determining one or more keywords based on the name field, further comprising: performing word segmentation on the name field to generate a word segmentation; and acquiring a keyword of the name field according to the word segmentation.
更进一步地,所述步骤:根据分词获取所述名称字段的关键词,进一步包括:统计对应相同地址信息的每个分词出现的频次,作为第一频次;根据所述第一频次确定所述名称字段的关键词。Further, the step of: acquiring the keyword of the name field according to the word segmentation, further comprising: counting frequency of occurrence of each participle corresponding to the same address information as the first frequency; determining the name according to the first frequency Key words for the field.
更进一步地,所述步骤根据所述第一频次确定所述名称字段的关键词具体为:选择第一频次最小并且是非地名的分词作为所述名称的关键词。 Further, the step of determining, according to the first frequency, the keyword of the name field is specifically: selecting a word segment with a first frequency minimum and a non-place name as a keyword of the name.
本发明实施例中,对所挖掘的地址数据中POI信息的名称切词,并且统计切词后每个词出现的次数,同一个POI名称中出现频次最少即包含的信息量最大,并且是非地名的那个词记为该POI名称的关键词,根据关键词聚类:同一个关键词对应的POI名称记为同一类。In the embodiment of the present invention, the name of the POI information in the mined address data is cut, and the number of occurrences of each word after the word is cut is counted. The least frequent occurrence of the same POI name includes the largest amount of information, and is a non-place name. The word is recorded as the keyword of the POI name, and is clustered according to the keyword: the POI name corresponding to the same keyword is recorded as the same class.
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类的POI名称确定方法中步骤S123的细分步骤,来体现依据本步骤实现的另一实施例。参照图14,本步骤的细分步骤包括:In order to further embody the superiority of the invention, the subdivision step of step S123 in the cluster-based POI name determining method of the present invention is further disclosed as follows to embody another embodiment implemented according to this step. Referring to Figure 14, the subdivision steps of this step include:
S1231、获取所述名称字段的来源;S1231: Obtain a source of the name field;
S1232、判断所述来源是否属于可靠来源,如果是,则执行S1233;S1232, determining whether the source is a reliable source, and if so, executing S1233;
S1233、统计所述名称字段出现的频次,作为第二频次。S1233: Count the frequency of occurrence of the name field as the second frequency.
本发明实施例中,所述可靠来源为具有预定可信度的来源。其中,所述来源为网站或者网页。In an embodiment of the invention, the reliable source is a source having a predetermined degree of confidence. Wherein, the source is a website or a webpage.
其中,可靠来源的网站或者网页包括但不限于,如新浪、凤凰网等大型网站、通过官方认证的网站、访问频次比较高、数据流量大的网站以及不携带恶意链接、病毒链接且客户满意度交高的网站等。Among them, reliable sources of websites or web pages include, but are not limited to, large websites such as Sina and Phoenix, websites that have been officially certified, websites with high frequency of access, large data traffic, and no malicious links, virus links, and customer satisfaction. High-profile websites, etc.
本发明实施例中,可靠来源的网站或者网页的可信度是可量化的,可根据用户的访问次数以及客户评价等对各个网站或网页的可信度进行量化。而且各个网站或网页的可信度是动态变化的,若当前网站出现病毒、欺诈广告或被其他恶意欺诈网站所利用,则其可信度会随之降低,本发明通过网站可信度的量化和动态调整,进一步保证获取的POI信息的可靠、有效。In the embodiment of the present invention, the credibility of the website or the webpage of the reliable source is quantifiable, and the credibility of each website or webpage can be quantified according to the number of visits by the user and the customer evaluation. Moreover, the credibility of each website or webpage is dynamically changed. If the current website is infected with viruses, fraudulent advertisements or used by other malicious fraudulent websites, the credibility thereof will be reduced, and the present invention quantifies the credibility of the website. And dynamic adjustment to further ensure that the acquired POI information is reliable and effective.
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类的POI名称确定方法中步骤S124的细分步骤,来体现依据本步骤实现的另一实施例。本步骤的细分步骤包括:In order to further embody the superiority of the invention, the subdivision step of step S124 in the cluster-based POI name determining method of the present invention is further disclosed as follows to embody another embodiment implemented according to this step. The subdivision steps of this step include:
将所述各个类中所述第二频次最高的名称字段作为类标识名称;将每类标识名称均作为对应该地址信息的POI名称。The name field with the highest frequency in the second class is used as the class identification name; each type of identification name is taken as the POI name corresponding to the address information.
为了进一步体现发明的优越性,如下进一步揭示本发明基于聚类的POI名称确定方法中步骤S124的细分步骤,来体现依据本步骤实现的另一实施例。本步骤的细分步骤包括:In order to further embody the superiority of the invention, the subdivision step of step S124 in the cluster-based POI name determining method of the present invention is further disclosed as follows to embody another embodiment implemented according to this step. The subdivision steps of this step include:
将所述各个类中所述第二频次最高的名称字段作为类标识名称;将网络上出现次数最多的类标识名称作为对应该地址信息的POI名称。The name field with the highest frequency in the second class is used as the class identification name; the class identification name with the most occurrences on the network is used as the POI name corresponding to the address information.
本发明实施例提供的基于聚类的POI名称确定方法,根据切词后词频次的多少来挖掘POI名字的关键词,并且以此关键词来聚类,把不同说法的同一个POI名字聚为一类,解决同一个经纬度对应多个POI名字的问题,利用互联网“投票”机制来选取最佳的POI名字。The clustering-based POI name determining method provided by the embodiment of the present invention mines the keywords of the POI name according to the frequency of the word after the word cutting, and clusters the keywords, and aggregates the same POI name of different sayings into One class solves the problem that the same latitude and longitude corresponds to multiple POI names, and uses the Internet "voting" mechanism to select the best POI name.
综上所述,本发明的上述实施例通过对从网络数据中抓取地址数据进行名称字段和地址信息的提取,基于名称字段确定关键词,并将对应相同地址信息的关键词进行聚类,基于聚类后的关键词确定地址信息对应的POI名称,从而使得用户能够快速、准确地搜索到同一经、纬度的POI地址对应的POI名称,改善用户体验。In summary, the above embodiment of the present invention extracts a name field and address information by fetching address data from network data, determines a keyword based on the name field, and clusters keywords corresponding to the same address information. The POI name corresponding to the address information is determined based on the clustered keywords, so that the user can quickly and accurately search for the POI name corresponding to the POI address of the same latitude and longitude, thereby improving the user experience.
图15示意性示出了本发明一个实施例的基于网络中的地址数据确定POI信息有效性的系统的框图。Figure 15 is a block diagram schematically showing a system for determining the validity of POI information based on address data in a network, in accordance with one embodiment of the present invention.
参照图15,本发明实施例的基于网络中的地址数据确定POI信息有效性的系统,包括:Referring to FIG. 15, a system for determining validity of POI information based on address data in a network according to an embodiment of the present invention includes:
POI信息获取单元511,用于基于搜索引擎利用网络中的地址数据获取对应相同POI名称的多个相关POI信息;The POI information obtaining unit 511 is configured to acquire, according to the search engine, the plurality of related POI information corresponding to the same POI name by using the address data in the network;
本发明实施例中,所述多个相关POI信息为对应POI至少一个预设属性的信息。进一步地,所述预设属性为经纬度、地址、建筑物名称或所囊括单位名称。In the embodiment of the present invention, the multiple related POI information is information corresponding to at least one preset attribute of the POI. Further, the preset attribute is a latitude and longitude, an address, a building name, or a unit name included.
统计单元512,用于统计所述POI信息在所述网络中的地址数据中的出现次数;The statistics unit 512 is configured to count the number of occurrences of the POI information in the address data in the network;
POI信息确定单元513,用于根据所述POI信息在所述网络中的地址数据中的出现次数确定对应所述相同POI名称的有效POI信息。The POI information determining unit 513 is configured to determine valid POI information corresponding to the same POI name according to the number of occurrences of the POI information in the address data in the network.
由前述表1可见,在同一地理位置(经纬度相同)从不同来源网站获取的POI数据中,有可能存在重复性数据,即同一个地址(经纬度)可能存在多个POI名字;还可以看出,同一个poi名字可能多种不同的说法。It can be seen from the foregoing Table 1 that there may be repetitive data in the POI data obtained from different source websites in the same geographical location (the same latitude and longitude), that is, there may be multiple POI names in the same address (latitude and longitude); it can also be seen that The same poi name may have many different claims.
本发明实施例中,基于搜索引擎利用网络中的地址数据获取对应相同POI名称的多个相关POI信息,其中,多个相关POI信息为对应POI至少一个预设属性的信息,所述预设属性为经纬度、地址、建筑物名称或所囊括单位名称,根据所述POI信息在所述网络中的地址数据中的出现次数确定对应所述相同POI名称的有效POI信息。In the embodiment of the present invention, the search engine uses the address data in the network to obtain a plurality of related POI information corresponding to the same POI name, where the plurality of related POI information is information corresponding to at least one preset attribute of the POI, and the preset attribute For the latitude and longitude, the address, the building name, or the included unit name, valid POI information corresponding to the same POI name is determined according to the number of occurrences of the POI information in the address data in the network.
进一步地,根据所述POI信息在所述网络中的地址数据中的出现次数确定对应所述相同POI名称的有效POI信息,包括:根据相关POI信息的预设属性的信息将对应相同地址信息的名称字段按照关键词进行聚类,统计聚类后各类别中名称字段出现的频次,作为第二频次,根据所述第二频次确定该类别对应该地址信息的POI名 称,根据所述第二频次确定该类别对应该地址信息的POI名称,利用互联网“投票”机制来选取相同POI名称的可信的、有效POI信息。Further, determining valid POI information corresponding to the same POI name according to the number of occurrences of the POI information in the address data in the network, including: information corresponding to the same address information according to the preset attribute of the related POI information The name field is clustered according to keywords, and the frequency of occurrence of the name field in each category after clustering is counted as the second frequency, and the POI name corresponding to the address information of the category is determined according to the second frequency. It is said that the POI name corresponding to the address information of the category is determined according to the second frequency, and the "voting" mechanism of the Internet is used to select the trusted POI information of the same POI name.
更进一步地,基于所述名称字段确定一个或多个关键词,将对应相同地址信息的所述关键词进行聚类,根据聚类后的关键词确定聚类后的名称字段。Further, the one or more keywords are determined based on the name field, the keywords corresponding to the same address information are clustered, and the clustered name field is determined according to the clustered keywords.
更进一步地,对所述名称字段中的名称进行切词处理生成分词,根据所述分词获取所述名称字段的关键词。Further, a word segmentation process is performed on the name in the name field to generate a word segment, and the keyword of the name field is obtained according to the word segmentation.
更进一步地,统计对应相同地址信息的每个分词出现的频次,作为第一频次,根据所述第一频次生成所述名称字段的关键词,具体为,选择所述第一频次最小并且是非地名的分词作为所述名称字段的关键词。Further, the frequency of occurrence of each participle corresponding to the same address information is counted as the first frequency, and the keyword of the name field is generated according to the first frequency, specifically, the first frequency is selected to be the smallest and the non-place name is selected. The participle is used as the keyword of the name field.
更进一步地,本发明可以将所述各个类中所述第二频次最高的名称字段作为类标识名称,将每类标识名称均作为对应该地址信息的POI名称;或,将所述各个类中第二频次最高的名称字段作为类标识名称,将网络上出现次数最多的类标识名称作为对应该地址信息的POI名称。Further, the present invention may use the name field with the highest frequency in the respective classes as the class identifier name, and each type of the identifier name as the POI name corresponding to the address information; or, in the respective classes The name field with the highest frequency is used as the class identification name, and the class identification name with the most occurrences on the network is taken as the POI name corresponding to the address information.
其中,对所挖掘的地址数据中POI信息的名称切词,并且统计切词后每个词出现的次数,同一个POI名称中出现频次最少即包含的信息量最大,并且是非地名的那个词记为该POI名称的关键词,比如表1中出现的地址数据对应的相关POI信息中POI名称切词后数据如前述表2所示(词频是根据约9000万的poi名字统计的)。为了进一步体现发明的优越性,如下进一步揭示本发明基于网络中的地址数据确定POI信息有效性的系统中的统计单元512的在另一实施例中的内部结构,来体现依据统计单元512实现的另一实施例的细节。参照图16,统计单元512进一步包括POI信息来源获取模块5121、POI信息来源可靠性判断模块5122以及统计模块5123:Wherein, the name of the POI information in the mined address data is cut, and the number of occurrences of each word after the word is cut is counted. The least frequent occurrence of the same POI name includes the largest amount of information, and the word of the non-place name is the most For the keyword of the POI name, for example, the POI name of the relevant POI information corresponding to the address data appearing in Table 1 is as shown in Table 2 above (the word frequency is counted according to the poi name of about 90 million). In order to further embody the superiority of the invention, the internal structure of the statistical unit 512 in the system for determining the validity of the POI information based on the address data in the network in the present invention is further disclosed as follows to implement the implementation according to the statistical unit 512. Details of another embodiment. Referring to FIG. 16, the statistics unit 512 further includes a POI information source obtaining module 5121, a POI information source reliability determining module 5122, and a statistics module 5123:
所述的POI信息来源获取模块5121,用于获取所述POI信息的来源;The POI information source obtaining module 5121 is configured to obtain a source of the POI information;
所述的POI信息来源可靠性判断模块5122,用于判断所述来源是否属于可靠来源;The POI information source reliability determining module 5122 is configured to determine whether the source is a reliable source;
所述的统计模块5123,用于在来源属于可靠来源的情况下统计所述POI信息在所述网络中的地址数据中的出现次数;否则不统计。The statistic module 5123 is configured to count the number of occurrences of the POI information in the address data in the network if the source belongs to a reliable source; otherwise, it is not counted.
本实施例中,在同一类的POI名称中,选取最佳的POI名称是根据互联上的“投票”来解决,所谓“投票”主要是根据此POI名称在互联网上出现的频次以及来源的可信度,互联网上出现的频次最高、来源最可信的那个名字为要选取的最佳名字。比如:In this embodiment, among the POI names of the same type, the best POI name is selected according to the "voting" on the Internet. The so-called "voting" is mainly based on the frequency of the POI name appearing on the Internet and the source. Reliability, the name with the highest frequency and the most trusted source on the Internet is the best name to choose. such as:
A类中只有一个名字,最佳的也是这一个。There is only one name in class A, and the best one is this one.
B类中有两个名字,其中“云南省澜沧江啤酒集团保山有限公司”出现的频率最高,作为最佳名字。There are two names in category B, among which “Yunjiang Beer Group Baoshan Co., Ltd.” has the highest frequency and is the best name.
C类中有两个名字,其中“保山明志汽车销售服务有限公司”出现的频率最高,作为最佳名字。There are two names in category C, of which “Baoshan Mingzhi Automobile Sales Service Co., Ltd.” appears the most frequently, as the best name.
D类和E类中同样是只有一个名字,类似A。There is only one name in class D and class E, similar to A.
为了进一步体现发明的优越性,如下进一步揭示本发明基于网络中的地址数据确定POI信息有效性的系统中的POI信息确定单元513的在另一实施例中的内部结构,来体现依据POI信息确定单元513实现的另一实施例的细节。参照图17,POI信息确定单元513进一步包括判断子单元5131以及信息点信息确定子单元5132:In order to further demonstrate the superiority of the invention, the internal structure of the POI information determining unit 513 in the system for determining the validity of the POI information based on the address data in the network in another embodiment is further disclosed as follows to reflect the determination according to the POI information. Details of another embodiment implemented by unit 513. Referring to FIG. 17, the POI information determining unit 513 further includes a judging subunit 5131 and an information point information determining subunit 5132:
所述的判断子单元5131,用于判断所述POI信息在所述网络中的地址数据中的出现次数是否高于预定阈值;The determining subunit 5131 is configured to determine whether the number of occurrences of the POI information in the address data in the network is higher than a predetermined threshold;
所述的信息点信息确定子单元5132,用于在所述判断子单元判断为是的情况下,确定所获取的POI信息有效。The information point information determining sub-unit 5132 is configured to determine that the acquired POI information is valid if the determining sub-unit determines to be YES.
本发明实施例中,POI信息在互联上出现的频率越高、来源的可信度越可信,则POI信息越可信。对最终选取的最佳POI名字根据其在互联上出现的频次以及来源来过滤,高于一定阈值的则为最终挖掘的可信的POI信息。In the embodiment of the present invention, the higher the frequency of POI information appearing on the interconnection and the more credible the source is, the more credible the POI information is. The best selected POI name is filtered according to the frequency and source of its occurrence on the interconnection. Above a certain threshold is the final POI information.
本发明实施例中,所述可靠来源为具有预定可信度的来源。其中,所述来源为网站或者网页。In an embodiment of the invention, the reliable source is a source having a predetermined degree of confidence. Wherein, the source is a website or a webpage.
本发明实施例中,预定可信度的来源的网站或者网页包括但不限于,如新浪、凤凰网等大型网站、通过官方认证的网站、访问频次比较高、数据流量大的网站以及不携带恶意链接、病毒链接且客户满意度交高的网站等。In the embodiment of the present invention, the website or the webpage of the source of the predetermined credibility includes, but is not limited to, a large website such as Sina, Fenghuang.com, an officially certified website, a website with a relatively high frequency of access, a large data flow, and no maliciousness. Websites with links, virus links, and high customer satisfaction.
本发明实施例中,可信度是可量化的,可根据用户的访问次数以及客户评价等对各个网站或网页的可信度进行量化。而且各个网站或网页的可信度是动态变化的,若当前网站出现病毒、欺诈广告或被其他恶意欺诈网站所利用,则其可信度会随之降低,本发明通过网站可信度的量化和动态调整,进一步保证获取的POI信息的可靠、有效。In the embodiment of the present invention, the credibility is quantifiable, and the credibility of each website or webpage can be quantified according to the number of visits by the user and the customer evaluation. Moreover, the credibility of each website or webpage is dynamically changed. If the current website is infected with viruses, fraudulent advertisements or used by other malicious fraudulent websites, the credibility thereof will be reduced, and the present invention quantifies the credibility of the website. And dynamic adjustment to further ensure that the acquired POI information is reliable and effective.
本实施例对利用网络中的地址数据获取对应相同POI名称的多个相关POI信息,根据POI信息在网络中的地址数据中的出现次数确定对应所述相同POI名称的有效POI信息,从而使得用户能够快速、准确地搜索 到同一经、纬度的POI地址对应的一个或多个POI名称,然后利用网络投票机制从一个或多个POI名称按照信息来源以及其在互联网上出现的频次进行过滤,选出可信度高的POI名称作为当前POI地址对应的POI名称,提高POI信息的有效性。In this embodiment, a plurality of related POI information corresponding to the same POI name are obtained by using address data in the network, and valid POI information corresponding to the same POI name is determined according to the number of occurrences of the POI information in the address data in the network, thereby making the user Ability to search quickly and accurately One or more POI names corresponding to the POI address of the same latitude and longitude, and then using the network voting mechanism to filter from one or more POI names according to the information source and the frequency of occurrence on the Internet, and selecting a highly credible The POI name is used as the POI name corresponding to the current POI address to improve the validity of the POI information.
图18示意性示出了本发明一个实施例的基于网络中的地址数据确定POI信息有效性的方法的流程图。FIG. 18 is a flow chart schematically showing a method of determining validity of POI information based on address data in a network according to an embodiment of the present invention.
参照图18,本发明实施例的基于网络中的地址数据确定POI信息有效性的方法包括以下步骤:Referring to FIG. 18, a method for determining validity of POI information based on address data in a network according to an embodiment of the present invention includes the following steps:
S811、利用网络中的地址数据获取对应相同POI名称的多个相关POI信息;S811. Acquire, by using address data in the network, multiple related POI information corresponding to the same POI name.
S812、统计所述POI信息在所述网络中的地址数据中的出现次数;S812. Count the number of occurrences of the POI information in the address data in the network.
S813、根据所述POI信息在所述网络中的地址数据中的出现次数确定对应所述相同POI名称的有效POI信息。S813. Determine valid POI information corresponding to the same POI name according to the number of occurrences of the POI information in the address data in the network.
本发明实施例中,所述多个相关POI信息为对应POI至少一个预设属性的信息。其中,所述预设属性为经纬度、地址、建筑物名称或所囊括单位名称。In the embodiment of the present invention, the multiple related POI information is information corresponding to at least one preset attribute of the POI. The preset attribute is a latitude and longitude, an address, a building name, or a unit name.
在同一地理位置(经纬度相同)从不同来源网站获取的POI数据中,有可能存在重复性数据,同一个poi名字可能多种不同的说法。In the POI data obtained from different source websites in the same geographical location (same latitude and longitude), there may be repetitive data, and the same poi name may be different.
对此,本发明实施例,对所挖掘的地址数据中POI信息的名称切词,并且统计切词后每个词出现的次数,同一个POI名称中出现频次最少即包含的信息量最大,并且是非地名的那个词记为该POI名称的关键词。In this regard, in the embodiment of the present invention, the name of the POI information in the mined address data is cut, and the number of occurrences of each word after the word-cutting is counted, and the frequency of occurrence of the same POI name is the least, that is, the amount of information is the largest, and The word of the non-place name is recorded as the keyword of the POI name.
为了进一步体现发明的优越性,如下进一步揭示本发明基于网络中的地址数据确定POI信息有效性的方法中步骤S812的细分步骤,来体现依据本步骤实现的另一实施例。参照图19,本步骤的细分步骤包括:In order to further embody the advantages of the invention, the subdivision step of step S812 in the method for determining the validity of the POI information based on the address data in the network is further disclosed as follows to embody another embodiment implemented according to this step. Referring to Figure 19, the subdivision steps of this step include:
S8121、获取所述POI信息的来源;S8121: Obtain a source of the POI information;
S8122、判断所述来源是否属于可靠来源,如果是,则执行步骤S123;S8122, determining whether the source is a reliable source, and if so, executing step S123;
S8123、当所述来源属于可靠来源时,统计所述POI信息在所述网络中的地址数据里的出现次数,否则不统计。S8123: When the source belongs to a reliable source, count the number of occurrences of the POI information in the address data in the network, otherwise it is not counted.
本实施例中,在同一类的POI名称中,选取最佳的POI名称是根据互联上的“投票”来解决,所谓“投票”主要是根据此POI名称在互联网上出现的频次以及来源的可信度,互联网上出现的频次最高、来源最可信的那个名字为要选取的最佳名字。In this embodiment, among the POI names of the same type, the best POI name is selected according to the "voting" on the Internet. The so-called "voting" is mainly based on the frequency of the POI name appearing on the Internet and the source. Reliability, the name with the highest frequency and the most trusted source on the Internet is the best name to choose.
为了进一步体现发明的优越性,如下进一步揭示本发明基于网络中的地址数据确定POI信息有效性的方法中步骤S813的细分步骤,来体现依据本步骤实现的另一实施例。参照图20,本步骤的细分步骤包括:In order to further embody the advantages of the invention, the subdivision step of step S813 in the method for determining the validity of the POI information based on the address data in the network is further disclosed as follows to embody another embodiment implemented according to this step. Referring to Figure 20, the subdivision steps of this step include:
S8131、判断所述POI信息在所述网络中的地址数据中的出现次数是否高于预定阈值;如果是,则执行步骤S8132,S8131, determining whether the number of occurrences of the POI information in the address data in the network is higher than a predetermined threshold; if yes, executing step S8132,
S8132、确定所述POI信息有效。S8132: Determine that the POI information is valid.
本发明实施例中,POI信息在互联上出现的频率越高、来源的可信度越可信,则POI信息越可信。对最终选取的最佳POI名字根据其在互联上出现的频次以及来源来过滤,高于一定阈值的则为最终挖掘的可信的POI信息。In the embodiment of the present invention, the higher the frequency of POI information appearing on the interconnection and the more credible the source is, the more credible the POI information is. The best selected POI name is filtered according to the frequency and source of its occurrence on the interconnection. Above a certain threshold is the final POI information.
本发明实施例中,所述可靠来源为具有预定可信度的来源。其中,所述来源为网站或者网页。In an embodiment of the invention, the reliable source is a source having a predetermined degree of confidence. Wherein, the source is a website or a webpage.
本发明实施例中,预定可信度的来源的网站或者网页包括但不限于,如新浪、凤凰网等大型网站、通过官方认证的网站、访问频次比较高、数据流量大的网站以及不携带恶意链接、病毒链接且客户满意度交高的网站等。In the embodiment of the present invention, the website or the webpage of the source of the predetermined credibility includes, but is not limited to, a large website such as Sina, Fenghuang.com, an officially certified website, a website with a relatively high frequency of access, a large data flow, and no maliciousness. Websites with links, virus links, and high customer satisfaction.
本发明实施例中,可信度是可量化的,可根据用户的访问次数以及客户评价等对各个网站或网页的可信度进行量化。而且各个网站或网页的可信度是动态变化的,若当前网站出现病毒、欺诈广告或被其他恶意欺诈网站所利用,则其可信度会随之降低,本发明通过网站可信度的量化和动态调整,进一步保证获取的POI信息的可靠、有效。In the embodiment of the present invention, the credibility is quantifiable, and the credibility of each website or webpage can be quantified according to the number of visits by the user and the customer evaluation. Moreover, the credibility of each website or webpage is dynamically changed. If the current website is infected with viruses, fraudulent advertisements or used by other malicious fraudulent websites, the credibility thereof will be reduced, and the present invention quantifies the credibility of the website. And dynamic adjustment to further ensure that the acquired POI information is reliable and effective.
通过采用本发明实施例提供的基于网络中的地址数据确定POI信息有效性的方法,根据切词后词频次的多少来挖掘poi名字的关键词,并且以此关键词来聚类,把不同说法的同一个poi名字聚为一类,解决同一个经纬度对应多个poi名字的问题,利用互联网“投票”机制来选取最佳的poi名字,利用互联上“投票”机制来选取可信的poi信息。By using the method for determining the validity of the POI information based on the address data in the network provided by the embodiment of the present invention, the keywords of the poi name are searched according to the frequency of the word after the word is cut, and the keywords are clustered by the keyword. The same poi name is grouped together to solve the problem of multiple poi names corresponding to one latitude and longitude. The Internet "voting" mechanism is used to select the best poi name, and the "voting" mechanism on the Internet is used to select trusted poi information. .
综上所述,本发明上述实施例对利用网络中的地址数据获取对应相同POI名称的多个相关POI信息,根据POI信息在网络中的地址数据中的出现次数确定对应所述相同POI名称的有效POI信息,从而使得用户能够快速、准确地搜索到同一经、纬度的POI地址对应的一个或多个POI名称,然后利用网络投票机制从一个或多个POI名称按照信息来源以及其在互联网上出现的频次进行过滤,选出可信度高的POI名称作为当前POI地址对应 的POI名称,提高POI信息的有效性。In summary, the foregoing embodiment of the present invention acquires a plurality of related POI information corresponding to the same POI name by using address data in the network, and determines, according to the number of occurrences of the POI information in the address data in the network, the same POI name. Effective POI information, enabling users to quickly and accurately search for one or more POI names corresponding to the same latitude and longitude POI address, and then use the online voting mechanism to follow the information source from one or more POI names and on the Internet. The frequency of occurrence is filtered, and the POI name with high reliability is selected as the current POI address. The POI name improves the validity of the POI information.
应当注意,在此提供的算法和公式不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示例一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本发明也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。It should be noted that the algorithms and formulas provided herein are not inherently related to any particular computer, virtual system, or other device. Various general purpose systems can also be used with the examples based herein. The structure required to construct such a system is apparent from the above description. Moreover, the invention is not directed to any particular programming language. It is to be understood that the invention may be embodied in a variety of programming language, and the description of the specific language has been described above in order to disclose the preferred embodiments of the invention.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.
类似地,应当理解,为了精简本发明并帮助理解本发明各个方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法和装置解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如权利要求书所反映,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。Similarly, the various features of the present invention are sometimes grouped together into a single embodiment in the above description of the exemplary embodiments of the present invention in order to the , diagram, or description of it. However, the method and apparatus disclosed are not to be interpreted as reflecting the invention that the claimed invention is claimed to have more features than those recited in the claims. Rather, as the following claims reflect, inventive aspects reside in less than all features of the single embodiments disclosed herein. Therefore, the claims following the specific embodiments are hereby explicitly incorporated into the embodiments, and each of the claims as a separate embodiment of the invention.
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art will appreciate that the modules in the devices of the embodiments can be adaptively changed and placed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and further they may be divided into a plurality of sub-modules or sub-units or sub-components. In addition to such features and/or at least some of the processes or units being mutually exclusive, any combination of the features disclosed in the specification, including the accompanying claims, the abstract and the drawings, and any methods so disclosed, or All processes or units of the device are combined. Each feature disclosed in this specification (including the accompanying claims, the abstract and the drawings) may be replaced by alternative features that provide the same, equivalent or similar purpose.
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。In addition, those skilled in the art will appreciate that, although some embodiments described herein include certain features that are included in other embodiments and not in other features, combinations of features of different embodiments are intended to be within the scope of the present invention. Different embodiments are formed and formed.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的基于聚类确定POI名称的系统中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or digital signal processor (DSP) may be used in practice to implement some or all of some or all of the components of the POI name-based system based on clustering in accordance with embodiments of the present invention. Features. The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
例如,图21示意性地示出了用于执行根据本发明的方法的计算设备的框图。该计算设备传统上包括处理器2110和以存储器2120形式的计算机程序产品或者计算机可读介质。存储器2120可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器2120具有用于执行上述方法中的任何方法步骤的程序代码2131的存储空间2130。例如,用于程序代码的存储空间2130可以包括分别用于实现上面的方法中的各种步骤的各个程序代码2131。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图22所述的便携式或者固定存储单元。该存储单元可以具有与图21的计算设备中的存储器2120类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括用于执行根据本发明的方法步骤的计算机可读代码2131’,即可以由例如诸如2110之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所描述的方法中的各个步骤。For example, Figure 21 schematically illustrates a block diagram of a computing device for performing the method in accordance with the present invention. The computing device conventionally includes a processor 2110 and a computer program product or computer readable medium in the form of a memory 2120. The memory 2120 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM. Memory 2120 has a storage space 2130 for program code 2131 for performing any of the method steps described above. For example, storage space 2130 for program code may include various program code 2131 for implementing various steps in the above methods, respectively. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have a storage segment, a storage space, and the like that are similarly arranged to the storage 2120 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit comprises computer readable code 2131' for performing the steps of the method according to the invention, ie code that can be read by a processor such as, for example, 2110, which when executed by the computing device causes the calculation The device performs the various steps in the methods described above.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包括”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word 'comprising' does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而 非限制性的,本发明的范围由所附权利要求书限定。In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. The disclosure of the present invention is illustrative for the scope of the present invention, and Without limiting the scope of the invention, the scope of the invention is defined by the appended claims.
本发明可以应用于计算机系统/服务器,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与计算机系统/服务器一起使用的众所周知的计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统、大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。The present invention is applicable to computer systems/servers that can operate with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations suitable for use with computer systems/servers include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, based on Microprocessor systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.
计算机系统/服务器可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。The computer system/server can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system. Generally, program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types. The computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network. In a distributed cloud computing environment, program modules may be located on a local or remote computing system storage medium including storage devices.
本丈中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。&quot;an embodiment,&quot; or &quot;an embodiment,&quot; or &quot;one or more embodiments&quot; as used in the context of the present invention means that the particular features, structures, or characteristics described in connection with the embodiments are included in at least one embodiment of the invention. In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.
以上所述仅是本发明的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。 The above is only a part of the embodiments of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should be considered as the scope of protection of the present invention.

Claims (44)

  1. 一种基于聚类确定POI名称的系统,该系统包括:A system for determining a POI name based on clustering, the system comprising:
    地址数据抓取器,用于从网络数据中抓取地址数据;An address data grabber for fetching address data from network data;
    地址数据解析器,用于从抓取到的一个或多个地址数据中分别提取名称字段和地址信息;An address data parser, configured to separately extract a name field and address information from the captured one or more address data;
    关键词确定器,用于基于所述名称字段确定一个或多个关键词;a keyword determiner for determining one or more keywords based on the name field;
    关键词聚类器,用于将对应相同地址信息的所述关键词进行聚类,生成至少一个类;a keyword clusterer for clustering the keywords corresponding to the same address information to generate at least one class;
    POI名称生成器,用于根据聚类后的关键词确定此地址信息对应的POI名称。The POI name generator is configured to determine a POI name corresponding to the address information according to the clustered keywords.
  2. 如权利要求1所述的系统,所述关键词确定器进一步包括:The system of claim 1 wherein said keyword determiner further comprises:
    切词单元,用于对所述名称字段中的名称进行切词处理生成分词;a word unit for performing word segmentation on the name in the name field to generate a word segmentation;
    关键词获取单元,用于根据所述分词获取所述地址数据的关键词。And a keyword acquiring unit, configured to acquire a keyword of the address data according to the word segmentation.
  3. 如权利要求1-2任一项所述的系统,所述关键词获取单元进一步包括:The system of any of claims 1-2, the keyword acquisition unit further comprising:
    第一频次统计模块,用于统计对应相同地址信息的每个分词出现的频次,作为第一频次;a first frequency statistics module, configured to count frequency of occurrence of each participle corresponding to the same address information, as the first frequency;
    关键词生成模块,用于根据所述第一频次生成所述地址数据的关键词。And a keyword generating module, configured to generate a keyword of the address data according to the first frequency.
  4. 如权利要求1-3任一项所述的系统,所述关键词生成模块选择频次最小并且是非地名的分词作为所述地址数据的关键词。The system according to any one of claims 1 to 3, wherein the keyword generating module selects a word segment having the smallest frequency and being a non-place name as a keyword of the address data.
  5. 如权利要求1-4任一项所述的系统,所述POI名称生成器进一步包括:The system of any of claims 1-4, the POI name generator further comprising:
    频率统计单元,用于计算各个类中名称字段的出现频率;a frequency statistics unit for calculating the frequency of occurrence of the name field in each class;
    类标识名称确定单元,用于将所述各个类中出现频率最高的名称字段作为类标识名称;a class identifier name determining unit, configured to use a name field with the highest frequency of occurrence in each of the classes as a class identifier name;
    POI名称确定单元,用于将每个类标识名称均作为POI名称。The POI name determining unit is configured to use each class identification name as the POI name.
  6. 如权利要求1-4任一项所述的系统,所述POI名称生成器进一步包括:The system of any of claims 1-4, the POI name generator further comprising:
    频率统计单元,用于计算各个类中名称字段的出现频率;a frequency statistics unit for calculating the frequency of occurrence of the name field in each class;
    类标识名称确定单元,用于将所述各个类中出现频率最高的名称字段作为类标识名称;a class identifier name determining unit, configured to use a name field with the highest frequency of occurrence in each of the classes as a class identifier name;
    POI名称确定单元,用于选择出现频率最高的类标识名称作为POI名称。The POI name determining unit is configured to select the class identifier name with the highest frequency of occurrence as the POI name.
  7. 一种基于聚类确定POI名称的方法,包括:A method for determining a POI name based on clustering, comprising:
    从网络数据中抓取地址数据;Grab address data from network data;
    从抓取到的一个或多个地址数据中分别提取名称字段和地址信息;Extracting the name field and address information from the captured one or more address data;
    基于所述名称字段确定一个或多个关键词;Determining one or more keywords based on the name field;
    将对应相同地址信息的所述关键词进行聚类,生成至少一个类;And clustering the keywords corresponding to the same address information to generate at least one class;
    根据聚类后的关键词确定此地址信息对应的POI名称。The POI name corresponding to the address information is determined according to the clustered keywords.
  8. 如权利要求7所述的方法,所述步骤:基于所述名称字段确定一个或多个关键词,进一步包括:The method of claim 7, the step of determining one or more keywords based on the name field, further comprising:
    对所述名称字段中的名称进行切词处理生成分词;Performing word segmentation on the name in the name field to generate a participle;
    根据所述分词获取所述地址数据的关键词。Obtaining keywords of the address data according to the word segmentation.
  9. 如权利要求7-8任一项所述的方法,所述步骤:根据所述分词获取所述地址数据的关键词,进一步包括:The method according to any one of claims 7 to 8, wherein the step of: acquiring the keyword of the address data according to the word segmentation further comprises:
    统计对应相同地址信息的每个分词出现的频次作为第一频次;Counting the frequency of occurrence of each participle corresponding to the same address information as the first frequency;
    根据所述第一频次生成所述地址数据的关键词。Generating keywords of the address data according to the first frequency.
  10. 如权利要求7-9任一项所述的方法,所述步骤根据所述第一频次生成所述地址数据的关键词具体为:The method according to any one of claims 7-9, wherein the step of generating the keyword data according to the first frequency is specifically:
    选择频次最小并且是非地名的分词作为所述地址数据的关键词。A word segment having the smallest frequency and being a non-place name is selected as a keyword of the address data.
  11. 如权利要求7-10任一项所述的方法,所述步骤:根据聚类后的关键词确定此地址信息对应的POI名称,进一步包括:The method according to any one of claims 7 to 10, wherein the step of: determining the POI name corresponding to the address information according to the clustered keywords, further comprising:
    计算各个类中名称字段的出现频率;Calculate the frequency of occurrence of name fields in each class;
    将所述各个类中出现频率最高的名称字段作为类标识名称;Name the name of the highest frequency in each class as the class identifier name;
    将每个类标识名称均作为POI名称。Each class ID name is taken as the POI name.
  12. 如权利要求7-11任一项所述的方法,所述步骤:根据聚类后的关键词确定此地址信息对应的POI名称,进一步包括:The method according to any one of claims 7 to 11, wherein the step of: determining the POI name corresponding to the address information according to the clustered keywords, further comprising:
    计算各个类中名称字段的出现频率;Calculate the frequency of occurrence of name fields in each class;
    将所述各个类中出现频率最高的名称字段作为类标识名称; Name the name of the highest frequency in each class as the class identifier name;
    选择出现频率最高的类标识名称作为POI名称。Select the class ID name with the highest frequency of occurrence as the POI name.
  13. 一种基于聚类的POI名称确定系统,包括:A cluster-based POI name determination system, comprising:
    地址数据抓取器,用于基于搜索引擎从网络数据中抓取地址数据,所述地址数据包括名称字段和地址信息;An address data grabber for extracting address data from network data based on a search engine, the address data including a name field and address information;
    名称字段聚类器,用于将对应相同地址信息的名称字段按照关键词进行聚类;a name field clusterer for clustering name fields corresponding to the same address information according to keywords;
    第二频次统计器,用于统计聚类后各类别中名称字段出现的频次,作为第二频次;The second frequency statistic is used for counting the frequency of occurrence of the name field in each category after clustering, as the second frequency;
    POI名称确定单元,用于根据所述第二频次确定该类别对应该地址信息的POI名称。The POI name determining unit is configured to determine, according to the second frequency, a POI name corresponding to the address information of the category.
  14. 如权利要求13所述系统,所述名称字段聚类器进一步包括:The system of claim 13 wherein said name field clusterer further comprises:
    关键词确定单元,用于基于所述名称字段确定一个或多个关键词;a keyword determining unit, configured to determine one or more keywords based on the name field;
    关键词聚类单元,用于将对应相同地址信息的所述关键词进行聚类;a keyword clustering unit, configured to cluster the keywords corresponding to the same address information;
    名称字段聚类确定单元,用于根据聚类后的关键词确定聚类后的名称字段。The name field cluster determining unit is configured to determine the clustered name field according to the clustered keywords.
  15. 如权利要求13-14任一项所述的系统,所述关键词确定单元进一步包括:The system of any of claims 13-14, the keyword determining unit further comprising:
    切词模块,用于对所述名称字段中的名称进行切词处理生成分词;a word cutting module, configured to perform word segmentation on the name in the name field to generate a word segmentation;
    关键词获取模块,用于根据所述分词获取所述名称字段的关键词。And a keyword obtaining module, configured to acquire a keyword of the name field according to the word segmentation.
  16. 如权利要求13-15任一项所述的系统,所述关键词获取模块进一步包括:The system of any of claims 13-15, the keyword acquisition module further comprising:
    第一频次统计子模块,用于统计对应相同地址信息的每个分词出现的频次,作为第一频次;a first frequency statistics sub-module, configured to count the frequency of occurrence of each participle corresponding to the same address information, as the first frequency;
    关键词生成子模块,用于根据所述第一频次生成所述名称字段的关键词。And a keyword generating submodule, configured to generate a keyword of the name field according to the first frequency.
  17. 如权利要求13-16任一项所述的系统,所述关键词生成子模块选择所述第一频次最小并且是非地名的分词作为所述名称字段的关键词。The system according to any one of claims 13 to 16, wherein the keyword generation sub-module selects the word segmentation of the first frequency minimum and non-place name as a keyword of the name field.
  18. 如权利要求13-17任一项所述系统,所述第二频次统计器进一步包括:The system of any of claims 13-17, the second frequency statistic further comprising:
    名称字段来源获取单元,用于获取所述名称字段的来源;a name field source obtaining unit, configured to obtain a source of the name field;
    来源可靠性判断单元,用于判断所述来源是否属于可靠来源;a source reliability determining unit, configured to determine whether the source is a reliable source;
    第二频次统计单元,用于在判断为是的情况下,统计所述名称字段出现的频次,作为第二频次,否则不统计。The second frequency statistics unit is configured to: when the determination is yes, count the frequency of occurrence of the name field as the second frequency, otherwise it is not counted.
  19. 如权利要求13-18任一项所述系统,所述POI名称确定单元进一步包括:The system of any of claims 13-18, the POI name determining unit further comprising:
    类标识名称确定模块,用于将所述各个类中所述第二频次最高的名称字段作为类标识名称;a class identifier name determining module, configured to use the name field with the highest frequency in the respective classes as the class identifier name;
    第一POI名称确定模块,用于将每类标识名称均作为对应该地址信息的POI名称。The first POI name determining module is configured to use each type of identification name as the POI name corresponding to the address information.
  20. 如权利要求13-19任一项所述系统,所述POI名称确定单元进一步包括:The system of any one of claims 13 to 19, wherein the POI name determining unit further comprises:
    类标识名称确定模块,用于将所述各个类中第二频次最高的名称字段作为类标识名称;a class identifier name determining module, configured to use a name field with the highest frequency in the second class as the class identifier name;
    第二POI名称确定模块,用于将网络上出现次数最多的类标识名称作为对应该地址信息的POI名称。The second POI name determining module is configured to use the class identifier name that has the most occurrence on the network as the POI name corresponding to the address information.
  21. 一种基于聚类的POI名称确定方法,包括:A method for determining a POI name based on clustering, comprising:
    从网络数据中抓取地址数据,所述地址数据包括名称字段和地址信息;Obtaining address data from network data, the address data including a name field and address information;
    将对应相同地址信息的名称字段按照关键词进行聚类;The name fields corresponding to the same address information are clustered according to keywords;
    统计聚类后各类别中名称字段出现的频次,作为第二频次;The frequency at which the name field appears in each category after statistical clustering, as the second frequency;
    根据所述第二频次确定该类别对应该地址信息的POI名称。The POI name corresponding to the address information of the category is determined according to the second frequency.
  22. 如权利要求21所述方法,所述将对应相同地址信息的名称字段按照关键词进行聚类,进一步包括:The method of claim 21, wherein the clustering of the name fields corresponding to the same address information by keywords further comprises:
    基于所述名称字段确定一个或多个关键词;Determining one or more keywords based on the name field;
    将对应相同地址信息的所述关键词进行聚类;Clustering the keywords corresponding to the same address information;
    根据聚类后的关键词确定聚类后的名称字段。The clustered name field is determined according to the clustered keywords.
  23. 如权利要求21-22任一项所述的方法,所述基于所述名称字段确定一个或多个关键词,进一步包括:The method of any one of claims 21 to 22, wherein the determining one or more keywords based on the name field further comprises:
    对所述名称字段进行切词处理生成分词;Performing word segmentation on the name field to generate a participle;
    根据分词获取所述名称字段的关键词。The keyword of the name field is obtained according to the word segmentation.
  24. 如权利要求21-23任一项所述的方法,所述根据分词获取所述名称字段的关键词,进一步包括:The method according to any one of claims 21 to 23, wherein the obtaining the keyword of the name field according to the word segmentation further comprises:
    统计对应相同地址信息的每个分词出现的频次,作为第一频次; Counting the frequency of occurrence of each participle corresponding to the same address information as the first frequency;
    根据所述第一频次确定所述名称字段的关键词。Determining a keyword of the name field according to the first frequency.
  25. 如权利要求21-24任一项所述的方法,所述根据所述第一频次确定所述名称字段的关键词具体为:The method according to any one of claims 21 to 24, wherein the determining the keyword of the name field according to the first frequency is specifically:
    选择第一频次最小并且是非地名的分词作为所述名称的关键词。A participle whose first frequency is the smallest and is not a place name is selected as the keyword of the name.
  26. 如权利要求21-25任一项所述的方法,所述统计聚类后各类别中名称字段出现的频次,作为第二频次,进一步包括:The method according to any one of claims 21 to 25, wherein the frequency of occurrence of the name field in each category after the statistical clustering, as the second frequency, further comprises:
    获取所述名称字段的来源;Get the source of the name field;
    判断所述来源是否属于可靠来源,如果是,则统计所述名称字段出现的频次,作为第二频次。It is determined whether the source is a reliable source, and if so, the frequency at which the name field appears is counted as the second frequency.
  27. 如权利要求21-26任一项所述方法,所述根据所述第二频次确定该类别对应该地址信息的POI名称,进一步包括:The method according to any one of claims 21 to 26, wherein the determining, according to the second frequency, the POI name corresponding to the address information of the category, further comprising:
    将所述各个类中所述第二频次最高的名称字段作为类标识名称;Name the name of the second highest frequency in each of the classes as a class identifier name;
    将每类标识名称均作为对应该地址信息的POI名称。Each type of identification name is taken as the POI name corresponding to the address information.
  28. 如权利要求21-27任一项所述方法,所述根据所述第二频次确定该类别对应该地址信息的POI名称,进一步包括:The method according to any one of claims 21 to 27, wherein the determining, according to the second frequency, the POI name corresponding to the address information of the category, further comprising:
    将所述各个类中所述第二频次最高的名称字段作为类标识名称;Name the name of the second highest frequency in each of the classes as a class identifier name;
    将网络上出现次数最多的类标识名称作为对应该地址信息的POI名称。The class identifier name that appears most frequently on the network is taken as the POI name corresponding to the address information.
  29. 一种基于网络中的地址数据确定POI信息有效性的系统,该系统包括:A system for determining validity of POI information based on address data in a network, the system comprising:
    POI信息获取单元,用于基于搜索引擎利用网络中的地址数据获取对应相同POI名称的多个相关POI信息;a POI information acquiring unit, configured to acquire, according to the search engine, a plurality of related POI information corresponding to the same POI name by using address data in the network;
    统计单元,用于统计所述POI信息在所述网络中的地址数据中的出现次数;a statistical unit, configured to count the number of occurrences of the POI information in the address data in the network;
    POI信息确定单元,用于根据所述POI信息在所述网络中的地址数据中的出现次数确定对应所述相同POI名称的有效POI信息。The POI information determining unit is configured to determine valid POI information corresponding to the same POI name according to the number of occurrences of the POI information in the address data in the network.
  30. 如权利要求29所述的系统,所述多个相关POI信息为对应POI至少一个预设属性的信息。The system of claim 29, wherein the plurality of related POI information is information corresponding to at least one preset attribute of the POI.
  31. 如权利要求29-30任一项所述的系统,所述预设属性为经纬度、地址、建筑物名称或所囊括单位名称。The system of any one of claims 29-30, wherein the preset attribute is a latitude and longitude, an address, a building name, or a unit name.
  32. 如权利要求29-31任一项所述的系统,所述统计单元进一步包括:The system of any of claims 29-31, the statistical unit further comprising:
    POI信息来源获取模块,用于获取所述POI信息的来源;a POI information source obtaining module, configured to obtain a source of the POI information;
    POI信息来源可靠性判断模块,用于判断所述来源是否属于可靠来源;The POI information source reliability judging module is configured to judge whether the source is a reliable source;
    统计模块,用于在来源属于可靠来源的情况下统计所述POI信息在所述网络中的地址数据中的出现次数;否则不统计。The statistics module is configured to count the number of occurrences of the POI information in the address data in the network if the source belongs to a reliable source; otherwise, it is not counted.
  33. 如权利要求29-32任一项所述的系统,所述POI信息确定单元进一步包括:The system of any of claims 29-32, the POI information determining unit further comprising:
    判断子单元,用于判断所述POI信息在所述网络中的地址数据中的出现次数是否高于预定阈值;a determining subunit, configured to determine whether the number of occurrences of the POI information in the address data in the network is higher than a predetermined threshold;
    信息点信息确定子单元,用于在所述判断子单元判断为是的情况下,确定所获取的POI信息有效。The information point information determining subunit is configured to determine that the acquired POI information is valid if the determining subunit determines that it is YES.
  34. 如权利要求29-33任一项所述的系统,所述可靠来源为具有预定可信度的来源。A system according to any of claims 29-33, said reliable source being a source having a predetermined degree of confidence.
  35. 如权利要求29-34任一项所述的系统,所述来源为网站或者网页。A system according to any of claims 29-34, the source being a website or a web page.
  36. 一种基于网络中的地址数据确定POI信息有效性的方法,该方法包括:A method for determining validity of POI information based on address data in a network, the method comprising:
    利用网络中的地址数据获取对应相同POI名称的多个相关POI信息;Acquiring a plurality of related POI information corresponding to the same POI name by using address data in the network;
    统计所述POI信息在所述网络中的地址数据中的出现次数;Counting the number of occurrences of the POI information in address data in the network;
    根据所述POI信息在所述网络中的地址数据中的出现次数确定对应所述相同POI名称的有效POI信息。The valid POI information corresponding to the same POI name is determined according to the number of occurrences of the POI information in the address data in the network.
  37. 如权利要求36所述的方法,所述多个相关POI信息为对应POI至少一个预设属性的信息。The method of claim 36, wherein the plurality of related POI information is information corresponding to at least one preset attribute of the POI.
  38. 如权利要求36-37任一项所述的方法,所述预设属性为经纬度、地址、建筑物名称或所囊括单位名称。The method according to any one of claims 36 to 37, wherein the preset attribute is a latitude and longitude, an address, a building name, or a unit name.
  39. 如权利要求36-38任一项所述的方法,所述步骤:统计所述POI信息在所述网络中的地址数据中的出现次数,进一步包括:The method of any of claims 36-38, the step of: counting the number of occurrences of the POI information in the address data in the network, further comprising:
    获取所述POI信息的来源;Obtaining the source of the POI information;
    判断所述来源是否属于可靠来源,如果是,则统计所述POI信息在所述网络中的地址数据里的出现次数,否则不统计。 Determining whether the source is a reliable source, and if so, counting the number of occurrences of the POI information in the address data in the network, otherwise it is not counted.
  40. 如权利要求36-39任一项所述的方法,所述步骤:根据所述POI信息在所述网络中的地址数据中的出现次数确定对应所述相同POI名称的有效POI信息,进一步包括:The method according to any one of claims 36 to 39, wherein the step of: determining valid POI information corresponding to the same POI name according to the number of occurrences of the POI information in the address data in the network, further comprising:
    判断所述POI信息在所述网络中的地址数据中的出现次数是否高于预定阈值;Determining whether the number of occurrences of the POI information in the address data in the network is higher than a predetermined threshold;
    如果是,则确定所述POI信息有效。If so, it is determined that the POI information is valid.
  41. 如权利要求36-40任一项所述的方法,所述可靠来源为具有预定可信度的来源。A method according to any one of claims 36 to 40, wherein the reliable source is a source having a predetermined degree of confidence.
  42. 如权利要求36-41任一项所述的方法,所述来源为网站或者网页。A method according to any one of claims 36 to 41, wherein the source is a website or a web page.
  43. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求7-12中的任一项所述的基于聚类确定POI名称的方法,或者,导致所述计算设备执行根据权利要求21-28中的任一项所述的基于聚类的POI名称确定方法,或者导致所述计算设备执行根据权利要求36-42中的任一项所述的基于网络中的地址数据确定POI信息有效性的方法。A computer program comprising computer readable code, when said computer readable code is run on a computing device, causing said computing device to perform clustering based determination of POI according to any of claims 7-12 a method of naming, or causing the computing device to perform a cluster-based POI name determination method according to any one of claims 21-28, or causing the computing device to perform according to claims 36-42 Any of the methods for determining validity of POI information based on address data in a network.
  44. 一种计算机可读介质,其中存储了如权利要求43所述的计算机程序。 A computer readable medium storing the computer program of claim 43.
PCT/CN2015/095857 2014-12-29 2015-11-27 System and method for determining poi name and for determining validity of poi information WO2016107352A1 (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN201410849382.5A CN104572957B (en) 2014-12-29 2014-12-29 A kind of POI title based on cluster determines system and method
CN201410849380.6 2014-12-29
CN201410849380.6A CN104572956B (en) 2014-12-29 2014-12-29 Determine the system and method for POI effectiveness
CN201410849123.2 2014-12-29
CN201410849382.5 2014-12-29
CN201410849123.2A CN104572955B (en) 2014-12-29 2014-12-29 A kind of system and method determining POI title based on cluster

Publications (1)

Publication Number Publication Date
WO2016107352A1 true WO2016107352A1 (en) 2016-07-07

Family

ID=56284188

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/095857 WO2016107352A1 (en) 2014-12-29 2015-11-27 System and method for determining poi name and for determining validity of poi information

Country Status (1)

Country Link
WO (1) WO2016107352A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111198912A (en) * 2018-11-19 2020-05-26 阿里巴巴集团控股有限公司 Address data processing method and device
CN111460055A (en) * 2019-01-21 2020-07-28 阿里巴巴集团控股有限公司 POI address supplementing method and device
CN111782741A (en) * 2020-06-04 2020-10-16 汉海信息技术(上海)有限公司 Interest point mining method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090110302A1 (en) * 2007-10-31 2009-04-30 Microsoft Corporation Declustering Point-of-Interest Icons
CN102063460A (en) * 2010-10-19 2011-05-18 蔡亮华 Information processing method and device
CN102479229A (en) * 2010-11-29 2012-05-30 北京四维图新科技股份有限公司 Method and system for generating point of interest (POI) data
JP2014086045A (en) * 2012-10-26 2014-05-12 Kddi Corp Server, system, program, and method for estimating poi on the basis of position and direction information of terminal
CN104077295A (en) * 2013-03-27 2014-10-01 百度在线网络技术(北京)有限公司 Data label mining method and data label mining system
CN104572955A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 System and method for determining POI name based on clustering
CN104572956A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 System and method for confirming POI information effectiveness
CN104572957A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 POI name determination system based on clustering and method thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090110302A1 (en) * 2007-10-31 2009-04-30 Microsoft Corporation Declustering Point-of-Interest Icons
CN102063460A (en) * 2010-10-19 2011-05-18 蔡亮华 Information processing method and device
CN102479229A (en) * 2010-11-29 2012-05-30 北京四维图新科技股份有限公司 Method and system for generating point of interest (POI) data
JP2014086045A (en) * 2012-10-26 2014-05-12 Kddi Corp Server, system, program, and method for estimating poi on the basis of position and direction information of terminal
CN104077295A (en) * 2013-03-27 2014-10-01 百度在线网络技术(北京)有限公司 Data label mining method and data label mining system
CN104572955A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 System and method for determining POI name based on clustering
CN104572956A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 System and method for confirming POI information effectiveness
CN104572957A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 POI name determination system based on clustering and method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, RUISHAN: "Multi-source Poi Information Fusion Based on Natural Language Processing", MASTER DEGREE THESES OF MASTER OF CHINA EXCELLENT FULL-TEXT DATABASE, INFORMATION TECHNOLOGY SERIES, vol. 17, no. 3, 15 March 2014 (2014-03-15), pages 1 - 4, ISSN: 1674-0246 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111198912A (en) * 2018-11-19 2020-05-26 阿里巴巴集团控股有限公司 Address data processing method and device
CN111460055A (en) * 2019-01-21 2020-07-28 阿里巴巴集团控股有限公司 POI address supplementing method and device
CN111460055B (en) * 2019-01-21 2023-06-20 阿里巴巴集团控股有限公司 POI address supplementing method and device
CN111782741A (en) * 2020-06-04 2020-10-16 汉海信息技术(上海)有限公司 Interest point mining method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Schulz et al. A multi-indicator approach for geolocalization of tweets
WO2016155386A1 (en) Method and device for determining whether webpage comprises point of interest (poi) data
CN104572955B (en) A kind of system and method determining POI title based on cluster
US20190116222A1 (en) Providing geocoded targeted web content
CN104572956B (en) Determine the system and method for POI effectiveness
US11526769B2 (en) Encoding knowledge graph entries with searchable geotemporal values for evaluating transitive geotemporal proximity of entity mentions
CN104572957B (en) A kind of POI title based on cluster determines system and method
TW201348990A (en) Method and Apparatus of Recommending Candidate Terms Based on Geographical Location
CN107203526B (en) Query string semantic demand analysis method and device
US9424358B2 (en) Searching and classifying information about geographic objects within a defined area of an electronic map
CN107330079B (en) Method and device for presenting rumor splitting information based on artificial intelligence
JP2009134463A (en) Retrieval device, retrieval method and retrieval program for document group including geographic information, and recording medium recording the program
WO2016107352A1 (en) System and method for determining poi name and for determining validity of poi information
US20130031458A1 (en) Hyperlocal content determination
US8799314B2 (en) System and method for managing information map
US9767121B2 (en) Location-based mobile search
Srivastava et al. A geocoding framework powered by delivery data
WO2016058520A1 (en) Method and apparatus for recognizing name of face picture
EP3332334B1 (en) Efficient location-based entity record conflation
CN111177719A (en) Address category determination method, device, computer-readable storage medium and equipment
WO2015192716A1 (en) Scribe line search method and device based on electronic map
KR102151598B1 (en) Method and system for providing relevant keywords based on keyword attribute
JP5361090B2 (en) Topic word acquisition apparatus, method, and program
de Oliveira et al. Leveraging VGI for gazetteer enrichment: A case study for geoparsing twitter messages
WO2017107651A1 (en) Method and device for determining relevance between news and for calculating the relevance between news

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15875031

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15875031

Country of ref document: EP

Kind code of ref document: A1