CN104462143A - Method and device for establishing chain brand word bank and category word bank - Google Patents

Method and device for establishing chain brand word bank and category word bank Download PDF

Info

Publication number
CN104462143A
CN104462143A CN201310439450.6A CN201310439450A CN104462143A CN 104462143 A CN104462143 A CN 104462143A CN 201310439450 A CN201310439450 A CN 201310439450A CN 104462143 A CN104462143 A CN 104462143A
Authority
CN
China
Prior art keywords
word
poi data
chain brand
recognizer
data group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310439450.6A
Other languages
Chinese (zh)
Other versions
CN104462143B (en
Inventor
刘广权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Autonavi Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Autonavi Software Co Ltd filed Critical Autonavi Software Co Ltd
Priority to CN201310439450.6A priority Critical patent/CN104462143B/en
Publication of CN104462143A publication Critical patent/CN104462143A/en
Application granted granted Critical
Publication of CN104462143B publication Critical patent/CN104462143B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The embodiment of the invention discloses a method and device for establishing a chain brand word bank and a category word bank. In one case, training of a chain brand word recognizer is performed based on POI data in the same urban POI data bank, the chain brand word recognizer can be utilized to recognize name trunks of all POI data in the POI data bank, recognized trunks are name trunks of chain brand words and stored in the chain brand word bank; in another case, training of the chain brand word recognizer is performed based on query words recorded in user's query logs and clicked POI data corresponding to the query words, the chain brand word recognizer can be utilized to recognize all query words recorded in the user's query logs, recognized words are query words of the chain brand words and category words and are stored in the chain brand word bank and the category word bank respectively. According to the embodiment of the invention, the working efficiency is improved, and timely word bank update can be further achieved through timely excavation.

Description

Chain brand word dictionary, classifier dictionary method for building up and device
Technical field
The present invention relates to technical field of geographic information, particularly chain brand word dictionary, classifier dictionary method for building up and device.
Background technology
Utilizing before navigation engine carries out path navigation, usually need first to search for destination.In the process of search destination, the first navigation engine of user inputs a query word, navigation engine is from POI(Point ofInterest, point of interest) search for several POI data of mating with this query word in database, after user therefrom selects a POI data, the POI data that navigation engine is selected according to user is carried out path planning and navigates.
In some cases, the query word of user's input may be the classifier reflecting a certain classification, such as, " restaurant " is a classifier, based on different dimensions, " restaurant " both can be divided in " Chinese meal restaurant " and " western-style food restaurant ", can be divided into again " exclusive restaurant " and " street corner snack paving ", and " Chinese meal restaurant ", " western-style food restaurant ", " exclusive restaurant " and " street corner snack paving " as the subclass in " restaurant ", be equally also all classifier.In other cases, the query word of user's input also may be the chain brand word reflecting some chain brand mechanisms, and such as, " industrial and commercial bank ", " KFC " and " Suning's electrical equipment " etc. all belong to chain brand word.
At present, based on the consideration of user's request, for improving the accuracy of Search Results, guarantee that Search Results more meets the result of the required inquiry of user, when query word be a classifier or a chain brand word time, the searching method of the POI data that navigation engine search is mated with this query word and be that (generic word is not that classifier neither chain brand word to generic word to the sort method of Search Results and query word, as permanent International Center, side) time, the searching method that navigation engine adopts is different with sort method.During as judged that when navigation engine the query word that user inputs is classifier, that illustrate that user needs to search for should be the POI of a certain classification, therefore the searching method that navigation engine is taked is the classifier according to user's input, from POI data storehouse, filtering out the POI mated with this classifier, and showing according to POI distance users position order from the near to the remote when showing Query Result; And when navigation engine judges that the query word that user inputs is chain brand word, due to chain brand mechanism geographically distribution comparatively even, what illustrate that user needs to search for should be the nearer chain brand mechanism of distance current location, therefore the searching method that navigation engine is taked is the POI mated with chain brand word in the periphery certain limit of search subscriber position, and when showing result for retrieval, show according to POI distance users position order from the near to the remote.
In prior art, navigation engine judges that query word that user inputs is the mode of classifier or chain brand word and is: by being mated with classifier dictionary and chain brand word dictionary by query word, if match this query word from classifier dictionary, judge that this query word is classifier, if match this query word from chain brand word dictionary, judge that this query word is chain brand word dictionary.At present mainly through manually to analyze POI data, summary and induction goes out some conventional classifier and chain brand words, and set up classifier dictionary and chain brand word dictionary respectively, so that according to classifier dictionary and chain brand word dictionary, navigation engine identifies that a query word is classifier or chain brand word.But the artificial summary and induction of this dependence sets up the mode not only inefficiency of classifier dictionary and chain brand word dictionary, and once there is new vocabulary, also cannot upgrade in time dictionary.
Summary of the invention
In order to solve the problems of the technologies described above, embodiments provide chain brand word, classifier dictionary method for building up and device, chain brand word can be gone out by automatic mining from POI data storehouse, and automatic mining goes out chain brand word and classifier from user's inquiry log, not only increase work efficiency, and, can also be excavated by timing, realize upgrading in time dictionary.
The embodiment of the invention discloses following technical scheme:
A kind of chain brand word dictionary method for building up, comprising:
POI data identical for title trunk in point of interest POI data storehouse, same city is aggregated into a POI data group, and described POI data group is corresponding with described title trunk;
The recognition feature of described POI data group is extracted from each POI data group;
From all POI data groups, extract title trunk be marked as the POI data group of chain brand word and non-chain brand word as training data, the recognition feature based on described training data carries out the training of chain brand word recognizer;
Utilize the described chain brand word recognizer after training to identify Unidentified title trunk in title trunk corresponding to all POI data groups, identifying is the title trunk of chain brand word;
By described be that the title trunk of chain brand word is stored in preset chain brand word dictionary.
A kind of chain brand word dictionary and classifier dictionary method for building up, comprising:
From user's inquiry log, obtain different user and inquired about the POI data obtained in same city by identical query word, the POI data got is aggregated into a POI data group, described POI data group is corresponding with described query word;
The recognition feature of described POI data group is extracted from each POI data group;
From all POI data groups, extract query word be marked as the POI data group of chain brand word, classifier and generic word as training data, the recognition feature based on described training data carries out the training of recognizer;
Utilize the recognizer after training to identify Unidentified query word in query word corresponding to all POI data groups, identifying is the query word of chain brand word and classifier;
By described be that the query word of chain brand word is stored in preset chain brand word dictionary, and by described be that the query word of classifier is stored in preset classifier dictionary.
A kind of chain brand word dictionary apparatus for establishing, comprising:
First polymerized unit, for POI data identical for title trunk in POI data storehouse, same city is aggregated into a POI data group, described POI data group is corresponding with described title trunk;
Fisrt feature extraction unit, for extracting the recognition feature of described POI data group from each POI data group;
First training unit, be marked as the POI data group of chain brand word and non-chain brand word as training data for extracting title trunk from all POI data groups, the recognition feature based on described training data carries out the training of chain brand word recognizer;
First recognition unit, for utilizing the described chain brand word recognizer after training to identify Unidentified title trunk in title trunk corresponding to all POI data groups, identifying is the title trunk of chain brand word;
Unit set up in first dictionary, for by described be that the title trunk of chain brand word is stored in preset chain brand word dictionary.
A kind of chain brand word dictionary and classifier dictionary apparatus for establishing, comprising:
Second polymerized unit, for from user's inquiry log, obtain different user and inquired about the POI data obtained in same city by identical query word, the POI data got is aggregated into a POI data group, described POI data group is corresponding with described query word;
Second feature extraction unit, for extracting the recognition feature with described POI data group from each POI data group;
Second training unit, be marked as the POI data group of chain brand word, classifier and generic word as training data for extracting query word from all POI data groups, the recognition feature based on described training data carries out the training of recognizer;
3rd recognition unit, for utilizing the recognizer after training to identify Unidentified query word in query word corresponding to all POI data groups, identifying is the query word of chain brand word and classifier;
Unit set up in second dictionary, for by described be that the query word of chain brand word is stored in preset chain brand word dictionary, and by described be that the query word of classifier is stored in preset classifier dictionary.
As can be seen from the above-described embodiment, compared with prior art, the invention has the advantages that:
Chain brand word dictionary method for building up provided by the invention, the training of chain brand word recognizer is carried out based on the POI data in POI data storehouse, same city, utilize this chain brand word recognizer can identify the title trunk of POI data all in POI data storehouse, identifying is the title trunk of chain brand word, and is stored in chain brand word dictionary.In another kind of situation, the training of recognizer is carried out based on the query word recorded in user's inquiry log and the clicked POI data corresponding with query word, utilize this recognizer can to identify all query words recorded in user's inquiry log, identifying is the query word of chain brand word and classifier, and is stored in respectively in chain brand word dictionary and classifier dictionary.Obtain compared with chain brand word, improve the efficiency obtaining chain brand word, thus improve the efficiency and speed of setting up chain brand word dictionary by manually coming to carry out analysis to the POI data in POI data storehouse with prior art.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the process flow diagram of a kind of chain brand word dictionary method for building up of the present invention;
Fig. 2 is the process flow diagram of the another kind of chain brand word dictionary method for building up of the present invention;
Fig. 3 is the process flow diagram of a kind of chain brand word dictionary of the present invention and classifier dictionary method for building up;
Fig. 4 is the process flow diagram of the method for building up of the another kind of chain brand word dictionary of the present invention and classifier dictionary;
Fig. 5 is an example structure figure of a kind of chain brand word dictionary apparatus for establishing of the present invention;
Fig. 6 is an example structure figure of the another kind of chain brand word dictionary apparatus for establishing of the present invention;
Fig. 7 is an example structure figure of a kind of chain brand word dictionary of the present invention and classifier dictionary apparatus for establishing;
Fig. 8 is an example structure figure of the another kind of chain brand word dictionary of the present invention and classifier dictionary apparatus for establishing.
Embodiment
Embodiments provide chain brand word, the method for building up of classifier dictionary and device.In a kind of situation, the training of chain brand word recognizer is carried out based on the POI data in POI data storehouse, same city, utilize this chain brand word recognizer can identify the title trunk of POI data all in POI data storehouse, identifying is the title trunk of chain brand word, and is stored in chain brand word dictionary.In another kind of situation, the training of recognizer is carried out based on the query word recorded in user's inquiry log and the clicked POI data corresponding with query word, utilize this recognizer can to identify all query words recorded in user's inquiry log, identifying is the query word of chain brand word and classifier, and is stored in respectively in chain brand word dictionary and classifier dictionary.
For enabling above-mentioned purpose of the present invention, feature and advantage become apparent more, are described in detail the embodiment of the present invention below in conjunction with accompanying drawing.
Embodiment one
The present embodiment carries out the training of chain brand word recognizer based on the POI data in POI data storehouse, the title trunk deriving from POI data can be divided into chain brand word and non-chain brand word by this chain brand word recognizer, filtering out from classification results is the title trunk of chain brand word, and is stored in chain brand word dictionary.Refer to Fig. 1, it is the process flow diagram of a kind of chain brand word dictionary method for building up of the present invention, and the method comprises the following steps:
Step 101: POI data identical for title trunk in POI data storehouse, same city is aggregated into a POI data group, wherein, described POI data group is corresponding with described title trunk;
" title trunk " refers to the part after the satellite informations such as the branch in the title of POI data and address being removed, and title trunk is relevant with POI data form to the differentiation mode of satellite information.In the data layout that some are conventional, satellite information is placed in bracket, and in other some data layouts, satellite information is placed on after symbol "-".Such as, " industrial and commercial bank (Wangjing subbranch) " is the title of a POI data, and wherein, " industrial and commercial bank " is exactly the title trunk of this POI data.Also such as, " industrial and commercial bank-Wangjing subbranch " is the title of a POI data, wherein, industrial and commercial bank " be exactly the title trunk of this POI data.
The POI data in POI data storehouse with same names trunk is summarised in together, and forms POI data group.Obviously, a POI data storehouse can form multiple POI data group, and each POI data group includes one or more POI data, and all POI data that POI data group comprises all have identical title trunk.
It should be noted that, in technical solution of the present invention, " POI data storehouse " is the database of all POI data comprising same city, such as, and POI data storehouse, Beijing.
Step 102: the recognition feature extracting described POI data group from each POI data group;
This recognition feature is for identifying that whether title trunk corresponding to described POI data group be the parameter of chain brand word.
This recognition feature comprises the combination of any one feature following or any number of feature: space distribution distance; Spatial distribution entropy; The ratio of the POI data that title indicates with branch in POI data group; In POI data group, title is with the ratio of the POI data of door mark; The classification score value of POI data group.
Described classification score value refers to and POI data identical for classification in described POI data group is aggregated into a data group, comprise the preset score value that the classification of the maximum data group of POI data is corresponding, described preset score value occurs that the prior probability of chain brand mechanism obtains according to preset described classification, described prior probability equals N/M, wherein M for be labeled as chain brand word in described training data title trunk corresponding to the quantity of POI data, N is the quantity of POI data identical with the described classification comprising the maximum data group of POI data in a described M POI data.
Step 103: extract title trunk and be marked as the POI data group of chain brand word and non-chain brand word as training data from all POI data groups, the recognition feature based on described training data carries out the training of chain brand word recognizer;
Suppose, based on a POI data storehouse copolymerization 1000 POI data groups, wherein there is the title trunk of 100 POI data groups to be marked as chain brand word and non-chain brand word, from 1000 POI data groups, extract 100 POI data groups as training data.For setting up dictionary first, the title trunk of these 100 POI data groups can in advance by artificial cognition also mark before setting up dictionary.For Word library updating, the title trunk of these 100 POI data groups can be in advance by artificial cognition and mark, also can when setting up chain brand word dictionary before by the identification of chain brand word recognizer and mark.
It should be noted that, the number of technical solution of the present invention to the POI data group in training data does not limit.Certainly, the training data of extraction is more, and the chain brand word recognizer trained is more accurate.When concrete operations, according to the requirement of the accuracy to chain brand word recognizer, appropriate training data can be extracted and trains chain brand word recognizer.
To extract 100 POI data groups as training data from 1000 POI data groups, title trunk corresponding to these 100 POI data groups has been marked as chain brand word and non-chain brand word, as, when for chain brand word, be labeled as 2, when for non-chain brand word, be labeled as 0, raw 100 marks result (2 or 0) of common property, obviously, each mark result marked is result accurately.And then from these 100 POI data groups, extracting identical recognition feature respectively, the raw 100 groups of recognition features of common property, each organizes the recognition feature extracted is all identical, and e.g., each organizes the recognition feature of extraction all: space distribution Distance geometry spatial distribution entropy.Finally detector model is trained to the chain brand word recognizer obtaining distinguishing chain brand word and non-chain brand word based on 100 mark results and 100 groups of recognition category features.
Below, for the POI data group (there is identical title trunk " Suning's electrical equipment ") in POI data storehouse, Shanghai City, describe in detail and how from a POI data group, to extract recognition feature, 87 POI data are had in this POI data group, as, Suning's electrical equipment (Jiang Qiao shop) and Suning's electrical equipment (shop, Silver Road) etc.
(1) space distribution distance
First, determine the minimum distribution rectangle of 87 POI data in navigation map in this POI data group, as, according to the latitude and longitude coordinates in 87 POI data, find out the POI data of longitude maximum (being namely positioned at easternmost) and longitude minimum (being namely positioned at westernmost), and, find out the POI data of latitude maximum (being namely positioned at northernmost) and latitude minimum (being namely positioned at southernmost), obtain:
The POI data that longitude is maximum, its title is Suning's electrical equipment (shop, street, east gate, Nanhui), and longitude is 121.7629;
The POI data that longitude is minimum, its title is Suning's electrical equipment (park road shop), and longitude is 121.1173;
The POI data that latitude is maximum, its title is Suning's electrical equipment (north gate Lu Dian), and dimension is 31.6278;
The POI data that latitude is minimum, its title is Suning's electrical equipment (defending zero shop, tunnel), and latitude is 30.7155.
A rectangle (that is, the minimum distribution rectangle that 87 POI data in this POI data group are formed) can be determined according to above 4 longitude and latitude data.
Secondly, choose the longest edge of minimum distribution rectangle, longest edge is done normalized, obtain space distribution distance, e.g., two limits calculating above-mentioned rectangle are respectively 101.1 kilometers and 61.2 kilometers, get longest edge 101.1 kilometers, and be normalized 101.1/200=0.505, therefore, space distribution distance is 0.505.
Because the space distribution scope of chain brand mechanism is comparatively wide, therefore, space distribution distance is also just larger.Otherwise the space distribution scope of non-chain brand mechanism is narrower, space distribution distance is also just less.Whether be a chain brand mechanism, and then whether the title trunk distinguishing this POI data is a chain brand word if can distinguish a POI data according to space distribution apart from this evident characteristics.
(2) spatial distribution entropy
First, the minimum distribution rectangle that 87 POI data are formed is determined in the manner described above.
Secondly, this minimum distribution rectangle is split into multiple region, add up the distribution probability of 87 POI data in each region respectively, as, this minimum distribution rectangle is split into 3*3=9 region, statistics drops into the POI number of regional, result is: { 3, 5, 0, 6, 54, 7, 3, 7, 2}, calculate the ratio falling into the POI number of regional and total number of this POI data group POI data, obtaining distribution probability is: { 0.34482759, 0.057471264, 0, 0.068965517, 0.620689655, 0.08045977, 0.034482759, 0.08045977, 0.022988506}.
Finally, calculate the entropy of 87 POI data at the distribution probability in each region, this entropy is done normalized, obtains spatial distribution entropy, as, employing asks entropy formula S um [-P*log2 (P)]/log2 (N) to calculate entropy, result of calculation is 1.976/3.170=0.623, and wherein, " Sum " represents summation, " P " represents the distribution probability of POI data in each region, and " N " represents the areal split.
Certainly, what can adopt other asks entropy formulae discovery entropy, and the computing formula of technical solution of the present invention to entropy does not limit.
Due to chain brand mechanism being more evenly distributed in space, therefore, spatial distribution entropy is also just larger, otherwise non-chain brand mechanism is uneven in space distribution, and spatial distribution entropy is also just less.Whether be a chain brand mechanism, and then whether the title trunk distinguishing this POI data is a chain brand word if also can distinguish a POI data according to this evident characteristics of spatial distribution entropy.
(3) ratio of the POI data that title indicates with branch in POI data
Such as, indicated by the printed words such as " shop " and " business hall " as branch, in 87 POI data, the POI data with branch mark in title has 79, and the ratio calculating the POI data that title indicates with branch in POI data is 79/87=0.908.
Because the ratio of chain brand mechanism band branch mark is higher, therefore, in POI data, the ratio of the POI data that title indicates with branch is also just larger, otherwise, the ratio of non-chain brand mechanism band branch mark is lower, and in POI data, the ratio of the POI data that title indicates with branch is also just less.Whether this evident characteristics of ratio of the POI data indicated with branch according to title in POI data also can distinguish a POI data is a chain brand mechanism, and then whether the title trunk distinguishing this POI data is a chain brand word.
(4) in POI data title with the ratio of the POI data of door mark
Such as, by " door printed words " as door mark, in 87 POI data, title has 2 with the POI data of door mark, and in calculating POI data, title is 2/87=0.023 with the ratio of the POI data of door mark.
Because the ratio of the POI data of chain brand mechanism band door mark is lower, therefore, in POI data, title is also just less with the ratio of the POI data of door mark, otherwise, the ratio of the POI data of non-chain brand mechanism band door mark is higher, and in POI data, the ratio of the POI data of title band door mark is also just larger.Whether be a chain brand mechanism, and then whether the title trunk distinguishing this POI data is a chain brand word if also can distinguish a POI data according to title in POI data with this evident characteristics of ratio of the POI data of door mark.
(5) the classification score value of POI data group
Described classification score value refers to and POI data identical for classification in described POI data group is aggregated into a data group, comprise the preset score value that the classification of the maximum data group of POI data is corresponding, described preset score value occurs that the prior probability of chain brand mechanism obtains according to preset described classification, described prior probability equals N/M, wherein M for be labeled as chain brand word in described training data title trunk corresponding to the quantity of POI data, N is the quantity of POI data identical with the described classification comprising the maximum data group of POI data in a described M POI data.
In POI data storehouse, generally can classify to POI data wherein, usually can classify to POI data with secondary or three grades, secondary class is the subclass of one-level class, and three grades of classes are the subclass of secondary class.Because chain brand mechanism majority appears at " food and drink ", in " shopping " and " life " one-level class, but not chain brand mechanism majority appears at " house ", in " landscape " and " government organs " one-level class, therefore, " food and drink ", the score value ratio " house " of " shopping " and " life " one-level class, the score value of " landscape " and " government organs " one-level class wants high, as, by " food and drink ", the score value of " shopping " and " life " one-level class is set to 2, by " house ", the score value of " landscape " and " government organs " one-level class is set to 0, the score value of all the other one-level classes is set to 1.
Such as, in 87 POI data, wherein there is the classification of 40 POI data for " shopping ", the classification of 37 POI data is " house ", is all that 40 POI data of " shopping " aggregate into a data group by classification, is all that 37 POI data of " house " aggregate into another data group by classification, the POI data that first data group comprises is maximum, and its classification is " shopping ", therefore, the classification score value of POI data group is 2.
Obtain 100 mark results and 100 groups of recognition features are input to training module, obtain chain brand word recognizer by training.Such as, a kind of preferred version is: chain brand word recognizer is linear classifier, and this linear classifier is:
y = Σ ( W i × X i ) + b Formula (1)
Wherein, in formula (1), W ibe the weight coefficient of i-th recognition feature, X iit is the value of i-th recognition feature, b is constant term, then identify that title trunk corresponding to described POI array is chain brand word when y is more than or equal to predetermined threshold value, then identify that title trunk corresponding to described POI data group is non-chain brand word when y is less than described predetermined threshold value.
By training, weight coefficient corresponding to each recognition feature and constant term can be obtained, be specially:
So far, chain brand word recognizer training is complete, after the title trunk that this chain brand word recognizer is corresponding to POI data group identifies, the result exported is a numerical value, when this numerical value close to 0 time, then being expressed as title trunk is that the probability of non-chain brand word is large, when this numerical value close to 2 time, being then expressed as title trunk is that the probability of chain brand word is large.
Step 104: utilize the described chain brand word recognizer after training to identify Unidentified title trunk in title trunk corresponding to all POI data groups, identifying is the title trunk of chain brand word;
After training obtains chain brand word recognizer, needing to identify Unidentified title trunk in the title trunk corresponding with all POI data groups aggregated into, needing the recognition feature based on extracting in POI data group.Namely, the recognition feature of POI data group is input in chain brand word recognizer, the result that chain brand word recognizer exports is a numerical value, when this numerical value close to 0 time, represent that the title trunk corresponding with this POI data group is a non-chain brand word, when this numerical value close to 2 time, represent that the title trunk corresponding with this POI data group is a chain brand word.During concrete operations, a classification critical line can be set, as 1.5, when the numerical value that chain brand word recognizer exports is more than or equal to 1.5, title trunk is chain brand word, and when the numerical value that chain brand word recognizer exports is less than 1.5, title trunk is non-chain brand word.Last filtering out from all results obtained is the title trunk of chain brand word, obtains all chain brand word in a POI data storehouse.
Step 105: by described be that the title trunk of chain brand word is stored in preset chain brand word dictionary.
After establishing chain brand word dictionary, when navigation engine carries out destination search, whether can be that different ways of search taked in a chain brand word according to query word.Wherein, navigation engine judges whether query word appears in this chain brand word dictionary, if, determine that this query word is a chain brand word, from POI data storehouse, extract POI data according to distance current location or the first nearly rear order far away of assigned address, within the scope of the POI data extracted, search for the POI data of mating with this query word; Otherwise, determine that this query word is a non-chain brand word, within the scope of all POI data in POI data storehouse, search for the POI data of mating with this query word.
When navigation engine is when sorting to Search Results, whether can be also that a chain brand word adopts different sortords according to query word.Wherein, navigation engine judges whether query word appears in this chain brand word dictionary, if, determine that this query word is a chain brand word, be that principal element sorts (as shown Search Results according to distance users current location sortord from the near to the remote) to the POI data searched with distance, otherwise determining that this query word is non-chain brand word, is that principal element sorts to the POI data searched with text similarity.
As can be seen from the above-described embodiment, compared with prior art, the invention has the advantages that:
Provide the method for training chain brand word recognizer based on all POI data in a POI data storehouse and the recognition feature extracted from POI data, utilizing chain brand word recognizer automatically to identify from all POI data in POI data storehouse to be the title trunk of chain brand word, is that the title trunk of chain brand word sets up a chain brand word dictionary based on identifying.This automatic recognition method not only increases work efficiency, and, recognition method can also be carried out by timing to the title trunk of POI data all in POI data storehouse, realize upgrading in time dictionary.
Embodiment two
The present embodiment two is with the difference of embodiment one, after the chain brand word recognizer of acquisition, further the identification accuracy of chain brand word recognizer is tested, if identify that through checking it accuracy is undesirable, chain brand word recognizer is regulated, and then once check, continuous duplication check and adjustment, until the identification accuracy of chain brand word recognizer meets the requirements.Refer to Fig. 2, it is the process flow diagram of the another kind of chain brand word dictionary method for building up of the present invention, and the method comprises the following steps:
Step 201: POI data identical for title trunk in point of interest POI data storehouse, same city is aggregated into a POI data group, and described POI data group is corresponding with described title trunk;
Step 202: the recognition feature extracting described POI data group from each POI data group;
Described recognition feature is for identifying that whether title trunk corresponding to described POI data group be the parameter of chain brand word.
Step 203: extract title trunk and be marked as the POI data group of chain brand word and non-chain brand word as training data from all POI data groups, the recognition feature based on described training data carries out the training of chain brand word recognizer;
The concrete implementation of above-mentioned steps 201-203 can see embodiment one, and the present embodiment repeats no more this.Below enter checkout procedure:
Step 204: extract title trunk and be marked as the POI data group of chain brand word and non-chain brand word as check data from all POI data groups, described check data and described training data are different data;
Such as, based on a POI data storehouse copolymerization 1000 POI data groups, wherein the title trunk of 1-200 POI data group has been marked as chain brand word and non-chain brand word, from 1000 POI data groups, extract 1-100 title trunk be marked as the POI data group of chain brand word and non-chain brand word as training data, from 1000 POI data groups, extract 101-200 title trunk be marked as the POI data group of chain brand word and non-chain brand word as check data.
It should be noted that, the number of technical solution of the present invention to the POI data group in check data does not limit.Certainly, the check data of extraction is more, and recognition accuracy and the identification recall rate of the chain brand word calculated are more credible, and then assay is more credible.When concrete operations, according to the requirement of the accuracy to chain brand word recognizer, appropriate check data can be extracted this recognizer tested.
Step 205: utilize the title trunk of described chain brand word recognizer to described check data to identify, identifying is the title trunk of chain brand word;
Wherein, this recognition methods see the step 104 in embodiment one, no longer can be described in detail herein.Finally, 100 recognition results are obtained altogether.
Step 206: according to described chain brand word recognizer to the recognition result of described check data, calculates described chain brand word recognizer to the recognition accuracy of chain brand word and/or identification recall rate;
Wherein, the quantity that described recognition accuracy equals the title trunk in described recognition result being chain brand word is accurately the quantity of the title trunk of chain brand word divided by what identify in described recognition result, the quantity that described recall rate equals the title trunk in described recognition result being chain brand word is accurately divided by the quantity of title trunk being marked as chain brand word in described check data, and the title trunk of described chain brand word is accurately the title trunk being not only marked as chain brand word but also being identified as chain brand word.
Such as, in check data, have 400 title trunks, wherein, have 100 to be labeled as chain brand word, have 300 to be labeled as non-chain brand word.By chain brand word recognizer, check data is identified, 90 title trunks are had to be identified as chain brand word in recognition result, but only have 60 title trunks to be actually chain brand accurately (title trunk is the title trunk being not only marked as chain brand word but also being identified as chain brand word) in this recognition result, all the other 30 is erroneous judgement, the in fact chain brand word of right and wrong.Then, the recognition accuracy of this chain brand word recognizer to chain brand word is 60/90=66.67%, and the identification recall rate of chain brand word recognizer to chain brand word is 60/100=60%.When regulating sorter, optionally only can consider that chain brand word recognizer is to the recognition accuracy of chain brand word, and only calculate this parameter, or, optionally only consider that chain brand word recognizer is to the identification recall rate of chain brand word, and only calculate this parameter.Certainly, also can consider that chain brand word recognizer is to the recognition accuracy of chain brand word and identification recall rate simultaneously, and calculate two parameters simultaneously.
Step 207: judge described recognition accuracy and/or identify whether recall rate is more than or equal to each self-corresponding threshold value, if not, enters step 208, if so, enters step 209;
When previous step only calculates recognition accuracy, this step just only need judge whether recognition accuracy reaches accuracy rate threshold value, and equally, when only calculating identification recall rate, this step just only need judge to identify whether recall rate reaches recall rate threshold value.And if calculate two parameters simultaneously, this step needs to judge whether this recognition accuracy reaches accuracy rate threshold value and judge whether described identification recall rate reaches recall rate threshold value simultaneously.
Such as, suppose that accuracy rate threshold value is 0.8, the recognition accuracy calculated reaches accuracy rate threshold value, namely, (namely the recognition accuracy of chain brand word recognizer to chain brand word reach, be more than or equal to) accuracy rate threshold value, can directly utilize the chain brand word recognizer title trunk corresponding to POI data group to identify.Suppose that accuracy rate threshold value is 0.9 again, the recognition accuracy calculated does not reach accuracy rate threshold value, that is, (namely the recognition accuracy of chain brand word recognizer to chain brand word do not reach, be less than) accuracy rate threshold value, need to regulate chain brand word recognizer.
Step 208: readjust described chain brand word recognizer, return step 205;
Such as, the recognition feature extracted when can be modified in trainable recognizer; Again such as, also can be modified in some coefficients of adopting when extracting recognition feature, e.g., in order to realize the normalization coefficient that normalization adopts when being modified in computer memory distribution distance; Or the areal divided when being modified in computer memory Distribution Entropy, as changed 3*3 into 4*4.
In addition, the weight coefficient of each recognition feature in recognizer also can be revised, or, be modified in the classification critical line value adopted when distinguishing chain brand word and non-chain brand word, e.g., change 1.6 or 1.7 into by 1.5.
After turning back to step 205, be utilize the title trunk of described chain brand word recognizer to described check data after regulating to identify, identifying is the title trunk of chain brand word.
Step 209: utilize described chain brand word recognizer to identify Unidentified title trunk in title trunk corresponding to all POI data groups;
Step 210: by described be that the title trunk of chain brand word is stored in preset chain brand word dictionary.
The specific implementation process of above-mentioned steps 209-210 can see embodiment one, and this repeats no more.
As can be seen from the above-described embodiment, compared with prior art, the invention has the advantages that:
Provide the method for training chain brand word recognizer based on all POI data in a POI data storehouse and the recognition feature extracted from POI data, utilizing chain brand word recognizer automatically to identify from all POI data in POI data storehouse to be the title trunk of chain brand word, is that the title trunk of chain brand word sets up a chain brand word dictionary based on identifying.This automatic recognition method not only increases work efficiency, and, recognition method can also be carried out by timing to the title trunk of POI data all in POI data storehouse, realize upgrading in time dictionary.And, after training obtains chain brand word recognizer, also extract title trunk and be marked as the check data of chain brand word and non-chain brand word to the further inspection of chain brand word recognizer do, when disqualified upon inspection, this chain brand word recognizer is regulated, ensure that the accuracy that the title trunk of chain brand word recognizer to POI data group identifies and validity.
Embodiment three
The first two embodiment is all from POI data storehouse, excavate chain brand word, and due to the data acquisition in POI data storehouse is all term, and therefore, the chain brand word excavated is all standardization title substantially, and this may not conform to the use habit of user.Such as, the standardization title in certain chain pharmacy is " the large pharmacy of * * ", and user may be accustomed to being called " * * pharmacy ", if the query word of user's input is " * * pharmacy ", will show that query word is not the error result of chain brand word.In addition, the title due to POI data is seldom a classifier, therefore, is also difficult to excavate classifier from POI data storehouse. 
The present embodiment trains the recognizer of the chain brand word of identifiable design, classifier and generic word based on the query word recorded in user's inquiry log and the clicked POI data corresponding with query word, this recognizer is utilized to identify all query words recorded in user's inquiry log, and to filter out from recognition result be the query word of chain brand word and classifier, set up chain brand word dictionary and classifier dictionary respectively.Refer to Fig. 3, it is the process flow diagram of a kind of chain brand word dictionary of the present invention and classifier dictionary method for building up, and the method comprises the following steps:
Step 301: from user's inquiry log, obtain different user and inquired about the POI data (namely by POI data that user clicks) obtained in same city by identical query word, the POI data got is aggregated into a POI data group, described POI data group is corresponding with described query word;
When carrying out path navigation, navigation engine can record the query word of user wizard boat engine input, and for the Search Results that navigation engine feeds back, by the POI data that user clicked, and is kept in user's inquiry log.Navigation engine can obtain query word and all clicked POI data corresponding with this query word from user's inquiry log, and the POI data corresponding with each query word is polymerized, and obtains POI data group.Obviously, a POI data group comprises one or more POI data, and these POI data are all the POI data of clicked mistake, and, the corresponding query word of each POI data group.
Step 302: extract the recognition feature with described POI data group from each POI data group;
Described recognition feature is for identifying that query word corresponding to described POI data group is the parameter of chain brand word, classifier and generic word.
Step 303: extract query word and be marked as the POI data group of chain brand word, classifier and generic word as training data from all POI data groups, the recognition feature based on described training data carries out the training of recognizer;
Suppose, based on user's inquiry log copolymerization 1000 POI data groups, from 1000 POI data groups, extract 100 query words be marked as the POI data group of chain brand word, classifier and generic word as training data.For setting up dictionary first, query word corresponding to these 100 POI data groups can in advance by artificial cognition and mark before setting up dictionary.For Word library updating, query word corresponding to these 100 POI data groups can be in advance by artificial cognition and mark, also can when setting up dictionary before by recognizer identification and mark.
It should be noted that, the number of technical solution of the present invention to the POI data group in training data does not limit.Certainly, the training data of extraction is more, and the recognizer trained is more accurate.When concrete operations, according to the requirement to recognizer accuracy, appropriate training data trainable recognizer can be extracted.
To extract 100 POI data groups as training data from 1000 POI data groups, the each self-corresponding query word of these 100 POI data groups has been marked as chain brand word, classifier or generic word (so-called " generic word " is exactly other all word except chain brand word and classifier), as, when for classifier, be labeled as 1, when for chain brand word, be labeled as 2, when for generic word, be labeled as 0, raw 100 marks result (1 or 2 or 0) of common property, obviously, the each mark result marked is result accurately.And then identical recognition feature is extracted respectively from these 100 POI data groups, the raw 100 groups of recognition features of common property, each organizes the recognition feature extracted is all identical, and e.g., each organizes the recognition feature of extraction all: the click number of POI data group, the click Distribution Entropy of POI data group.Obviously, the recognition feature extracted from each POI data group is more, and the recognizer trained is more accurate.
The recognition feature of POI data group is following any one or any number of combination:
The number of POI data in POI data group; The click Distribution Entropy of POI data group; The classification number of POI data in POI data group; The class of pressing of POI data group clicks Distribution Entropy; Space distribution distance; Spatial distribution entropy; There is the number in the city of same query word; The ratio of the POI data that title indicates with branch in POI data group; In POI data group, title is with the ratio of door mark P OI data; The clicked ratio of POI data, the clicked ratio of wherein said POI data equals M divided by N, the wherein quantity of title trunk that obtains for the extraction carrying out title trunk to the title of the POI data in POI data group of M, N is the quantity of the POI data in POI data group.
Below, with user in Beijing inquiry " furniture building materials ", in user's inquiry log, record 421 clicked POI data corresponding with query word " furniture building materials " is example, describes how from the POI data group be made up of 421 clicked POI data, to extract recognition feature in detail.
(1) the click number of POI data group
The number of clicked POI data is 421, is normalized: log (421)/6=1.007, is set to 1 by force.
(2) the click Distribution Entropy of POI data group
First add up the probability of the number of clicks of each clicked POI data respectively: the clicked total degree adding up above-mentioned 421 clicked POI data is 985 times, the number of clicks of each clicked POI data is exemplified below: like family's household (shop, store, West 4th Ring Road): 3 times, Chaoyang District, Beijing City brilliance building materials multi-purpose market: 5 times
Calculate the clicked probability of each POI data: like family's household (shop, store, West 4th Ring Road) 3/985, Chaoyang District, Beijing City brilliance building materials multi-purpose market: 5/985
Calculate the entropy of the probability of the number of clicks of each clicked POI data again, as, employing asks entropy formula S um [-P*log2 (P)]/log2 (N) to calculate entropy, result of calculation is 0.924, wherein, " Sum " represents summation, and " P " represents the probability of the number of clicks of clicked POI data, and " N " represents the number of clicked POI data.
(3) the classification number of POI data in POI data group
Above-mentioned 421 clicked POI data belong to 7 different classifications respectively, and e.g., classification is the furniture building materials comprehensive market in three grades of classifications, or is building materials hardware market.This number is normalized: 7/20=0.35.
(4) class of pressing of POI data group clicks Distribution Entropy
First add up respectively clicked POI data ownership of all categories in the probability of clicked number of times, e.g., in above-mentioned 7 classifications, the clicked number of times of POI data is respectively { 42,108,136,634,22,17,26}, the probability obtained is respectively { 42/985,108/985,136/985,634/985,22/985,17/985,26/985}.
Calculate again clicked POI data ownership of all categories in the entropy of probability of number of clicks, the entropy calculated is normalized, what obtain POI data group clicks Distribution Entropy by class, as, employing asks entropy formula S um [-P*log2 (P)]/log2 (N) to calculate entropy, result of calculation is 0.609, wherein, " Sum " represents summation, " P " represents the probability of the number of clicks of clicked POI data in each minimum subclass of ownership, and " N " represents the number of clicks of clicked POI data.
(5) space distribution distance
First determine the minimum distribution rectangle that in a POI data group, POI data is formed, then select the longest edge of this minimum distribution rectangle, this longest edge is normalized, obtains the space distribution distance of clicked POI data.
Wherein, the mode adopted when determining the mode of minimum distribution rectangle and the space distribution distance extracted in embodiment one is identical.Concrete method can see embodiment one, and this repeats no more.
(6) spatial distribution entropy
First determine the minimum distribution rectangle that in a POI data group, all POI data are formed, again this minimum distribution rectangle is split into multiple region, add up each POI data distribution probability in each area respectively, finally calculate the entropy of POI data distribution probability in each area, the entropy calculated is normalized, obtain spatial distribution entropy, the method calculating this spatial distribution entropy see embodiment one, can not repeat them here.
(7) there is the number in the city of same query word
According to the record of user's inquiry log, user inquired about " furniture building materials " in 326 city scope, that is, this query word occurred in 326 cities, was normalized 326/360=0.905 to this numerical value.
It should be noted that, the city of indication is city above county level here.
(8) ratio of the POI data that title indicates with branch in POI data group
The extracting mode of this recognition feature is identical with the extracting mode in embodiment one, and concrete method can see embodiment one, and this repeats no more.
(9) in POI data group title with the ratio of door mark P OI data
The extracting mode of this recognition feature is identical with the extracting mode in embodiment one, and concrete mode can see embodiment one, and this repeats no more.
(10) the clicked ratio of POI data
Wherein, the clicked ratio of described POI data equals M divided by N, wherein the quantity of title trunk that obtains for the extraction carrying out title trunk to the title of the POI data in POI data group of M, and N is the quantity of the POI data in POI data group.As, 421 POI data are comprised in POI data group, in 421 POI data, although there is the title of some POI data different, but its title trunk is the same, be called in " KFC-shop, Wangjing " as there being the name of 3 POI data in these 421 POI data, " KFC-shop, Madian ", " KFC (An Zhen shop) ", then extract the title trunk of the title of these 3 POI data, obtain 1 title trunk " KFC ", according to this, the title of these 421 POI data is carried out to the extraction of title trunk, obtain 389 title trunks, then the clicked rate of POI data is: 389/421=0.926.
Except can extracting above-mentioned recognition feature, any one in following two recognition features can also be extracted or extract following two recognition features simultaneously: hitting hit in the ratio of the number of clicks of chain brand word and POI data group in POI data group and the number of the unduplicated chain brand word of title.
Such as, total number of clicks of 421 POI data is 985, the title trunk of the POI data clicked for 201 times is wherein had to be chain brand word in chain brand word dictionary, namely, have 201 clicks to hit chain brand word, the ratio hitting the number of clicks of chain brand word in POI data group is 201/985=0.204.
Wherein, chain brand word dictionary is the chain brand word dictionary set up by the mode in embodiment one.Therefore, when needing when trainable recognizer to extract this recognition feature, need before the scheme performing the present embodiment, first perform the scheme in embodiment one, thus first obtain a chain brand word dictionary.
And click in the chain brand word hit at 201 times, the number of the unduplicated chain brand word of title is 64, is normalized this numerical value, obtains 64/50=1.28, is set to 1 by force.
Equally, the recognition feature of mark result and extraction is input to training module, can recognizer be obtained by training.
This recognizer comprises the first recognizer, the second recognizer and the 3rd recognizer, and the recognition feature based on described training data carries out the training of recognizer, specifically comprises:
1) be labeled as based on query word in training data the training that recognition feature that the POI data group of chain brand word and query word be labeled as the POI data group of classifier carries out the first recognizer, the query word obtained for identifying POI data group according to the recognition feature of POI data group is the first recognizer of doubtful classifier and doubtful chain brand word;
2) be labeled as based on query word in training data the training that recognition feature that the POI data group of chain brand word and query word be labeled as the POI data group of generic word carries out the second recognizer, the query word obtained for identifying POI data group according to the recognition feature of POI data group is the second recognizer of chain brand word and generic word;
3) be labeled as based on query word in training data the training that recognition feature that the POI data group of classifier and query word be labeled as the POI data group of generic word carries out the 3rd recognizer, the query word obtained for identifying POI data group according to the recognition feature of POI data group is the 3rd recognizer of classificating word and generic word.
So far, recognizer training is complete, each recognizer is a numerical value to the result exported after query word identification, when this numerical value is 0 be, represent that query word is that the probability of generic word is large, when this numerical value close to 1 time, represent that query word is the probability of classifier, but this numerical value close to 2 time, represent that query word is that the probability of chain brand word is large.
Step 304: utilize the recognizer after training to identify Unidentified query word in query word corresponding to all POI data groups, identifying is the query word of chain brand word and classifier;
Utilize the recognizer after training to identify Unidentified query word in query word corresponding to all POI data groups, identifying is the query word of chain brand word and classifier, specifically comprises:
1) recognition feature of POI data group corresponding for Unidentified query word is input in described first recognizer, exports from described first recognizer the first recognition result that query word corresponding to described POI data group is doubtful chain brand word or doubtful classifier;
2) be input in described second recognizer by the recognition feature of POI data group corresponding for the query word of chain brand word doubtful in described first recognition result, described second recognizer exports the second recognition result that query word corresponding to described POI data group is generic word or chain brand word;
3) be input in described 3rd recognizer by the recognition feature of described POI data group corresponding for the query word of doubtful classifier in described first recognition result, the query word that described 3rd recognizer exports described POI data group is the 3rd recognition result of generic word or classifier;
4) from described second recognition result and the 3rd recognition result, chain brand word and classifier is extracted.
When concrete operations, first recognizer, the second recognizer and the 3rd recognizer can be all linear classifier, this linear classifier all can adopt the formula of previously described formula (1), distinctive points is, weight coefficient, the constant term of the recognition feature of each recognizer may be different, and the weight coefficient of the recognition feature of each recognizer and the value of constant term obtain according to the training data of training this recognizer.
Step 305: by described be that the query word of chain brand word is stored in preset chain brand word dictionary, and by described be that the query word of classifier is stored in preset classifier dictionary.
After establishing the dictionary about classifier and chain brand word, the chain brand word dictionary set up in chain brand word dictionary the present embodiment can set up and embodiment one gather, thus sets up a total chain brand word dictionary.
In addition, in the dictionary of classifier and the chain brand word search strategy that will be applied in navigation engine equally and search results ranking.As, when navigation engine judge query word be a classifier or a chain brand word time, the searching method of the POI data that navigation engine search is mated with this query word and be generic word to the sort method of Search Results and query word time, the searching method that navigation engine adopts is different with sort method.When the query word that user inputs as judged when navigation engine is classifier (judging when namely query word mates with classifier dictionary that this query word is classifier), that illustrate that user needs to search for should be the POI of a certain classification, therefore the searching method that navigation engine is taked is the classifier according to user's input, from POI data storehouse, filtering out the POI mated with this classifier, and showing according to POI distance users position order from the near to the remote when showing Query Result; And when navigation engine judges that the query word that user inputs is chain brand word (judging when namely query word mates with chain brand word dictionary that this query word is chain brand word), due to chain brand mechanism geographically distribution comparatively even, what illustrate that user needs to search for should be the nearer chain brand mechanism of distance current location, therefore the searching method that navigation engine is taked is the POI mated with chain brand word in the periphery certain limit of search subscriber position, and when showing result for retrieval, show according to POI distance users position order from the near to the remote.
As can be seen from the above-described embodiment, compared with prior art, the invention has the advantages that:
Provide the method for the POI data corresponding with query word obtained based on user's inquiry log and the recognition feature trainable recognizer extracted from POI data, utilizing this recognizer automatically to identify from query word corresponding to all POI data obtained to be the query word of chain brand word, is that the query word of chain brand word sets up a chain brand word dictionary based on identifying.This automatic recognition method not only increases work efficiency, and, recognition method can also be carried out by the timing query word corresponding to all POI data obtained based on user's inquiry log, realize upgrading in time dictionary.
Embodiment four
The present embodiment four is with the difference of embodiment three, after acquisition sorter, can also test to the identification accuracy of recognizer further, if identify that through checking it accuracy is undesirable, this recognizer is regulated, and then once check, continuous duplication check and adjustment, until the identification accuracy of recognizer meets the requirements.As shown in Figure 4, it is the process flow diagram of the method for building up of the another kind of chain brand word dictionary of the present invention and classifier dictionary, and the method comprises the following steps:
Step 401: from user's inquiry log, obtain different user and inquired about the POI data obtained in same city by identical query word, the POI data got is aggregated into a POI data group, described POI data group is corresponding with described query word;
Step 402: extract the recognition feature with described POI data group from each POI data group;
Described recognition feature is for identifying that query word corresponding to described POI data group is the parameter of chain brand word, classifier and generic word.
Step 403: extract query word and be marked as the POI data group of chain brand word, classifier and generic word as training data from all POI data groups, the recognition feature based on described training data carries out the training of recognizer;
The concrete implementation of above-mentioned steps 401-403 can see embodiment three, and the present embodiment repeats no more this.Be below checking process:
Step 404: extract query word and be marked as the POI data group of chain brand word, classifier and generic word as check data from all POI data groups, described check data and described training data are different data;
Concrete extraction mode can see the explanation in embodiment two, and this repeats no more.
Step 405: utilize the query word of described recognizer to described check data to identify, identifying is the query word of chain brand word and the query word being classifier;
Concrete mark mode can see the explanation in embodiment two, and this repeats no more.
Step 406: according to the recognition result of described recognizer to described check data, calculates described recognizer to the recognition accuracy of chain brand word and/or identification recall rate, and calculates described recognizer to the recognition accuracy of classifier and/or identification recall rate;
Wherein: the quantity that the recognition accuracy of chain brand word/classifier equals the query word in described recognition result being chain brand word/classifier is accurately the quantity of the query word of chain brand word/classifier divided by what identify in described recognition result, the identification recall rate of described chain brand word/classifier equal in described recognition result be the query word quantity of chain brand word/classifier accurately divided by the quantity of query word being marked as chain brand word/classifier, the query word of described chain brand word/classifier is accurately the query word being not only marked as chain brand word/classifier but also being identified as chain brand word/classifier.
Wherein, about chain brand word, there are three kinds of possibilities: only calculate recognition accuracy, only calculate and identify recall rate, calculate recognition accuracy simultaneously and identify recall rate.Equally, about classifier, also there are three kinds of possibilities: only calculate recognition accuracy, only calculating identification and recalled, calculate recognition accuracy and identification recall rate simultaneously.
Step 407: judge the recognition accuracy of described chain brand word and/or identify whether recall rate is more than or equal to each self-corresponding threshold value, and judge the recognition accuracy of described classifier and/or identify whether recall rate is more than or equal to each self-corresponding threshold value, if the recognition accuracy of described chain brand word and/or identification recall rate are less than each self-corresponding threshold value, or, recognition accuracy and/or the identification recall rate of described classifier are less than each self-corresponding threshold value, enter step 408, otherwise, enter step 409;
Step 408: readjust described recognizer, return step 405;
Such as, the recognition feature extracted when can be modified in trainable recognizer; Again such as, also can be modified in some coefficients of adopting when extracting recognition feature, e.g., in order to realize the normalization coefficient that normalization adopts when being modified in the number calculating POI data in POI data group; Or the areal divided when being modified in computer memory Distribution Entropy, as changed 3*3 into 4*4.
In addition, the weight coefficient of each recognition feature in recognizer also can be revised, or, be modified in the classification critical line value adopted when distinguishing generic word, chain brand word and classifier.
Step 409: utilize the recognizer after training to identify Unidentified query word in query word corresponding to all POI data groups, identifying is the query word of chain brand word and classifier;
Step 410: by described be that the query word of chain brand word is stored in preset chain brand word dictionary, and by described be that the query word of classifier is stored in preset classifier dictionary.
The specific implementation process of above-mentioned steps 409-410 can see embodiment three, and this repeats no more.
As can be seen from the above-described embodiment, compared with prior art, the invention has the advantages that:
Provide the method for the POI data corresponding with query word obtained based on user's inquiry log and the recognition feature trainable recognizer extracted from POI data, utilizing this recognizer automatically to identify from query word corresponding to all POI data obtained to be the query word of chain brand word, is that the query word of chain brand word sets up a chain brand word dictionary based on identifying.This automatic recognition method not only increases work efficiency, and, recognition method can also be carried out by the timing query word corresponding to all POI data obtained based on user's inquiry log, realize upgrading in time dictionary.In addition, after training obtains recognizer, also extract title trunk and be marked as the check data of chain brand word, classifier and generic word to the further inspection of recognizer do, when disqualified upon inspection, this recognizer is regulated, ensure that the accuracy that the title trunk of recognizer to POI data group identifies and validity.
Embodiment five
Corresponding with above-mentioned one chain brand word dictionary method for building up, the embodiment of the present invention additionally provides a kind of chain brand word dictionary apparatus for establishing.Refer to Fig. 5, its example structure figure being a kind of chain brand word dictionary apparatus for establishing of the present invention, this device comprises: unit 505 set up in the first polymerized unit 501, fisrt feature extraction unit 502, first training unit 503, first recognition unit 504 and the first dictionary.Principle of work below in conjunction with this device introduces its inner structure and annexation further.
First polymerized unit 501, for POI data identical for title trunk in POI data storehouse, same city is aggregated into a POI data group, described POI data group is corresponding with described title trunk;
Fisrt feature extraction unit 502, for extracting the recognition feature of described POI data group from each POI data group;
First training unit 503, be marked as the POI data group of chain brand word and non-chain brand word as training data for extracting title trunk from all POI data groups, the recognition feature based on described training data carries out the training of chain brand word recognizer;
First recognition unit 504, for utilizing the described chain brand word recognizer after training to identify Unidentified title trunk in title trunk corresponding to all POI data groups, identifying is the title trunk of chain brand word;
Unit 505 set up in first dictionary, for by described be that the title trunk of chain brand word is stored in preset chain brand word dictionary.
Preferably, as shown in Figure 6, this device also comprises: the first extracting unit 506, second recognition unit 507, first computing unit 508, first judging unit 509 and the first regulon 510, wherein,
First extracting unit 506, before identifying for Unidentified title trunk in the title trunk that the described chain brand word recognizer after described first recognition unit 504 utilizes training is corresponding to all POI data groups, from all POI data groups, extract title trunk be marked as the POI data group of chain brand word and non-chain brand word as check data, described check data and described training data are different data;
Second recognition unit 507, for utilizing the title trunk of described chain brand word recognizer to described check data to identify, identifying is the title trunk of chain brand word;
First computing unit 508, for according to described chain brand word recognizer to the recognition result of described check data, calculate described chain brand word recognizer to the recognition accuracy of chain brand word and/or identification recall rate, wherein, the quantity that described recognition accuracy equals the title trunk in described recognition result being chain brand word is accurately the quantity of the title trunk of chain brand word divided by what identify in described recognition result, the quantity that described recall rate equals the title trunk in described recognition result being chain brand word is accurately divided by the quantity of title trunk being marked as chain brand word in described check data, the title trunk of described chain brand word is accurately the title trunk being not only marked as chain brand word but also being identified as chain brand word,
First judging unit 509, for judging described recognition accuracy and/or identifying whether recall rate is more than or equal to each self-corresponding threshold value;
First regulon 510, if be no for the judged result of described first judging unit 509, then regulate described chain brand word recognizer, utilize the described chain brand word recognizer after regulating to trigger described second recognition unit 507, described first computing unit 508 and described first judging unit 509 and rework;
Then described first recognition unit 504 specifically for, utilize the described chain brand word recognizer after regulating to identify Unidentified title trunk in title trunk corresponding to all POI data groups.
Preferably, described chain brand word recognizer is linear classifier, and described linear classifier is:
y = Σ ( W i × X i ) + b
Wherein, W ibe the weight coefficient of i-th recognition feature, X iit is the value of i-th recognition feature, b is constant term, then identify that title trunk corresponding to described POI array is chain brand word when y is more than or equal to predetermined threshold value, then identify that title trunk corresponding to described POI data group is non-chain brand word when y is less than described predetermined threshold value.
The recognition feature of POI data group is following any one or any number of combination:
Space distribution distance, spatial distribution entropy, the ratio of the POI data that title indicates with branch in POI data group, in POI data group, title is with the ratio of the POI data of door mark, the classification score value of POI data group, described classification score value refers to and POI data identical for classification in described POI data group is aggregated into a data group, comprise the preset score value that the classification of the maximum data group of POI data is corresponding, described preset score value occurs that the prior probability of chain brand mechanism obtains according to preset described classification, described prior probability equals N/M, wherein M for be labeled as chain brand word in described training data title trunk corresponding to the quantity of POI data, N is the quantity of POI data identical with the described classification comprising the maximum data group of POI data in a described M POI data.
As can be seen from the above-described embodiment, compared with prior art, the invention has the advantages that:
Provide the device of training chain brand word recognizer based on all POI data in a POI data storehouse and the recognition feature extracted from POI data, utilizing chain brand word recognizer automatically to identify from all POI data in POI data storehouse to be the title trunk of chain brand word, is that the title trunk of chain brand word sets up a chain brand word dictionary based on identifying.This automatic recognition method not only increases work efficiency, and, recognition method can also be carried out by timing to the title trunk of POI data all in POI data storehouse, realize upgrading in time dictionary.
Embodiment six
Corresponding with above-mentioned one chain brand word dictionary and classifier dictionary method for building up, the embodiment of the present invention additionally provides a kind of chain brand word dictionary and classifier dictionary apparatus for establishing.Refer to Fig. 7, it is an example structure figure of a kind of chain brand word dictionary of the present invention and classifier dictionary apparatus for establishing, and this device comprises: unit 705 set up in the second polymerized unit 701, second feature extraction unit 702, second training unit 703, the 3rd recognition unit 704 and the second dictionary.Principle of work below in conjunction with this device introduces its inner structure and annexation further.
Second polymerized unit 701, for from user's inquiry log, obtain different user and inquired about the POI data obtained in same city by identical query word, the POI data got is aggregated into a POI data group, described POI data group is corresponding with described query word;
Second feature extraction unit 702, for extracting the recognition feature with described POI data group from each POI data group;
Second training unit 703, be marked as the POI data group of chain brand word, classifier and generic word as training data for extracting query word from all POI data groups, the recognition feature based on described training data carries out the training of recognizer;
3rd recognition unit 704, for utilizing the recognizer after training to identify Unidentified query word in query word corresponding to all POI data groups, identifying is the query word of chain brand word and classifier;
Unit 705 set up in second dictionary, for by described be that the query word of chain brand word is stored in preset chain brand word dictionary, and by described be that the query word of classifier is stored in preset classifier dictionary.
Preferably, as shown in Figure 8, this device also comprises: the second extracting unit 706, the 4th recognition unit 707, second computing unit 708, second judging unit 709 and the second regulon 710, wherein,
Second extracting unit 706, before in the query word that described 3rd recognition unit 704 is corresponding to all POI data groups with the recognizer after training, Unidentified query word identifies, from all POI data groups, extract query word be marked as the POI data group of chain brand word, classifier and generic word as check data, described check data and described training data are different data;
4th recognition unit 707, for utilizing the query word of described recognizer to described check data to identify, identifying is the query word of chain brand word and the query word being classifier;
Second computing unit 708, for according to the recognition result of described recognizer to described check data, calculate described recognizer to the recognition accuracy of chain brand word and/or identification recall rate, and calculate described recognizer to the recognition accuracy of classifier and/or identification recall rate, wherein: the recognition accuracy of chain brand word/classifier equals the quantity of the query word in described recognition result being chain brand word/classifier accurately divided by the quantity identifying the query word being chain brand word/classifier in described recognition result, the identification recall rate of described chain brand word/classifier equals to be that the query word quantity of chain brand word/classifier is accurately divided by the quantity of query word being marked as chain brand word/classifier in described recognition result, the query word of described chain brand word/classifier is accurately the query word being not only marked as chain brand word/classifier but also being identified as chain brand word/classifier,
Second judging unit 709, for judging the recognition accuracy of described chain brand word and/or identifying whether recall rate is more than or equal to each self-corresponding threshold value, and judge the recognition accuracy of described classifier and/or identify whether recall rate is more than or equal to each self-corresponding threshold value;
Second regulon 710, if for described chain brand word recognition accuracy and/or identify recall rate be less than each self-corresponding threshold value, or, recognition accuracy and/or the identification recall rate of described classifier are less than each self-corresponding threshold value, then regulate described recognizer, utilize the described recognizer after regulating to trigger described 4th recognition unit 707, described second computing unit 708 and the second judging unit 709 and rework;
Then described 3rd recognition unit 704 specifically for, utilize the described recognizer after regulating to identify Unidentified query word in query word corresponding to all POI data groups.
Preferably, described recognizer comprises the first recognizer, the second recognizer and the 3rd recognizer, and described second training unit 703 comprises:
First recognizer training subelement, the recognition feature that POI data group and query word for being labeled as chain brand word based on query word in training data are labeled as the POI data group of classifier carries out the training of the first recognizer, and the query word obtained for identifying POI data group according to the recognition feature of POI data group is the first recognizer of doubtful classifier and doubtful chain brand word;
Second recognizer training subelement, the recognition feature that POI data group and query word for being labeled as chain brand word based on query word in training data are labeled as the POI data group of generic word carries out the training of the second recognizer, and the query word obtained for identifying POI data group according to the recognition feature of POI data group is the second recognizer of chain brand word and generic word;
3rd recognizer training subelement, the recognition feature that POI data group and query word for being labeled as classifier based on query word in training data are labeled as the POI data group of generic word carries out the training of the 3rd recognizer, and the query word obtained for identifying POI data group according to the recognition feature of POI data group is the 3rd recognizer of classificating word and generic word.
Preferably, described 3rd recognition unit 704 comprises:
First recognition result determination subelement, for being input in described first recognizer by the recognition feature of POI data group corresponding for Unidentified query word, export from described first recognizer the first recognition result that query word corresponding to described POI data group is doubtful chain brand word or doubtful classifier;
Second recognition result determination subelement, for being input in described second recognizer by the recognition feature of POI data group corresponding for the query word of chain brand word doubtful in described first recognition result, described second recognizer exports the second recognition result that query word corresponding to described POI data group is generic word or chain brand word;
3rd recognition result determination subelement, for being input in described 3rd recognizer by the recognition feature of described POI data group corresponding for the query word of doubtful classifier in described first recognition result, the query word that described 3rd recognizer exports described POI data group is the 3rd recognition result of generic word or classifier;
Chain brand word/classifier extracts subelement, for extracting chain brand word and classifier from described second recognition result and the 3rd recognition result.
Preferably, the recognition feature of POI data group is following any one or any number of combination:
The number of POI data in POI data group; The click Distribution Entropy of POI data group; The classification number of POI data in POI data group; The class of pressing of POI data group clicks Distribution Entropy; Space distribution distance; Spatial distribution entropy; There is the number in the city of same query word; The ratio of the POI data that title indicates with branch in POI data group; In POI data group, title is with the ratio of door mark P OI data; The clicked ratio of POI data, the clicked ratio of wherein said POI data equals M divided by N, the wherein quantity of title trunk that obtains for the extraction carrying out title trunk to the title of the POI data in POI data group of M, N is the quantity of the POI data in POI data group.
As can be seen from the above-described embodiment, compared with prior art, the invention has the advantages that:
Provide the method for training chain brand word recognizer based on all POI data in a POI data storehouse and the recognition feature extracted from POI data, utilizing chain brand word recognizer automatically to identify from all POI data in POI data storehouse to be the title trunk of chain brand word, is that the title trunk of chain brand word sets up a chain brand word dictionary based on identifying.This automatic recognition method not only increases work efficiency, and, recognition method can also be carried out by timing to the title trunk of POI data all in POI data storehouse, realize upgrading in time dictionary.
It should be noted that, one of ordinary skill in the art will appreciate that all or part of flow process realized in above-described embodiment method, that the hardware that can carry out instruction relevant by computer program has come, described program can be stored in a computer read/write memory medium, this program, when performing, can comprise the flow process of the embodiment as above-mentioned each side method.Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-Only Memory, ROM) or random store-memory body (Random AccessMemory, RAM) etc.
Above the method for building up of a kind of chain brand word provided by the present invention and classifier and device are described in detail, apply specific embodiment herein to set forth principle of the present invention and embodiment, the explanation of above embodiment just understands method of the present invention and core concept thereof for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (18)

1. a chain brand word dictionary method for building up, is characterized in that, comprising:
POI data identical for title trunk in point of interest POI data storehouse, same city is aggregated into a POI data group, and described POI data group is corresponding with described title trunk;
The recognition feature of described POI data group is extracted from each POI data group;
From all POI data groups, extract title trunk be marked as the POI data group of chain brand word and non-chain brand word as training data, the recognition feature based on described training data carries out the training of chain brand word recognizer;
Utilize the described chain brand word recognizer after training to identify Unidentified title trunk in title trunk corresponding to all POI data groups, identifying is the title trunk of chain brand word;
By described be that the title trunk of chain brand word is stored in preset chain brand word dictionary.
2. method according to claim 1, it is characterized in that, before in the title trunk that described chain brand word recognizer after utilizing training is corresponding to all POI data groups, Unidentified title trunk identifies, also comprise checkout procedure, described checkout procedure comprises:
From all POI data groups, extract title trunk be marked as the POI data group of chain brand word and non-chain brand word as check data, described check data and described training data are different data;
Utilize the title trunk of described chain brand word recognizer to described check data to identify, identifying is the title trunk of chain brand word;
According to described chain brand word recognizer to the recognition result of described check data, calculate described chain brand word recognizer to the recognition accuracy of chain brand word and/or identification recall rate, wherein, the quantity that described recognition accuracy equals the title trunk in recognition result being chain brand word is accurately the quantity of the title trunk of chain brand word divided by what identify in recognition result, the quantity that described recall rate equals the title trunk in recognition result being chain brand word is accurately divided by the quantity of title trunk being marked as chain brand word in described check data, the title trunk of described chain brand word is accurately the title trunk being not only marked as chain brand word but also being identified as chain brand word,
Judge described recognition accuracy and/or identify whether recall rate is more than or equal to each self-corresponding threshold value;
If not, then regulate described chain brand word recognizer, utilize the described chain brand word recognizer after regulating to repeat second in described checkout procedure to the 4th step.
3. method according to claim 2, is characterized in that, described chain brand word recognizer is linear classifier, and described linear classifier is:
y = Σ ( W i × X i ) + b
Wherein, W ibe the weight coefficient of i-th recognition feature, X iit is the value of i-th recognition feature, b is constant term, then identify that title trunk corresponding to described POI array is chain brand word when y is more than or equal to predetermined threshold value, then identify that title trunk corresponding to described POI data group is non-chain brand word when y is less than described predetermined threshold value.
4. the method according to any one of claims 1 to 3, is characterized in that, the recognition feature of POI data group is following any one or any number of combination:
Space distribution distance, spatial distribution entropy, the ratio of the POI data that title indicates with branch in POI data group, in POI data group, title is with the ratio of the POI data of door mark, the classification score value of POI data group, described classification score value refers to and POI data identical for classification in described POI data group is aggregated into a data group, comprise the preset score value that the classification of the maximum data group of POI data is corresponding, described preset score value occurs that the prior probability of chain brand mechanism obtains according to preset described classification, described prior probability equals N/M, wherein M for be labeled as chain brand word in described training data title trunk corresponding to the quantity of POI data, N is the quantity of POI data identical with the described classification comprising the maximum data group of POI data in a described M POI data.
5. chain brand word dictionary and a classifier dictionary method for building up, is characterized in that, comprising:
From user's inquiry log, obtain different user and inquired about the POI data obtained in same city by identical query word, the POI data got is aggregated into a POI data group, described POI data group is corresponding with described query word;
The recognition feature of described POI data group is extracted from each POI data group;
From all POI data groups, extract query word be marked as the POI data group of chain brand word, classifier and generic word as training data, the recognition feature based on described training data carries out the training of recognizer;
Utilize the recognizer after training to identify Unidentified query word in query word corresponding to all POI data groups, identifying is the query word of chain brand word and classifier;
By described be that the query word of chain brand word is stored in preset chain brand word dictionary, and by described be that the query word of classifier is stored in preset classifier dictionary.
6. method according to claim 5, is characterized in that, before in the query word that the recognizer after utilizing training is corresponding to all POI data groups, Unidentified query word identifies, also comprise checkout procedure, described checkout procedure comprises:
From all POI data groups, extract query word be marked as the POI data group of chain brand word, classifier and generic word as check data, described check data and described training data are different data;
Utilize the query word of described recognizer to described check data to identify, identifying is the query word of chain brand word and the query word being classifier;
According to the recognition result of described recognizer to described check data, calculate described recognizer to the recognition accuracy of chain brand word and/or identification recall rate, and calculate described recognizer to the recognition accuracy of classifier and/or identification recall rate, wherein: the quantity that the recognition accuracy of chain brand word/classifier equals the query word in described recognition result being chain brand word/classifier is accurately the quantity of the query word of chain brand word/classifier divided by what identify in described recognition result, the identification recall rate of described chain brand word/classifier equals to be that the query word quantity of chain brand word/classifier is accurately divided by the quantity of query word being marked as chain brand word/classifier in described recognition result, the query word of described chain brand word/classifier is accurately the query word being not only marked as chain brand word/classifier but also being identified as chain brand word/classifier,
Judge the recognition accuracy of described chain brand word and/or identify whether recall rate is more than or equal to each self-corresponding threshold value, and judge the recognition accuracy of described classifier and/or identify whether recall rate is more than or equal to each self-corresponding threshold value;
If the recognition accuracy of described chain brand word and/or identification recall rate are less than each self-corresponding threshold value, or, recognition accuracy and/or the identification recall rate of described classifier are less than each self-corresponding threshold value, then regulate described recognizer, utilize the described recognizer after regulating to repeat second in described checkout procedure to the 4th step.
7. method according to claim 5, is characterized in that, described recognizer comprises the first recognizer, the second recognizer and the 3rd recognizer, and the recognition feature based on described training data carries out the training of recognizer, specifically comprises:
Be labeled as based on query word in training data the training that recognition feature that the POI data group of chain brand word and query word be labeled as the POI data group of classifier carries out the first recognizer, the query word obtained for identifying POI data group according to the recognition feature of POI data group is the first recognizer of doubtful classifier and doubtful chain brand word;
Be labeled as based on query word in training data the training that recognition feature that the POI data group of chain brand word and query word be labeled as the POI data group of generic word carries out the second recognizer, the query word obtained for identifying POI data group according to the recognition feature of POI data group is the second recognizer of chain brand word and generic word;
Be labeled as based on query word in training data the training that recognition feature that the POI data group of classifier and query word be labeled as the POI data group of generic word carries out the 3rd recognizer, the query word obtained for identifying POI data group according to the recognition feature of POI data group is the 3rd recognizer of classificating word and generic word.
8. method according to claim 7, is characterized in that, utilizes the recognizer after training to identify Unidentified query word in query word corresponding to all POI data groups, and identifying is the query word of chain brand word and classifier, specifically comprises:
The recognition feature of POI data group corresponding for Unidentified query word is input in described first recognizer, exports from described first recognizer the first recognition result that query word corresponding to described POI data group is doubtful chain brand word or doubtful classifier;
Be input in described second recognizer by the recognition feature of POI data group corresponding for the query word of chain brand word doubtful in described first recognition result, described second recognizer exports the second recognition result that query word corresponding to described POI data group is generic word or chain brand word;
The recognition feature of described POI data group corresponding for the query word of doubtful classifier in described first recognition result be input in described 3rd recognizer, the query word that described 3rd recognizer exports described POI data group is the 3rd recognition result of generic word or classifier;
Chain brand word and classifier is extracted from described second recognition result and the 3rd recognition result.
9. the method according to any one of claim 5 ~ 8, is characterized in that, the recognition feature of POI data group is following any one or any number of combination:
The number of POI data in POI data group; The click Distribution Entropy of POI data group; The classification number of POI data in POI data group; The class of pressing of POI data group clicks Distribution Entropy; Space distribution distance; Spatial distribution entropy; There is the number in the city of same query word; The ratio of the POI data that title indicates with branch in POI data group; In POI data group, title is with the ratio of door mark P OI data; The clicked ratio of POI data, the clicked ratio of wherein said POI data equals M divided by N, the wherein quantity of title trunk that obtains for the extraction carrying out title trunk to the title of the POI data in POI data group of M, N is the quantity of the POI data in POI data group.
10. a chain brand word dictionary apparatus for establishing, is characterized in that, comprising:
First polymerized unit, for POI data identical for title trunk in POI data storehouse, same city is aggregated into a POI data group, described POI data group is corresponding with described title trunk;
Fisrt feature extraction unit, for extracting the recognition feature of described POI data group from each POI data group;
First training unit, be marked as the POI data group of chain brand word and non-chain brand word as training data for extracting title trunk from all POI data groups, the recognition feature based on described training data carries out the training of chain brand word recognizer;
First recognition unit, for utilizing the described chain brand word recognizer after training to identify Unidentified title trunk in title trunk corresponding to all POI data groups, identifying is the title trunk of chain brand word;
Unit set up in first dictionary, for by described be that the title trunk of chain brand word is stored in preset chain brand word dictionary.
11. methods according to claim 10, is characterized in that, also comprise:
First extracting unit, before identifying for Unidentified title trunk in the title trunk that the described chain brand word recognizer after described first recognition unit utilizes training is corresponding to all POI data groups, from all POI data groups, extract title trunk be marked as the POI data group of chain brand word and non-chain brand word as check data, described check data and described training data are different data;
Second recognition unit, for utilizing the title trunk of described chain brand word recognizer to described check data to identify, identifying is the title trunk of chain brand word;
First computing unit, for according to described chain brand word recognizer to the recognition result of described check data, calculate described chain brand word recognizer to the recognition accuracy of chain brand word and/or identification recall rate, wherein, the quantity that described recognition accuracy equals the title trunk in described recognition result being chain brand word is accurately the quantity of the title trunk of chain brand word divided by what identify in described recognition result, the quantity that described recall rate equals the title trunk in described recognition result being chain brand word is accurately divided by the quantity of title trunk being marked as chain brand word in described check data, the title trunk of described chain brand word is accurately the title trunk being not only marked as chain brand word but also being identified as chain brand word,
First judging unit, for judging described recognition accuracy and/or identifying whether recall rate is more than or equal to each self-corresponding threshold value;
First regulon, if the judged result for described first judging unit is no, then regulate described chain brand word recognizer, utilize the described chain brand word recognizer after regulating to trigger described second recognition unit, described first computing unit and described first judging unit and rework;
Then described first recognition unit specifically for, utilize the described chain brand word recognizer after regulating to identify Unidentified title trunk in title trunk corresponding to all POI data groups.
12. devices according to claim 11, is characterized in that, described chain brand word recognizer is linear classifier, and described linear classifier is:
y = Σ ( W i × X i ) + b
Wherein, W ibe the weight coefficient of i-th recognition feature, X iit is the value of i-th recognition feature, b is constant term, then identify that title trunk corresponding to described POI array is chain brand word when y is more than or equal to predetermined threshold value, then identify that title trunk corresponding to described POI data group is non-chain brand word when y is less than described predetermined threshold value.
13. devices according to any one of claim 10 ~ 12, it is characterized in that, the recognition feature of POI data group is following any one or any number of combination:
Space distribution distance, spatial distribution entropy, the ratio of the POI data that title indicates with branch in POI data group, in POI data group, title is with the ratio of the POI data of door mark, the classification score value of POI data group, described classification score value refers to and POI data identical for classification in described POI data group is aggregated into a data group, comprise the preset score value that the classification of the maximum data group of POI data is corresponding, described preset score value occurs that the prior probability of chain brand mechanism obtains according to preset described classification, described prior probability equals N/M, wherein M for be labeled as chain brand word in described training data title trunk corresponding to the quantity of POI data, N is the quantity of POI data identical with the described classification comprising the maximum data group of POI data in a described M POI data.
14. 1 kinds of chain brand word dictionaries and classifier dictionary apparatus for establishing, is characterized in that, comprising:
Second polymerized unit, for from user's inquiry log, obtain different user and inquired about the POI data obtained in same city by identical query word, the POI data got is aggregated into a POI data group, described POI data group is corresponding with described query word;
Second feature extraction unit, for extracting the recognition feature with described POI data group from each POI data group;
Second training unit, be marked as the POI data group of chain brand word, classifier and generic word as training data for extracting query word from all POI data groups, the recognition feature based on described training data carries out the training of recognizer;
3rd recognition unit, for utilizing the recognizer after training to identify Unidentified query word in query word corresponding to all POI data groups, identifying is the query word of chain brand word and classifier;
Unit set up in second dictionary, for by described be that the query word of chain brand word is stored in preset chain brand word dictionary, and by described be that the query word of classifier is stored in preset classifier dictionary.
15. devices according to claim 14, is characterized in that, also comprise:
Second extracting unit, before identifying for Unidentified query word in the query word that the recognizer after described 3rd recognition unit training is corresponding to all POI data groups, from all POI data groups, extract query word be marked as the POI data group of chain brand word, classifier and generic word as check data, described check data and described training data are different data;
4th recognition unit, for utilizing the query word of described recognizer to described check data to identify, identifying is the query word of chain brand word and the query word being classifier;
Second computing unit, for according to the recognition result of described recognizer to described check data, calculate described recognizer to the recognition accuracy of chain brand word and/or identification recall rate, and calculate described recognizer to the recognition accuracy of classifier and/or identification recall rate, wherein: the quantity that the recognition accuracy of chain brand word/classifier equals the query word in described recognition result being chain brand word/classifier is accurately the quantity of the query word of chain brand word/classifier divided by what identify in described recognition result, the identification recall rate of described chain brand word/classifier equals to be that the query word quantity of chain brand word/classifier is accurately divided by the quantity of query word being marked as chain brand word/classifier in described Query Result, the query word of described chain brand word/classifier is accurately the query word being not only marked as chain brand word/classifier but also being identified as chain brand word/classifier,
Second judging unit, for judging the recognition accuracy of described chain brand word and/or identifying whether recall rate is more than or equal to each self-corresponding threshold value, and judge the recognition accuracy of described classifier and/or identify whether recall rate is more than or equal to each self-corresponding threshold value;
Second regulon, if for described chain brand word recognition accuracy and/or identify recall rate be less than each self-corresponding threshold value, or, recognition accuracy and/or the identification recall rate of described classifier are less than each self-corresponding threshold value, then regulate described recognizer, utilize the described recognizer after regulating to trigger described 4th recognition unit, described second computing unit and the second judging unit and rework;
Then described 3rd recognition unit specifically for, utilize the described recognizer after regulating to identify Unidentified query word in query word corresponding to all POI data groups.
16. devices according to claim 14, is characterized in that, described recognizer comprises the first recognizer, the second recognizer and the 3rd recognizer, and described second training unit comprises:
First recognizer training subelement, the recognition feature that POI data group and query word for being labeled as chain brand word based on query word in training data are labeled as the POI data group of classifier carries out the training of the first recognizer, and the query word obtained for identifying POI data group according to the recognition feature of POI data group is the first recognizer of doubtful classifier and doubtful chain brand word;
Second recognizer training subelement, the recognition feature that POI data group and query word for being labeled as chain brand word based on query word in training data are labeled as the POI data group of generic word carries out the training of the second recognizer, and the query word obtained for identifying POI data group according to the recognition feature of POI data group is the second recognizer of chain brand word and generic word;
3rd recognizer training subelement, the recognition feature that POI data group and query word for being labeled as classifier based on query word in training data are labeled as the POI data group of generic word carries out the training of the 3rd recognizer, and the query word obtained for identifying POI data group according to the recognition feature of POI data group is the 3rd recognizer of classificating word and generic word.
17. devices according to claim 16, is characterized in that, described 3rd recognition unit comprises:
First recognition result determination subelement, for being input in described first recognizer by the recognition feature of POI data group corresponding for Unidentified query word, export from described first recognizer the first recognition result that query word corresponding to described POI data group is doubtful chain brand word or doubtful classifier;
Second recognition result determination subelement, for being input in described second recognizer by the recognition feature of POI data group corresponding for the query word of chain brand word doubtful in described first recognition result, described second recognizer exports the second recognition result that query word corresponding to described POI data group is generic word or chain brand word;
3rd recognition result determination subelement, for being input in described 3rd recognizer by the recognition feature of described POI data group corresponding for the query word of doubtful classifier in described first recognition result, the query word that described 3rd recognizer exports described POI data group is the 3rd recognition result of generic word or classifier;
Chain brand word/classifier extracts subelement, for extracting chain brand word and classifier from described second recognition result and the 3rd recognition result.
18. devices according to any one of claim 14 ~ 17, is characterized in that, the recognition feature of POI data group is following any one or any number of combination:
The number of POI data in POI data group; The click Distribution Entropy of POI data group; The classification number of POI data in POI data group; The class of pressing of POI data group clicks Distribution Entropy; Space distribution distance; Spatial distribution entropy; There is the number in the city of same query word; The ratio of the POI data that title indicates with branch in POI data group; In POI data group, title is with the ratio of door mark P OI data; The clicked ratio of POI data, the clicked ratio of wherein said POI data equals M divided by N, the wherein quantity of title trunk that obtains for the extraction carrying out title trunk to the title of the POI data in POI data group of M, N is the quantity of the POI data in POI data group.
CN201310439450.6A 2013-09-24 2013-09-24 Chain brand word dictionary, classifier dictionary method for building up and device Active CN104462143B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310439450.6A CN104462143B (en) 2013-09-24 2013-09-24 Chain brand word dictionary, classifier dictionary method for building up and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310439450.6A CN104462143B (en) 2013-09-24 2013-09-24 Chain brand word dictionary, classifier dictionary method for building up and device

Publications (2)

Publication Number Publication Date
CN104462143A true CN104462143A (en) 2015-03-25
CN104462143B CN104462143B (en) 2018-01-30

Family

ID=52908199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310439450.6A Active CN104462143B (en) 2013-09-24 2013-09-24 Chain brand word dictionary, classifier dictionary method for building up and device

Country Status (1)

Country Link
CN (1) CN104462143B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183908A (en) * 2015-09-30 2015-12-23 北京奇虎科技有限公司 Point of interest (POI) data classifying method and device
CN106933883A (en) * 2015-12-31 2017-07-07 中移(苏州)软件技术有限公司 Point of interest Ordinary search word sorting technique, device based on retrieval daily record
CN108280198A (en) * 2018-01-29 2018-07-13 口碑(上海)信息技术有限公司 List generation method and device
WO2018176913A1 (en) * 2017-03-31 2018-10-04 北京三快在线科技有限公司 Search method and apparatus, and non-temporary computer-readable storage medium
CN109885752A (en) * 2019-01-14 2019-06-14 口碑(上海)信息技术有限公司 Brand word method for digging, device, equipment and readable storage medium storing program for executing
CN110781283A (en) * 2019-09-16 2020-02-11 腾讯大地通途(北京)科技有限公司 Chain brand word bank generation method and device and electronic equipment
CN111782979A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Point of interest brand classification method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7310602B2 (en) * 2004-09-27 2007-12-18 Kabushiki Kaisha Equos Research Navigation apparatus
CN101350154A (en) * 2008-09-16 2009-01-21 北京搜狗科技发展有限公司 Method and apparatus for ordering electronic map data
CN102047249A (en) * 2008-05-27 2011-05-04 高通股份有限公司 Method and apparatus for aggregating and presenting data associated with geographic locations
US8073200B2 (en) * 2007-06-06 2011-12-06 Sony Corporation Information processing apparatus, information processing method, and computer program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7310602B2 (en) * 2004-09-27 2007-12-18 Kabushiki Kaisha Equos Research Navigation apparatus
US8073200B2 (en) * 2007-06-06 2011-12-06 Sony Corporation Information processing apparatus, information processing method, and computer program
CN102047249A (en) * 2008-05-27 2011-05-04 高通股份有限公司 Method and apparatus for aggregating and presenting data associated with geographic locations
CN101350154A (en) * 2008-09-16 2009-01-21 北京搜狗科技发展有限公司 Method and apparatus for ordering electronic map data

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183908A (en) * 2015-09-30 2015-12-23 北京奇虎科技有限公司 Point of interest (POI) data classifying method and device
CN105183908B (en) * 2015-09-30 2019-05-28 北京奇虎科技有限公司 A kind of classification method and device of point of interest POI data
CN106933883A (en) * 2015-12-31 2017-07-07 中移(苏州)软件技术有限公司 Point of interest Ordinary search word sorting technique, device based on retrieval daily record
CN106933883B (en) * 2015-12-31 2019-12-27 中移(苏州)软件技术有限公司 Method and device for classifying common search terms of interest points based on search logs
WO2018176913A1 (en) * 2017-03-31 2018-10-04 北京三快在线科技有限公司 Search method and apparatus, and non-temporary computer-readable storage medium
US11144594B2 (en) 2017-03-31 2021-10-12 Beijing Sankuai Online Technology Co., Ltd Search method, search apparatus and non-temporary computer-readable storage medium for text search
CN108280198A (en) * 2018-01-29 2018-07-13 口碑(上海)信息技术有限公司 List generation method and device
CN109885752A (en) * 2019-01-14 2019-06-14 口碑(上海)信息技术有限公司 Brand word method for digging, device, equipment and readable storage medium storing program for executing
CN109885752B (en) * 2019-01-14 2021-03-02 口碑(上海)信息技术有限公司 Brand word mining method, device, equipment and readable storage medium
CN110781283A (en) * 2019-09-16 2020-02-11 腾讯大地通途(北京)科技有限公司 Chain brand word bank generation method and device and electronic equipment
CN110781283B (en) * 2019-09-16 2023-12-08 腾讯大地通途(北京)科技有限公司 Chain brand word stock generation method and device and electronic equipment
CN111782979A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Point of interest brand classification method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN104462143B (en) 2018-01-30

Similar Documents

Publication Publication Date Title
CN104462143A (en) Method and device for establishing chain brand word bank and category word bank
CN108427965B (en) Hot spot area mining method based on road network clustering
CN103810299B (en) Image retrieval method on basis of multi-feature fusion
US11698261B2 (en) Method, apparatus, computer device and storage medium for determining POI alias
CN100472515C (en) Audio duplicate detector
CN102902806B (en) A kind of method and system utilizing search engine to carry out query expansion
CN104572956B (en) Determine the system and method for POI effectiveness
CN104537027A (en) Information recommendation method and device
CN103853738A (en) Identification method for webpage information related region
CN108090232A (en) Utilize the semantic event detection of cross-domain knowledge
CN103914498A (en) Search recommending method and device for map searching
CN105069047A (en) Retrieval method and device of geographic information
CN103823900A (en) Information point significance determining method and device
CN112000736B (en) Spatiotemporal trajectory adjoint analysis method and system, electronic device and storage medium
CN103207901B (en) A kind of method and apparatus that IP address ownership place is obtained based on search engine
CN104572957A (en) POI name determination system based on clustering and method thereof
CN111625732A (en) Address matching method and device
CN111782741A (en) Interest point mining method and device, electronic equipment and storage medium
Vaccari et al. A holistic framework for the study of urban traces and the profiling of urban processes and dynamics
Mokhtari et al. Tagging address queries in maps search
Moradi et al. Exploring five indicators for the quality of OpenStreetMap road networks: A case study of Québec, Canada
CN110457706A (en) Interest point name preference pattern training method, application method, device and storage medium
CN103324707A (en) Query expansion method based on semi-supervised clustering
CN103226559B (en) For the SOI object of combination and the spatial information directory system of content
CN109889981B (en) Positioning method and system based on binary classification technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200511

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 102200, No. 8, No., Changsheng Road, Changping District science and Technology Park, Beijing, China. 1-5

Patentee before: AUTONAVI SOFTWARE Co.,Ltd.