CN105550169A - Method and device for identifying point of interest names based on character length - Google Patents

Method and device for identifying point of interest names based on character length Download PDF

Info

Publication number
CN105550169A
CN105550169A CN201510921183.5A CN201510921183A CN105550169A CN 105550169 A CN105550169 A CN 105550169A CN 201510921183 A CN201510921183 A CN 201510921183A CN 105550169 A CN105550169 A CN 105550169A
Authority
CN
China
Prior art keywords
interest point
point name
string length
candidate
text participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510921183.5A
Other languages
Chinese (zh)
Inventor
王智广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510921183.5A priority Critical patent/CN105550169A/en
Publication of CN105550169A publication Critical patent/CN105550169A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a method and device for identifying point of interest names based on a character string length. The method comprises following steps: carrying out word segmentation to a to-be-identified text string, thus obtaining text segmented words; screening out candidate text segmented words in the text segmented words within a certain character string length range; comparing the screened candidate text segmented words and the point of interest names corresponding to the character string length range, judging whether the candidate text segmented words are the point of interest names. According to the embodiment of the invention, the subsequent calculation quantity is greatly reduced, therefore, the memory consumption is reduced, the memory limit is effectively retarded, and the searching efficiency is improved.

Description

A kind of method and apparatus based on character length identification interest point name
Technical field
The present invention relates to the technical field of computer disposal, particularly relate to a kind of method based on string length identification interest point name and a kind of device based on string length identification interest point name.
Background technology
Point of interest (PointofInterest, POI), can translate into " information point " again, and it comprises many-sided information, as title, classification, latitude, longitude etc.
In Geographic Information System, POI can be a house, retail shop, mailbox, a bus station etc.
Traditional geographical information collection method needs map mapping worker to adopt accurate instrument of surveying and mapping to remove the longitude and latitude of an acquisition point of interest, and then marks.
Just because of the collection of POI data is a very time-consuming bothersome job, concerning a Geographic Information System, the quantity of POI is in the value that to a certain degree represent whole system.
In order to enrich the quantity of the POI data of Geographic Information System, under the sights such as text mining, often need to judge whether to comprise in text some POI title in given POI name set.
Generally that given POI name set is set up a dictionary at present, for a given text string, text string is pressed character cutting, the character substring of n adjacent character composition (being designated as n eldest son string) is gone to search in dictionary, if can find, thinks that text string contains a POI name.。
Due to quantity ten million order of magnitude often of given POI name set, data volume is very big, now, need to consume larger internal memory, especially in some cases, as utilized hadoop Distributed Calculation when loading dictionary, limited memory system, the mode that application dictionary directly travels through just fails.
And get " n eldest son string " when searching in dictionary in text string, a lot of data are invalid, considerably increase calculated amount, and search efficiency is very low.
Summary of the invention
In view of the above problems, the present invention is proposed to provide a kind of overcoming the problems referred to above or a kind of method based on string length identification interest point name solved the problem at least in part and a kind of device based on string length identification interest point name accordingly.
According to one aspect of the present invention, provide a kind of method based on string length identification interest point name, comprising:
Word segmentation processing is carried out to text string to be identified, obtains text participle;
In described text participle, filter out the candidate's text participle within the scope of certain string length;
Interest point name corresponding with described string length scope for the candidate's text participle filtered out is compared, to judge that whether described candidate's text participle is for interest point name.
Alternatively, described in described text participle, the step filtering out the candidate's text participle within the scope of certain string length comprises:
Search default point of interest noun dictionary, in described interest point name dictionary, there is one or more interest point name set, the string length that in described interest point set, interest point name is the longest and the shortest string length composition string length scope;
Calculate the string length of described text participle;
In described text participle, filter out the candidate text participle of string length within the scope of described string length.
Alternatively, in described interest point set, interest point name has identical key word;
Described in described text participle, the step filtering out the candidate text participle of string length within the scope of described string length comprises:
In described text participle, filter out with described keyword match and the candidate text participle of string length within the scope of described string length.
Alternatively, described identical key word is lead-in.
Alternatively, described interest point name corresponding with described string length scope for the candidate's text participle filtered out to be compared, to judge described candidate's text participle whether for the step of interest point name comprises:
Interest point name in described candidate's text participle and described interest point set is put into same container;
Candidate's text participle in same container and interest point name are compared, to judge that whether described candidate's text participle is for interest point name.
Alternatively, described candidate's text participle in same container and interest point name to be compared, to judge described candidate's text participle whether for the step of interest point name comprises:
Candidate's text participle in same set and interest point name are sorted;
One or more interest point name adjacent with sequence for described candidate's text participle are compared;
When the interest point name that described candidate's text participle is adjacent with sequence is identical, determine that described candidate's text participle is interest point name;
When the interest point name that described candidate's text participle is adjacent with sequence is not identical, determine that described candidate's text participle is not interest point name.
Alternatively, also comprise:
One or more interest point name is divided in same interest point name set;
Add up in each interest point name set, the string length scope of described interest point name;
For described interest point name set, at least generate interest point name dictionary according to described string length scope.
It is alternatively, described that by one or more interest point name, the step be divided in same interest point name set comprises:
One or more interest point name with same keyword are divided in same interest point name set.
Alternatively, described the step that one or more interest point name with same keyword are divided in same interest point name set to be comprised:
Word segmentation processing is carried out to one or more interest point name, obtains title participle;
The title participle belonging to lead-in in described interest point name is set to key word;
One or more interest point name with same keyword are divided in the classification of same interest point name.
Alternatively, described statistics is in each interest point name set, and the step of the string length scope of described interest point name comprises:
Calculate in each interest point name set, the string length of described interest point name;
With the shortest string length and the longest string length composition string length scope.
Alternatively, described for described interest point name set, at least comprise according to the step of described string length scope generation interest point name dictionary:
For described interest point name set, generate interest point name dictionary using described keyword as key, using described string length scope as value.
According to a further aspect in the invention, provide a kind of device based on string length identification interest point name, comprising:
Text string word-dividing mode, is suitable for carrying out word segmentation processing to text string to be identified, obtains text participle;
Candidate's text participle screening module, is suitable in described text participle, filters out the candidate's text participle within the scope of certain string length;
Interest point name judge module, is suitable for interest point name corresponding with described string length scope for the candidate's text participle filtered out to compare, to judge that whether described candidate's text participle is for interest point name.
Alternatively, described candidate's text participle screening module is also suitable for:
Search default point of interest noun dictionary, in described interest point name dictionary, there is one or more interest point name set, the string length that in described interest point set, interest point name is the longest and the shortest string length composition string length scope;
Calculate the string length of described text participle;
In described text participle, filter out the candidate text participle of string length within the scope of described string length.
Alternatively, in described interest point set, interest point name has identical key word;
Described candidate's text participle screening module is also suitable for:
In described text participle, filter out with described keyword match and the candidate text participle of string length within the scope of described string length.
Alternatively, described identical key word is lead-in.
Alternatively, described interest point name judge module is also suitable for:
Interest point name in described candidate's text participle and described interest point set is put into same container;
Candidate's text participle in same container and interest point name are compared, to judge that whether described candidate's text participle is for interest point name.
Alternatively, described interest point name judge module is also suitable for:
Candidate's text participle in same set and interest point name are sorted;
One or more interest point name adjacent with sequence for described candidate's text participle are compared;
When the interest point name that described candidate's text participle is adjacent with sequence is identical, determine that described candidate's text participle is interest point name;
When the interest point name that described candidate's text participle is adjacent with sequence is not identical, determine that described candidate's text participle is not interest point name.
Alternatively, also comprise:
Interest point name divides module, is suitable for one or more interest point name being divided in same interest point name set;
String length range statistics module, is suitable for statistics in each interest point name set, the string length scope of described interest point name;
Interest point name dictionary generation module, is suitable for for described interest point name set, at least generates interest point name dictionary according to described string length scope.
Alternatively, described interest point name division module is also suitable for:
One or more interest point name with same keyword are divided in same interest point name set.
Alternatively, described interest point name division module is also suitable for:
Word segmentation processing is carried out to one or more interest point name, obtains title participle;
The title participle belonging to lead-in in described interest point name is set to key word;
One or more interest point name with same keyword are divided in the classification of same interest point name.
Alternatively, described string length range statistics module is also suitable for:
Calculate in each interest point name set, the string length of described interest point name;
With the shortest string length and the longest string length composition string length scope.
Alternatively, described interest point name dictionary generation module is also suitable for:
For described interest point name set, generate interest point name dictionary using described keyword as key, using described string length scope as value.
The embodiment of the present invention is based on the text participle of the attribute such as string length, key word to screening place doubtful POI title fragment in text string to be identified, compare with POI title further again, to judge whether doubtful POI title fragment is real POI title, due to the preliminary screening of attribute, make the doubtful POI title fragment of acquisition more targeted, so a large amount of POI titles can be excluded, greatly reduce follow-up calculated amount, thus decrease the consumption of internal memory, effectively slow down the restriction of internal memory, improve the efficiency of searching.
The embodiment of the present invention generates interest point name dictionary for attributes such as string length scope, key words, because the data of the attributes such as string length scope, key word are simple, load this interest point name dictionary, to carry out searching required internal memory at this interest point name dictionary all very little, ensure that the efficiency of preliminary screening doubtful POI title fragment.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows a kind of according to an embodiment of the invention flow chart of steps of the embodiment of the method 1 based on string length identification interest point name;
Fig. 2 shows a kind of according to an embodiment of the invention flow chart of steps of the embodiment of the method 2 based on string length identification interest point name;
Fig. 3 shows a kind of according to an embodiment of the invention structured flowchart of the device embodiment 1 based on string length identification interest point name; And
Fig. 4 shows a kind of according to an embodiment of the invention structured flowchart of the device embodiment 2 based on string length identification interest point name.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
With reference to Fig. 1, show a kind of according to an embodiment of the invention flow chart of steps of the embodiment of the method 1 based on string length identification interest point name, specifically can comprise the steps:
Step 101, carries out word segmentation processing to text string to be identified, obtains text participle;
In embodiments of the present invention, reptile can in advance by the linking relationship between webpage, capture the webpage of internet and preserve, the webpage of crawler capturing is kept in web database and forms a large amount of searching resources, for the text string in these webpages, the excavation of POI title can be carried out, for fields such as geographic information services.
In specific implementation, word segmentation processing can be carried out to text string to be identified, obtain text participle.
In the embodiment of the present invention, one or more modes following can be adopted to carry out word segmentation processing:
1, based on the participle of string matching: refer to and to be mated with the entry in a preset machine dictionary by Chinese character string to be analyzed according to certain strategy, if find certain character string in dictionary, then the match is successful (identifying a word).
2, the participle of feature based scanning or mark cutting: refer to and preferential identify and be syncopated as some words with obvious characteristic in character string to be analyzed, using these words as breakpoint, can less string be divided into come into mechanical Chinese word segmentation more former character string, thus reduce the error rate of coupling; Or participle and part-of-speech tagging are combined, utilizes abundant grammatical category information to offer help to participle decision-making, and conversely word segmentation result tested again in annotation process, adjust, thus improve the accuracy rate of cutting.
3, based on the participle understood: referring to by allowing the understanding of anthropomorphic distich of computer mould, reaching the effect identifying word.Its basic thought is exactly carry out syntax, semantic analysis while participle, utilizes syntactic information and semantic information to process Ambiguity.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Under the coordination of master control part, participle subsystem can obtain about the syntax of word, sentence etc. and semantic information judge segmentation ambiguity, and namely it simulates the understanding process of people to sentence.This segmenting method needs to use a large amount of linguistries and information.
4, the segmenting method of Corpus--based Method: refer to, because the frequency of the adjacent co-occurrence of word and word or probability can reflect into the confidence level of word preferably in Chinese information, so can add up the frequency of each combinatorics on words of co-occurrence adjacent in language material, calculate their information that appears alternatively, and calculate the adjacent co-occurrence probabilities of two Chinese characters X, Y.The information of appearing alternatively can embody the tightness degree of marriage relation between Chinese character.When tightness degree is higher than some threshold values, just can think that this word group may constitute a word.
Certainly, just exemplarily, when implementing the embodiment of the present invention, can arrange other word segmentation processing according to actual conditions, the embodiment of the present invention is not limited this above-mentioned word segmentation processing.In addition, except above-mentioned word segmentation processing, those skilled in the art can also adopt other word segmentation processing according to actual needs, and the embodiment of the present invention is not also limited this.
Step 102, in described text participle, filters out the candidate's text participle within the scope of certain string length;
In embodiments of the present invention, preliminary screening can be carried out based on string length, get rid of the data that a part is invalid, to reduce calculated amount.
In specific implementation, default point of interest noun dictionary can be searched.
Wherein, there is one or more interest point name set in interest point name dictionary;
The string length that in this interest point set, interest point name is the longest and the shortest string length composition string length scope.
Calculate the string length of text participle, in text participle, filter out the candidate text participle of string length within the scope of string length.
In order to get rid of invalid data further, in interest point set, interest point name has identical key word, further can screen based on this key word to data.
In specific implementation, this identical key word can be lead-in, and certainly, this identical key word also can for the word of specifying arbitrarily in tail word, word, and the embodiment of the present invention is not limited this.
Such as, the example of this interest point name dictionary is as follows:
Key value Value value
T 1 S 1/L 1
T 2 S 2/L 2
……. …….
L i S i/L i
……. …….
T n S n/L n
Wherein, in this interest point name dictionary, can be Key with key word, as T n(n is positive integer, represents the n-th key word), with string length scope for Value, as S n/ L n(n is positive integer, represents the n-th string length), L nfor the longest string length, S nfor the shortest string length.
Therefore, in text participle, can filter out with keyword match and candidate's text participle of string length within the scope of string length.
It should be noted that, except key word, the interest point name in interest point name set also can have other identical attributes, and the embodiment of the present invention is not limited this.
Step 103, compares interest point name corresponding with described string length scope for the candidate's text participle filtered out, to judge that whether described candidate's text participle is for interest point name.
In embodiments of the present invention, candidate's text participle that preliminary screening goes out mates with some attribute (as string length, key word etc.) of the point of interest in interest point name set, is doubtful point of interest noun, can identifies further.
In specific implementation, interest point name in candidate's text participle and interest point set can be put into same container (bucket can be referred to as), candidate's text participle in same container and interest point name are compared, to judge that whether described candidate's text participle is for interest point name.
In one example, can to the candidate's text participle in same set and interest point name, sort based on certain attribute, as based on string length sequence, based on Pinyin sorting etc., so that candidate's text participle is sorted near the interest point name that attribute is close.
One or more interest point name adjacent with sequence for candidate's text participle are compared, when the interest point name that candidate's text participle is adjacent with sequence is identical, determine that candidate's text participle is interest point name, when the interest point name that candidate's text participle is adjacent with sequence is not identical, determine that candidate's text participle is not interest point name.
Certainly, just exemplarily, when implementing the embodiment of the present invention, can arrange other judgment modes according to actual conditions, the embodiment of the present invention is not limited this above-mentioned judgment mode.In addition, except above-mentioned judgment mode, those skilled in the art can also adopt other judgment mode according to actual needs, and the embodiment of the present invention is not also limited this.
For making those skilled in the art understand the embodiment of the present invention better, in inventive embodiments, be described further by the mode of example to interest point name identification of concrete application scenarios:
For text string C to be identified, carrying out word segmentation processing to text string C, searching for cutting each word t after word in interest point name dictionary, if can find, the Key in the corresponding interest point name dictionary of note word t is T t, T tcorresponding Value value is S t/ L t.
Start with t, S tstring length≤the L of≤t tall substrings be all doubtful POI title fragment.
The all doubtful POI title fragment extracted and initial POI title carry out a point bucket according to the signature of character string, and a so-called point bucket is exactly that identical is assigned to one piece, determine whether doubtful POI title fragment is really certain POI title in each bucket.
The embodiment of the present invention is based on the text participle of the attribute such as string length, key word to screening place doubtful POI title fragment in text string to be identified, compare with POI title further again, to judge whether doubtful POI title fragment is real POI title, due to the preliminary screening of attribute, make the doubtful POI title fragment of acquisition more targeted, so a large amount of POI titles can be excluded, greatly reduce follow-up calculated amount, thus decrease the consumption of internal memory, effectively slow down the restriction of internal memory, improve the efficiency of searching.
With reference to Fig. 2, show a kind of according to an embodiment of the invention flow chart of steps of the embodiment of the method 2 based on string length identification interest point name, specifically can comprise the steps:
Step 201, is divided into one or more interest point name in same interest point name set;
In embodiments of the present invention, based on a certain or multiple attribute, one or more interest point name with same alike result can be divided into same interest point name set, interest point name are classified.
For key word, one or more interest point name with same keyword can be divided in same interest point name set.
Furthermore, if key word is lead-in, then can carry out word segmentation processing to one or more interest point name, obtain title participle, the title participle belonging to lead-in in interest point name is set to key word, one or more interest point name with same keyword is divided in the classification of same interest point name.
In the embodiment of the present invention, one or more modes following can be adopted to carry out word segmentation processing:
1, based on the participle of string matching: refer to and to be mated with the entry in a preset machine dictionary by Chinese character string to be analyzed according to certain strategy, if find certain character string in dictionary, then the match is successful (identifying a word).
2, the participle of feature based scanning or mark cutting: refer to and preferential identify and be syncopated as some words with obvious characteristic in character string to be analyzed, using these words as breakpoint, can less string be divided into come into mechanical Chinese word segmentation more former character string, thus reduce the error rate of coupling; Or participle and part-of-speech tagging are combined, utilizes abundant grammatical category information to offer help to participle decision-making, and conversely word segmentation result tested again in annotation process, adjust, thus improve the accuracy rate of cutting.
3, based on the participle understood: referring to by allowing the understanding of anthropomorphic distich of computer mould, reaching the effect identifying word.Its basic thought is exactly carry out syntax, semantic analysis while participle, utilizes syntactic information and semantic information to process Ambiguity.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Under the coordination of master control part, participle subsystem can obtain about the syntax of word, sentence etc. and semantic information judge segmentation ambiguity, and namely it simulates the understanding process of people to sentence.This segmenting method needs to use a large amount of linguistries and information.
4, the segmenting method of Corpus--based Method: refer to, because the frequency of the adjacent co-occurrence of word and word or probability can reflect into the confidence level of word preferably in Chinese information, so can add up the frequency of each combinatorics on words of co-occurrence adjacent in language material, calculate their information that appears alternatively, and calculate the adjacent co-occurrence probabilities of two Chinese characters X, Y.The information of appearing alternatively can embody the tightness degree of marriage relation between Chinese character.When tightness degree is higher than some threshold values, just can think that this word group may constitute a word.
Certainly, just exemplarily, when implementing the embodiment of the present invention, can arrange other word segmentation processing according to actual conditions, the embodiment of the present invention is not limited this above-mentioned word segmentation processing.In addition, except above-mentioned word segmentation processing, those skilled in the art can also adopt other word segmentation processing according to actual needs, and the embodiment of the present invention is not also limited this.
It should be noted that, except key word, the interest point name in interest point name set also can have other identical attributes, and the embodiment of the present invention is not limited this.
Step 202, adds up in each interest point name set, the string length scope of described interest point name;
In specific implementation, can calculate in each interest point name set, the string length of interest point name, from the string length of this interest point name, the string length that screening place is the shortest and the longest string length, with the shortest string length and the longest string length composition string length scope.
Step 203, for described interest point name set, at least generates interest point name dictionary according to described string length scope;
In actual applications, for interest point name set, interest point name dictionary can be generated using keyword as key Key, using string length scope as value Value.
For making those skilled in the art understand the embodiment of the present invention better, in inventive embodiments, be described further by the generating mode of example to interest point name dictionary of concrete application scenarios:
For given POI title, carry out word segmentation processing, using first character after cutting word as key word, POI title identical for key word being gathered is a class, forms interest point name set.
Calculate the number (i.e. string length) of the word comprised after each POI title cuts word in each interest point name set, find out the longest string length and the shortest string length.
With the lead-in of POI title (i.e. key word) for key, the corresponding the longest string length of this lead-in and the combinations of values of shortest character string length are that Value generates interest point name dictionary.
Such as, if POI title has n (n is positive integer) individual different lead-in after cutting word, T is designated as respectively 1, T 2t n, the longest string length of the POI title that i-th lead-in is corresponding is L i, the shortest string length is S i, then interest point name Dictionary format is as follows:
Key value Value value
T 1 S 1/L 1
T 2 S 2/L 2
……. …….
L i S i/L i
……. …….
T n S n/L n
Wherein, in this interest point name dictionary, can be Key with key word, as T n, with string length scope for Value, as S n/ L n, L nfor the longest string length, S nfor the shortest string length.
The embodiment of the present invention generates interest point name dictionary for attributes such as string length scope, key words, because the data of the attributes such as string length scope, key word are simple, load this interest point name dictionary, to carry out searching required internal memory at this interest point name dictionary all very little, ensure that the efficiency of preliminary screening doubtful POI title fragment.
Step 204, carries out word segmentation processing to text string to be identified, obtains text participle;
Step 205, searches default point of interest noun dictionary;
Wherein, in described interest point name dictionary, there is one or more interest point name set;
The string length that in described interest point set, interest point name is the longest and the shortest string length composition string length scope;
Step 206, calculates the string length of described text participle;
Step 207, in described text participle, filters out the candidate text participle of string length within the scope of described string length;
Step 208, compares interest point name corresponding with described string length scope for the candidate's text participle filtered out, to judge that whether described candidate's text participle is for interest point name.
For embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the embodiment of the present invention is not by the restriction of described sequence of movement, because according to the embodiment of the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action might not be that the embodiment of the present invention is necessary.
With reference to Fig. 3, show a kind of according to an embodiment of the invention structured flowchart of the device embodiment 1 based on string length identification interest point name, specifically can comprise as lower module:
Text string word-dividing mode 301, is suitable for carrying out word segmentation processing to text string to be identified, obtains text participle;
Candidate's text participle screening module 302, is suitable in described text participle, filters out the candidate's text participle within the scope of certain string length;
Interest point name judge module 303, is suitable for interest point name corresponding with described string length scope for the candidate's text participle filtered out to compare, to judge that whether described candidate's text participle is for interest point name.
In a kind of embodiment of the present invention, described candidate's text participle screening module 302 can also be suitable for:
Search default point of interest noun dictionary, in described interest point name dictionary, there is one or more interest point name set, the string length that in described interest point set, interest point name is the longest and the shortest string length composition string length scope;
Calculate the string length of described text participle;
In described text participle, filter out the candidate text participle of string length within the scope of described string length.
In a kind of embodiment of the present invention, in described interest point set, interest point name can have identical key word;
Described candidate's text participle screening module 302 can also be suitable for:
In described text participle, filter out with described keyword match and the candidate text participle of string length within the scope of described string length.
In specific implementation, described identical key word can be lead-in.
In a kind of embodiment of the present invention, described interest point name judge module 303 can also be suitable for:
Interest point name in described candidate's text participle and described interest point set is put into same container;
Candidate's text participle in same container and interest point name are compared, to judge that whether described candidate's text participle is for interest point name.
In a kind of embodiment of the present invention, described interest point name judge module 303 can also be suitable for:
Candidate's text participle in same set and interest point name are sorted;
One or more interest point name adjacent with sequence for described candidate's text participle are compared;
When the interest point name that described candidate's text participle is adjacent with sequence is identical, determine that described candidate's text participle is interest point name;
When the interest point name that described candidate's text participle is adjacent with sequence is not identical, determine that described candidate's text participle is not interest point name.
With reference to Fig. 4, show a kind of according to an embodiment of the invention structured flowchart of the device embodiment 2 based on string length identification interest point name, specifically can comprise as lower module:
Interest point name divides module 401, is suitable for one or more interest point name being divided in same interest point name set;
String length range statistics module 402, is suitable for statistics in each interest point name set, the string length scope of described interest point name;
Interest point name dictionary generation module 403, is suitable for for described interest point name set, at least generates interest point name dictionary according to described string length scope;
Text string word-dividing mode 404, is suitable for carrying out word segmentation processing to text string to be identified, obtains text participle;
Candidate's text participle screening module 405, searches default point of interest noun dictionary, calculates the string length of described text participle, in described text participle, filter out the candidate text participle of string length within the scope of described string length;
Wherein, in described interest point name dictionary, there is one or more interest point name set, the string length that in described interest point set, interest point name is the longest and the shortest string length composition string length scope;
Interest point name judge module 406, is suitable for interest point name corresponding with described string length scope for the candidate's text participle filtered out to compare, to judge that whether described candidate's text participle is for interest point name.
In a kind of embodiment of the present invention, described interest point name divides module 401 and can also be suitable for:
One or more interest point name with same keyword are divided in same interest point name set.
In a kind of embodiment of the present invention, described interest point name divides module 401 and can also be suitable for:
Word segmentation processing is carried out to one or more interest point name, obtains title participle;
The title participle belonging to lead-in in described interest point name is set to key word;
One or more interest point name with same keyword are divided in the classification of same interest point name.
In a kind of embodiment of the present invention, described string length range statistics module 402 can also be suitable for:
Calculate in each interest point name set, the string length of described interest point name;
With the shortest string length and the longest string length composition string length scope.
In a kind of embodiment of the present invention, described interest point name dictionary generation module 403 can also be suitable for:
For described interest point name set, generate interest point name dictionary using described keyword as key, using described string length scope as value.
For device embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with display at this algorithm provided.Various general-purpose system also can with use based on together with this teaching.According to description above, the structure constructed required by this type systematic is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions based on the some or all parts in the equipment of string length identification interest point name that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims (10)

1., based on a method for string length identification interest point name, comprising:
Word segmentation processing is carried out to text string to be identified, obtains text participle;
In described text participle, filter out the candidate's text participle within the scope of certain string length;
Interest point name corresponding with described string length scope for the candidate's text participle filtered out is compared, to judge that whether described candidate's text participle is for interest point name.
2. the method for claim 1, is characterized in that, described in described text participle, and the step filtering out the candidate's text participle within the scope of certain string length comprises:
Search default point of interest noun dictionary, in described interest point name dictionary, there is one or more interest point name set, the string length that in described interest point set, interest point name is the longest and the shortest string length composition string length scope;
Calculate the string length of described text participle;
In described text participle, filter out the candidate text participle of string length within the scope of described string length.
3. the method as described in any one of claim 1-2, is characterized in that, in described interest point set, interest point name has identical key word;
Described in described text participle, the step filtering out the candidate text participle of string length within the scope of described string length comprises:
In described text participle, filter out with described keyword match and the candidate text participle of string length within the scope of described string length.
4. the method as described in any one of claim 1-3, is characterized in that, described identical key word is lead-in.
5. the method as described in any one of claim 1-4, it is characterized in that, described interest point name corresponding with described string length scope for the candidate's text participle filtered out to be compared, to judge described candidate's text participle whether for the step of interest point name comprises:
Interest point name in described candidate's text participle and described interest point set is put into same container;
Candidate's text participle in same container and interest point name are compared, to judge that whether described candidate's text participle is for interest point name.
6. the method as described in any one of claim 1-5, is characterized in that, describedly candidate's text participle in same container and interest point name is compared, to judge described candidate's text participle whether for the step of interest point name comprises:
Candidate's text participle in same set and interest point name are sorted;
One or more interest point name adjacent with sequence for described candidate's text participle are compared;
When the interest point name that described candidate's text participle is adjacent with sequence is identical, determine that described candidate's text participle is interest point name;
When the interest point name that described candidate's text participle is adjacent with sequence is not identical, determine that described candidate's text participle is not interest point name.
7. the method as described in any one of claim 1-6, is characterized in that, also comprises:
One or more interest point name is divided in same interest point name set;
Add up in each interest point name set, the string length scope of described interest point name;
For described interest point name set, at least generate interest point name dictionary according to described string length scope.
8. the method as described in any one of claim 1-7, is characterized in that, described by one or more interest point name, the step be divided in same interest point name set comprises:
One or more interest point name with same keyword are divided in same interest point name set.
9. the method as described in any one of claim 1-8, is characterized in that, describedly the step that one or more interest point name with same keyword are divided in same interest point name set is comprised:
Word segmentation processing is carried out to one or more interest point name, obtains title participle;
The title participle belonging to lead-in in described interest point name is set to key word;
One or more interest point name with same keyword are divided in the classification of same interest point name.
10., based on a device for string length identification interest point name, comprising:
Text string word-dividing mode, is suitable for carrying out word segmentation processing to text string to be identified, obtains text participle;
Candidate's text participle screening module, is suitable in described text participle, filters out the candidate's text participle within the scope of certain string length;
Interest point name judge module, is suitable for interest point name corresponding with described string length scope for the candidate's text participle filtered out to compare, to judge that whether described candidate's text participle is for interest point name.
CN201510921183.5A 2015-12-11 2015-12-11 Method and device for identifying point of interest names based on character length Pending CN105550169A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510921183.5A CN105550169A (en) 2015-12-11 2015-12-11 Method and device for identifying point of interest names based on character length

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510921183.5A CN105550169A (en) 2015-12-11 2015-12-11 Method and device for identifying point of interest names based on character length

Publications (1)

Publication Number Publication Date
CN105550169A true CN105550169A (en) 2016-05-04

Family

ID=55829358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510921183.5A Pending CN105550169A (en) 2015-12-11 2015-12-11 Method and device for identifying point of interest names based on character length

Country Status (1)

Country Link
CN (1) CN105550169A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063670A (en) * 2018-08-16 2018-12-21 大连民族大学 Block letter language of the Manchus word recognition methods based on prefix grouping
CN109740406A (en) * 2018-08-16 2019-05-10 大连民族大学 Non-division block letter language of the Manchus word recognition methods and identification network
WO2019227581A1 (en) * 2018-05-29 2019-12-05 平安科技(深圳)有限公司 Interest point recognition method, apparatus, terminal device, and storage medium
CN110716992A (en) * 2018-06-27 2020-01-21 百度在线网络技术(北京)有限公司 Method and device for recommending name of point of interest
CN111026937A (en) * 2019-11-13 2020-04-17 百度在线网络技术(北京)有限公司 Method, device and equipment for extracting POI name and computer storage medium
WO2020082890A1 (en) * 2018-10-25 2020-04-30 阿里巴巴集团控股有限公司 Text restoration method and apparatus, and electronic device
CN117332126A (en) * 2023-09-11 2024-01-02 中科驭数(北京)科技有限公司 Character string filtering method, device, acceleration card and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101324439A (en) * 2008-07-29 2008-12-17 江苏华科导航科技有限公司 Navigation apparatus for searching interest point and method for searching interest point
WO2012172160A1 (en) * 2011-06-16 2012-12-20 Nokia Corporation Method and apparatus for resolving geo-identity
CN103955450A (en) * 2014-05-06 2014-07-30 杭州东信北邮信息技术有限公司 Automatic extraction method of new words
CN104537122A (en) * 2015-01-26 2015-04-22 北京奇艺世纪科技有限公司 Keyword determination method and keyword determination device
CN104699835A (en) * 2015-03-31 2015-06-10 北京奇虎科技有限公司 Method and device used for determining webpages including POI (point of interest) data
CN105069079A (en) * 2015-07-31 2015-11-18 北京奇虎科技有限公司 Method and device for screening point of interest POI data
CN105117425A (en) * 2015-07-31 2015-12-02 北京奇虎科技有限公司 Method and apparatus for selecting interest point of POI data
CN105138708A (en) * 2015-09-30 2015-12-09 北京奇虎科技有限公司 Method and device for identifying names of points of interest (POI)

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101324439A (en) * 2008-07-29 2008-12-17 江苏华科导航科技有限公司 Navigation apparatus for searching interest point and method for searching interest point
WO2012172160A1 (en) * 2011-06-16 2012-12-20 Nokia Corporation Method and apparatus for resolving geo-identity
CN103955450A (en) * 2014-05-06 2014-07-30 杭州东信北邮信息技术有限公司 Automatic extraction method of new words
CN104537122A (en) * 2015-01-26 2015-04-22 北京奇艺世纪科技有限公司 Keyword determination method and keyword determination device
CN104699835A (en) * 2015-03-31 2015-06-10 北京奇虎科技有限公司 Method and device used for determining webpages including POI (point of interest) data
CN105069079A (en) * 2015-07-31 2015-11-18 北京奇虎科技有限公司 Method and device for screening point of interest POI data
CN105117425A (en) * 2015-07-31 2015-12-02 北京奇虎科技有限公司 Method and apparatus for selecting interest point of POI data
CN105138708A (en) * 2015-09-30 2015-12-09 北京奇虎科技有限公司 Method and device for identifying names of points of interest (POI)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019227581A1 (en) * 2018-05-29 2019-12-05 平安科技(深圳)有限公司 Interest point recognition method, apparatus, terminal device, and storage medium
CN110716992A (en) * 2018-06-27 2020-01-21 百度在线网络技术(北京)有限公司 Method and device for recommending name of point of interest
CN110716992B (en) * 2018-06-27 2022-05-27 百度在线网络技术(北京)有限公司 Method and device for recommending name of point of interest
CN109063670A (en) * 2018-08-16 2018-12-21 大连民族大学 Block letter language of the Manchus word recognition methods based on prefix grouping
CN109740406A (en) * 2018-08-16 2019-05-10 大连民族大学 Non-division block letter language of the Manchus word recognition methods and identification network
CN109740406B (en) * 2018-08-16 2020-09-22 大连民族大学 Non-segmentation printed Manchu word recognition method and recognition network
WO2020082890A1 (en) * 2018-10-25 2020-04-30 阿里巴巴集团控股有限公司 Text restoration method and apparatus, and electronic device
CN111026937A (en) * 2019-11-13 2020-04-17 百度在线网络技术(北京)有限公司 Method, device and equipment for extracting POI name and computer storage medium
US11768892B2 (en) 2019-11-13 2023-09-26 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for extracting name of POI, device and computer storage medium
CN117332126A (en) * 2023-09-11 2024-01-02 中科驭数(北京)科技有限公司 Character string filtering method, device, acceleration card and medium

Similar Documents

Publication Publication Date Title
CN105550169A (en) Method and device for identifying point of interest names based on character length
CN107203468B (en) AST-based software version evolution comparative analysis method
CN106202514A (en) Accident based on Agent is across the search method of media information and system
Kovbasistyi et al. Method for detection of non-relevant and wrong information based on content analysis of web resources
CN104537065A (en) Search result pushing method and system
CN111459799A (en) Software defect detection model establishing and detecting method and system based on Github
CN103425687A (en) Retrieval method and system based on queries
JP2010501096A (en) Cooperative optimization of wrapper generation and template detection
CN103544267A (en) Search method and device based on search recommended words
CN112463774B (en) Text data duplication eliminating method, equipment and storage medium
CN107463548A (en) Short phrase picking method and device
CN103473338A (en) Webpage content extraction method and webpage content extraction system
JP2020126641A (en) Api mash-up exploration and recommendation
CN106156357A (en) Text data beam search method
CN115495755B (en) Codebert and R-GCN-based source code vulnerability multi-classification detection method
CN105095391A (en) Device and method for identifying organization name by word segmentation program
CN103106211B (en) Emotion recognition method and emotion recognition device for customer consultation texts
CN104331438A (en) Method and device for selectively extracting content of novel webpage
CN105808615A (en) Document index generation method and device based on word segment weights
CN105159885A (en) Point-of-interest name identification method and device
CN115130601A (en) Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion
CN104424399B (en) A kind of method, apparatus of the knowledge navigation based on virus protein body
CN105279249B (en) The determination method and device of the confidence level of interest point data in a kind of website
CN109614535B (en) Method and device for acquiring network data based on Scapy framework
CN109948015B (en) Meta search list result extraction method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160504

RJ01 Rejection of invention patent application after publication