CN104424177A - Method and device for extracting core words - Google Patents

Method and device for extracting core words Download PDF

Info

Publication number
CN104424177A
CN104424177A CN201310376577.8A CN201310376577A CN104424177A CN 104424177 A CN104424177 A CN 104424177A CN 201310376577 A CN201310376577 A CN 201310376577A CN 104424177 A CN104424177 A CN 104424177A
Authority
CN
China
Prior art keywords
participle
word
core
core word
unknown
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310376577.8A
Other languages
Chinese (zh)
Other versions
CN104424177B (en
Inventor
彭松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Autonavi Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Autonavi Software Co Ltd filed Critical Autonavi Software Co Ltd
Priority to CN201310376577.8A priority Critical patent/CN104424177B/en
Publication of CN104424177A publication Critical patent/CN104424177A/en
Application granted granted Critical
Publication of CN104424177B publication Critical patent/CN104424177B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

An embodiment of the invention discloses a method and a device for extracting core words. Accurate core words can be extracted from query words inputted by users, so that query accuracy ratio is increased. The method includes the steps: segmenting the query words by the aid of preset segmentation words to obtain segmentation words forming the query words; respectively matching the segmentation words of the query words with phrases in a core word bank and a non-core word bank; determining the segmentation words matched with the core word bank to serve as core words of the query words if the segmentation words are matched with the core word bank and/or the non-core word bank and unknown segmentation words exist in the segmentation words of the query words; acquiring unknown segmentation words meeting length standards of preset core words or splicing the unknown segmentation words to obtain segmentation words, and enabling the segmentation words to serve as the core words of the query words. The unknown segmentation words are segmentation words which are not matched with the phrases in the core word bank and the non-core word bank.

Description

A kind of method and device extracting core word
Technical field
The present invention relates to word processing field, particularly a kind of method and device extracting core word.
Background technology
In electronic map query application, when carrying out POI inquiry according to the query word of user's input, common way is, first participle is carried out to the query word of user's input, again each participle is mated with POI data storehouse respectively, obtain multiple queries result, using the Query Result that Query Result the highest for frequency of occurrence in Query Result is inquired about as this.But, can be there is following technological deficiency in this kind of inquiry mode: can obtain multiple participle owing to carrying out cutting to query word, but some participle is not the core word (core word is the minimum complete word unit referring to accurately to express query word implication) of this query word, if the frequency of inquiring about the Query Result appearance obtained according to these non-core words is the highest, may not be then the result of user's actual needs as Query Result using Query Result the highest for the frequency, thus cause the inaccurate or mistake of Query Result.Such as, the query word " hospital of traditional Chinese hospital of Tongrentang, Beijing " of user's input, the participle that cutting obtains is " Beijing ", " Tongrentang ", " hospital of traditional Chinese hospital ", after inquiring about according to these three participles, find that the frequency that " Beijing pharmacy of Tongrentang " occurs is the highest, now by " Beijing pharmacy of Tongrentang " as Query Result export, but user actual to look into be a hospital of traditional Chinese hospital and be not pharmacy.
Prior art is inquired about according to the participle of query word, and using Query Result the highest for the frequency as final Query Result, compared with prior art, the present invention proposes the method for the core word extracting query word, and inquire about with the core word that this is drawn into, because core word is the minimum complete word unit accurately can expressing query word implication, namely the query intention of user can be expressed accurately, therefore it is comparatively accurate that corresponding according to query word core word carries out inquiring about obtained Query Result, thus reach the object improving inquiry accuracy rate.When extracting core word, first preset participle mode is adopted to carry out to query word the participle that participle obtains forming query word, again the participle of query word is mated with the word in preset core word dictionary and non-core word dictionary respectively, known core word accurately due to what store in core word dictionary, therefore, if there is the participle mated with core word dictionary in the participle of query word, then can using the participle that mates with the core word dictionary core word as query word, be not the word of core word through being verified as due to what store in non-core word dictionary, therefore, then outside the word that mates with core word dictionary with all unmatched unknown participle of word in core word dictionary and non-core word dictionary, it is most possibly the word of core word, therefore, the participle obtained by the unknown participle of unknown participle or splicing meeting preset core word length standard is again as the core word of described query word, the probability being drawn into accurate core word can be improved, thus inquire about with the core word accurately that this is drawn into, the Query Result obtained is more accurate, thus improve the accuracy rate of inquiry.
Summary of the invention
In view of this, the fundamental purpose of the embodiment of the present invention is to provide a kind of method for extracting core word and device, can realize extracting core word comparatively accurately from the query word of user's input, thus reach the object improving inquiry accuracy rate.
In the first aspect of the embodiment of the present invention, provide a kind of method extracting core word, the method can comprise:
Adopt preset participle mode to carry out cutting to query word, obtain the participle forming described query word;
The participle of described query word is mated with the word in preset core word dictionary and non-core word dictionary respectively;
If there is the participle mated with described core word dictionary and/or the participle mated with described non-core word dictionary in the participle of described query word, and there is unknown participle, then:
The participle mated with described core word dictionary is defined as the core word of described query word; And,
The participle obtained by the unknown participle of unknown participle or splicing meeting preset core word length standard is as the core word of described query word, and described unknown participle refers to and all unmatched participle of word in described core word dictionary and described non-core word dictionary.
In the second aspect of the embodiment of the present invention, provide a kind of device extracting core word, this device can comprise:
Participle unit, for adopting preset participle mode to carry out cutting to query word, obtains the participle forming described query word;
Participle matching unit, for mating the participle of described query word with the word in preset core word dictionary and non-core word dictionary respectively;
First core word extracting unit, if for there is the participle mated with described core word dictionary and/or the participle mated with described non-core word dictionary in the participle of described query word, and there is unknown participle, then: the core word participle mated with described core word dictionary being defined as described query word; And, the participle obtained by the unknown participle of unknown participle or splicing meeting preset core word length standard is as the core word of described query word, and described unknown participle refers to and all unmatched participle of word in described core word dictionary and described non-core word dictionary.
Visible the present invention has following beneficial effect:
Accompanying drawing explanation
Fig. 1 is one of process flow diagram of the method for the extraction core word that the embodiment of the present invention provides;
Fig. 2 is the process flow diagram two of the method for the extraction core word that the embodiment of the present invention provides;
Fig. 3 is one of structural representation of the device of the extraction core word that the embodiment of the present invention provides.
Fig. 4 is the structural representation two of the device of the extraction core word that the embodiment of the present invention provides;
Fig. 5 is the structural representation three of the device of the extraction core word that the embodiment of the present invention provides;
Fig. 6 is the structural representation four of the device of the extraction core word that the embodiment of the present invention provides;
Fig. 7 is the structural representation five of the device of the extraction core word that the embodiment of the present invention provides.
Embodiment
For enabling above-mentioned purpose of the present invention, feature and advantage become apparent more, are described in further detail the embodiment of the present invention below in conjunction with the drawings and specific embodiments.
See Fig. 1, for a kind of process flow diagram extracting the method for core word that the embodiment of the present invention provides, the method can be applied to map search, Perimeter etc. needs arbitrarily input inquiry word to carry out the application scenarios inquired about, the method can the pre-configured core word dictionary for preserving known core word and the non-core word dictionary for preserving known non-core word, comprising:
S110, adopt preset participle mode to carry out cutting to query word, obtain the participle forming described query word;
Wherein, preset participle mode can comprise the participle mode such as basic participle, mixing participle mode.Do not limit in the present invention.In order to make embodiment of the present invention easy to understand more, below mixing participle mode and basic participle are simply introduced:
Basic participle mode is mated with the basic dictionary comprising basic Chinese unit by query word, and carry out cutting according to the word of coupling and obtain each participle, wherein, basic dictionary can comprise the basic Chinese unit that can become word, is also likely individual character.Such as, the participle that " the online business hall of China Mobile " obtains according to basic participle mode cutting is: " China ", " movement ", " on the net ", " business hall ".
Mixing participle mode is mated with the basic dictionary comprising basic Chinese unit by query word, carry out cutting according to the word of coupling and obtain each basic participle, again the various combination of each basic participle is mated with the expansion dictionary comprising associating word, carry out cutting according to the word of coupling and obtain each final participle, wherein expand dictionary and can comprise the word combination that can be unified into word in basic dictionary.Such as, the participle that " the online business hall of China Mobile " obtains according to the cutting of mixing participle mode is: " China Mobile ", " on the net ", " business hall ".With basic participle Method compare, mixing participle mode is carried out participle and can be obtained the less participle of number, and the information that each participle comprises is comparatively detailed.Considering this factor, can be optimal way using mixing participle mode as the participle mode that first-selection is preset in the embodiment of the present invention.
S120, the participle of described query word to be mated with the word in core word dictionary and non-core word dictionary respectively;
If there is the participle mated with described core word dictionary and/or the participle mated with described non-core word dictionary in the participle of the described query word of S130, and there is unknown participle, then:
The participle mated with described core word dictionary is defined as the core word of described query word; And, the participle obtained by the unknown participle of unknown participle or splicing meeting preset core word length standard is as the core word of described query word, and described unknown participle refers to and all unmatched participle of word in described core word dictionary and described non-core word dictionary.
It should be noted that, a query word can only have a core word, also can have multiple core word.
Particularly, such as, the participle preset unknown participle meeting core word length standard or the unknown participle of splicing obtained, as the core word of described query word, can realize in the following manner:
If there is the unknown participle of continuous print, then unknown for continuous print participle can be spliced into a participle, the participle that the splicing of length in preset the second length range (the second length range can be 4 ~ 12 bytes, i.e. 2 ~ 6 Chinese characters) obtains is defined as the core word of described query word;
If there is discrete unknown participle, then the discrete unknown participle of length in the second preset length range is defined as the core word of described query word.
It should be noted that, the unknown participle of continuous print refers to has at least two to be unknown participle at the participle that query word present position is adjacent, and discrete unknown participle refers to unknown participle is adjacent with this in query word last participle and a rear participle is not all unknown participle.
As: query word is " the self-service ATM in China Minsheng Banking Corporation Wangjing ", carry out cutting to this query word to obtain participle and be respectively: " China ", " people's livelihood ", " bank ", " self-service ", " ATM ", wherein " China ", " bank " is the participle mated with non-core word dictionary, " people's livelihood ", " self-service ", " ATM " is unknown participle, then judge known according to aforementioned manner, the last participle " China " adjacent with " people's livelihood " and a rear participle " bank " are non-unknown word, therefore determine that " people's livelihood " is a discontinuous unknown participle, and its length is in the second preset length range, confirm to be somebody's turn to do the core word that " people's livelihood " is aforementioned query word, and self-service ", " ATM " be adjacent and continuous print two unknown participles, therefore that this is self-service ", " ATM " be spliced into a participle " self-service ATM ", judge that the length of the participle that this splicing obtains is in the second length range, is therefore defined as the core word of aforementioned query word by " self-service ATM ".
The method of the extraction core word that the application embodiment of the present invention provides, what store in the core word dictionary due to setting is known core word accurately, what store in non-core word dictionary is known non-core word, therefore, in the embodiment of the present invention, using higher as the accuracy rate of the core word of query word for the participle mated with core word dictionary, and unknown participle is owing to being not the word in non-core word dictionary, and therefore it may be core word on very large probability.In the step 130 of the therefore method flow shown in earlier figures 1, whether be that core word is done further to judge to unknown participle.Method shown in Fig. 1 at least can be drawn into core word accurately in following three kinds of situations:
The participle and unknown participle that mate with core word dictionary is only there is in the participle of situation 1, composition query word, this situation, the solution that the embodiment of the present invention provides is: the participle mated with core word dictionary is defined as core word, and extracts participle that the unknown participle that meets core word length standard or the unknown participle of splicing obtain as core word;
The participle and unknown participle that mate with non-core word dictionary is only there is in the participle of situation 2, composition query word, this situation, the solution that the embodiment of the present invention provides is: extract participle that the unknown participle that meets core word length standard or the unknown participle of splicing obtain as core word;
There is the participle that mates with core word dictionary in the participle of situation 3, composition query word and there is the participle that mates with non-core word dictionary and there is unknown participle, this situation, the solution that the embodiment of the present invention provides is: the participle mated with core word dictionary is defined as core word and extracts participle that the unknown participle that meets core word length standard or the unknown participle of splicing obtain as core word.
Visible, the application embodiment of the present invention can be drawn into core word accurately, and then the corresponding accuracy rate improving inquiry.
Such as, input inquiry word can be needed arbitrarily to carry out applying the embodiment of the present invention in the application scenarios inquired about in map search, perimeter query etc., if there is core word in described query word, then can also with the core word of described query word for keyword, inquire about the point of interest (as destination-address etc.) corresponding with described query word, thus improve the accuracy rate of inquiry.
In addition, except above-mentioned three kinds of situations, also there is other certain situation in actual applications, comprising:
The participle of situation 4, composition query word all mates with non-core word dictionary, and for this situation, the solution that the embodiment of the present invention provides is: inquire about the participle for administrative region title in the participle of described query word; Whether judgement is the rear participle that the participle of administrative region title is adjacent is administrative region title; If not, be then that the rear participle that the participle of administrative region title is adjacent is spliced into a participle by this; The participle other participles of described query word and splicing obtained, as the new participle of described query word, for the new participle of query word, re-executes aforementioned S120.
The participle of situation 5, composition query word is unknown participle, for this situation, the solution that the embodiment of the present invention provides is: judge whether (the second length range can be 4 ~ 12 bytes at the first preset length range for the length of described query word, i.e. 2 ~ 6 Chinese characters) in, if so, then described query word is defined as the core word of described query word.
The participle of situation 6, composition query word all mates with core word dictionary, and for this situation, the solution that the embodiment of the present invention provides is: using with the participle of the core word dictionary core word as query word.
A participle part for situation 7, composition query word is mated with core word dictionary and another part participle mates with non-core word dictionary, and for this situation, the solution that the embodiment of the present invention provides is consistent with the aforementioned solution to situation 6, does not repeat them here.
For further enabling those skilled in the art clearly understand technical solution of the present invention, be described in detail technical solution of the present invention with a detailed process flow diagram below, refer to Fig. 2, the method comprises:
S210, adopt preset participle mode to carry out cutting to query word, obtain the participle forming described query word;
S220, the participle of described query word to be mated with the word in core word dictionary and non-core word dictionary respectively;
Wherein, core word dictionary and non-core word dictionary can be obtained by software or manual sorting magnanimity word in advance.The core word stored in core word dictionary in the embodiment of the present invention meets following characteristics: core word is spliced by province, city and region's title and non-province, city and region title and forms a physical name, as " Bank of China ", " Haikou electrical equipment ", " Peking University ", " China Mobile " etc.When arranging core word dictionary, province, city and region's name can be referred to as a part for core word, first utilize software automatically by other nouns upper for splicing after province, city and region's title, then obtaining known core word accurately by artificial screening.Non-core word dictionary in this embodiment can comprise high frequency dictionary, venue type dictionary or administrative region title dictionary.Wherein:
High frequency dictionary can be obtained by following implementation in advance: for each city, the title of all POI in the POI data storehouse corresponding to this city carries out participle, count the frequency of all participles, frequency is greater than the participle of predeterminated frequency threshold values as high frequency words, and high frequency words is added in preset high frequency dictionary, by high frequency words with form (keyword, adcode, citycode, frequency) be stored in high frequency dictionary, wherein, keyword is high frequency words, adcode is administrative region code, citycode is telephone area code corresponding to administrative region.POI data storehouse as corresponding to Beijing is analyzed, and obtains " agency " for high frequency words, then by this high frequency words, the form be stored in high frequency dictionary is: agency+Beijing+010+ frequency.The POI data storehouse corresponding to Shanghai City is analyzed, and obtains " agency " for high frequency words, then by this high frequency words, the form be stored in high frequency dictionary is: agency+Shanghai+021+ frequency.
Venue type dictionary can be obtained by manual sorting in advance, and the word that venue type dictionary comprises can be " food and drink ", " hotel ", " hotel ", " bank ", " parking lot ", " market ", " supermarket " etc.;
Administrative region title dictionary can be obtained by manual sorting in advance, and the administrative region title that administrative region title dictionary comprises can be the titles such as province, city, district, small towns, village, road.
Preferably, for ease of fast, know participle and core word dictionary intuitively, the match condition of non-core word dictionary, participle is being carried out in the process of mating with core word bank and non-core dictionary, if the word match in certain participle and core word dictionary or non-core word dictionary, can be then the mark corresponding with the core word dictionary of its coupling or non-core word dictionary by this mark of word segmentation, if certain participle does not all mate with the word in core word dictionary and non-core word dictionary, can be then unknown by this mark of word segmentation, thus matching result can be inquired fast according to this mark after coupling.As: be 4 by the mark of word segmentation mated with core word dictionary, the mark of word segmentation mated with administrative region title dictionary is 3, and the mark of word segmentation mated with venue type dictionary is 2, and the mark of word segmentation mated with high frequency dictionary is 1, and other situations are labeled as 0.Obtaining participle after POI " Haikou Electrical Appliances Co., Ltd " is carried out cutting is " Haikou ", " electrical equipment ", " company limited ", after participle is mated with aforementioned core word dictionary, high frequency dictionary, venue type dictionary, administrative region title dictionary, mark according to match condition, result is Haikou (3), electrical equipment (2), company limited (2).
Preferably, when setting up aforementioned core word dictionary and non-core word dictionary, for avoiding situation about can delimit the while of same participle possibility in multiple dictionary, the embodiment of the present invention pre-sets the priority of four dictionaries, as: core word dictionary > administrative region title dictionary > venue type dictionary > high frequency dictionary.When judging that a certain participle can be delimited in multiple dictionary, then this participle is stored in the higher dictionary of priority, is also high frequency words if " market " is type word, then " market " is added in the higher venue type dictionary of priority.
If there is the participle mated with described core word dictionary and/or the participle mated with described non-core word dictionary in the participle of the described query word of S230, and there is unknown participle, then:
If S230.1 exists the unknown participle of continuous print, then unknown for continuous print participle is spliced into a participle, and judges that the length of the participle that this splicing obtains is whether in the second preset length range;
S230.2, if so, then will splice the participle that obtains and be defined as the core word of described query word, then determine that participle that this splicing obtains is not the core word of query word if not;
If S230.3 exists discrete unknown participle, then judge that the length of described discrete unknown participle is whether in the second preset length range;
S230.4, if so, then described discrete unknown participle is defined as the core word of described query word, then determines that described discontinuous unknown participle is not the core word of described query word if not;
S230.5, the participle mated with described core word dictionary is defined as the core word of described query word;
If the participle of the described query word of S240 is unknown participle, then:
S240.1, judge that the length of described query word is whether in the first preset length range;
S240.2, if so, then described query word is defined as the core word of described query word.Then determine that described query word does not exist core word if not; Or, in other preset participle modes, participle is again carried out to query word, repeats aforementioned S220;
Such as, query word " Tian An-men " all not with the word match in core word dictionary and non-core word dictionary, this query word " Tian An-men " within preset the first length range (e.g., 2 ~ 6 Chinese characters), then itself can be defined as core word by its length.
If the participle of the described query word of S250 all not with word match in described core word dictionary, but all with the word match in described non-core word dictionary, then:
S250.1, inquire about described query word participle in be the participle of administrative region title;
Such as, the participle with the word match in the title dictionary of administrative region can be inquired about.
S250.2, judge to be the rear participle that the participle of administrative region title is adjacent whether be administrative region title;
This is then that the rear participle that the participle of administrative region title is adjacent is spliced into a participle, if then do not deal with by S250.3, if not;
Such as, the participle that query word " Haikou Electrical Appliances Co., Ltd " obtains after cutting is: Haikou, electrical equipment, company limited, if participle all not with word match in described core word dictionary, but all with the word match in described non-core word dictionary, then can inquire the participle " Haikou " for administrative region title in participle, a rear participle adjacent due to " Haikou " is that " electrical equipment " does not belong to administrative region title, therefore, " Haikou " and " electrical equipment " can be spliced into a participle " Haikou electrical equipment "
S250.4, the participle that other participles and the splicing of described query word obtained, as the new participle of described query word, for the new participle of query word, re-execute aforementioned S220.
It should be noted that, other participles of the query word described in above-mentioned S250.4 refer to other participles in cutting acquisition participle except the participle of this administrative region title and an adjacent rear participle thereof.
If the participle of the described query word of S260 all with word match in core word dictionary, or a participle part for query word is mated with core word dictionary and another part participle mates with non-core word dictionary, then:
S260.1, using with the participle of word match in the core word dictionary core word as described query word.
The present inventor find, if in practical application the participle of query word all not with the word match in described core word dictionary and non-core word dictionary, then when this query word is within the scope of certain length, this query word inherently core word can be determined, when query word participle all with word match in core word dictionary, or, in word match in a part of participle of query word and core word dictionary and another part participle and non-core word dictionary during word match, be then exactly the core word of query word with the participle of word match in core word dictionary, the present inventor also finds that the probability being spliced into core word of the participle of the rear non-administrative region title that administrative region title is adjacent is higher, therefore, this embodiment query word participle all not with word match in described core word dictionary, but when all with word match in described non-core word dictionary, the rear participle be adjacent by participle for administrative region title is spliced into a participle, the participle other participles of query word and splicing obtained is as the new participle of described query word, for the new participle of query word, re-execute described step of the participle of described query word being carried out respectively mate with the word in core word dictionary and non-core word dictionary, thus improve the probability extracting core word accurately.
In addition, when applying the above embodiment of the present invention and not being drawn into core word, the participle mode that next is preset can also be switched to, again apply the extraction that the embodiment of the present invention carries out core word, thus increase the probability being drawn into core word.Participle mode as preset comprises basic participle mode and mixing participle mode, preferably, first takes mixing participle mode to carry out cutting to query word, and carries out core word extraction to the participle that cutting obtains.When extracting less than core word, take preset basic participle mode again to carry out participle to query word, and carry out follow-up core word extraction flow process.
See Fig. 3, be a kind of structural representation extracting the device of core word that the embodiment of the present invention provides, this device can be configured at the relevant device that map search, prompting input information etc. need arbitrarily to inquire about by core word.As shown in the figure, this device can comprise:
Participle unit 310, for adopting preset participle mode to carry out cutting to query word, obtains the participle forming described query word;
Participle matching unit 320, for mating the participle of described query word with the word in preset core word dictionary and non-core word dictionary respectively;
First core word extracting unit 330, if for there is the participle mated with described core word dictionary and/or the participle mated with described non-core word dictionary in the participle of described query word, and there is unknown participle, then: the core word participle mated with described core word dictionary being defined as described query word; And, the participle obtained by the unknown participle of unknown participle or splicing meeting preset core word length standard is as the core word of described query word, and described unknown participle refers to and all unmatched participle of word in described core word dictionary and described non-core word dictionary.
First core word extracting unit 330, the participle obtained by the unknown participle of unknown participle or splicing meeting preset core word length standard is as the core word of described query word, specifically for: if there is the unknown participle of continuous print, then the participle that the splicing of length in the second preset length range obtains is defined as the core word of described query word; If there is discrete unknown participle, then the discrete unknown participle of length in the second preset length range is defined as the core word of described query word.
The device of the extraction core word that the application embodiment of the present invention provides, what store in the core word dictionary due to setting is known core word accurately, what store in non-core word dictionary is known non-core word, therefore, in the embodiment of the present invention, using higher as the accuracy rate of the core word of query word for the participle mated with core word dictionary, and unknown participle is owing to being not the word in non-core word dictionary, and therefore it may be core word on very large probability.Therefore, whether the first core word extracting unit 330 in Fig. 3 shown device is that core word is done further to judge to unknown participle.Therefore, the device that the application embodiment of the present invention provides, can be drawn into core word accurately, and then the corresponding accuracy rate improving inquiry.
Preferably, for query word participle all not with word match in described core word dictionary, but all with the situation of the word match in described non-core word dictionary, the device described in the embodiment of the present invention, can also comprise first participle recomposition unit 340, as shown in Figure 4.
First participle recomposition unit 340, if for described query word participle all not with word match in described core word dictionary, but all with the word match in described non-core word dictionary, then: inquiring about in the participle of described query word is the participle of administrative region title; Whether judgement is the rear participle that the participle of administrative region title is adjacent is administrative region title; If not, be then that the rear participle that the participle of administrative region title is adjacent is spliced into a participle by this; The participle other participles of described query word and splicing obtained, as the new participle of described query word, for the new participle of query word, triggers participle matching unit 320.
Preferably, the participle for query word is unknown participle, and the device of embodiment of the present invention earlier figures 3 or Fig. 4 can also comprise the second core word extracting unit 350, is illustrated in figure 5 in the device shown in Fig. 3 and also comprises the second core word extracting unit 350:
Second core word extracting unit 350, if be unknown participle for the participle of described query word, then: judge that the length of described query word is whether in the first preset length range, if then described query word to be defined as the core word of described query word.
Preferably, the device shown in embodiment of the present invention earlier figures 3, Fig. 4 or Fig. 5, can also comprise the 3rd core word extracting unit 360, is illustrated in figure 6 in the device shown in Fig. 3 and also comprises the 3rd core word extracting unit 360:
3rd core word extracting unit 360, if for described query word participle all with word match in core word dictionary, or word match in word match in a part of participle of query word and core word dictionary and another part participle and non-core word dictionary, then: using with the participle of word match in the core word dictionary core word as described query word.
Input inquiry word is needed arbitrarily to carry out all can applying technical solution of the present invention in the application scenarios inquired about in map search, perimeter query etc.Aforementioned means in the embodiment of the present invention can also comprise query unit 370, as Fig. 7 for as described in also comprise the second core word extracting unit 350 and query unit 370 in the device shown in Fig. 4:
Query unit 370, if there is core word for described query word, with the core word of described query word for keyword, inquires about the point of interest corresponding with described query word.
It should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.
The foregoing is only preferred embodiment of the present invention, be not intended to limit protection scope of the present invention.All any amendments done within the spirit and principles in the present invention, equivalent replacement, improvement etc., be all included in protection scope of the present invention.

Claims (10)

1. extract a method for core word, it is characterized in that, comprising:
Adopt preset participle mode to carry out cutting to query word, obtain the participle forming described query word;
The participle of described query word is mated with the word in preset core word dictionary and non-core word dictionary respectively;
If there is the participle mated with described core word dictionary and/or the participle mated with described non-core word dictionary in the participle of described query word, and there is unknown participle, then:
The participle mated with described core word dictionary is defined as the core word of described query word; And,
The participle obtained by the unknown participle of unknown participle or splicing meeting preset core word length standard is as the core word of described query word, and described unknown participle refers to and all unmatched participle of word in described core word dictionary and described non-core word dictionary.
2. method according to claim 1, is characterized in that, if the participle of described query word all with the word match in described non-core word dictionary, then described method also comprises:
Inquire about the participle for administrative region title in the participle of described query word;
Whether judgement is the rear participle that the participle of administrative region title is adjacent is administrative region title;
If not, be then that the rear participle that the participle of administrative region title is adjacent is spliced into a participle by this;
The participle other participles of described query word and splicing obtained is as the new participle of described query word, for the new participle of query word, re-execute described step of the participle of described query word being carried out respectively mate with the word in core word dictionary and non-core word dictionary.
3. method according to claim 1, is characterized in that, if the participle of described query word is unknown participle, then described method also comprises:
Judge that the length of described query word is whether in the first preset length range, is if so, then defined as the core word of described query word by described query word.
4. the method according to any one of claims 1 to 3, is characterized in that, the described participle preset unknown participle meeting core word length standard or the unknown participle of splicing obtained, as the core word of described query word, specifically comprises:
If there is the unknown participle of continuous print, then unknown for continuous print participle is spliced into a participle, the participle that the splicing of length in the second preset length range obtains is defined as the core word of described query word;
If there is discrete unknown participle, then the discrete unknown participle of length in the second preset length range is defined as the core word of described query word.
5. the method according to any one of claims 1 to 3, is characterized in that, if described query word exists core word, described method also comprises:
With the core word of described query word for keyword, inquire about the point of interest corresponding with described query word.
6. extract a device for core word, it is characterized in that, comprising:
Participle unit, for adopting preset participle mode to carry out cutting to query word, obtains the participle forming described query word;
Participle matching unit, for mating the participle of described query word with the word in preset core word dictionary and non-core word dictionary respectively;
First core word extracting unit, if for there is the participle mated with described core word dictionary and/or the participle mated with described non-core word dictionary in the participle of described query word, and there is unknown participle, then: the core word participle mated with described core word dictionary being defined as described query word; And, the participle obtained by the unknown participle of unknown participle or splicing meeting preset core word length standard is as the core word of described query word, and described unknown participle refers to and all unmatched participle of word in described core word dictionary and described non-core word dictionary.
7. device according to claim 6, is characterized in that, also comprises:
First participle recomposition unit, if for described query word participle all not with word match in described core word dictionary, but all with the word match in described non-core word dictionary, then: inquiring about in the participle of described query word is the participle of administrative region title; Whether judgement is the rear participle that the participle of administrative region title is adjacent is administrative region title; If not, be then that the rear participle that the participle of administrative region title is adjacent is spliced into a participle by this; The participle other participles of described query word and splicing obtained, as the new participle of described query word, for the new participle of query word, triggers participle matching unit.
8. device according to claim 6, is characterized in that, also comprises:
Second core word extracting unit, if be unknown participle for the participle of described query word, then: judge that the length of described query word is whether in the first preset length range, if then described query word to be defined as the core word of described query word.
9. the device according to any one of claim 6 ~ 8, it is characterized in that, the participle that the unknown participle of unknown participle or splicing meeting preset core word length standard obtains by described first core word extracting unit is as the core word of described query word, specifically for: if there is the unknown participle of continuous print, then the participle that the splicing of length in the second preset length range obtains is defined as the core word of described query word; If there is discrete unknown participle, then the discrete unknown participle of length in the second preset length range is defined as the core word of described query word.
10. the device according to any one of claim 6 ~ 8, is characterized in that, also comprises:
Query unit, if there is core word for described query word, with the core word of described query word for keyword, inquires about the point of interest corresponding with described query word.
CN201310376577.8A 2013-08-26 2013-08-26 A kind of method and device for extracting core word Active CN104424177B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310376577.8A CN104424177B (en) 2013-08-26 2013-08-26 A kind of method and device for extracting core word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310376577.8A CN104424177B (en) 2013-08-26 2013-08-26 A kind of method and device for extracting core word

Publications (2)

Publication Number Publication Date
CN104424177A true CN104424177A (en) 2015-03-18
CN104424177B CN104424177B (en) 2017-09-15

Family

ID=52973182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310376577.8A Active CN104424177B (en) 2013-08-26 2013-08-26 A kind of method and device for extracting core word

Country Status (1)

Country Link
CN (1) CN104424177B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899190A (en) * 2015-06-04 2015-09-09 百度在线网络技术(北京)有限公司 Generation method and device for word segmentation dictionary and word segmentation processing method and device
CN105630926A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Method and apparatus for extracting central word from query word
CN105893592A (en) * 2016-04-12 2016-08-24 广东欧珀移动通信有限公司 Searching method and searching device
CN106682411A (en) * 2016-12-22 2017-05-17 浙江大学 Method for converting physical examination diagnostic data into disease label
CN107748745A (en) * 2017-11-08 2018-03-02 厦门美亚商鼎信息科技有限公司 A kind of enterprise name keyword extraction method
CN109033082A (en) * 2018-07-19 2018-12-18 深圳创维数字技术有限公司 The learning training method, apparatus and computer readable storage medium of semantic model
CN110580271A (en) * 2018-06-08 2019-12-17 百度在线网络技术(北京)有限公司 Data query method and device
CN111026787A (en) * 2019-11-22 2020-04-17 中国银行股份有限公司 Network point retrieval method, device and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040243408A1 (en) * 2003-05-30 2004-12-02 Microsoft Corporation Method and apparatus using source-channel models for word segmentation
EP1515240A2 (en) * 2003-09-15 2005-03-16 Microsoft Corporation Chinese word segmentation
CN102103604A (en) * 2009-12-18 2011-06-22 百度在线网络技术(北京)有限公司 Method and device for determining core weight of term
CN102270244A (en) * 2011-08-26 2011-12-07 四川长虹电器股份有限公司 Method for quickly extracting webpage content key words based on core sentence

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040243408A1 (en) * 2003-05-30 2004-12-02 Microsoft Corporation Method and apparatus using source-channel models for word segmentation
EP1515240A2 (en) * 2003-09-15 2005-03-16 Microsoft Corporation Chinese word segmentation
CN102103604A (en) * 2009-12-18 2011-06-22 百度在线网络技术(北京)有限公司 Method and device for determining core weight of term
CN102270244A (en) * 2011-08-26 2011-12-07 四川长虹电器股份有限公司 Method for quickly extracting webpage content key words based on core sentence

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DEAN F.HOUGEN等: "Analysis of an Off-Line Intrusion Detection System:A Case Study in Multi-Objective Genetic Algorithms", 《FLAIRS CONFERENE》 *
曾依灵等: "网络文本主题词的提取与组织研究", 《中文信息学报》 *
李伟: "搜索引擎核心词提取系统设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
郑家恒等: "关键词抽取方法的研究", 《计算机工程》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899190A (en) * 2015-06-04 2015-09-09 百度在线网络技术(北京)有限公司 Generation method and device for word segmentation dictionary and word segmentation processing method and device
CN105630926A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Method and apparatus for extracting central word from query word
CN105893592A (en) * 2016-04-12 2016-08-24 广东欧珀移动通信有限公司 Searching method and searching device
CN106682411A (en) * 2016-12-22 2017-05-17 浙江大学 Method for converting physical examination diagnostic data into disease label
CN106682411B (en) * 2016-12-22 2019-04-16 浙江大学 A method of disease label is converted by physical examination diagnostic data
CN107748745A (en) * 2017-11-08 2018-03-02 厦门美亚商鼎信息科技有限公司 A kind of enterprise name keyword extraction method
CN110580271A (en) * 2018-06-08 2019-12-17 百度在线网络技术(北京)有限公司 Data query method and device
CN110580271B (en) * 2018-06-08 2022-05-24 百度在线网络技术(北京)有限公司 Data query method and device
CN109033082A (en) * 2018-07-19 2018-12-18 深圳创维数字技术有限公司 The learning training method, apparatus and computer readable storage medium of semantic model
CN109033082B (en) * 2018-07-19 2022-06-10 深圳创维数字技术有限公司 Learning training method and device of semantic model and computer readable storage medium
CN111026787A (en) * 2019-11-22 2020-04-17 中国银行股份有限公司 Network point retrieval method, device and system

Also Published As

Publication number Publication date
CN104424177B (en) 2017-09-15

Similar Documents

Publication Publication Date Title
CN104424177A (en) Method and device for extracting core words
CN102395965B (en) Method for searching objects in a database
CN102063446B (en) Method for creating inverted index and inverted indexing device
CN104504109A (en) Image search method and device
WO2021093308A1 (en) Method and apparatus for extracting poi name, device, and computer storage medium
CN105528372A (en) An address search method and apparatus
CN110909170B (en) Interest point knowledge graph construction method and device, electronic equipment and storage medium
CN104156352A (en) Method and system for handling Chinese event
CN107784110B (en) Index establishing method and device
CN104298665A (en) Identification method and device of evaluation objects of Chinese texts
CN102289467A (en) Method and device for determining target site
CN104516903A (en) Keyword extension method and system and classification corpus labeling method and system
CN104850554A (en) Searching method and system
CN102722709A (en) Method and device for identifying garbage pictures
CN103324632B (en) A kind of concept identification method based on Cooperative Study and device
CN103593371A (en) Method and device for recommending search keywords
CN104679801A (en) Point of interest searching method and point of interest searching device
CN109460386A (en) The matched malicious file homology analysis method and device of Hash is obscured based on various dimensions
US20190056235A1 (en) Path querying method and device, an apparatus and non-volatile computer storage medium
CN106155998B (en) A kind of data processing method and device
CN103927339B (en) Knowledge Reorganizing system and method for knowledge realignment
CN103425662A (en) Information search method and device in network community
CN103324612A (en) Method and device for segmenting word
CN109308315A (en) A kind of collaborative recommendation method based on specialist field similarity and incidence relation
CN102646124A (en) Method for automatically identifying address information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200520

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 102200, No. 8, No., Changsheng Road, Changping District science and Technology Park, Beijing, China. 1-5

Patentee before: AUTONAVI SOFTWARE Co.,Ltd.

TR01 Transfer of patent right