CN106933799A - A kind of Chinese word cutting method and device of point of interest POI titles - Google Patents

A kind of Chinese word cutting method and device of point of interest POI titles Download PDF

Info

Publication number
CN106933799A
CN106933799A CN201511029875.5A CN201511029875A CN106933799A CN 106933799 A CN106933799 A CN 106933799A CN 201511029875 A CN201511029875 A CN 201511029875A CN 106933799 A CN106933799 A CN 106933799A
Authority
CN
China
Prior art keywords
word
poi titles
poi
participle
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201511029875.5A
Other languages
Chinese (zh)
Inventor
史川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Navinfo Co Ltd
Original Assignee
Navinfo Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Navinfo Co Ltd filed Critical Navinfo Co Ltd
Priority to CN201511029875.5A priority Critical patent/CN106933799A/en
Publication of CN106933799A publication Critical patent/CN106933799A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of Chinese word cutting method and device of point of interest POI titles, and methods described includes:The dictionary for word segmentation to being obtained after the predetermined total sample process of POI titles is obtained, dictionary for word segmentation includes the word frequency of the keyword and each keyword that are extracted from the POI titles of the predetermined total sample of POI titles in the predetermined total sample of POI titles;The POI titles for treating participle carry out full cutting, obtain first participle result, wherein, if the same individual character in a POI titles has various keywords under different slit modes, the then word frequency according to the keyword obtained under different slit modes in the predetermined total sample of POI titles, using word frequency highest keyword as individual character word segmentation result.The cutting ambiguity that a certain individual character occurs when solving the problems, such as POI title participles by the Chinese word cutting method and device of the POI titles, makes cutting result more reasonable, it is ensured that the accuracy of participle.

Description

A kind of Chinese word cutting method and device of point of interest POI titles
Technical field
The present invention relates to participle technique field, a kind of Chinese word segmentation side of point of interest POI titles is particularly related to Method and device.
Background technology
With developing rapidly for internet, the information that people can contact also drastically is expanding, the information of magnanimity For people provide resource acquisition easily simultaneously as all kinds of mixes the inconvenience for also bringing information sifting, From there through introducing participle technique, you can people is obtained by the more accurate and reasonable of information sifting arrangement Resource, the work for giving people and life bring bigger facility, while making efficiency be greatly improved. Separated due to no between Chinese word, based on existing Chinese words segmentation in point of interest (POI) title There is a problem of phrase segmentation ambiguity in participle application, this causes that word segmentation result has deviation with physical meaning, Information processing, retrieval to after bring and directly affect.
The content of the invention
The technical problem to be solved in the present invention be to provide a kind of point of interest POI titles Chinese word cutting method and Device, to there is cutting ambiguity in the Chinese word segmentation for solving the problems, such as POI titles.
On the one hand, embodiments of the invention provide a kind of Chinese word cutting method of point of interest POI titles, bag Include:
Obtain to the dictionary for word segmentation that is obtained after the predetermined total sample process of POI titles, dictionary for word segmentation include from The keyword extracted in the POI titles of the predetermined total sample of POI titles and each keyword are in predetermined POI Word frequency in the total sample of title;
Treating a POI titles of participle carries out full cutting, obtains first participle result, wherein, if the Same individual character in one POI titles has various keywords under different slit modes, then cut according to difference Word frequency of the keyword obtained under point mode in the predetermined total sample of POI titles, word frequency highest is crucial Word as individual character word segmentation result.
Wherein, when including non-Chinese character in a POI titles, the above method also includes:
Half-angle treatment is carried out to a POI titles, all of non-Chinese character in a POI titles is extracted Group simultaneously marks the position of non-Chinese character group, and non-Chinese character group is added into first participle result.
Wherein, after the acquisition first participle result, the above method also includes:
In judging the keyword in first participle result, if having the unregistered word being not present in dictionary for word segmentation;
If so, word frequency of the unregistered word in the predetermined total sample of POI titles is then counted, when unregistered word When frequency is higher than predetermined threshold value, unregistered word is added to dictionary for word segmentation.
Wherein, the above-mentioned POI titles for treating participle carry out full cutting, obtain first participle result Step includes:
The first POI titles are matched with dictionary for word segmentation according to maximum matching method, is obtained the first matching knot Really;
The first matching result is modified according to the minimum principle of participle individual character, obtains first participle result.
Wherein, above-mentioned dictionary for word segmentation also includes:National link name storehouse and neighborhood name allocation list.
On the other hand, to realize the above method, the embodiment of the present invention also provides a kind of point of interest POI titles Chinese word segmentation device, including:
Acquisition module, for obtaining the dictionary for word segmentation to being obtained after the predetermined total sample process of POI titles, point Word dictionary includes the keyword and each pass extracted from the POI titles of the predetermined total sample of POI titles Word frequency of the keyword in the predetermined total sample of POI titles;
First participle module, the POI titles for treating participle carry out full cutting, obtain first point Word result, wherein, if the same individual character in a POI titles has various passes under different slit modes Keyword, then the word frequency according to the keyword obtained under different slit modes in the predetermined total sample of POI titles, Using word frequency highest keyword as individual character word segmentation result.
Wherein, said apparatus also include:
Second word-dividing mode, for carrying out half-angle treatment to a POI titles, extracts a POI titles In all of non-Chinese character group and mark the position of non-Chinese character group, and non-Chinese character group is added to First participle result.
Wherein, said apparatus also include:
Judge module, for judge word-dividing mode obtain first participle result in keyword in, if having It is not present in the unregistered word in dictionary for word segmentation;
Statistics and add module, if being yes for the judged result of judge module, statistics unregistered word is pre- Determine the word frequency in the total sample of POI titles, when the frequency of unregistered word is higher than predetermined threshold value, will be not logged in Word is added to dictionary for word segmentation.
Wherein, above-mentioned first participle module includes:
Matching unit, for the first POI titles to be matched with dictionary for word segmentation according to maximum matching method, Obtain the first matching result;
Amending unit, for being modified to the first matching result according to the minimum principle of participle individual character, obtains One word segmentation result.
Wherein, above-mentioned dictionary for word segmentation also includes:National link name storehouse and neighborhood name allocation list.
Above-mentioned technical proposal of the invention at least includes following beneficial effect:
Above-mentioned technical proposal of the invention is by according to a certain available difference of individual character cutting in POI titles Word frequency of the keyword in dictionary for word segmentation, word frequency highest keyword as the word segmentation result of individual character is solved The problem of the cutting ambiguity that a certain individual character occurs, makes the cutting result more reasonable during POI title participles, protects The accuracy of participle is demonstrate,proved.
Brief description of the drawings
Technical scheme in order to illustrate more clearly the embodiments of the present invention, will describe to the embodiment of the present invention below Needed for the accompanying drawing to be used be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, are not paying creative labor Under the premise of, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 represents the schematic flow sheet of the Chinese word cutting method of the inventive method embodiment POI titles;
Fig. 2 represents a kind of structural representation of the Chinese word segmentation device of apparatus of the present invention embodiment POI titles;
Fig. 3 represents another structural representation of the Chinese word segmentation device of apparatus of the present invention embodiment POI titles Figure;
Fig. 4 represents another structural representation of the Chinese word segmentation device of apparatus of the present invention embodiment POI titles Figure;
Fig. 5 represents the flow example of the Chinese word cutting method of specific embodiment POI titles of the invention.
Specific embodiment
To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with attached Figure and specific embodiment are described in detail.
Embodiment of the method
Fig. 1 is referred to, the stream of its Chinese word cutting method for being illustrated that the inventive method embodiment POI titles Journey schematic diagram, the Chinese word cutting method of the point of interest POI titles that the inventive method embodiment is provided, can be with Including:
Step S101, obtains the dictionary for word segmentation to being obtained after the predetermined total sample process of POI titles, participle word Allusion quotation includes the keyword and each keyword extracted from the predetermined POI titles of the total sample of POI titles Word frequency in the predetermined total sample of POI titles.
In above-described embodiment, arrangement treatment is carried out by the predetermined total sample of POI titles, obtained for POI The dictionary for word segmentation of title participle, here, the arrangement treatment for the predetermined total sample of POI titles can be people Work Collator Mode, is not construed as limiting to specific processing mode in this embodiment;The dictionary for word segmentation is included from predetermined It is total that the keyword extracted in the POI titles of the total sample of POI titles and each keyword are based on predetermined POI titles The word frequency of sample, here, the keyword of extraction can be according to based on the pre-defined of predetermined POI titles Attribute extract and according in pre-defined attribute storage to dictionary for word segmentation.
In addition, the predetermined total sample of POI titles is that this is pre- by gathering the POI name sets included in advance The POI title radixes for determining the total sample of POI titles are sufficiently large, and scope is wide enough, and here, the present invention is real Example is applied not limit the collection of the predetermined total sample of POI titles and recording method.
Step S102, treating a POI titles of participle carries out full cutting, obtains first participle result, Wherein, if the same individual character in a POI titles has various keywords under different slit modes, According to word frequency of the keyword obtained under different slit modes in the predetermined total sample of POI titles, by word frequency Highest keyword as individual character word segmentation result.
In above-described embodiment, during the POI titles for treating participle carry out full cutting, if A certain individual character can obtain various different keywords, the i.e. cutting of the individual character according to different slit modes With ambiguity, then according to various keywords recorded in dictionary for word segmentation in the predetermined total sample of POI titles In word frequency, be would know that in the predetermined total sample of POI titles by word frequency, various keywords using frequency Rate, using word frequency highest keyword as the word segmentation result of individual character, a POI titles can be obtained with this In individual character accuracy cutting result higher.For example:" multiple Beijing University pharmacy Eastern Han Dynasty Yang Lu shops " is carried out complete It is after cutting:" multiple ", " Beijing University ", " big pharmacy ", " pharmacy ", " Eastern Han Dynasty Yang Lu ", " shop ", here, individual character " big " can obtain different keyword " Beijing University " and " big pharmacies " according to different slit modes, according to The word frequency of keyword " Beijing University " and " big pharmacy " in the predetermined total sample of POI titles is entered to cutting result Row confirms, can obtain in dictionary for word segmentation two word frequency of keyword respectively " 213 " and " 43782 ", It is possible thereby to the cutting result for confirming individual character " big " is " big pharmacy ".
Wherein, in a kind of possible implementation of the inventive method embodiment, at the described first POI When including non-Chinese character in title, the above method also includes:
Half-angle treatment is carried out to a POI titles, all of non-Chinese character in a POI titles is extracted Group simultaneously marks the position of non-Chinese character group, and non-Chinese character group is added into first participle result.
Here, when when non-Chinese character is included in a POI titles of participle, to a POI Title carries out half-angle treatment, extracts all of non-Chinese character group in a POI titles, and mark it is non-in The position of Chinese character group, first participle result, the position then conduct of the mark are added to by non-Chinese character group Natural delimiter in first POI titles during remaining Chinese character cutting.
Wherein, in a kind of possible implementation of the inventive method embodiment, the acquisition first participle After result, the above method also includes:
In judging the keyword in first participle result, if having the unregistered word being not present in dictionary for word segmentation;
If so, word frequency of the unregistered word in the predetermined total sample of POI titles is then counted, when unregistered word When frequency is higher than predetermined threshold value, unregistered word is added to dictionary for word segmentation.
In above-described embodiment, the unregistered word in first participle result is judged, and to being not logged in base Word frequency statisticses are carried out in the predetermined total sample of POI titles, when the word frequency of the unregistered word is higher than predetermined threshold value, The unregistered word is added in dictionary for word segmentation, the keyword of dictionary for word segmentation is expanded with this.
Wherein, in above-mentioned steps S102, treating a POI titles of participle carries out full cutting, obtains the The step of one word segmentation result, can include:
First POI titles are matched according to maximum matching method with dictionary for word segmentation, the first matching knot is obtained Really;The first matching result is modified according to the minimum principle of participle individual character, obtains first participle result.
In above-described embodiment, according to maximum matching method by the keyword in a POI titles and dictionary for word segmentation Matched, the first matching result of the keyword of Corresponding matching dictionary for word segmentation is obtained with this;Then basis The minimum principle of participle individual character merges the adjacent individual character matched in the first matching result as cutting Divide result, so as to be modified to the first matching result, obtain first participle result.For example, to " new ocean Big pharmacy " carries out matching can obtain the first matching result:" new/ocean/big pharmacy ", then according to participle individual character most Few principle is modified to first matching result and can obtain first participle result:" new ocean/big pharmacy ".In addition, In this embodiment, maximum matching method can be from Forward Maximum Method method, reverse maximum matching method and double To one or more matching methods in maximum matching method.
Wherein, above-mentioned dictionary for word segmentation also includes:National link name storehouse and neighborhood name allocation list.
To sum up, the Chinese word cutting method of the POI titles that the inventive method embodiment is provided is by according to POI Word frequency of a certain available different keywords of individual character cutting in dictionary for word segmentation in title, by word frequency highest Keyword solves the cutting discrimination that a certain individual character occurs during POI title participles as the word segmentation result of individual character The problem of justice, makes cutting result more reasonable, it is ensured that the accuracy of participle.
Below, then by one specific implementation example of the present invention, the present invention is described in more detail.
Fig. 5 is referred to, its Chinese word cutting method for being illustrated that specific embodiment POI titles of the invention Flow example.The step of Chinese word cutting method of specific embodiment POI titles, includes:
A, keyword is extracted the characteristics of POI titles according in the predetermined total sample of POI titles, such as " Sheng Dawu Golden electrical equipment various household supplies instrument firm ", it is necessary to extract keyword " hardware ", " electrical equipment ", " various household supplies ", " instrument ", " firm ", and by these keywords according to default attribute storage to dictionary for word segmentation, default attribute includes: Part of speech, whether be brand, whether be place etc., while by each keyword be based on predetermined POI titles gross sample This word frequency correspondence is added to dictionary for word segmentation, storage form such as following table:
IDCODE NAME LOCTION ADJECTIVE BRAND NOUN FREQUENCY
K00270 Cinema Y N N Y 2065345
K00271 KFC N N Y Y 7844
K00272 Hardware N N N Y 48732
K00273 Company Y N N Y 884245
K00274 Fast N Y N N 1045623
In addition, adding national link name storehouse and neighborhood name allocation list in the dictionary for word segmentation.
B, the POI titles for treating participle are pre-processed.The POI titles of participle will be treated carries out half-angle Treatment, the various segmentation symbols in record POI titles, such as the mark of word segmentation such as dash, bracket is carried Take English word therein and numeral etc., and mark position.
C, the POI titles for treating participle carry out Chinese word segmentation treatment.According to Forward Maximum Method algorithm POI titles are carried out into full cutting, for example:" multiple Beijing University pharmacy Eastern Han Dynasty Yang Lu shops " is cut entirely based on dictionary for word segmentation After be:" multiple ", " Beijing University ", " big pharmacy ", " pharmacy ", " Eastern Han Dynasty Yang Lu ", " shop ", (here, according to National link name storehouse, does not split for link name " Eastern Han Dynasty Yang Lu ");For the discrimination occurred in participle Justice is analyzed treatment by word frequency, is " 213 " according to the word frequency of " Beijing University " in the total samples of predetermined POI, And the word frequency in " big pharmacy " is " 43782 ", so result is:" big pharmacy ", the first matching result is " multiple / north/big pharmacy/Eastern Han Dynasty Yang Lu/shop ";First matching result is modified according to the minimum principle of participle individual character, Obtaining first participle result is:" multiple north/big pharmacy/Eastern Han Dynasty Yang Lu/shop ".
D, unregistered word treatment.Judge whether to be not present in first participle result in dictionary for word segmentation not Posting term, if so, carrying out word frequency statisticses in the predetermined total sample of POI titles for not logging in base, works as word Frequency reaches predetermined threshold, and just the unregistered word is added in dictionary for word segmentation.
Device embodiment
Fig. 2 is referred to, the one of its Chinese word segmentation device for being illustrated that apparatus of the present invention embodiment POI titles Plant structural representation.To realize above method embodiment, apparatus of the present invention embodiment provides a kind of point of interest The Chinese word segmentation device of POI titles, can include:
Acquisition module 210, for obtaining the dictionary for word segmentation to being obtained after the predetermined total sample process of POI titles, Dictionary for word segmentation includes the keyword and each extracted from the POI titles of the predetermined total sample of POI titles Word frequency of the keyword in the predetermined total sample of POI titles;
First participle module 220, the POI titles for treating participle carry out full cutting, obtain the One word segmentation result, wherein, if the same individual character in a POI titles under different slit modes have it is many Keyword is planted, then according to the keyword obtained under different slit modes in the predetermined total sample of POI titles Word frequency, using word frequency highest keyword as individual character word segmentation result.
Wherein, on the basis of Fig. 2, referring to Fig. 3, it is illustrated that apparatus of the present invention embodiment POI Another structural representation of the Chinese word segmentation device of title, said apparatus can also include:
Second word-dividing mode 230, for carrying out half-angle treatment to a POI titles, extracts a POI All of non-Chinese character group and the position of non-Chinese character group is marked in title, and non-Chinese character group is added Add to first participle result.
Wherein, on the basis of Fig. 2, referring to Fig. 4, it is illustrated that apparatus of the present invention embodiment POI Another structural representation of the Chinese word segmentation device of title, said apparatus can also include:
Judge module 240, for judge word-dividing mode obtain first participle result in keyword in, be It is no to have the unregistered word being not present in dictionary for word segmentation;
Statistics and add module 250, if being yes for the judged result of judge module, count unregistered word Word frequency in the predetermined total sample of POI titles, when the frequency of unregistered word is higher than predetermined threshold value, will not Posting term is added to dictionary for word segmentation.
Wherein, above-mentioned first participle module 220 can include:
Matching unit, for the first POI titles to be matched with dictionary for word segmentation according to maximum matching method, Obtain the first matching result;
Amending unit, for being modified to the first matching result according to the minimum principle of participle individual character, obtains One word segmentation result.
Wherein, above-mentioned dictionary for word segmentation also includes:National link name storehouse and neighborhood name allocation list.
The Chinese word segmentation device of the POI titles that said apparatus embodiment of the invention is provided and above method reality Apply example and belong to same design, it implements process and refers to embodiment of the method, to avoid repeating, here no longer Repeat.
In sum, the Chinese word cutting method and device of the POI titles that the above embodiment of the present invention is provided By according to word frequency of the available different keywords of a certain individual character cutting in dictionary for word segmentation in POI titles, Using word frequency highest keyword as individual character word segmentation result, a certain individual character when solving POI title participles The problem of the cutting ambiguity of appearance, makes cutting result more reasonable, it is ensured that the accuracy of participle.
The above is the preferred embodiment of the present invention, it is noted that for the common skill of the art For art personnel, on the premise of principle of the present invention is not departed from, some improvements and modifications can also be made, These improvements and modifications also should be regarded as protection scope of the present invention.
It should be noted that for foregoing embodiment, in order to be briefly described, therefore it is all expressed as one it is The combination of actions of row, but those skilled in the art should know, and the present invention is not suitable by described action The limitation of sequence, because according to the present invention, some steps can sequentially or simultaneously be carried out using other.Secondly, Those skilled in the art should also know that embodiment described in this description belongs to preferred embodiment, institute The action being related to is not necessarily essential to the invention.
In addition, in inventive embodiments, such as first and second or the like relational terms are used merely to one Individual entity or operation make a distinction with another entity or operation, and not necessarily require or imply these realities There is any this actual relation or order between body or operation.

Claims (10)

1. a kind of Chinese word cutting method of point of interest POI titles, it is characterised in that including:
The dictionary for word segmentation to being obtained after the predetermined total sample process of POI titles is obtained, the dictionary for word segmentation includes There are the keyword and each keyword that are extracted from the POI titles of the predetermined total sample of POI titles to exist Word frequency in the predetermined total sample of POI titles;
Treating a POI titles of participle carries out full cutting, obtains first participle result, wherein, if institute State the same individual character in a POI titles has various keywords under different slit modes, then according to not With word frequency of the keyword obtained under slit mode in the predetermined total sample of POI titles, by word frequency most Keyword high as the individual character word segmentation result.
2. method according to claim 1, it is characterised in that wrapped in a POI titles When having included non-Chinese character, methods described also includes:
Half-angle treatment is carried out to a POI titles, all of non-Chinese in a POI titles is extracted Character group simultaneously marks the position of the non-Chinese character group, and by the non-Chinese character group added to described the One word segmentation result.
3. method according to claim 1, it is characterised in that the acquisition first participle result it Afterwards, methods described also includes:
In judging the keyword in the first participle result, if be not present in the dictionary for word segmentation Unregistered word;
If so, then count word frequency of the unregistered word in the predetermined total sample of POI titles, when it is described not When the frequency of posting term is higher than predetermined threshold value, the unregistered word is added to dictionary for word segmentation.
4. method according to claim 1, it is characterised in that treat participle the first POI The step of title carries out full cutting, acquisition first participle result includes:
The first POI titles are matched with the dictionary for word segmentation according to maximum matching method, is obtained One matching result;
First matching result is modified according to the minimum principle of participle individual character, obtains first participle result.
5. method according to claim 1, it is characterised in that the dictionary for word segmentation also includes:Entirely State's link name storehouse and neighborhood name allocation list.
6. a kind of Chinese word segmentation device of point of interest POI titles, it is characterised in that including:
Acquisition module, for obtaining the dictionary for word segmentation to being obtained after the predetermined total sample process of POI titles, institute State dictionary for word segmentation include from the POI titles of the predetermined total sample of POI titles extract keyword with And word frequency of each keyword in the predetermined total sample of POI titles;
First participle module, the POI titles for treating participle carry out full cutting, obtain first point Word result, wherein, if the same individual character in a POI titles under different slit modes have it is many Keyword is planted, then according to the keyword obtained under different slit modes in the predetermined total sample of POI titles In word frequency, using word frequency highest keyword as the individual character word segmentation result.
7. device according to claim 6, it is characterised in that described device also includes:
Second word-dividing mode, for carrying out half-angle treatment to a POI titles, extracts a POI All of non-Chinese character group and the position of the non-Chinese character group is marked in title, and by the non-Chinese Character group is added to the first participle result.
8. device according to claim 6, it is characterised in that described device also includes:
Judge module, for judging the keyword in the first participle result that the word-dividing mode is obtained in, be It is no to have the unregistered word being not present in the dictionary for word segmentation;
Statistics and add module, if being yes for the judged result of the judge module, do not step on described in statistics Word frequency of the record word in the predetermined total sample of POI titles, when the frequency of the unregistered word is higher than predetermined threshold value When, the unregistered word is added to dictionary for word segmentation.
9. device according to claim 6, it is characterised in that the first participle module includes:
Matching unit, for being entered a POI titles with the dictionary for word segmentation according to maximum matching method Row matching, obtains the first matching result;
Amending unit, for being modified to first matching result according to the minimum principle of participle individual character, obtains To first participle result.
10. device according to claim 6, it is characterised in that the dictionary for word segmentation also includes:Entirely State's link name storehouse and neighborhood name allocation list.
CN201511029875.5A 2015-12-31 2015-12-31 A kind of Chinese word cutting method and device of point of interest POI titles Pending CN106933799A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511029875.5A CN106933799A (en) 2015-12-31 2015-12-31 A kind of Chinese word cutting method and device of point of interest POI titles

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511029875.5A CN106933799A (en) 2015-12-31 2015-12-31 A kind of Chinese word cutting method and device of point of interest POI titles

Publications (1)

Publication Number Publication Date
CN106933799A true CN106933799A (en) 2017-07-07

Family

ID=59443618

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511029875.5A Pending CN106933799A (en) 2015-12-31 2015-12-31 A kind of Chinese word cutting method and device of point of interest POI titles

Country Status (1)

Country Link
CN (1) CN106933799A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107424612A (en) * 2017-07-28 2017-12-01 北京搜狗科技发展有限公司 Processing method, device and machine readable media
CN109033082A (en) * 2018-07-19 2018-12-18 深圳创维数字技术有限公司 The learning training method, apparatus and computer readable storage medium of semantic model
CN109710087A (en) * 2018-12-28 2019-05-03 北京金山安全软件有限公司 Input method model generation method and device
CN113688628A (en) * 2021-07-28 2021-11-23 上海携宁计算机科技股份有限公司 Text recognition method, electronic device, and computer-readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060116869A1 (en) * 2002-07-30 2006-06-01 Hitoshi Kimura Automatic keyword extraction apparatus, method, recording medium and program
CN103678684A (en) * 2013-12-25 2014-03-26 沈阳美行科技有限公司 Chinese word segmentation method based on navigation information retrieval
CN104077275A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Method and device for performing word segmentation based on context
CN104778159A (en) * 2015-03-31 2015-07-15 北京奇虎科技有限公司 Word segmenting method and device based on word weights
CN105159885A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Point-of-interest name identification method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060116869A1 (en) * 2002-07-30 2006-06-01 Hitoshi Kimura Automatic keyword extraction apparatus, method, recording medium and program
CN103678684A (en) * 2013-12-25 2014-03-26 沈阳美行科技有限公司 Chinese word segmentation method based on navigation information retrieval
CN104077275A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Method and device for performing word segmentation based on context
CN104778159A (en) * 2015-03-31 2015-07-15 北京奇虎科技有限公司 Word segmenting method and device based on word weights
CN105159885A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Point-of-interest name identification method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107424612A (en) * 2017-07-28 2017-12-01 北京搜狗科技发展有限公司 Processing method, device and machine readable media
CN109033082A (en) * 2018-07-19 2018-12-18 深圳创维数字技术有限公司 The learning training method, apparatus and computer readable storage medium of semantic model
CN109033082B (en) * 2018-07-19 2022-06-10 深圳创维数字技术有限公司 Learning training method and device of semantic model and computer readable storage medium
CN109710087A (en) * 2018-12-28 2019-05-03 北京金山安全软件有限公司 Input method model generation method and device
CN113688628A (en) * 2021-07-28 2021-11-23 上海携宁计算机科技股份有限公司 Text recognition method, electronic device, and computer-readable storage medium
CN113688628B (en) * 2021-07-28 2023-09-22 上海携宁计算机科技股份有限公司 Text recognition method, electronic device, and computer-readable storage medium

Similar Documents

Publication Publication Date Title
CN103123618B (en) Text similarity acquisition methods and device
CN105426539B (en) A kind of lucene Chinese word cutting method based on dictionary
CN103186524B (en) A kind of place name identification method and apparatus
CN104462547B (en) A kind of method and system of configurable collecting webpage data
CN103336766A (en) Short text garbage identification and modeling method and device
CN108446388A (en) Text data quality detecting method, device, equipment and computer readable storage medium
CN104298665A (en) Identification method and device of evaluation objects of Chinese texts
CN106933799A (en) A kind of Chinese word cutting method and device of point of interest POI titles
CN104504024B (en) Keyword method for digging based on content of microblog and system
CN104408093A (en) News event element extracting method and device
CN101937436B (en) Text classification method and device
CN104281653A (en) Viewpoint mining method for ten million microblog texts
CN103761239A (en) Method for performing emotional tendency classification to microblog by using emoticons
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
Prokić et al. Recognising groups among dialects
CN105224520B (en) A kind of Chinese patent document term automatic identifying method
CN107562843B (en) News hot phrase extraction method based on title high-frequency segmentation
CN105095091B (en) A kind of software defect code file localization method based on Inverted Index Technique
CN104899335A (en) Method for performing sentiment classification on network public sentiment of information
CN104239321B (en) A kind of data processing method and device of Search Engine-Oriented
CN106445906A (en) Generation method and apparatus for medium-and-long phrase in domain lexicon
CN104866558A (en) Training method of social networking account mapping model, mapping method and system
CN107967364A (en) Web documents transmissibility appraisal procedure and device
CN104346382B (en) Use the text analysis system and method for language inquiry
CN107526792A (en) A kind of Chinese question sentence keyword rapid extracting method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170707

RJ01 Rejection of invention patent application after publication