WO2008107305A2 - Search-based word segmentation method and device for language without word boundary tag - Google Patents

Search-based word segmentation method and device for language without word boundary tag Download PDF

Info

Publication number
WO2008107305A2
WO2008107305A2 PCT/EP2008/052051 EP2008052051W WO2008107305A2 WO 2008107305 A2 WO2008107305 A2 WO 2008107305A2 EP 2008052051 W EP2008052051 W EP 2008052051W WO 2008107305 A2 WO2008107305 A2 WO 2008107305A2
Authority
WO
WIPO (PCT)
Prior art keywords
word segmentation
candidate word
segment
units
search results
Prior art date
Application number
PCT/EP2008/052051
Other languages
English (en)
French (fr)
Other versions
WO2008107305A3 (en
Inventor
Wen Liu
Yong Qin
Xin Jing Wang
Original Assignee
International Business Machines Corporation
Ibm United Kingdom Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corporation, Ibm United Kingdom Limited filed Critical International Business Machines Corporation
Publication of WO2008107305A2 publication Critical patent/WO2008107305A2/en
Publication of WO2008107305A3 publication Critical patent/WO2008107305A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text

Definitions

  • the present invention relates to the field of word segmentation technologies for a language without a word boundary tag, and in particular to a search-based word segmentation method and device for a language without a word boundary tag.
  • a sentence will typically comprise a set of consecutive characters, and there is no delimiter, i.e., separator, between words. How to delimit words is dependent upon whether a word in question is a phoneme word, a vocabulary word, a morphology word, a sentence making-based word, a semantics word or a psychology word. Consequently, for any word-based language process, for example, Text-to- Speech (i.e. speech synthesis, or TTS), extracting a document feature, automatic document abstraction, automatic document sorting, and Chinese text searching, the first step is to segment each sentence into words.
  • TTS Text-to- Speech
  • NLP Chinese Natural Language Processing
  • word segmentation involves mainly two research issues: word boundary disambiguation and unknown word identification.
  • word boundary disambiguation and unknown word identification.
  • these two issues are considered to be two separate tasks, and hence are dealt with using different components in a cascaded or consecutive manner.
  • some specific language natures of Chinese words result in that a major difficulty in Chinese word segmentation presents an output which can vary dependent upon different linguistic definitions of words and different engineering requirements.
  • dictionary-based methods a predefined dictionary is used along with artificial grammar rules.
  • sentences are segmented in accordance with the dictionaries, and the grammar rules are used to improve the performance.
  • Atypical technique of dictionary-based method is called maximum matching, in which an input sentence is compared with entries in a dictionary to find out an entry which includes the greatest number of matching characters.
  • maximum matching in which an input sentence is compared with entries in a dictionary to find out an entry which includes the greatest number of matching characters.
  • the accuracy of this type of methods is seriously affected by the limited coverage of the dictionary and the lack of robust statistical inference in the rules. Since it is virtually impossible to list all the words in a predefined dictionary and impossible to timely update the dictionary, the accuracy of such methods degrades sharply as new words appear.
  • Statistical machine learning methods are word segmentation methods for text using probabilities or a cost-based scoring mechanism instead of dictionaries.
  • Current statistical machine learning methods fall roughly into the following categories: 1) the MSRSeg method, involving two parts, where one part is a generic segmenter, which is based upon the framework of linear mixture models, and unifies five features of word- level Chinese language processing, including lexicon word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and the other part is a set of output adaptors for adapting an output of the generic segmenter to different application- specific standards; 2) information of adjacent characters is utilized to join the N-grams and their adjacent characters; 3) a maximum likelihood approach; 4) approach employing neural networks; 5) a unified HHMM (Hierarchical Hidden Markov Model)-based frame of which a Chinese lexical analyzer is introduced; 6) Various available features in a sentence are extracted to construct a generalized model, and then various probabilistic models are derived based upon this model; and
  • Transformation-based methods are initially used in POS (Part-of-Speech) tagging and parsing.
  • the main idea of these methods is to try to learn a set of n-gram rules from a training corpus and to apply them to segmentation of a new text.
  • the learning algorithm compares the corpus (serving as a dictionary) with its un-segmented counterpart to find the rules.
  • One transformation-based method trains taggers based on manually annotated data so as to automatically assign Chinese characters with tags that indicate the position of a character within a word.
  • the tagged output is then converted into segmented text for evaluation.
  • Another transformation-based method presented is Chinese word segmentation algorithms based upon the so-called LMR tagging.
  • the LMR taggers in such a method are implemented with the Maximum Entropy Markov Model, and transformation-based learning is adopted to combine results of two LMR taggers that scan an input in opposite directions.
  • a further transformation-based method presents a statistical framework, and identifies domain- specific or strongly time-dependent words based upon linear models, and then performs adaptation to standards by a post-processor performing a series of conversion on an output from the generic segmenter to implement a single word-segmentation system.
  • the transformation-based methods learn N-gram rules from training corpora, and therefore are still limited to training corpora.
  • Combining Methods are methods which combine several current methods or various information.
  • dictionary and word frequency information can be combined; a maximum entropy model and a transformation-based model can be combined; several Support Vector Machines can be trained, and how a dynamic weighted method work for the segmentation task can be explored; a Hidden Markov Model-based word segmenter and
  • Support Vector Machine-based chunker can be combined for this task.
  • Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation Li, M., Gao, J.F., Huang, CN. , and Li, J.F., Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. Jul.2003, pp.1-7)
  • an unsupervised training approach is proposed to resolve overlapping ambiguities in Chinese word segmentation, which trains a set of Na ⁇ ve Bayesian classifiers from an unlabelled Chinese text corpus.
  • a system can be conveniently customized to meet various user-defined standards in the segmentation of MDWs (Morphologically Derived Words).
  • all MDWs contain word trees where root nodes correspond to maximal words and leaf nodes correspond to minimal words.
  • Each non-terminal node in the tree is associated with a resolution parameter, which determines whether its children are to be displayed as a single word or separate words.
  • Different outputs of segmentation can be obtained from different cuts of the word tree, which cuts are specified by the user through the different value combinations of those resolution parameters.
  • the combing methods merely combine the several types of methods as described previously, and therefore, may still be limited alike.
  • the present invention accordingly provides, in a first aspect, a search-based word segmentation method for a language without a word boundary tag, comprising the steps of: a. providing at least one search engine with a segment of a text comprising at least one segment; b. searching for the segment through the at least one search engine, and returning search results; and c. selecting a word segmentation approach for the segment in accordance with at least part of the returned search results.
  • the at least part of the returned search results are top-ranked search results.
  • the step c comprises the steps of: extracting, from the at least part of the returned search results, all candidate word segmentation units appearing in the segment; scoring the extracted candidate word segmentation units; ranking subsets of extracted candidate word segmentation units in accordance with scores, wherein the candidate word segmentation units in each subset form sequentially the segment; and selecting a highest-ranked subset as the word segmentation approach for the segment.
  • the step c further comprises the step of filtering out, from the extracted candidate word segmentation units, an invalid candidate word segmentation unit, which is one of an unigram or a word segmentation unit that does not appear in the segment.
  • the method for scoring the candidate word segmentation units is frequency- based, and for the part of the search results, a ratio of the number of occurrences of the scored candidate word segmentation units to the total number of occurrences of all the candidate segmentation units is taken as the scores of the scored candidate word segmentation units.
  • the method for scoring the candidate word segmentation units is SVM (Support Vector Machine)-based, using an SVM classifier or an SVM regression model to score each candidate word segmentation unit; and representing the candidate word segmentation units, which are data points, as feature vectors so as to train the SVM classifier and the SVM regression model.
  • SVM Serial Vector Machine
  • a feature extracted for each candidate word segmentation unit comprises one or a combination of the following features: the number of characters in the candidate word segmentation unit; an average occurrence rate, which is the number of times that the candidate word segmentation unit appears, divided by the number of documents in the search results returned by the search engine; and a document frequency, which is the number of search results containing the candidate word segmentation unit.
  • a subset of candidate word segmentation units with the highest average score of candidate word segmentation units is selected as the word segmentation approach for the segment.
  • the extracting of candidate word segmentation units from the returned search results is implemented via extracting highlighted phrases in the returned snippets.
  • a word segmentation unit is obtained by viewing adjacencies of positions of terms in a document using information provided from an indexing table.
  • a search-based word segmentation device for a language without a word boundary tag, comprising: at least one search engine, adapted to receive a segment of a text comprising at least one segment, to search in a search network for the segment, and to return search results; and a word segmentation result generating means, adapted to select a word segmentation approach for the segment in accordance with at least part of the returned search results.
  • the at least part of the search results returned by the at least one search engine are top-ranked search results.
  • the word segmentation result generating means is further adapted to: extract, from the at least part of the returned search results, all candidate word segmentation units appearing in the segment; score the extracted candidate word segmentation units; rank subsets of extracted candidate word segmentation units in accordance with scores, wherein the candidate word segmentation units in each subset form sequentially the segment; and select a highest-ranked subset as the word segmentation approach for the segment.
  • the word segmentation result generating means is further adapted to filter out, from the extracted candidate word segmentation units, an invalid candidate word segmentation unit, which is one of an unigram or a word segmentation unit that does not appear in the segment.
  • the word segmentation result generating means scores the candidate word segmentation units in a frequency-based manner
  • the word segmentation result generating means is further adapted to: for the part of the search results, take a ratio of the number of occurrences of the scored candidate word segmentation units to the total number of occurrences of all the candidate segmentation units as the scores of the scored candidate word segmentation units.
  • the word segmentation result generating means scores the candidate word segmentation units in a SVM (Support Vector Machine)-based manner, uses an SVM classifier or an SVM regression model to score each candidate word segmentation unit, and represents the candidate word segmentation units, which are data points, as feature vectors so as to train the SVM classifier and the SVM regression model.
  • SVM Serial Vector Machine
  • a feature extracted for each candidate word segmentation unit comprises one or a combination of the following features: the number of characters in the candidate word segmentation unit; an average occurrence rate, which is the number of times that the candidate word segmentation unit appears, divided by the number of documents in the search results returned by the search engine; and a document frequency, which is the number of search results containing the candidate word segmentation unit.
  • the word segmentation result generating means is further adapted to select a subset of candidate word segmentation units with the highest average score of candidate word segmentation units as the word segmentation approach for the segment.
  • the word segmentation result generating means extracts candidate word segmentation units from the returned search results by extracting highlighted phrases in the returned snippets.
  • the word segmentation result generating means is adapted to use information provided from an indexing table to view adjacencies of positions of terms in a document to obtain a word segmentation unit.
  • a computer program comprising computer program code to, when loaded into a computer system and executed thereon, cause said computer system to perform all the steps of a method according to the first aspect.
  • a preferred embodiment of the present invention provides a search-based word segmentation method and device for a language without a word boundary tag, which can solve relatively well the problem of word segmentation for a language without a word boundary tag and thus overcome the disadvantages in the prior art.
  • the invention uses search results returned from a search engine to segment words, and thus combat the limitations of the current word segmentation approaches in terms of flexibility, dependence upon coverage of dictionaries, available training data corpuses, processing of a new word, etc.
  • a search-based word segmentation method for a language without a word boundary tag including the steps of: a. providing at least one search engine with a segment of a text including at least one segment; b. searching for the segment through the at least one search engine, and returning search results; and c. selecting a word segmentation approach for the segment in accordance with at least part of the returned search results.
  • a search-based word segmentation device for a language without a word boundary tag, including: at least one search engine, adapted to receive a segment of a text including at least one segment, to search in a search network for the segment, and to return search results; and a word segmentation result generating means, adapted to select a word segmentation approach for the segment in accordance with at least part of the returned search results.
  • the invention may be advantageous in the following.
  • the invention uses a search technology for word segmentation of a language without a word boundary, such as
  • the invention lies in detection of a new word.
  • the invention provides a very easy way to identify an OOV word, e.g. " ⁇ E ⁇ ” (SARS), while new words emerge everyday, since information available in the Internet is dynamic and updated rapidly.
  • SARS SARS
  • the dictionaries are limited regardless of whether they are used for a real-time query (e.g. a dictionary-based method) or for training a word segmentation model (e.g. a statistical method, etc.). In contrast, the previous methods require a support from dictionaries, the dictionaries are limited regardless of whether they are used for a real-time query (e.g. a dictionary-based method) or for training a word segmentation model (e.g. a statistical method, etc.). In contrast, the dictionaries are limited regardless of whether they are used for a real-time query (e.g. a dictionary-based method) or for training a word segmentation model (e.g. a statistical method, etc.). In contrast, the previous methods require a support from dictionaries, the dictionaries are limited regardless of whether they are used for a real-time query (e.g. a dictionary-based method) or for training a word segmentation model (e.g. a statistical method, etc.). In contrast, the previous methods require a
  • various word segmentation units can be provided through a search engine. For instance, a query “i ⁇ T — i ⁇ "("had a try”) returns “i ⁇ T “(”tried"), “— ⁇ j£"("a try”), “i ⁇ £7 — i ⁇ t”("had a try”) by the Yahoo! Search.
  • This feature plus the word segmentation unit scoring step in the invention, enables the adaptability of the inventive method and device to various standards.
  • the manual labeling of a training corpus is a time-consuming and tedious task, while the inventive method and device may be entirely unsupervised. Since in the invention, the only step which may require a training course relates to the scoring function. According to the invention, if a "term frequency" is used as a scoring criterion for word segmentation units, then no data needs to be trained, thus making the entire solution unsupervised. Since the preferred embodiments of the invention use numerous documents retrieved though a search engine from the Internet to obtain initial word segmentation units, and the documents are human-written, hence in compliance with a natural language, the inventive method and device can obtain directly a correct word segmentation result without a natural language analysis of the documents, in comparison with the previous methods.
  • Fig.l is a schematic diagram of elementary elements in a search-based word segmentation system for a language without a word boundary tag according to an embodiment of the invention
  • Fig.2 depicts a search-based word segmentation method for a language without a word boundary tag according to an embodiment of the invention
  • Fig.3 depicts a flow chart of an example of the search-based word segmentation method according to an embodiment of the invention
  • Fig.4 depicts search results of the search using the public Yahoo! search engine
  • Fig.5 depicts one illustrative word segmentation result according to the invention.
  • Fig.6 depicts another illustrative word segmentation result according to the invention.
  • Fig.l is a schematic diagram of elementary elements in a search-based word segmentation system for a language without a word boundary tag according to an embodiment of the invention
  • Fig.2 depicts a search-based word segmentation method for a language without a word boundary tag according to an embodiment of the invention.
  • a segment of the text including at least one segment is provided as a query content to at least one search engine 1.
  • the query content can be provided to the engine, for instance, through a keyboard input, a manual input, a voice input, a direction operation on the text (e.g. a segment of text is selected for the operation), or any other available way.
  • the segments of the text can be separated by interpunctions or other marking contents or symbols.
  • search engine 1 searches for the query content (segment) through the search engine 1 are made in a search network 2, such as the Internet, and the search results are returned.
  • a word segmentation generating means 3 selects an optimal word segmentation approach for the submitted segment in accordance with the returned search results.
  • a sentence is segmented by punctuation into a group of sentence units. Then each sentence unit is submitted as a query to a search engine. All candidate phrases (i.e. the hits), called candidate word segmentation units, are extracted from snippets of the documents, which are returned from the search engine. A score can be calculated for each candidate word segmentation unit. All the candidate word segmentation units form a plurality of subsets. The candidate word segmentation units in each subset are cascaded to form the submitted query, that is, a "path" (i.e. sequence), and an optimal "path” is taken as a word segmentation result of the submitted sentence units.
  • a "path" i.e. sequence
  • Fig.3 depicts a flow chart of an example of the search- based word segmentation method according to the embodiment of the invention.
  • a document S is input, e.g. a Chinese document.
  • the given document S is segmented by punctuation into sentence units, thus giving *- s '> illustrated in Fig.3, where ' indicates the i th item in *- s '> .
  • respective items are processed until all the items in *- s '> are processed.
  • each of the segmented sentence units i.e. each s > e ⁇ s >> , is submitted to a search engine, which typically provides various word segmentation units.
  • a search engine typically provides various word segmentation units.
  • a set of all word segmentation units ⁇ Wj ' returned from all search engines, are collected based
  • J is an index of a word segmentation unit.
  • a public search engine like Yahoo!, Google, etc.
  • he can extract a candidate word segmentation unit from HTML source files of returned search results, that is, extract a highlighted phrase in returned snippets, such as a red one illustrated in Fig.4, which illustrates search results of the public Yahoo! search engine for "ftili ⁇ i 9i iii ⁇ i#.” ("he said horr").
  • a self-maintained search engine is available, information directly provided from an indexing table can be used to view adjacencies of positions of terms in a document to obtain a word segmentation unit.
  • the invention will not be limited to this, but it is also possible to collect all highlighted phrases given in search results from a public or self-maintained search engine, and to combine the search results. Indeed, the collection of candidate word segmentation units based upon multiple search engines provided with different local segmentation models will yield a better segmentation performance, because a feature (e.g. frequency) is calculate based upon top-ranked documents, and local segmentation models affect the search results and hence the candidate word segmentation units.
  • a feature e.g. frequency
  • search engine preliminarily segmented a submitted query into a set of terms based upon the query.
  • search engine indexes all documents that contain one or more of these terms (i.e. hits), calculates a score for each document based upon the hits, ranks the documents, and finally outputs the top ranked documents (e.g. first 1,000 ones) to the user.
  • hits all documents that contain one or more of these terms
  • the top ranked documents e.g. first 1,000 ones
  • a distribution (e.g. frequency) of a term indicates popularity of the term, or how probably certain characters will associate with each other. Still referring to Fig.4, as can be seen, appears four times. If a frequency in which a term appears is used as a criterion for evaluation of a candidate word segmentation unit, then "MW ("happy") will be preferred to "ffe ⁇ fe” ("hesville") as the former has a higher frequency than that of the latter. On the other hand, an n-gram or a local segmentation model as adopted by the search engine may not be effective per se.
  • the collected candidate word segmentation units are highlighted phrases in snippets of retrieved documents. Because Web documents are human- written, thus they follow the natural language. Even if the local segmentation of a search engine is not correct, the local segmentation will be corrected by those documents, or by the way people speak. Taking an extreme case as an example, it can be assumed that a search engine separates each character, i.e. neither local segmentation model nor n-gram is adopted, and the search engine uses each unigram (i.e. each term contains only one character) as a term to index the documents. In this case, these terms will be in neighbor of each other in the retrieved documents.
  • Fig.4 illustrates an example of Yahoo!
  • invalid word segmentation units can be preferably filtered out from ⁇ Wj ' .
  • step Sl 105 all collected candidate word segmentation units are scored, and various available scoring methods can be used for this step.
  • two scoring method will be described illustratively, namely a frequency-based method and an SVM (Support Vector Machine)-based method.
  • the frequency-based method is used as a scoring method.
  • a simplest way is to use, based upon the search results, occurrence frequencies of all terms in each Wj as scores.
  • the occurrence frequencies of all terms are defined as Eq.(l) below:
  • ⁇ Wl ' indicates a term frequency score of Wj
  • N gives the number of documents retrieved by s '
  • k ⁇ Wj ' is the number of times that Wj appears in snippets of the k th document in the case that a public search engine is used.
  • the Eq.(l) gives the ratio of the number of occurrences of Wj to the total number of occurrences of all the segmentation units ⁇ Wj ' corresponding to the query s > .
  • this method corresponds to the maximum likelihood criterion.
  • this criterion minimizes an empirical risk on a dataset when the dataset is large enough (in compliance with the large-number theory).
  • the use of the maximum likelihood method as a nonlinear fitting method in the embodiment may be advantageous in that parameters estimated in this method will maximize a positive logarithmic likelihood value or minimize a negative logarithmic likelihood value.
  • the SVM-based method when a dataset is not large enough, it is considered to resort to minimize a structural risk, while the SVM-based method is such an algorithm that tries to minimize the structural risk on a dataset.
  • Different kernels may be tried, such as RBF kernel, sigmoid kernel, linear and polynomial kernels. It is possible to choose either an SVM classifier or an SVM regression model to score a word segmentation unit.
  • an SVM classifier requires providing a numerical score to each training data point, it is generally difficult to specify a score strategy.
  • SVM classifiers are used to score each word segmentation unit.
  • each data point i.e. candidate word segmentation unit
  • a feature vector For instance, one or a combination of the following three types of features can be extracted for each word segmentation unit:
  • LEN The "LEN" feature is defined as the number of characters in a word segmentation unit. A longer word segmentation unit is preferred to a short one, because the former indicates a better semantic unit in applications of speech synthesis, speech recognition, and etc.
  • AVGOCCU The "AVGOCCU” feature is defined as an average occurrence rate, that is, the number of times that a word segmentation unit appears, preferably in a set of "valid" word segmentation units (i.e., those which remain in the set of word segmentation units after invalid word segment units are filtered out), divided by the number of documents returned by the search engine. A higher AVGOCCU value indicates a better word segmentation unit.
  • DF The "DF" feature is defined as a document frequency, that is, as for a word segmentation unit, how many search results contain the word segmentation unit. The larger the DF, the better the word segmentation unit.
  • one or more features also can be used as the feature(s) of a word segmentation unit.
  • step Sl 106 an optimal subset of candidate word segmentation units is determined from the candidate word segmentation units in accordance with the scoring results obtained in the step Sl 105.
  • Various methods can be utilized in an embodiment of the invention to determine an optimal subset of candidate word segmentation units.
  • the highest-ranked path can be found through terms of a reconstructed query sentence.
  • An illustrative path-finding method is dynamical programming.
  • This constraints facilitates the generation of w ' ⁇ W ⁇ Wj "" w - by limiting the selection of Wj+1 with given Wj .
  • the beginning character of Wj+l should be the one which immediately follows the ending character of Wj in a character string s > .
  • An example of the ranking function is given in Eq.(2) below, which defines the optimal subset w of word segmentation units as a subset of word segmentation units that gives a sequence with the highest path score:
  • ' ' is a score given via either the frequency-based method or the SVM-based method
  • n is the number of word segmentation units contained in the optimal subset.
  • step Sl 107 the optimal subset of word segmentation units is output as the way by which the query sentence is segmented.
  • Fig.5 illustrates word segmentation results of the inventive method for IBM Full-Parser (a current dictionary-based word segmentation tool used by IBM). "Aj*!
  • Allt is a new word, and does not exist in dictionaries of the IBM Full-Parser due to limitations of the dictionary- based method. Therefore, the IBM Full-Parser segments "Aj*! Allt” into four independent word units "A", “Ig", "A” and "lit”. However, the new word "A ⁇ t Alt” can be identified correctly by the inventive method since the latter uses a set of documents, e.g. the Internet, and thus can be dynamic and updated in a real-time way.
  • Fig.6 gives an example for this, and illustrates word segmentation results of the inventive method for an illustrative sentence have titles of a technical post and these haven't titles of a technical post )" vs. the IBM Full-Parser.
  • word segmentation results of the inventive method for an illustrative sentence have titles of a technical post and these haven't titles of a technical post )" vs. the IBM Full-Parser.
  • involves different meanings possibly segmented in the way of either " ⁇ P f ⁇ (monk)" and “7 ⁇ (haven't)" or (have)”.
  • the illustrative sentence gives the context of a technical post)" is meaningless to " ⁇ P ⁇ ". Therefore, the context information actually defines that a correct word segmentation approach should be the latter one, " ⁇ P", and
  • inventive method may be encoded as a program, which may be stored on a computer readable storage medium and executed by a computer to implement the inventive method. Therefore, a product of a computer program encoded according to the inventive method, and a computer readable storage medium, which stores the computer program, shall be encompassed by the invention.
  • various languages without a word boundary can be processed, various methods for inputting a query can be used, one or more search engines can be utilized, static or dynamic weighting can be performed on search results obtained from different search engines, any other scoring method for candidate word segmentation units can be used, any other ranking method for subsets of candidate word segmentation units can be used, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
PCT/EP2008/052051 2007-03-07 2008-02-20 Search-based word segmentation method and device for language without word boundary tag WO2008107305A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710086030.9 2007-03-07
CNA2007100860309A CN101261623A (zh) 2007-03-07 2007-03-07 基于搜索的无词边界标记语言的分词方法以及装置

Publications (2)

Publication Number Publication Date
WO2008107305A2 true WO2008107305A2 (en) 2008-09-12
WO2008107305A3 WO2008107305A3 (en) 2008-11-06

Family

ID=39707621

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2008/052051 WO2008107305A2 (en) 2007-03-07 2008-02-20 Search-based word segmentation method and device for language without word boundary tag

Country Status (3)

Country Link
US (1) US8131539B2 (US08131539-20120306-P00017.png)
CN (1) CN101261623A (US08131539-20120306-P00017.png)
WO (1) WO2008107305A2 (US08131539-20120306-P00017.png)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353309A (zh) * 2019-12-25 2020-06-30 北京合力亿捷科技股份有限公司 基于文本分析处理通信质量投诉地址的方法及系统

Families Citing this family (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8219407B1 (en) 2007-12-27 2012-07-10 Great Northern Research, LLC Method for processing the output of a speech recognizer
US20090326916A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Unsupervised chinese word segmentation for statistical machine translation
CA2639438A1 (en) * 2008-09-08 2010-03-08 Semanti Inc. Semantically associated computer search index, and uses therefore
CN101430680B (zh) * 2008-12-31 2011-01-19 阿里巴巴集团控股有限公司 一种无词边界标记语言文本的分词序列选择方法及系统
US20100191758A1 (en) * 2009-01-26 2010-07-29 Yahoo! Inc. System and method for improved search relevance using proximity boosting
EP2488963A1 (en) * 2009-10-15 2012-08-22 Rogers Communications Inc. System and method for phrase identification
US9081868B2 (en) * 2009-12-16 2015-07-14 Google Technology Holdings LLC Voice web search
CN102411563B (zh) * 2010-09-26 2015-06-17 阿里巴巴集团控股有限公司 一种识别目标词的方法、装置及系统
JP5043209B2 (ja) * 2011-03-04 2012-10-10 楽天株式会社 集合拡張処理装置、集合拡張処理方法、プログラム、及び、記録媒体
CN102955773B (zh) * 2011-08-31 2015-12-02 国际商业机器公司 用于在中文文档中识别化学名称的方法及系统
CN102567529B (zh) * 2011-12-30 2013-11-06 北京理工大学 一种基于双视图主动学习技术的跨语言文本分类方法
TWI608367B (zh) * 2012-01-11 2017-12-11 國立臺灣師範大學 中文文本可讀性計量系統及其方法
CN103324607B (zh) * 2012-03-20 2016-11-23 北京百度网讯科技有限公司 一种泰语文本切词方法及装置
TW201403354A (zh) * 2012-07-03 2014-01-16 Univ Nat Taiwan Normal 以資料降維法及非線性算則建構中文文本可讀性數學模型之系統及其方法
CN104462051B (zh) * 2013-09-12 2018-10-02 腾讯科技(深圳)有限公司 分词方法及装置
US9817823B2 (en) 2013-09-17 2017-11-14 International Business Machines Corporation Active knowledge guidance based on deep document analysis
CN104517106B (zh) * 2013-09-29 2017-11-28 北大方正集团有限公司 一种列表识别方法与系统
CN103558926A (zh) * 2013-11-12 2014-02-05 金蝶软件(中国)有限公司 一种地名录入方法及装置
CN103559177A (zh) * 2013-11-12 2014-02-05 金蝶软件(中国)有限公司 一种地名识别方法及装置
CN103699524A (zh) * 2013-12-18 2014-04-02 百度在线网络技术(北京)有限公司 分词方法和移动终端
CN105335446A (zh) * 2014-08-13 2016-02-17 中国科学院声学研究所 一种基于词矢量的短文本分类模型生成方法与分类方法
CN104156454B (zh) * 2014-08-18 2018-09-18 腾讯科技(深圳)有限公司 搜索词的纠错方法和装置
CN104933023B (zh) * 2015-05-12 2017-09-01 深圳市华傲数据技术有限公司 中文地址分词标注方法
CN104933024B (zh) * 2015-05-12 2017-09-01 深圳市华傲数据技术有限公司 中文地址分词标注方法
CN104866472B (zh) * 2015-06-15 2017-10-27 百度在线网络技术(北京)有限公司 分词训练集的生成方法和装置
CN106355628B (zh) * 2015-07-16 2019-07-05 中国石油化工股份有限公司 图文知识点标注方法和装置、图文标注的修正方法和系统
CN105095196B (zh) * 2015-07-24 2017-11-14 北京京东尚科信息技术有限公司 文本中新词发现的方法和装置
CN105260482A (zh) * 2015-11-16 2016-01-20 金陵科技学院 基于众包技术的网络新词发现装置以及方法
CN106708893B (zh) * 2015-11-17 2018-09-28 华为技术有限公司 搜索查询词纠错方法和装置
CN105550170B (zh) * 2015-12-14 2018-10-12 北京锐安科技有限公司 一种中文分词方法及装置
CN106095759B (zh) * 2016-06-20 2019-05-24 西安交通大学 一种基于启发式规则的发票货物归类方法
CN111381751A (zh) * 2016-10-18 2020-07-07 北京字节跳动网络技术有限公司 一种文本处理方法及装置
TWI656450B (zh) * 2017-01-06 2019-04-11 香港商光訊網絡科技有限公司 從中文語料庫提取知識的方法和系統
JP6778654B2 (ja) * 2017-06-08 2020-11-04 日本電信電話株式会社 単語分割推定モデル学習装置、単語分割装置、方法、及びプログラム
CN107295375A (zh) * 2017-06-13 2017-10-24 中国传媒大学 综艺节目内容特征获取系统及应用系统
CN107301170B (zh) 2017-06-19 2020-12-22 北京百度网讯科技有限公司 基于人工智能的切分语句的方法和装置
CN109284763A (zh) * 2017-07-19 2019-01-29 阿里巴巴集团控股有限公司 一种生成分词训练数据的方法和服务器
CN110945514B (zh) * 2017-07-31 2023-08-25 北京嘀嘀无限科技发展有限公司 用于分割句子的系统和方法
CN107480136B (zh) * 2017-08-02 2020-07-03 逄泽沐风 一种应用于电影剧本中情感曲线分析的方法
CN110020120B (zh) * 2017-10-10 2023-11-10 腾讯科技(北京)有限公司 内容投放系统中的特征词处理方法、装置及存储介质
US10607604B2 (en) * 2017-10-27 2020-03-31 International Business Machines Corporation Method for re-aligning corpus and improving the consistency
CN108320740B (zh) * 2017-12-29 2021-01-19 深圳和而泰数据资源与云技术有限公司 一种语音识别方法、装置、电子设备及存储介质
CN108509425B (zh) * 2018-04-10 2021-08-24 中国人民解放军陆军工程大学 一种基于新颖度的中文新词发现方法
US11003854B2 (en) * 2018-10-30 2021-05-11 International Business Machines Corporation Adjusting an operation of a system based on a modified lexical analysis model for a document
US10949622B2 (en) * 2018-10-30 2021-03-16 The Florida International University Board Of Trustees Systems and methods for segmenting documents
CN110309504B (zh) * 2019-05-23 2023-10-31 平安科技(深圳)有限公司 基于分词的文本处理方法、装置、设备及存储介质
CN110399452A (zh) * 2019-07-23 2019-11-01 福建奇点时空数字科技有限公司 一种基于实例特征建模的命名实体列表生成方法
CN111090720B (zh) * 2019-11-22 2023-09-12 北京捷通华声科技股份有限公司 一种热词的添加方法和装置
CN111274806B (zh) * 2020-01-20 2020-11-06 医惠科技有限公司 分词和词性识别方法、装置及电子病历的分析方法、装置
CN113448935B (zh) * 2020-03-24 2024-04-26 伊姆西Ip控股有限责任公司 用于提供日志信息的方法、电子设备和计算机程序产品
CN111444716A (zh) * 2020-03-30 2020-07-24 深圳市微购科技有限公司 标题分词方法、终端及计算机可读存储介质
CN112765975B (zh) * 2020-12-25 2023-08-04 北京百度网讯科技有限公司 分词岐义处理方法、装置、设备以及介质
CN113704501A (zh) * 2021-08-10 2021-11-26 上海硬通网络科技有限公司 应用的标签获取方法、装置、电子设备及存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1014278A1 (en) * 1998-12-22 2000-06-28 Xerox Corporation System for providing cross-lingual information retrieval
US20050222998A1 (en) * 2004-03-31 2005-10-06 Oce-Technologies B.V. Apparatus and computerised method for determining constituent words of a compound word

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5583763A (en) * 1993-09-09 1996-12-10 Mni Interactive Method and apparatus for recommending selections based on preferences in a multi-user system
AU2003245506A1 (en) * 2002-06-13 2003-12-31 Mark Logic Corporation Parent-child query indexing for xml databases
TW575813B (en) * 2002-10-11 2004-02-11 Intumit Inc System and method using external search engine as foundation for segmentation of word
US7680648B2 (en) * 2004-09-30 2010-03-16 Google Inc. Methods and systems for improving text segmentation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1014278A1 (en) * 1998-12-22 2000-06-28 Xerox Corporation System for providing cross-lingual information retrieval
US20050222998A1 (en) * 2004-03-31 2005-10-06 Oce-Technologies B.V. Apparatus and computerised method for determining constituent words of a compound word

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DATABASE WPI Thomson Scientific, London, GB; AN 2004-569248 XP002494848 "System and method of using external search engine as foundation for segmentation of word" & TW 575 813 A (INTUMIT INC [TW]) 11 February 2004 (2004-02-11) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353309A (zh) * 2019-12-25 2020-06-30 北京合力亿捷科技股份有限公司 基于文本分析处理通信质量投诉地址的方法及系统

Also Published As

Publication number Publication date
US8131539B2 (en) 2012-03-06
CN101261623A (zh) 2008-09-10
US20080221863A1 (en) 2008-09-11
WO2008107305A3 (en) 2008-11-06

Similar Documents

Publication Publication Date Title
US8131539B2 (en) Search-based word segmentation method and device for language without word boundary tag
Kim et al. Two-stage multi-intent detection for spoken language understanding
CN106537370B (zh) 在存在来源和翻译错误的情况下对命名实体鲁棒标记的方法和系统
US6243669B1 (en) Method and apparatus for providing syntactic analysis and data structure for translation knowledge in example-based language translation
US6442524B1 (en) Analyzing inflectional morphology in a spoken language translation system
US6278968B1 (en) Method and apparatus for adaptive speech recognition hypothesis construction and selection in a spoken language translation system
US20200226126A1 (en) Vector-based contextual text searching
CN110991180A (zh) 一种基于关键词和Word2Vec的命令识别方法
Sen et al. Bangla natural language processing: A comprehensive analysis of classical, machine learning, and deep learning-based methods
Comas et al. Sibyl, a factoid question-answering system for spoken documents
Kuo et al. Learning transliteration lexicons from the web
Nguyen et al. An ontology-based approach for key phrase extraction
Sen et al. Bangla natural language processing: A comprehensive review of classical machine learning and deep learning based methods
Khoufi et al. Chunking Arabic texts using conditional random fields
Dandapat Part-of-Speech tagging for Bengali
Tukur et al. Parts-of-speech tagging of Hausa-based texts using hidden Markov model
Sarkar et al. Bengali noun phrase chunking based on conditional random fields
Tedla Tigrinya Morphological Segmentation with Bidirectional Long Short-Term Memory Neural Networks and its Effect on English-Tigrinya Machine Translation
Rajendran et al. Text processing for developing unrestricted Tamil text to speech synthesis system
KR20040018008A (ko) 품사 태깅 장치 및 태깅 방법
KR100463376B1 (ko) 원시언어를 대상언어로 번역하기 위한 번역엔진 장치 및 그 번역방법
Tüselmann et al. Named entity linking on handwritten document images
SAMIR et al. AMAZIGH NAMED ENTITY RECOGNITION: A NOVEL APPROACH.
Chen et al. Automatic title generation for Chinese spoken documents using an adaptive k nearest-neighbor approach.
Lindberg et al. Improving part of speech disambiguation rules by adding linguistic knowledge

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08709129

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08709129

Country of ref document: EP

Kind code of ref document: A2