WO2017177809A1 - 语言文本的分词方法和系统 - Google Patents

语言文本的分词方法和系统 Download PDF

Info

Publication number
WO2017177809A1
WO2017177809A1 PCT/CN2017/077830 CN2017077830W WO2017177809A1 WO 2017177809 A1 WO2017177809 A1 WO 2017177809A1 CN 2017077830 W CN2017077830 W CN 2017077830W WO 2017177809 A1 WO2017177809 A1 WO 2017177809A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
word boundary
boundary
language text
credibility
Prior art date
Application number
PCT/CN2017/077830
Other languages
English (en)
French (fr)
Inventor
陈晓
李航
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP17781785.5A priority Critical patent/EP3416064B1/en
Publication of WO2017177809A1 publication Critical patent/WO2017177809A1/zh
Priority to US16/134,393 priority patent/US10691890B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text

Definitions

  • the present application relates to the field of natural language processing and, more particularly, to a word segmentation method and system for a linguistic text.
  • Word segmentation is one of the basic problems of natural language processing. All languages without word boundary markers (such as Chinese, Japanese, Arabic, etc.) face word segmentation problems.
  • the word segmentation system has a wide range of applications in the fields of information retrieval, machine translation, and question and answer systems.
  • the word retrieval system can retrieve relevant documents as long as the word segmentation system ensures that all the “Jiang Wenyuan” segments in the document are consistent.
  • the word segmentation system ensures that all the “Jiang Wenyuan” segments in the document are consistent.
  • the machine translation system if the string "Jiang Wenyuan” is divided into “Jiang” and “Wenyuan”, then the word “Jiang” may be incorrectly translated into the English word ginger, resulting in a machine.
  • the translation results of the translation system are not accurate.
  • the current word segmentation system can only meet the needs of a specific application, and it is difficult to reuse in unused application scenarios. Considering that some companies and organizations in the industry need to use word segmentation systems in several different application scenarios, their usual solution is to customize different word segmentation systems for different applications. This approach leads to waste of resources and difficulties in system maintenance.
  • the present application provides a word segmentation method and system for language text, which can adapt to different needs of word segmentation systems in various application scenarios.
  • a method for word segmentation of a linguistic text comprising: obtaining a first language text to be processed and a credibility threshold, wherein the credibility threshold is used to indicate a word segmentation precision required by the first language text; Using the first word segmentation method, segmenting the first language text to obtain a first word boundary set; and according to the credibility threshold, dividing the first word boundary set into a trusted second word boundary set and An untrusted third word boundary set; selecting, according to the third word boundary set, a second language text from the first language text, the second language text including each of the third word boundary sets a word corresponding to the word boundary; using the second word segmentation method to segment the second language text to obtain a fourth word boundary set, wherein the second word segmentation method has higher segmentation accuracy than the first word segmentation word segmentation Accuracy; determining the second word boundary set and the fourth word boundary set as a word segmentation result of the first language text.
  • the accuracy of the word segmentation required for the first language text can be flexibly adjusted, thereby being able to adapt to various application scenarios having different requirements for word segmentation accuracy. For example, for a scenario where the word segmentation accuracy is high, the user can input a lower confidence threshold; for a scenario with a lower word segmentation accuracy, the user can input a higher confidence threshold.
  • the first word boundary set is divided into a trusted second word boundary set and an untrusted first according to the credibility threshold
  • the three-word boundary set includes: selecting at least one word corresponding to each word boundary from a context of each word boundary in the first word boundary set; extracting at least one word corresponding to each word boundary a feature; determining, according to a feature of the at least one word corresponding to each word boundary, a credibility of the word boundary in the context by a pre-trained classifier; and the first word boundary set a word boundary whose reliability is greater than the credibility threshold is added to the second word boundary set; a word boundary in which the credibility in the first word boundary set is less than or equal to the credibility threshold Add to the third word boundary set.
  • the determining, according to a feature of the at least one word corresponding to each word boundary, by using a pre-trained classifier The credibility of each word boundary in the context including: Determining the credibility of each word boundary in the context, wherein P(True
  • B i ,c) represents an i-th word boundary B i in the first word boundary set in the context c
  • S(t, B i , c) represents the score of the i-th word boundary B i in the context c
  • f j (t, B i , c) represents the jth feature of the features of the at least one word
  • ⁇ j represents a parameter of the classifier
  • t represents a class corresponding to the classifier
  • Linear classifiers speed up the classification of word boundaries.
  • the context from the boundary of each word in the first set of word boundaries Selecting at least one word corresponding to each word boundary includes: selecting a word corresponding to each word boundary from a context of each word boundary, and a previous word of the word corresponding to each word boundary, And the latter word of the word corresponding to each word boundary.
  • the parameter of the classifier is a parameter based on training of a target language text
  • the target language text is a language text obtained by segmenting the language text whose word boundary is known by the first participle method.
  • the first participle method to segment the linguistic text with known word boundaries, obtain the target language text, and train the classifier parameters based on the target language text, which is more consistent with the actual situation (in practice, each language text of the word segmentation will be adopted first)
  • the first participle method is used for word segmentation), and the trained classifier will be more accurate.
  • the context from each word boundary in the first set of word boundaries Selecting at least one word includes: determining a context of the word boundary according to a location of the each word boundary in the first language text; and selecting the at least one word from the context.
  • a word segmentation system for a linguistic text comprising a module capable of performing the method of the first aspect.
  • a third aspect provides a word segmentation system for a language text, comprising a memory for storing a program, a processor for executing the program, and when the program is executed, the processor executes the method of the first aspect .
  • a computer readable medium storing program code for execution by a word segmentation system, the program code comprising instructions for performing the method of the first aspect.
  • the features corresponding to each of the at least one word include: a word length of each word, an overhead corresponding to each word, a type of each word in the dictionary The phoneme of each word, whether each word contains an affix, and whether each word contains a cell tag, wherein the cost corresponding to each word is occupied by each word in the word path
  • the word path is a word path formed by the word segmentation result after the word segmentation by the first word segmentation method.
  • the classifier can be a linear classifier.
  • the parameters of the linear classifier are weights for each of the features of the at least one word. Linear classifiers can reduce computational complexity.
  • the confidence threshold can be used to indicate the speed of word segmentation required for the first language text.
  • the word segmentation speed of the first word segmentation method may be higher than the word segmentation speed of the second word segmentation method.
  • a word corresponding to a word boundary may refer to a word that is divided by the word boundary. For example, it can refer to a word in the word segmentation that precedes the word boundary.
  • the parameter of the classifier is a parameter obtained based on training of a target language text
  • the target language text is a set of word boundaries obtained by segmenting a language text whose word boundary is known by using a first word segmentation method.
  • a language text obtained by comparison with a manually labeled set of word boundaries.
  • the training data of the parameters of the classifier includes language text for training, and a set of known word boundaries of the language text is obtained by segmenting the language text in a first word segmentation manner. Word border collection.
  • the context of each word boundary may refer to the context of each word in the first language text, for example, may include words in the first language text that are to the left of the first word boundary and / or the word on the right.
  • FIG. 1 is a diagram showing an example of the structure of a word segmentation system according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a word segmentation process according to an embodiment of the present application.
  • FIG. 3 is a schematic flow chart of simple word segmentation of a language text according to an embodiment of the present application.
  • FIG. 4 is an exemplary diagram of a word map.
  • FIG. 5 is a schematic flow chart of performing complex word segmentation on a language text according to an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a training process of a classifier according to an embodiment of the present application.
  • FIG. 7 is a diagram showing an example of a target language text of an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a complex word segmentation module according to an embodiment of the present application.
  • FIG. 9 is a diagram showing an example of a token-based word segmentation method according to an embodiment of the present application.
  • FIG. 10 is a graph of credibility threshold versus word segmentation results in an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a word segmentation system of a language text according to an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a word segmentation system of a language text according to an embodiment of the present application.
  • the first participle method is called the simple word segmentation method, and the module corresponding to the simple word segmentation method is used later. It is called a simple word segmentation module.
  • the simple word segmentation method can use the word segmentation algorithm with fast word segmentation and high word segmentation consistency, including but not limited to the shortest path word segmentation algorithm; the second word segmentation method is called complex word segmentation mode, and the complex word segmentation mode corresponding module is called complex word segmentation module.
  • the complex word segmentation method can use a word segmentation algorithm with high accuracy and high algorithm complexity, including but not limited to word segmentation algorithm based on word tagging method.
  • FIG. 1 is a diagram showing an example of the structure of a word segmentation system according to an embodiment of the present application.
  • the input of the word segmentation system includes not only the input first language text but also the credibility threshold 101, and the output of the word segmentation system is based on the credibility threshold 101 to segment the first language text. The result of the word segmentation.
  • the functions of each module are described in detail below.
  • the credibility threshold 101 is a parameter input by the user as a threshold for determining whether the segmentation of the current simple segmentation module is credible as the credibility judging module.
  • the credibility threshold may be, for example, a real number ranging between 0 and 1.
  • the value of the credibility threshold of the word segmentation system may be different, for example, the information retrieval system is used for word segmentation speed and word segmentation.
  • the consistency requirement is higher, so the credibility threshold can be set lower (such as less than 0.5), and the machine translation system requires higher correctness of the word segmentation, and the credibility threshold can be set higher (for example, greater than 0.7).
  • the credibility determination module 202 can determine whether the word segmentation result output by the simple word segmentation module 201 is authentic.
  • the credibility judging module 202 may be a pre-trained classifier, which may be a linear classifier or a non-linear classifier.
  • the merge output module 301 may be a module that combines the word segmentation results of the simple word segmentation module 201 and the complex word segmentation module 203.
  • the core word segmentation module includes three modules: a simple word segmentation module 201, a credibility judgment module 202, and a complex word segmentation module 203.
  • a simple word segmentation module 201 a simple word segmentation module 201
  • a credibility judgment module 202 a credibility judgment module 203
  • a complex word segmentation module 203 a complex word segmentation module 203.
  • the first language text input by the user is first divided by the simple word segmentation module to obtain a first word boundary set.
  • the word segmentation result of the simple word segmentation module 201 can then be passed to the credibility determination module 202 along with the confidence threshold 101 entered by the user.
  • the credibility judging module 202 may determine the credibility of each word boundary in the first word boundary set, and divide the first word boundary set into a trusted word boundary set and an untrusted word boundary set. For the set of trusted word boundaries, it can be directly passed to the merge output module 301 as the final word segment output; for the untrusted word boundary set, it can be passed to the complex word segmentation module 203 for further segmentation, and then output to the merge output module. 301, merged with the set of trusted word boundaries, and output as the final result of the first language text.
  • one possible technical solution of the simple word segmentation module 201 is to adopt a dictionary-based word segmentation method and perform semantic disambiguation using a language model and a least word segmentation principle.
  • the simple word segmentation module 201 can segment the first language text using the process shown in FIG. 3:
  • the first language text can be segmented by using a dictionary, and the word map corresponding to the word segmentation result is established.
  • the word map shown in Fig. 4 can be established. As can be seen from Figure 4, there are two intersecting edges on the text "China has”. This situation is called participle ambiguity and will be disambiguated in the following steps.
  • the shortest path can be found in the word graph using the shortest path search method, that is, the path from the leftmost node to the left side of the node side. If there is a unique shortest path, the result of the word segmentation represented by the path is used as the result of the word segmentation of the simple word segmentation module 201.
  • the path with the least cost can be found by calculating the cost of each path, and the path with the least overhead is used as a simple word segmentation. Module 201.
  • the path cost can be calculated using a unary language model.
  • the unary language model can be expressed by the following formula:
  • the cost C(S) of the sentence S can be equal to the sum of the expenses of all the words w in the sentence.
  • the cost C(w) is the probability P(w) in the unary language model. ) Calculated.
  • both the dictionary and the unary language model can be derived from a participle training corpus.
  • the implementation of the simple word segmentation module 201 includes, but is not limited to, the above technical solutions. Any word segmentation method with low computational complexity, high speed, and high segmentation consistency can be implemented as the simple word segmentation module 201.
  • one possible technical solution of the credibility judging module 202 is: a linear classifier.
  • the linear classifier classifies the feature extracted from its context, calculates the credibility of the word boundary in the first word boundary set, and sets the first boundary set The credibility of the word boundaries is compared to a confidence threshold 101 to determine if the word boundaries are authentic.
  • the credibility determination module 202 can use the algorithm illustrated in FIG. 5 to divide the word boundaries in the first set of word boundaries into a set of trusted word boundaries and a set of untrusted word boundaries.
  • the credibility judging module 202 can extract the following features from the context of B i :
  • the word length of the current word W i the word length of the previous word W i-1 , and the word length of the latter word W i+1 ;
  • W i , W i-1 , W i+1 in the dictionary person name, place name, institution name, etc.
  • W i , W i-1 , W i+1 Other characteristics of W i , W i-1 , W i+1 (such as phoneme, whether it contains affixes, whether it contains a grid mark);
  • the linear classifier can be utilized to calculate the confidence of the word boundary B i :
  • j can represent the subscript of the feature used by the linear classifier
  • f j (t, B i , c) can represent the jth feature of the feature corresponding to the word boundary B i
  • ⁇ j can represent the parameter of the classifier
  • S(t, B i , c) may represent a score of the linear classifier for the word boundary B i
  • t may represent a class corresponding to the classifier.
  • the score is normalized to obtain the reliability P of the word boundary B i (True
  • the parameters ⁇ j of the classifier need to be trained (ie, the weights corresponding to each feature are trained) before using the linear classifier.
  • the classifier parameter ⁇ j can be trained in a machine learning manner based on the training data set.
  • the target language text used to train the classifier parameters may be obtained by simply segmenting the language text (hereinafter referred to as the word segmentation training corpus) whose word boundaries are known. See Figure 6 for the process of creating the target language text.
  • the word boundary mark in the word segmentation training corpus may first be removed to obtain an unmarked language text; then the text may be segmented by the simple word segmentation module 201 to obtain the word segmentation result of the simple word segmentation module 201.
  • the standard training method may be used to perform the training of the classifier to obtain the classifier parameters.
  • the complex word segmentation module 203 can be composed of two parts, see FIG.
  • the untrusted word boundary collector may be responsible for collecting consecutive untrusted word boundaries, exemplified by "Snow/Deng/”, which may combine language text segments segmented by these untrusted word boundaries into " Snowden, as an input to a complex tokenizer.
  • the complex tokenizer may employ word segmentation based word segmentation.
  • the general principle of the word segmentation method is to convert the word segmentation problem of the language text into a question of assigning a tag to each word in the language text.
  • the BEO in FIG. 9 can respectively indicate the position of the word in the word, O can represent a single word, B can represent the prefix of a multi-word, that is, the first word of the word, and E can represent a multi-word. Other locations than the prefix.
  • the complex tokenizer can employ the following training methods:
  • the training corpus with the word segmentation can be converted into Chinese characters and word position markers, as shown in FIG.
  • machine learning models can be used to learn which tags are assigned to each word in a certain context.
  • the complex tokenizer can take the following participle:
  • the manner in which the complex word segmentation module 203 is implemented is not specifically limited in the embodiment of the present application. As long as the word segmentation method or algorithm with high accuracy and high unregistered word recognition capability can be implemented as the complex word segmentation module 203.
  • the credibility threshold 101 can be a continuous variable, for example, that can be provided by a user of the word breaker system, along with the first language text to the word breaker system.
  • This variable can represent the application scenario for word segmentation
  • the requirements of the results such as: in the information retrieval scenario, the results of the word segmentation are fast and consistent, and the machine translation or automatic question and answer scenario requires high accuracy of the word segmentation results.
  • the confidence threshold can be set to a real number ranging between 0 and 1.
  • Figure 10 shows the impact of the confidence threshold on the word segmentation results. It can be seen from Fig. 10 that the higher the credibility threshold, the higher the recognition ability of the unregistered words, the stronger the disambiguation ability, the higher the correctness of the word segmentation results, and the lower the speed and consistency of the word segmentation. The lower the credibility threshold, the faster the word segmentation and the stronger the consistency. The unidentified word recognition ability and the ambiguity resolution ability are weakened, and the correctness of the word segmentation is also reduced.
  • the confidence threshold h can be set to 0.2.
  • the credibility judging module 202 can judge the credibility of each word boundary " ⁇ " in the result of the word segmentation.
  • the credibility judging module 202 can calculate the probability P (True
  • B i ,c)>h then B i can be transmitted to the merge output module 301 as a trusted word boundary.
  • B i , c) ⁇ h B i can be transmitted to the complex word segmentation module 203 as an untrusted word boundary. If, after the judgment by the credibility judging module 202, the word boundaries in the above-mentioned word segmentation result are all trusted word boundaries, the word segmentation results may be output to the merge output module 203.
  • the merge output module 301 sorts the word segmentation results.
  • the result of the word segmentation output by the combined output module 301 can be: fan ⁇ and ⁇ convention ⁇ end ⁇ previous ⁇ make ⁇ your_small ⁇ one ⁇ time ⁇ . ⁇
  • the final word segmentation results are: 1) Two unregistered words “Van Denggao” and “Wang Xiaoju” are not identified; 2) “Xiaoju” and “Xiaoju” are consistent in the “small gathering”.
  • the credibility judging module 202 After the credibility judging module 202, the following word segmentation result is obtained ("/" indicates an untrustworthy word boundary): Fan/Elevation/Just ⁇ and ⁇ / ⁇ /Convention ⁇ End of the year ⁇ Before ⁇ Come ⁇ Yes ⁇ ⁇ once ⁇ . ⁇
  • the untrusted boundary collector collects successive untrusted boundaries from the above results, forming an untrusted interval (the portion underlined in the example sentences below):
  • the complex word segmentation module 203 performs word segmentation for each untrusted interval.
  • the merge output module 301 sorts the word segmentation results.
  • the result of the word segmentation output by the merge output module 301 can be:
  • the complex word segmentation module in the embodiment of the present application may be one or multiple.
  • the output of the previous complex word segmentation module can be used as the input of the latter complex word segmentation module, and at the same time, each complex word segmentation module can receive a new confidence threshold before performing complex word segmentation.
  • FIG. 11 The word segmentation method of the linguistic text according to an embodiment of the present application has been described in detail above with reference to FIGS. 1 through 10.
  • a word segmentation system for a language text according to an embodiment of the present application will be described in detail below with reference to FIGS. 11 and 12. It should be understood that the word segmentation system of FIG. 11 or FIG. 12 is capable of performing the various steps of the method described above, and to avoid repetition, it will not be described in detail herein.
  • FIG. 11 is a schematic structural diagram of a word segmentation system of a language text according to an embodiment of the present application.
  • the word breaker system 1100 of Figure 11 includes:
  • the input module 1110 is configured to acquire a first language text and a credibility threshold to be processed, where the credibility threshold is used to indicate a word segmentation precision, a word segmentation speed, or a word segmentation consistency required by the first language text;
  • the first participle module 1120 is configured to perform segmentation on the first language text by using a first word segmentation manner to obtain a first word boundary set;
  • the credibility judging module 1130 is configured to divide the first word boundary set into a trusted second word boundary set and an untrusted third word boundary set according to the credibility threshold;
  • a selection module 1140 configured to select a second language text from the first language text according to the third word boundary set, where the second language text includes each word boundary corresponding to the third word boundary set Word
  • a second participle module 1150 configured to perform segmentation on the second language text by using a second word segmentation manner to obtain a fourth word boundary set, wherein the second word segmentation method has higher segmentation precision than the first word segmentation mode Participle accuracy;
  • the output module 1160 is configured to determine the second word boundary set and the fourth word boundary set as a word segmentation result of the first language text.
  • the accuracy of the word segmentation required for the first language text can be flexibly adjusted, thereby being able to adapt to various application scenarios having different requirements for word segmentation accuracy. For example, for a scenario where the word segmentation accuracy is high, the user can input a lower confidence threshold; for a scenario with a lower word segmentation accuracy, the user can input a higher confidence threshold.
  • the credibility determining module 1130 is specifically configured to select at least one word corresponding to each word boundary from a context of each word boundary in the first word boundary set; Extracting features of the at least one word corresponding to each word boundary; determining, according to the feature of the at least one word corresponding to each word boundary, the word boundary in the context by using a pre-trained classifier Credibility; adding a word boundary in the first set of word boundaries with a credibility greater than the credibility threshold to the second set of word boundaries; authenticating the set of first word boundaries A word boundary whose degree is less than or equal to the credibility threshold is added to the third word boundary set.
  • the credibility determining module 1130 is specifically configured to be used according to Determining the credibility of the each word boundary in the context, wherein P(True
  • B i ,c) represents the i-th word boundary B i in the first word boundary set in the context c
  • S(t, B i , c) represents the score of the i-th word boundary B i in the context c
  • f j (t, B i , c) represents the jth feature of the features of the at least one word
  • ⁇ j represents a parameter of the classifier
  • t represents a class corresponding to the classifier
  • the credibility determining module 1130 is specifically configured to select, from the context of each word boundary, a word corresponding to each word boundary, and a word corresponding to each word boundary. The previous word, and the latter word of the word corresponding to each word boundary.
  • the parameter of the classifier is a parameter obtained by training based on a target language text
  • the target language text is a language obtained by segmenting a language text whose word boundary is known by using a first word segmentation method. text.
  • FIG. 12 is a schematic structural diagram of a word segmentation system of a language text according to an embodiment of the present application.
  • the word breaker system 1200 of Figure 12 includes:
  • a memory 1210 configured to store a program
  • the processor 1220 is configured to execute a program in the memory 1210.
  • the processor 1220 obtains a first language text and a credibility threshold to be processed, where the credibility threshold is used to indicate Deciphering accuracy, word segmentation speed or word segmentation consistency required for the first language text; using the first word segmentation method, segmenting the first language text to obtain a first word boundary set; according to the credibility threshold, The first word boundary set is divided into a trusted second word boundary set and an untrusted third word boundary set; according to the third word boundary set, the second language text is selected from the first language text, The second language text includes a word corresponding to each word boundary in the third word boundary set; using a second word segmentation method, segmenting the second language text to obtain a fourth word boundary set, wherein The participle precision of the second participle mode is higher than the participle precision of the first participle mode; the second word boundary set and the fourth word boundary set are determined as the word segmentation of the first language text .
  • the accuracy of the word segmentation required for the first language text can be flexibly adjusted, thereby being able to adapt to various application scenarios having different requirements for word segmentation accuracy. For example, for a scenario where the word segmentation accuracy is high, the user can input a lower confidence threshold; for a scenario with a lower word segmentation accuracy, the user can input a higher confidence threshold.
  • the processor 1220 is specifically configured to: select at least one word corresponding to each word boundary from a context of each word boundary in the first word boundary set; extract the a feature of at least one word corresponding to each word boundary; determining, according to a feature of the at least one word corresponding to each word boundary, a pre-trained classifier to determine the credibility of each word boundary in the context Adding a word boundary of the first word boundary set with a credibility greater than the credibility threshold to the second word boundary set; determining a credibility in the first word boundary set to be less than or A word boundary equal to the confidence threshold is added to the third word boundary set.
  • the processor 1220 is specifically configured to Determining the credibility of each word boundary in the context, wherein P(True
  • B i ,c) represents an i-th word boundary B i in the first word boundary set in the context c
  • S(t, B i , c) represents the score of the i-th word boundary B i in the context c
  • Representing the jth feature of the features of the at least one word
  • ⁇ j represents a parameter of the classifier
  • t represents a class corresponding to the classifier
  • the processor 1220 is specifically configured to select, from the context of each word boundary, a word corresponding to each word boundary, and a word corresponding to each word boundary. The word, and the latter word of the word corresponding to each word boundary.
  • the parameter of the classifier is a parameter obtained by training based on a target language text
  • the target language text is a language obtained by segmenting a language text whose word boundary is known by using a first word segmentation method. text.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present application which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

本发明实施例提供一种语言文本的分词方法和系统,该方法包括:获取待处理的第一语言文本和可信度阈值;采用第一分词方式,对第一语言文本进行分词,得到第一词边界集合;根据可信度阈值,将第一词边界集合划分成可信的第二词边界集合和不可信的第三词边界集合;根据第三词边界集合,从第一语言文本中选取第二语言文本,第二语言文本包括第三词边界集合中的每个词边界对应的词;采用第二分词方式,对第二语言文本进行分词,得到第四词边界集合;将第二词边界集合和第四词边界集合确定为第一语言文本的分词结果。通过调整可信度阈值的大小,能够灵活调整第一语言文本所需的分词精度,从而能够适应对分词精度有不同要求的多种应用场景。

Description

语言文本的分词方法和系统
本申请要求于2016年04月12日提交中国专利局、申请号为201610225943.3、发明名称为“语言文本的分词方法和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及自然语言处理领域,并且更为具体地,涉及一种语言文本的分词方法和系统。
背景技术
分词是自然语言处理的基本问题之一。所有无词边界标记的语言(如:汉语、日语、阿拉伯语等)都面临分词问题。分词系统在信息检索、机器翻译、问答系统等领域都有着广泛的应用。
不同的应用对于分词系统的输出有着不同的要求。例如,信息检索系统对分词的速度和一致性要求较高。但信息检索系统对分词的正确性要求相对较低,如对未登录词(未被分词系统收录的词)识别率要求较低。而在机器翻译系统中,对分词的正确性要求较高,而对分词的一致性的要求则相对较低。例如,字串“姜文远”是一个未登录词,在信息检索应用中,如果分词系统没有将“姜文远”切成一个词,而是将其切分为“姜”和“文远”两个词,只要分词系统保证文档中的所有的“姜文远”的切分方式保持一致,信息检索系统就能够检索出相关文档。相比而言,在机器翻译系统中,如果字串“姜文远”被切分为“姜”和“文远”,那么“姜”字就有可能被错误的翻译成英文单词ginger,导致机器翻译系统的翻译结果不准确。
目前的分词系统都只能满足某一种特定应用的需要,难以在不用的应用场景下复用。考虑到行业内有一些公司和机构需要在几种不同的应用场景下使用分词系统,它们通常的解决方案是为不同的应用订制不同的分词系统。这种方式会导致资源的浪费和系统维护的困难。
发明内容
本申请提供一种语言文本的分词方法和系统,能够适应多种应用场景对分词系统的不同需求。
第一方面,提供一种语言文本的分词方法,包括:获取待处理的第一语言文本和可信度阈值,所述可信度阈值用于指示所述第一语言文本所需的分词精度;采用第一分词方式,对所述第一语言文本进行分词,得到第一词边界集合;根据所述可信度阈值,将所述第一词边界集合划分成可信的第二词边界集合和不可信的第三词边界集合;根据所述第三词边界集合,从所述第一语言文本中选取第二语言文本,所述第二语言文本包括所述第三词边界集合中的每个词边界对应的词;采用第二分词方式,对所述第二语言文本进行分词,得到第四词边界集合,其中,所述第二分词方式的分词精度高于所述第一分词方式的分词精度;将所述第二词边界集合和所述第四词边界集合确定为所述第一语言文本的分词结果。
通过调整可信度阈值的大小,能够灵活调整第一语言文本所需的分词精度,从而能够适应对分词精度有不同要求的多种应用场景。例如,针对分词精度要求较高的场景,用户可以输入较低的可信度阈值;针对分词精度要求较低的场景,用户可以输入较高的可信度阈值。
结合第一方面,在第一方面的第一种实现方式中,所述根据所述可信度阈值,将所述第一词边界集合划分成可信的第二词边界集合和不可信的第三词边界集合,包括:从所述第一词边界集合中的每个词边界的上下文中选取所述每个词边界对应的至少一个词;提取所述每个词边界对应的至少一个词的特征;根据所述每个词边界对应的至少一个词的特征,通过预先训练得到的分类器,确定所述每个词边界在所述上下文中的可信度;将所述第一词边界集合中的可信度大于所述可信度阈值的词边界添加至所述第二词边界集合;将所述第一词边界集合中的可信度小于或等于所述可信度阈值的词边界添加至所述第三词边界集合。
通过预先训练得到的分类器,能够实现第一词边界集合的快速分类。
结合第一方面的第一种实现方式,在第一方面的第二种实现方式中,所述根据所述每个词边界对应的至少一个词的特征,通过预先训练得到的分类器,确定所述每个词边界在所述上下文中的可信度,包括:根据
Figure PCTCN2017077830-appb-000001
确定所述每个词边界在所述上下文中的可信度,其中,P(True|Bi,c)表示所述第一词边界集合中的第i个词边界Bi在所述上下文c中的可信度,S(t,Bi,c)表示所述第i个词边界Bi在所述上下文c中的得分,
Figure PCTCN2017077830-appb-000002
fj(t,Bi,c)表示所述至少一个词的特征中的第j个特征,βj表示所述分类器的参数,t表示所述分类器对应的类,且t∈{True,False}。
线性分类器能够加快词边界的分类速度。
结合第一方面的第一至第二种实现方式中的任一种,在第一方面的第三种实现方式中,所述从所述第一词边界集合中的每个词边界的上下文中选取所述每个词边界对应的至少一个词,包括:从所述每个词边界的上下文中选取所述每个词边界对应的词、所述每个词边界对应的词的前一词,以及所述每个词边界对应的词的后一词。
结合第一方面的第一至第三种实现方式中的任一种,在第一方面的第四种实现方式中,所述分类器的参数是基于目标语言文本训练得到的参数,所述目标语言文本是采用第一分词方式对词边界已知的语言文本进行分词后得到的语言文本。
采用第一分词方式对词边界已知的语言文本进行分词,得到目标语言文本,基于目标语言文本训练分类器参数,这样与实际情况更加符合(实际中待分词的每个语言文本均会先采用第一分词方式进行分词),训练出的分类器会更加准确。
结合第一方面的第一至第四种实现方式中的任一种,在第一方面的第五种实现方式中,所述从所述第一词边界集合中的每个词边界的上下文中选取至少一个词,包括:根据所述每个词边界在所述第一语言文本中的位置,确定所述词边界的上下文;从所述上下文中选取所述至少一个词。
第二方面,提供一种语言文本的分词系统,包括能够执行第一方面中的方法的模块。
第三方面,提供一种语言文本的分词系统,包括存储器,用于存储程序;处理器,用于执行所述程序,当所述程序被执行时,所述处理器执行第一方面中的方法。
第四方面,提供一种计算机可读介质,所述计算机可读介质存储用于分词系统执行的程序代码,所述程序代码包括用于执行第一方面中的方法的指令。
在某些实现方式中,所述至少一个词中的每个词对应的特征包括:所述每个词的词长、所述每个词对应的开销、所述每个词在词典中的类型、所述每个词的音韵、所述每个词是否包含词缀、所述每个词是否含有格标记,其中,所述每个词对应的开销所述每个词在词路径中所占的开销,所述词路径为采用所述第一分词方式进行分词后的分词结果形成的词路径。
在某些实现方式中,所述分类器可以是线性分类器。在一个例子中,线性分类器的参数为所述至少一个词的特征中的每个特征的权值。线性分类器可以降低计算复杂度。
在某些实现方式中,可信度阈值可用于指示第一语言文本所需的分词速度。
在某些实现方式中,第一分词方式的分词速度可以高于第二分词方式的分词速度。
在某些实现方式中,词边界对应的词可以指该词边界划分出的词。例如,可以指分词结果中的位于该词边界前面的一个词。
在某些实现方式中,所述分类器的参数是基于目标语言文本训练得到的参数,所述目标语言文本是采用第一分词方式对词边界已知的语言文本进行分词后得到的词边界集合与人工标注的词边界集合比较而得到的语言文本。
在某些实现方式中,所述分类器的参数的训练数据包括用于训练的语言文本,所述语言文本的已知的词边界集合,采用第一分词方式对所述语言文本进行分词后得到的词边界集合。
在某些实现方式中,所述每个词边界的上下文可以指所述每个词在第一语言文本中的上下文,例如,可以包括第一语言文本中的位于第一词边界左边的词和/或右边的词。
附图说明
图1是本申请实施例的分词系统的结构示例图。
图2是本申请实施例的分词流程的示意图。
图3是本申请实施例的对语言文本进行简单分词的示意性流程图。
图4是词图的示例图。
图5是本申请实施例的对语言文本进行复杂分词的示意性流程图。
图6是本申请实施例的分类器的训练过程的示意性流程图。
图7是本申请实施例的目标语言文本的示例图。
图8是本申请实施例的复杂分词模块的示意性结构图。
图9是本申请实施例的基于标记的分词方式的示例图。
图10是本申请实施例的可信度阈值对分词结果的曲线图。
图11是本申请实施例的语言文本的分词系统的示意性结构图。
图12是本申请实施例的语言文本的分词系统的示意性结构图。
具体实施方式
为了便于理解,后文将第一分词方式称为简单分词方式,简单分词方式对应的模块 称为简单分词模块。简单分词方式可以使用分词速度快、分词一致性高的分词算法,包括但不限于最短路径分词算法;后文将第二分词方式称为复杂分词方式,复杂分词方式对应的模块称为复杂分词模块。复杂分词方式可以使用准确性高、算法复杂度高的分词算法,包括但不限于基于字标注方法的分词算法。
图1是本申请实施例的分词系统的结构示例图。参见图1,从整体来看,分词系统的输入不但包括输入的第一语言文本,而且包括可信度阈值101,而分词系统的输出是基于可信度阈值101对第一语言文本进行切分的分词结果。下面对各个模块的功能进行详细描述。
可信度阈值101:一个用户输入的参数,作为可信度判断模块判断当前简单分词模块的切分是否可信的阈值。可信度阈值例如可以是一个取值范围在0和1之间的实数,对于不同应用场景,分词系统的可信度阈值的取值可以有所不同,如:信息检索系统对分词速度和分词一致性要求较高,所以可信度阈值可以设置得较低(如小于0.5),而机器翻译系统对分词的正确性要求比较高,可信度阈值可以设置的比较高(如大于0.7)。
在一些实施例中,可信度判断模块202可以判断简单分词模块201输出的分词结果是否可信。可信度判断模块202可以是预先训练得到的分类器,该分类器可以是线性分类器,也可以是非线性分类器。
在一些实施例中,合并输出模块301可以是将简单分词模块201和复杂分词模块203的分词结果进行合并输出的模块。
从图1可以看出,核心分词模块包括3个模块:简单分词模块201、可信度判断模块202以及复杂分词模块203。下文以图2为例,介绍基于上述3个模块的分词流程。
具体而言,用户输入的第一语言文本首先会经过简单分词模块的切分,得到第一词边界集合。然后,简单分词模块201的分词结果可以与用户输入的可信度阈值101一起传给可信度判断模块202。可信度判断模块202可以判断第一词边界集合中的每个词边界的可信度,将第一词边界集合划分为可信词边界集合和不可信词边界集合。对于可信词边界集合,可以直接传给合并输出模块301,作为最终的分词输出;对于不可信词边界集合,可以传给复杂分词模块203,进行进一步的切分,然后再输出到合并输出模块301,与可信词边界集合合并,作为第一语言文本的最终结果输出。
在一些实施例中,简单分词模块201的一种可行的技术方案是:采用基于词典的分词方式,并利用语言模型和最少分词原则进行歧义消解。
在一些实施例中,简单分词模块201可以采用图3所示的流程对第一语言文本进行分词:
S310、建立词图。
具体地,可以利用词典对第一语言文本进行分词,并建立分词结果对应的词图。以第一语言文本为“市场中国有企业”为例,可以建立图4所示的词图。从图4可以看出,在“中国有”这段文本上存在这两条交叉的边,这种情况称为分词歧义,将在下面的步骤中进行歧义消除。
S320、第一次歧义消除。
在一些实施例中,可以采用最短路经搜索法在词图中找到最短的路径,即从最左边的节点到左右边的节点边最少的路径。若存在唯一的最短路径,则将这条路径所表示的分词结果作为简单分词模块201的分词结果。
S330、第二次歧义消除。
在一些实施例中,如果通过S320的最短路径搜索发现:该词图中有多条最短路径,则可以通过计算各条路径的开销来寻找开销最小的路径,并将开销最小的路径作为简单分词模块201。
在一些实施例中,可以利用一元语言模型计算路径开销。一元语言模型可以通过如下公式表示:
Figure PCTCN2017077830-appb-000003
C(w)=-log(P(w))      (2)
其中,句子S的开销C(S)可以等于该句子中所有词w的开销的和,对于一个词w而言,其开销C(w)则是利用其在一元语言模型中的概率P(w)计算得到。
在一些实施例中,词典和一元语言模型都可以从一份分词训练语料中得到。应理解,简单分词模块201的实现方式包括但不仅限于上述技术方案,凡是计算复杂度低、速度快、切分一致性比较高的分词方式均可以作为简单分词模块201的实现方式。
在一些实施例中,可信度判断模块202的一种可行的技术方案为:线性分类器。对第一词边界集合中的词边界,线性分类器利用从其上下文中提取的特征对其进行分类,计算第一词边界集合中的词边界的可信度,并将第一次边界集合中的词边界的可信度与可信度阈值101进行比较,以确定这些词边界是否可信。
在一些实施例中,可信度判断模块202可以采用图5所示的算法将第一词边界集合中的词边界划分成可信词边界集合和不可信词边界集合。
S510、提取第一词边界集合中的词边界对应的特征。
假设第一词边界集合中的第i词边界为Bi,Bi对应词Wi,可信度判断模块202可以从Bi的上下文中提取以下特征:
当前词Wi的词长、前一词Wi-1的词长、后一词Wi+1的词长;
Wi的开销、Wi-1的开销、Wi+1的开销;
Wi、Wi-1、Wi+1在词典中的类型(人名,地名,机构名,等);
Wi、Wi-1、Wi+1的其他特征(比如音韵,是否包含词缀,是否包含格标记);
上述特征的各种组合。
S520、可信度计算。
在一些实施例中,可以利用线性分类器计算词边界Bi的可信度:
Figure PCTCN2017077830-appb-000004
Figure PCTCN2017077830-appb-000005
其中,j可表示线性分类器所使用的特征的下标,fj(t,Bi,c)可表示词边界Bi对应的特征中的第j个特征,βj可表示分类器的参数,S(t,Bi,c)可表示线性分类器对于词边界Bi的得分,t可表示所述分类器对应的类。按照公式(3)对这个计分进行归一化,即可得到分词边界Bi的可信度P(True|Bi,c)(在本申请实施例中,可信度通过可信概率P表示)。
S530、可信度判别。
假设可信度阈值101为h,若P(True|Bi,c)>h,则Bi可信,传递给合并输出模块301;若P(True|Bi,c)≤h,则Bi不可信,传送给复杂分词模块203。
在一些实施例中,在使用线性分类器之前,需要对分类器的参数βj进行训练(即对每个特征对应的权值进行训练)。例如,可以基于训练数据集,通过机器学习的方式训练得到分类器参数βj
在一些实施例中,用于训练分类器参数的目标语言文本可以是对词边界已知的语言文本(下称分词训练语料)进行简单分词后得到的。目标语言文本的制作过程参见图6。
S610、利用简单分词模块201切分分词训练语料。
在一些实施例中,首先可以将分词训练语料中的词边界标记去掉,得到无标记的语言文本;然后可以用简单分词模块201对该文本进行切分,得到简单分词模块201的分词结果。
S620、将上述分词结果中的词边界与分词训练语料的正确的词边界进行一一比较。
通过S620,可以得到一份简单分词模块201输出的词边界集合,其中每个词边界都带有是否正确的标注,这样就得到了训练分类器所需的目标语言文本。图7目标语言文本制作的示例。
进一步地,得到上述训练数据后,可以采用标准的训练方法来进行分类器的训练,得到分类器参数。
在一些实施例中,复杂分词模块203可以由两部分组成,参见图8。
在一些实施例中,不可信词边界收集器可以负责收集连续的不可信词边界,以“斯诺/登/”为例,可以将这些不可信词边界所切分的语言文本片段合并为“斯诺登”,作为复杂分词器的输入。
在一些实施例中,复杂分词器可以采用基于字标注的分词方式。该分词方式的大致原理是将语言文本的分词问题转换成给语言文本中的每个字分配一个标记的问题。参见图9,图9中的BEO可以分别表示字在词中的位置,O可以表示单字成词,B可以表示一个多字词的词头,即词的第一个字,E可以表示多字词除词头外的其他位置。
在一些实施例中,复杂分词器可以采用如下训练方式:
首先,可以将有分词标记的训练语料转换成汉字和词位置标记,如图9所示。
其次,可以利用机器学习模型(最大熵模型,条件随机场模型,结构化感知机等)学习各个字在一定的上下文中分配何种标记。
在一些实施例中,复杂分词器可以采用如下分词方式:
首先,可以利用训练方式训练的复杂分词器的参数,给输入的句子的每个字分配一个标记。
然后,根据字的标记确定如何分词。
应理解,本申请实施例对实现复杂分词模块203的方式不作具体限定,只要是具有高准确性、高未登录词识别能力的分词方式或算法都可以作为复杂分词模块203的实现方案。
在一些实施例中,可信度阈值101可以是一个连续的变量,例如,可以由分词系统的用户提供,与第一语言文本一起传递给分词系统。该变量可以代表应用场景对于分词 结果的要求,比如:信息检索场景下要求分词结果速度快、一致性高,机器翻译或自动问答场景要求分词结果准确性高。在一个例子中,可以将可信度阈值设置成取值范围在0和1之间的实数。
图10示出了可信度阈值对分词结果的影响。从图10可以看出可信度阈值越高,未登录词识别能力越高,歧义消解能力越强,分词结果的正确性就越高,同时分词的速度和一致性就下降。可信度阈值越低,分词的速度越快,一致性越强,未登录词识别能力以及歧义消解能力则减弱,分词的正确性也会下降。
下面结合具体例子,更加详细地描述本申请实施例。应注意,后文的例子仅仅是为了帮助本领域技术人员理解本申请实施例,而非要将本申请实施例限于所例示的具体数值或具体场景。本领域技术人员根据所给出的例子,显然可以进行各种等价的修改或变化,这样的修改或变化也落入本申请实施例的范围内。
假设待处理的第一语言文本为:范登高便和王小聚约定年底之前一定要小聚一次。针对一致性要求较高的应用场景,可以将可信度阈值h设定为0.2。
首先,经过简单分词模块201之后,得到如下分词结果(“\”表示词边界):范\登高\便\和\王\小聚\约定\年底\之前\一定\要\小聚\一\次\。\
然后,可信度判断模块202可以对这一分词结果中的每个词边界“\”的可信度进行判断。
例如,如果用Bi表示上述分词结果中的第i个词边界,可信度判断模块202可以计算该词边界在一定的上下文c中可信(True)的概率P(True|Bi,c)。当P(True|Bi,c)>h时,则可以将Bi作为可信词边界传送给合并输出模块301。当P(True|Bi,c)≤h时,可以将Bi作为不可信词边界传送给复杂分词模块203处理。如果经过可信度判断模块202的判断,上述分词结果中的词边界都是可信的词边界,则可以将分词结果均输出到合并输出模块203。
然后,合并输出模块301将分词结果整理输出。
合并输出模块301输出的分词结果可以是:范\登高\便\和\王\小聚\约定\年底\之前\一定\要\小聚\一\次\。\
在本实施例中,由于对分词结果有较高一致性要求,所以设置了较低的可信度阈值(h=0.2)。最终的分词结果反映是:1)没有识别出“范登高”和“王小聚”两个未登录词;2)“王小聚”和“小聚一次”两个片段中的“小聚”切分一致。
下面以第一语言文本为:范登高便和王小聚约定年底之前一定要小聚一次,可信度阈值为h=0.9进行举例说明。
经过可信度判断模块202后,得到如下分词结果(“/”表示不可信的词边界):范/登高/便\和\王/小聚/约定\年底\之前\一定\要\小聚\一\次\。\
其次,不可信边界收集器从上述结果中收集连续的不可信边界,形成不可信区间(在下面例句中下划线标出的部分):
范/登高/便\和\王/小聚/约定\年底\之前\一定\要\小聚\一\次\。\
然后,复杂分词模块203对每一个不可信区间进行分词。
经过复杂分词模块203,以上两个不可信区域均会被识别成人名。
然后,合并输出模块301将分词结果整理输出。
合并输出模块301输出的分词结果可以是:
范登高\便\和\王小聚\约定\年底\之前\一定\要\小聚\一\次\。\
在本实施例中,需要分词结果据有较高的正确性,所以设置可信度阈值较高(h=0.9),分词结果中的反映是“范登高”和“王小聚”两个未登录词都识别了出来,但“王小聚”和“小聚一次”两个片段中的“小聚”切分不一致。
应理解,本申请实施例中的复杂分词模块可以是一个,也可以是多个。在复杂分词模块为多个时,前一复杂分词模块的输出可以作为后一复杂分词模块的输入,同时,在每一复杂分词模块进行复杂分词之前,可以接收新的可信度阈值。
上文结合图1至图10,详细描述了根据本申请实施例的语言文本的分词方法。下文结合图11和图12,详细描述根据本申请实施例的语言文本的分词系统。应理解,图11或图12的分词系统能够执行上文描述的方法的各个步骤,为避免重复,此处不再详述。
图11是本申请实施例的语言文本的分词系统的示意性结构图。图11的分词系统1100包括:
输入模块1110,用于获取待处理的第一语言文本和可信度阈值,所述可信度阈值用于指示所述第一语言文本所需的分词精度、分词速度或分词一致性;
第一分词模块1120,用于采用第一分词方式,对所述第一语言文本进行分词,得到第一词边界集合;
可信度判断模块1130,用于根据所述可信度阈值,将所述第一词边界集合划分成可信的第二词边界集合和不可信的第三词边界集合;
选取模块1140,用于根据所述第三词边界集合,从所述第一语言文本中选取第二语言文本,所述第二语言文本包括所述第三词边界集合中的每个词边界对应的词;
第二分词模块1150,用于采用第二分词方式,对所述第二语言文本进行分词,得到第四词边界集合,其中,所述第二分词方式的分词精度高于所述第一分词方式的分词精度;
输出模块1160,用于将所述第二词边界集合和所述第四词边界集合确定为所述第一语言文本的分词结果。
通过调整可信度阈值的大小,能够灵活调整第一语言文本所需的分词精度,从而能够适应对分词精度有不同要求的多种应用场景。例如,针对分词精度要求较高的场景,用户可以输入较低的可信度阈值;针对分词精度要求较低的场景,用户可以输入较高的可信度阈值。
可选地,作为一个实施例,所述可信度判断模块1130具体用于从所述第一词边界集合中的每个词边界的上下文中选取所述每个词边界对应的至少一个词;提取所述每个词边界对应的至少一个词的特征;根据所述每个词边界对应的至少一个词的特征,通过预先训练得到的分类器,确定所述每个词边界在所述上下文中的可信度;将所述第一词边界集合中的可信度大于所述可信度阈值的词边界添加至所述第二词边界集合;将所述第一词边界集合中的可信度小于或等于所述可信度阈值的词边界添加至所述第三词边界集合。
可选地,作为一个实施例,所述可信度判断模块1130具体用于根据
Figure PCTCN2017077830-appb-000006
确定所述每个词边界在所述上下文中的可信度,其中, P(True|Bi,c)表示所述第一词边界集合中的第i个词边界Bi在所述上下文c中的可信度,S(t,Bi,c)表示所述第i个词边界Bi在所述上下文c中的得分,
Figure PCTCN2017077830-appb-000007
fj(t,Bi,c)表示所述至少一个词的特征中的第j个特征,βj表示所述分类器的参数,t表示所述分类器对应的类,且t∈{True,False}。
可选地,作为一个实施例,所述可信度判断模块1130具体用于从所述每个词边界的上下文中选取所述每个词边界对应的词、所述每个词边界对应的词的前一词,以及所述每个词边界对应的词的后一词。
可选地,作为一个实施例,所述分类器的参数是基于目标语言文本训练得到的参数,所述目标语言文本是采用第一分词方式对词边界已知的语言文本进行分词后得到的语言文本。
图12是本申请实施例的语言文本的分词系统的示意性结构图。图12的分词系统1200包括:
存储器1210,用于存储程序;
处理器1220,用于执行存储器1210中的程序,当所述程序被执行时,所述处理器1220获取待处理的第一语言文本和可信度阈值,所述可信度阈值用于指示所述第一语言文本所需的分词精度、分词速度或分词一致性;采用第一分词方式,对所述第一语言文本进行分词,得到第一词边界集合;根据所述可信度阈值,将所述第一词边界集合划分成可信的第二词边界集合和不可信的第三词边界集合;根据所述第三词边界集合,从所述第一语言文本中选取第二语言文本,所述第二语言文本包括所述第三词边界集合中的每个词边界对应的词;采用第二分词方式,对所述第二语言文本进行分词,得到第四词边界集合,其中,所述第二分词方式的分词精度高于所述第一分词方式的分词精度;将所述第二词边界集合和所述第四词边界集合确定为所述第一语言文本的分词结果。
通过调整可信度阈值的大小,能够灵活调整第一语言文本所需的分词精度,从而能够适应对分词精度有不同要求的多种应用场景。例如,针对分词精度要求较高的场景,用户可以输入较低的可信度阈值;针对分词精度要求较低的场景,用户可以输入较高的可信度阈值。
可选地,作为一个实施例,所述处理器1220具体用于从所述第一词边界集合中的每个词边界的上下文中选取所述每个词边界对应的至少一个词;提取所述每个词边界对应的至少一个词的特征;根据所述每个词边界对应的至少一个词的特征,通过预先训练得到的分类器,确定所述每个词边界在所述上下文中的可信度;将所述第一词边界集合中的可信度大于所述可信度阈值的词边界添加至所述第二词边界集合;将所述第一词边界集合中的可信度小于或等于所述可信度阈值的词边界添加至所述第三词边界集合。
可选地,作为一个实施例,所述处理器1220具体用于根据
Figure PCTCN2017077830-appb-000008
确定所述每个词边界在所述上下文中的可信度,其中,P(True|Bi,c)表示所述第一词边界集合中的第i个词边界Bi在所述上下文c中的可信度,S(t,Bi,c)表示所述第i个词边界Bi在所述上下文c中的得分,
Figure PCTCN2017077830-appb-000009
表示所述至少一个词的特征中的第j个特征,βj表示所述分类器的参数,t表示所述分类器对应的类,且t∈{True,False}。
可选地,作为一个实施例,所述处理器1220具体用于从所述每个词边界的上下文中选取所述每个词边界对应的词、所述每个词边界对应的词的前一词,以及所述每个词边界对应的词的后一词。
可选地,作为一个实施例,所述分类器的参数是基于目标语言文本训练得到的参数,所述目标语言文本是采用第一分词方式对词边界已知的语言文本进行分词后得到的语言文本。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应所述以权利要求的保护范围 为准。

Claims (10)

  1. 一种语言文本的分词方法,其特征在于,包括:
    获取待处理的第一语言文本和可信度阈值,所述可信度阈值用于指示所述第一语言文本所需的分词精度、分词速度或分词一致性;
    采用第一分词方式,对所述第一语言文本进行分词,得到第一词边界集合;
    根据所述可信度阈值,将所述第一词边界集合划分成可信的第二词边界集合和不可信的第三词边界集合;
    根据所述第三词边界集合,从所述第一语言文本中选取第二语言文本,所述第二语言文本包括所述第三词边界集合中的每个词边界对应的词;
    采用第二分词方式,对所述第二语言文本进行分词,得到第四词边界集合,其中,所述第二分词方式的分词精度高于所述第一分词方式的分词精度;
    将所述第二词边界集合和所述第四词边界集合确定为所述第一语言文本的分词结果。
  2. 如权利要求1所述的方法,其特征在于,所述根据所述可信度阈值,将所述第一词边界集合划分成可信的第二词边界集合和不可信的第三词边界集合,包括:
    从所述第一词边界集合中的每个词边界的上下文中选取所述每个词边界对应的至少一个词;
    提取所述每个词边界对应的至少一个词的特征;
    根据所述每个词边界对应的至少一个词的特征,通过预先训练得到的分类器,确定所述每个词边界在所述上下文中的可信度;
    将所述第一词边界集合中的可信度大于所述可信度阈值的词边界添加至所述第二词边界集合;
    将所述第一词边界集合中的可信度小于或等于所述可信度阈值的词边界添加至所述第三词边界集合。
  3. 如权利要求2所述的方法,其特征在于,所述根据所述每个词边界对应的至少一个词的特征,通过预先训练得到的分类器,确定所述每个词边界在所述上下文中的可信度,包括:
    根据
    Figure PCTCN2017077830-appb-100001
    确定所述每个词边界在所述上下文中的可信度,其中,P(True|Bi,c)表示所述第一词边界集合中的第i个词边界Bi在所述上下文c中的可信度,S(t,Bi,c)表示所述第i个词边界Bi在所述上下文c中的得分,
    Figure PCTCN2017077830-appb-100002
    fj(t,Bi,c)表示所述至少一个词的特征中的第j个特征,βj表示所述分类器的参数,t表示所述分类器对应的类,且t∈{True,False}。
  4. 如权利要求2或3所述的方法,其特征在于,所述从所述第一词边界集合中的每个词边界的上下文中选取所述每个词边界对应的至少一个词,包括:
    从所述每个词边界的上下文中选取所述每个词边界对应的词、所述每个词边界对应的词的前一词,以及所述每个词边界对应的词的后一词。
  5. 如权利要求2-4中任一项所述的方法,其特征在于,所述分类器的参数是基于目标语言文本训练得到的参数,所述目标语言文本是采用第一分词方式对词边界已知的语言文本进行分词后得到的语言文本。
  6. 一种语言文本的分词系统,其特征在于,包括:
    输入模块,用于获取待处理的第一语言文本和可信度阈值,所述可信度阈值用于指示所述第一语言文本所需的分词精度、分词速度或分词一致性;
    第一分词模块,用于采用第一分词方式,对所述第一语言文本进行分词,得到第一词边界集合;
    可信度判断模块,用于根据所述可信度阈值,将所述第一词边界集合划分成可信的第二词边界集合和不可信的第三词边界集合;
    选取模块,用于根据所述第三词边界集合,从所述第一语言文本中选取第二语言文本,所述第二语言文本包括所述第三词边界集合中的每个词边界对应的词;
    第二分词模块,用于采用第二分词方式,对所述第二语言文本进行分词,得到第四词边界集合,其中,所述第二分词的分词精度高于所述第一分词的分词精度;
    输出模块,用于将所述第二词边界集合和所述第四词边界集合确定为所述第一语言文本的分词结果。
  7. 如权利要求6所述的分词系统,其特征在于,所述可信度判断模块具体用于从所述第一词边界集合中的每个词边界的上下文中选取所述每个词边界对应的至少一个词;提取所述每个词边界对应的至少一个词的特征;根据所述每个词边界对应的至少一个词的特征,通过预先训练得到的分类器,确定所述每个词边界在所述上下文中的可信度;将所述第一词边界集合中的可信度大于所述可信度阈值的词边界添加至所述第二词边界集合;将所述第一词边界集合中的可信度小于或等于所述可信度阈值的词边界添加至所述第三词边界集合。
  8. 如权利要求7所述的分词系统,其特征在于,所述可信度判断模块具体用于根据
    Figure PCTCN2017077830-appb-100003
    确定所述每个词边界在所述上下文中的可信度,其中,P(True|Bi,c)表示所述第一词边界集合中的第i个词边界Bi在所述上下文c中的可信度,S(t,Bi,c)表示所述第i个词边界Bi在所述上下文c中的得分,
    Figure PCTCN2017077830-appb-100004
    fj(t,Bi,c)表示所述至少一个词的特征中的第j个特征,βj表示所述分类器的参数,t表示所述分类器对应的类,且t∈{True,False}。
  9. 如权利要求7或8所述的分词系统,其特征在于,所述可信度判断模块具体用于从所述每个词边界的上下文中选取所述每个词边界对应的词、所述每个词边界对应的词的前一词,以及所述每个词边界对应的词的后一词。
  10. 如权利要求7-9中任一项所述的分词系统,其特征在于,所述分类器的参数是基于目标语言文本训练得到的参数,所述目标语言文本是采用第一分词方式对词边界已知的语言文本进行分词后得到的语言文本。
PCT/CN2017/077830 2016-04-12 2017-03-23 语言文本的分词方法和系统 WO2017177809A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP17781785.5A EP3416064B1 (en) 2016-04-12 2017-03-23 Word segmentation method and system for language text
US16/134,393 US10691890B2 (en) 2016-04-12 2018-09-18 Word segmentation method and system for language text

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610225943.3 2016-04-12
CN201610225943.3A CN107291684B (zh) 2016-04-12 2016-04-12 语言文本的分词方法和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/134,393 Continuation US10691890B2 (en) 2016-04-12 2018-09-18 Word segmentation method and system for language text

Publications (1)

Publication Number Publication Date
WO2017177809A1 true WO2017177809A1 (zh) 2017-10-19

Family

ID=60041359

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/077830 WO2017177809A1 (zh) 2016-04-12 2017-03-23 语言文本的分词方法和系统

Country Status (4)

Country Link
US (1) US10691890B2 (zh)
EP (1) EP3416064B1 (zh)
CN (1) CN107291684B (zh)
WO (1) WO2017177809A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020151218A1 (zh) * 2019-01-22 2020-07-30 福建亿榕信息技术有限公司 电力专业词库生成方法及装置、存储介质

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291684B (zh) * 2016-04-12 2021-02-09 华为技术有限公司 语言文本的分词方法和系统
CN110569496B (zh) * 2018-06-06 2022-05-17 腾讯科技(深圳)有限公司 实体链接方法、装置及存储介质
CN109408818B (zh) * 2018-10-12 2023-04-07 平安科技(深圳)有限公司 新词识别方法、装置、计算机设备及存储介质
CN109388806B (zh) * 2018-10-26 2023-06-27 北京布本智能科技有限公司 一种基于深度学习及遗忘算法的中文分词方法
CN110222329B (zh) * 2019-04-22 2023-11-24 平安科技(深圳)有限公司 一种基于深度学习的中文分词方法和装置
CN110390948B (zh) * 2019-07-24 2022-04-19 厦门快商通科技股份有限公司 一种快速语音识别的方法及系统
CN111274353B (zh) * 2020-01-14 2023-08-01 百度在线网络技术(北京)有限公司 文本切词方法、装置、设备和介质
CN112069319B (zh) * 2020-09-10 2024-03-22 杭州中奥科技有限公司 文本抽取方法、装置、计算机设备和可读存储介质
CN112131866A (zh) * 2020-09-25 2020-12-25 马上消费金融股份有限公司 一种分词方法、装置、设备及可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479191A (zh) * 2010-11-22 2012-05-30 阿里巴巴集团控股有限公司 提供多粒度分词结果的方法及其装置
CN104866472A (zh) * 2015-06-15 2015-08-26 百度在线网络技术(北京)有限公司 分词训练集的生成方法和装置
CN104899190A (zh) * 2015-06-04 2015-09-09 百度在线网络技术(北京)有限公司 分词词典的生成方法和装置及分词处理方法和装置
US20160026618A1 (en) * 2002-12-24 2016-01-28 At&T Intellectual Property Ii, L.P. System and method of extracting clauses for spoken language understanding

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7031911B2 (en) * 2002-06-28 2006-04-18 Microsoft Corporation System and method for automatic detection of collocation mistakes in documents
CN100504851C (zh) 2007-06-27 2009-06-24 腾讯科技(深圳)有限公司 一种中文分词方法及系统
US20090326916A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Unsupervised chinese word segmentation for statistical machine translation
CN101739393B (zh) * 2008-11-20 2012-07-04 苗玉水 汉语文本智能分词法
CN101950284B (zh) * 2010-09-27 2013-05-08 北京新媒传信科技有限公司 中文分词方法及系统
CN102402502A (zh) * 2011-11-24 2012-04-04 北京趣拿信息技术有限公司 用于搜索引擎的分词处理方法和装置
CN103324626B (zh) * 2012-03-21 2016-06-29 北京百度网讯科技有限公司 一种建立多粒度词典的方法、分词的方法及其装置
CN103324612B (zh) 2012-03-22 2016-06-29 北京百度网讯科技有限公司 一种分词的方法及装置
CN102693219B (zh) * 2012-06-05 2014-11-05 苏州大学 一种中文事件的抽取方法及系统
CN104281622B (zh) * 2013-07-11 2017-12-05 华为技术有限公司 一种社交媒体中的信息推荐方法和装置
CN104462051B (zh) * 2013-09-12 2018-10-02 腾讯科技(深圳)有限公司 分词方法及装置
CN104750705B (zh) * 2013-12-27 2019-05-28 华为技术有限公司 信息回复方法及装置
CN104951458B (zh) * 2014-03-26 2019-03-01 华为技术有限公司 基于语义识别的帮助处理方法及设备
CN105095182B (zh) * 2014-05-22 2018-11-06 华为技术有限公司 一种回复信息推荐方法及装置
CN104317882B (zh) 2014-10-21 2017-05-10 北京理工大学 一种决策级中文分词融合方法
CN106372053B (zh) * 2015-07-22 2020-04-28 华为技术有限公司 句法分析的方法和装置
CN105446955A (zh) * 2015-11-27 2016-03-30 贺惠新 一种自适应的分词方法
CN107291684B (zh) * 2016-04-12 2021-02-09 华为技术有限公司 语言文本的分词方法和系统
CN107608973A (zh) * 2016-07-12 2018-01-19 华为技术有限公司 一种基于神经网络的翻译方法及装置
CN108269110B (zh) * 2016-12-30 2021-10-26 华为技术有限公司 基于社区问答的物品推荐方法、系统及用户设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160026618A1 (en) * 2002-12-24 2016-01-28 At&T Intellectual Property Ii, L.P. System and method of extracting clauses for spoken language understanding
CN102479191A (zh) * 2010-11-22 2012-05-30 阿里巴巴集团控股有限公司 提供多粒度分词结果的方法及其装置
CN104899190A (zh) * 2015-06-04 2015-09-09 百度在线网络技术(北京)有限公司 分词词典的生成方法和装置及分词处理方法和装置
CN104866472A (zh) * 2015-06-15 2015-08-26 百度在线网络技术(北京)有限公司 分词训练集的生成方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3416064A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020151218A1 (zh) * 2019-01-22 2020-07-30 福建亿榕信息技术有限公司 电力专业词库生成方法及装置、存储介质

Also Published As

Publication number Publication date
EP3416064A1 (en) 2018-12-19
CN107291684A (zh) 2017-10-24
US20190018836A1 (en) 2019-01-17
EP3416064A4 (en) 2019-04-03
CN107291684B (zh) 2021-02-09
US10691890B2 (en) 2020-06-23
EP3416064B1 (en) 2023-05-10

Similar Documents

Publication Publication Date Title
WO2017177809A1 (zh) 语言文本的分词方法和系统
US10061768B2 (en) Method and apparatus for improving a bilingual corpus, machine translation method and apparatus
CN105095204B (zh) 同义词的获取方法及装置
US11113470B2 (en) Preserving and processing ambiguity in natural language
TW202020691A (zh) 特徵詞的確定方法、裝置和伺服器
WO2016127677A1 (zh) 地址结构化方法及装置
CN110413787B (zh) 文本聚类方法、装置、终端和存储介质
CN103678684A (zh) 一种基于导航信息检索的中文分词方法
CN105068997B (zh) 平行语料的构建方法及装置
CN110427612B (zh) 基于多语言的实体消歧方法、装置、设备和存储介质
CN110083832B (zh) 文章转载关系的识别方法、装置、设备及可读存储介质
CN113590810B (zh) 摘要生成模型训练方法、摘要生成方法、装置及电子设备
CN108763192B (zh) 用于文本处理的实体关系抽取方法及装置
JP2019032704A (ja) 表データ構造化システムおよび表データ構造化方法
JP2018025956A (ja) モデル作成装置、推定装置、方法、及びプログラム
JP2021501387A (ja) 自然言語処理のための表現を抽出するための方法、コンピュータ・プログラム及びコンピュータ・システム
JP5097802B2 (ja) ローマ字変換を用いる日本語自動推薦システムおよび方法
JP4979637B2 (ja) 複合語の区切り位置を推定する複合語区切り推定装置、方法、およびプログラム
US8666987B2 (en) Apparatus and method for processing documents to extract expressions and descriptions
CN109325237B (zh) 用于机器翻译的完整句识别方法与系统
US10002450B2 (en) Analyzing a document that includes a text-based visual representation
KR101126186B1 (ko) 형태적 중의성 동사 분석 장치, 방법 및 그 기록 매체
CN113553410B (zh) 长文档处理方法、处理装置、电子设备和存储介质
Pan Collaborative Recognition and Recovery of the Chinese Intercept Abbreviation
Liu et al. Collaborative Recognition and Recovery of the Chinese Intercept Abbreviation

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2017781785

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2017781785

Country of ref document: EP

Effective date: 20180913

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17781785

Country of ref document: EP

Kind code of ref document: A1