CN1193304C - Method and system for identifying property of new word in non-divided text - Google Patents

Method and system for identifying property of new word in non-divided text Download PDF

Info

Publication number
CN1193304C
CN1193304C CNB011353570A CN01135357A CN1193304C CN 1193304 C CN1193304 C CN 1193304C CN B011353570 A CNB011353570 A CN B011353570A CN 01135357 A CN01135357 A CN 01135357A CN 1193304 C CN1193304 C CN 1193304C
Authority
CN
China
Prior art keywords
speech
character
probability
sequence
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB011353570A
Other languages
Chinese (zh)
Other versions
CN1369877A (en
Inventor
吴安迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of CN1369877A publication Critical patent/CN1369877A/en
Application granted granted Critical
Publication of CN1193304C publication Critical patent/CN1193304C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

Embodiments of the present invention provide a method and apparatus for segmenting text by identifying new or rare words in the text. Under the present invention, a sub-string of single characters in the text is identified. For each character in the sub-string, an independent word probability is calculated that indicates the probability that each single character represents a single-character word. The probabilities for all of the characters in the sub-string are combined to form a total probability. If the total probability is below a threshold, the characters in the sub-string are considered to form a single multi-character word. In a further embodiment, the system determines parts of speech for multi-character words that are not found in the dictionary.

Description

The method of the input character sequence of the non-cutting language of cutting
Technical field
Relate generally to of the present invention is used to discern the computer-based method of text.More specifically, the present invention relates to the text that contains the neologisms in the language is carried out cutting.
Background technology
Word segmentation refers to identification and constitutes language performance, text for example, the process of each speech.For checking spelling with grammer, from the text synthetic speech, realizing natural language understanding and the data collection is searched for specific speech or phrase that word segmentation is useful.
Because space and punctuation mark are delimited each speech in the text usually, the word segmentation that carries out English text is quite simple.But in such as Japanese or in the non-divided text of Chinese, the speech border is that imply rather than tangible.That is, non-divided text does not typically contain space or the punctuate between speech.Like this, can not carry out cutting with the mode identical to these language with the english cutting.
In most of existing systems, utilize simple neologisms separation vessel cutting text.These neologisms separation vessels are typically also then searched for these sections to the synthetic possible section of character set in dictionary.If find certain section in dictionary, this section remained a kind of of text may the cutting part.
Although this technology is applicable to the speech that comprises in the dictionary very much, but for speech that seldom uses and the neologisms in the language, owing to typically can not in dictionary, find these speech, this technology is inapplicable, usually, dictionary technique is not that one group of character recognition is become to constitute single " rare " speech, but this group character recognition is become one group of single-character word.
Some systems attempt by expanding cutting based on dictionary based on the calculating of statistics so that help these " rare " speech of identification.Rely on this method, the probability and certain threshold ratio that multiword are accorded with " rare " speech shine.If it surpasses this threshold value, it is identified as a speech.Yet because this " rare " speech is so rare in this language, its probability always is lower than this threshold value.Thereby,, usually can not correctly discern " rare " speech even in these expanding systems.
If will carry out grammatical analysis to text, cutting system must not only be discerned rare words in the text or neologisms but also to the possible part of speech of these speech identifications.Thereby, need a kind of more be applicable to the speech that seldom uses and can be rare words or neologisms recognizing voice cutting system partly.
Summary of the invention
It is a kind of by the neologisms in the identification text or the method and apparatus of rare words cutting text that embodiments of the invention provide.According to the present invention, the substring that each character is formed in the identification text.To each character in this substring, calculate the autonomous word probability that each monocase of indication is represented the probability of a single-character word.The probability that makes up all characters in this substring is to produce a general probability.If this general probability is lower than threshold value, think that the character group in this substring constitutes single multi-character word.
At another embodiment, this system determines each part of speech for the multi-character word that does not find in dictionary.To this, this system determines that for each character in this speech is separated a probability, and this probability is described in the likelihood that can find this character in the same that grow and the speech that have a specific part of speech with this multi-character word of length on the current location of this character.For example, for a double word symbol speech " AB ", this system can determine first probability at the first character place in the present double word symbol noun for character " A ", appear at double word symbol verb first character second probability and appear at the 3rd probability that double word accords with the adjectival first character place.
The probability of each character of combination on the part of speech base is so that be that every kind of part of speech forms a separation general probability.Then each general probability and a threshold value are relatively, for top example, " A " appears at the probability combination that the probability at the first character place of double word symbol noun will appear at the second character place of double word symbol noun with " B " is the general probability of a noun so that " AB " to be provided.Its probability is increased to the possible part of speech of this multi-character word above every kind of part of speech of this threshold value.
Description of drawings
Fig. 1 is applicable to the calcspar of realizing demonstration general-purpose computing system of the present invention.
Fig. 2 is the calcspar that can implement handheld component of the present invention therein.
Fig. 3 is the more detailed block diagram of each member of one embodiment of the invention.
Fig. 4 is the cutting text of a foundation example embodiment of the present invention and the process flow diagram of discerning the method for part of speech.
Embodiment
Fig. 1 illustrates an example wherein can realizing a suitable computingasystem environment 100 of the present invention.This computingasystem environment 100 just is suitable for an example of computing environment, and the scope to use of the present invention or function that do not mean that proposes any restriction.Also computing environment is not interpreted as that the combination of any one member shown in this exemplary operational environment 100 or member is had any correlativity or requirement.
The present invention can be with various other universal or special computingasystem environment or configuration operations.The example that is applicable to known computing system, environment and/or the configuration used together with the present invention includes but not limited to: personal computer, server computer, hand-held or above-knee parts, multicomputer system, the system based on microprocessor, set-top box, programmable consumer electronics, network PC, small-size computer, mainframe computer, comprise any said system or parts distributed computing environment, or the like.
Can be at computer executable instructions, the program module of carrying out by computing machine for example, general environment explanation the present invention, for example program module down, it generally includes routine, program, object, parts, data structure etc., and they are realized specific task or realize specific abstract data type.The present invention can also realize in distributed computing environment, wherein by finishing the work by the teleprocessing parts of communication network link.In distributed computer environment, program module can be arranged in the local and remote computer-readable storage medium that is comprising memory unit.
With reference to Fig. 1, realize that a demonstration system of the present invention comprises that one is the general-purpose computer components of form with computing machine 110.The member of computing machine 110 can include but not limited to: processing element 120, system storage 130 and system bus 121, wherein system bus 121 is comprising that the various components of a system of system storage are connected to processing element 120.System bus 121 can be any in any of several types of bus structures, comprising the local bus of any architecture in memory bus or the various bus architectures of Memory Controller, peripheral bus and employing.As an example but not as restriction, these architectures comprise ISA(Industry Standard Architecture) bus, MCA (MCA) bus, expand ISA (EISA) bus, VESA (VESA) local bus and be also referred to as Peripheral Component Interconnect (PCI) bus of Mezzanine bus.
Computing machine 110 typically comprises various computer-readable mediums.Computer-readable medium can be any commercially available can be by the medium of computing machine 110 visit, and comprise volatibility and non-volatile media, removable and removable medium not.Not as restriction, computer-readable medium can comprise computer-readable storage medium and communication medium as an example.Computer-readable storage medium comprises with being used for of realizing of any method or technology to be stored such as the volatibility of computer-readable instruction, data structure, program module or other data and non-volatile media, removable and removable medium not.Computer-readable storage medium includes but not limited to: RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital universal disc (DVD) or other optical disc storage, magnetic tape cassette, tape, disk storage formula or other magnetic memory component or any other can be used to store information needed and can be by the medium of computing machine 110 visits.Communication medium is typically at modulated data signal, for example carrier wave or other transmission mechanism, in comprise computer-readable instruction, data structure, program module or other data and comprise any information delivery media.Term " modulated data signal " expression has signal one or more feature sets or that change by the mode of coded signal in signal.As an example but as restriction, communication medium does not comprise wired media, for example cable network or directly wiring connect and wireless medium, for example sound, FR, infrared and other wireless medium.Any above-mentioned combination also should be included in the scope of computer-readable medium.
System storage 130 comprises that with volatibility and/or nonvolatile memory for example ROM (read-only memory) (ROM) 131 and random-access memory (ram) 132 are the computer-readable storage medium of form.Typically store basic input/output 133 (BIOS) in ROM131, BIOS contains transmission information between helpful parts in computing machine 110, for example between the starting period, each basic routine.RAM132 typically contains data and/or program module that processing element 120 can directly be visited and/or that operating at present.As an example but as restriction, Fig. 1 does not illustrate operating system 134, application program 135, other program module 136 and routine data 137.
Computing machine 110 can also comprise that it is through removable/not removable, volatile/nonvolatile computer storage media.As just example, Fig. 1 illustrates the Winchester disk drive 141 to the read-write of not removable, non-volatile magnetic medium, to the Winchester disk drive 151 of removable, non-volatile magnetic disk 152 read-writes and to removable, the non-volatile CD CD player 155 of CD ROM or the read-write of other medium for example.That other can use in this exemplary operational environment is removable/and not removable, volatile/nonvolatile computer storage media includes but not limited to: magnetic tape cassette, flash memory card, digital universal disc, digital video band, solid-state RAM, solid-state ROM or the like.Winchester disk drive 141 typically by an immovable memory interface for example interface 140 be connected to system bus 121, and disk drive 151 and CD player 155 typically by a removable memory interface for example interface 150 be connected to system bus 121.
Each driver of discussing in Fig. 1 above and illustrating and relevant computer-readable storage medium provide storage to computer-readable instruction, data structure, program module and other data for computing machine 110.For example in Fig. 1 Winchester disk drive 141 being shown as is storage operating system 144, application program 145, other program module 146 and routine data 147.Please note that these parts can be identical or different with routine data 137 with operating system 134, application program 135, other program module 136.Giving different numbers to operating system 144, application program 145, other program module 146 and routine data 147 herein is different copies so that they to be shown at least.
The user can pass through input block, and for example keyboard 162, microphone 163 and such as the indication parts 161 of mouse, tracking ball or touch panel are input to order and information in the computing machine 110.Other input block (not shown) can comprise operating rod, cribbage-board, satellite plate-like antenna, scanner or the like.These and other input block is connected with processing element 120 by the user's input interface 160 that is connected with system bus usually, but also can pass through other interface and bus interface, for example parallel port, game port or pass through universal serial bus (OSB), connection.By an interface such as video interface 190, the display unit of monitor 191 or other type also is connected with system bus 121.Except that monitor, computing machine can also comprise it through peripheral output block for example loudspeaker 197 and printer 196, and they can connect by output Peripheral Interface 190.
Computing machine 110 can adopt one or more remote computers, and for example remote computer 180, the networked environment that logic connects operation down.Remote computer 180 can be personal computer, handheld component, server, router, network PC, parts at the same level or other common network node, and the parts that typically comprise major part or above all computing machine 110 is illustrated.The logic of describing among Fig. 1 connects and comprises Local Area Network 171 and wide area network (WAN) 173, but also can comprise other network, and in office, computer network of company level, Intranet and the Internet, this network environment is usual.
When using in the LAN networked environment, computing machine 110 is connected with LAN171 by network interface or adapter 170.When using in the WAN networked environment, computing machine 110 typically comprises modulator-demodular unit 172 or other device that is used for for example setting up on the Internet at WAN173 communication.Can be connected with system bus 121 by user's input interface 160 or other suitable mechanism at inner or outside modulator-demodular unit 172.Under networked environment, can in the remote memory memory unit, store the program module of computing machine 110 descriptions or a part wherein.As an example but as restriction, Fig. 1 does not illustrate remote application 185 and resides on the remote computer 180.Should understand shown network connection is exemplary and can uses other to set up the device of communication link between computing machine.
Fig. 2 illustrates a kind of calcspar of moving-member 200 of illustrative computer environment.The communication interface 208 that moving-member 200 comprises microprocessor 202, memory device, I/O (I/O) parts 206 and is used for communicating by letter with remote computer or common its moving-member.In one embodiment, on a suitable bus 210, connect above-mentioned each parts with mutual communication.
Storer 204 is to use non-volatile electronic memory, for example has the random-access memory (ram) of battery back up module (not shown), realization, thereby canned data in can loss storage 204 when turning off the general supply of moving-member 200.Preferably the part of storer 204 is assigned as the addressable memory that is used for the program execution, and simultaneously another part of storer 204 is used for storage, for example storage on the mock disc machine.
Storer 204 comprises operating system 212, application program 214 and object storage 216.During operation, preferably by processor 202 from storer 204 executive operating systems 212.In a preferred embodiment, operating system 212 is commercial WINDOWS that can buy from Microsoft The operating system of CE brand.Operating system 212 is that preferably design for moving-member and realizing database performance, and each is used can use these database performances by the one group of application programming interface opened and method.Thereby each use 214 and operating system 212 respond the object of keeping in the object storage 216 that calls of the application programming interface that exposes and method at least in part.
Parts and technology that the various permission moving-members 200 of communication interface 208 representatives send and receive information.These parts for example comprise wired and radio modem, satellite receiver and broadcasting tuner.Moving-member 200 can also directly be connected with swap data with computing machine.Under these circumstances, communication interface 208 can be that infrared transceiver or serial or parallel communicate to connect, and these can both send streaming information.
I/O parts 206 comprise various input blocks, and for example tactiosensible screen, button, roller and microphone and various output block comprise acoustic frequency generator, vibrating mass and display.The parts of listing above are for example, needn't all appear on the moving-member 200.In addition, within the scope of the invention, moving-member 200 can have other I/O parts.
Some embodiments of the present invention provide a kind of method and apparatus by the rare and new multi-character word cutting text of identification.Other embodiment provides a kind of method and system of discerning the part of speech of the speech that does not have on the dictionary.Fig. 3 is the calcspar of the various members of one embodiment of the present of invention.Fig. 4 is the process flow diagram according to a kind of method of the embodiment of each member of employing Fig. 3 of the present invention.
In the step 400 of Fig. 4, appear at the combination of the adjacent character in the little vocabulary record set (being also referred to as dictionary) 304 in the disconnected speech device 302 identification input texts 300 of Fig. 3.Only storing in each speech on the meaning of limited syntactic information amount, vocabulary record set 304 is small-sized.Vocabulary record set 304 not necessarily must comprise a spot of speech, and in fact, in certain embodiments, little vocabulary record set 304 contains a large amount of speech.
According to one embodiment of the invention, disconnected data structure in little vocabulary record set 304 search word of speech device 302 by using a kind of trie of being called.In trie, not to list speech according to priority, but bearing by the state chain of replacing.Each state is represented an independently character and comprise one or more sub-states, and wherein each sub-state comprises a character, and this character appears at the back of the character under this current state at least one speech of little vocabulary record set 304.Each state also indicates this current character whether to appear as last character a speech that current character begins to form by the state chain from this.
Utilize this trie data structure, can determine for example possible speech in " ABCD " of a character string concurrently.For example, this system can from the relevant state of character " A ".If this case pointer symbol " A " appears as a speech separately in little vocabulary record set 304, then can be designated a kind of of this string to " A " may cutting.Then whether this systems inspection extends the sub-state to character " B " from this state of character " A ".If there is " B " sub-state, check " B " state whether to be the last character of any speech with definite character " B ".If string " AB " is identified as a kind of possible cutting.Then this system checks that the state that whether exists from character " B " extends the sub-state for character " C ".If for the sub-state that character " C " does not exist current state to extend, system stops to follow the trail of current chain and begins to follow the trail of a new chain that begins from character " B ".To the process of the new chain of each character repeated priming in the input string, thereby check each character by the possible starting point of a chain.
In case identify the speech of storing in the little vocabulary record set 304 in step 400, the method of Fig. 4 advances to step 402, disconnected herein speech device 302 utilize 306 identifications of special name rule sets be not stored in the little vocabulary record set 304 but the speech of the special name of typical example such as name or place name.
Be placed in the speech battle array (lattice) from the speech of little vocabulary record set 304 and 306 identifications of special name rule, the vocabulary that this speech battle array is offered the big vocabulary record set 312 of visit searches 310.Big vocabulary record set 312 comprises than little vocabulary record set 304 more lexical informations.In fact, in many examples, little vocabulary record set 304 is to upgrade with reference to big vocabulary record set 312 foundation and cycle.
Utilize big vocabulary record set 312, vocabulary is searched 310 steps 406 at Fig. 4 and is expanded the lexical information amount of storing for each speech in the speech battle array in this speech battle array.The information of these increases comprises the source such as speech, can this speech use such content and this speech in proper noun other morphology and grammer details.
Search 310 speech battle arrays that have the lexical information of expansion from vocabulary and pass to derivation morphology 314.At step 408 place of Fig. 4, the phase hyphen in derivation morphology 314 these speech battle arrays of combination is to form bigger multiword section speech.For example, deriving from morphology 314 can be suffix character string, infix character string and the interpolation of prefix character string, insertion and preceding being inserted on other field to form bigger speech.In certain embodiments, in step 402 by disconnected speech device 302 rather than use some or all of these in step 408 in by morphology member 314 and derive from morphological rules.But morphology member 314 has and can be input to the advantage that derives from the morphological rule group to the abundanter information that can obtain in big vocabulary record set.In addition, deriving from morphology member 314 can be in order to discern combined field and can to extract the clauses and subclauses that other indicates title, for example name, mechanism, geographic position name and other title and other unit such as date and time.
Adding in this speech battle array by the bigger speech and their morphological information that derive from morphology 314 structures.In most of embodiment, do not replace less field by the bigger speech that derives from morphology 314 structures, replenish these less fields in this speech battle array but be placed on.
Offer neologisms recognizer 320 by the expansion speech battle array that derives from morphology 314 generations, the latter attempts to discern neologisms in the step 410 of Fig. 4.In order to discern neologisms, neologisms recognizer 320 is at first searched for the monocase sequence that this speech battle array is searched the part that does not belong to speech in this gust.For each sequence that identifies, neologisms recognizer 320 determines that all characters in this sequence of explanation all are the probability of the likelihood of a single-character word.According to an embodiment, this be by addition or on average in this sequence the autonomous word probability of each character finish, wherein each autonomous word probability is described the likelihood that each character occurs as single-character word in one section text.
According to an embodiment, the autonomous word probability of a character is:
Figure C0113535700121
Wherein N (speech (c)) is the number of times that character c occurs as single-character word in collected works, and N (c) is the character c number of times that (as single-character word or multi-character word) occurs in collected works, and IWP (c) is the autonomous word probability of this character.In certain embodiments, all leaves by determining analyzed structure from analyzed collected works are also counted and each character are occurred counting as an autonomous word the appearance of each character, calculate the probability of each character.
According to an embodiment, the analyzed collected works that are used for producing the autonomous word probability are not even as big as comprising each character.As long as these collected works comprise those characters that often occurs as single-character word, are not problems just.
According to an embodiment, before using the neologisms recognizer, concentrate each character that occurs to calculate the autonomous word probability to this article.These probability of storage in autonomous word probability storage 322, this storage 322 of recognizer 320 visits between the neologisms recognition phase.
When neologisms recognizer 320 found one not to be the monocase sequence of a part of multi-character word, its visited the autonomous word probability of storing for these characters in storage 322.Addition or on average in this sequence the autonomous word probability of each character to produce total autonomous word probability of this character string.
Total autonomous word probability of this character string and a threshold probability comparison are to judge that these characters more likely form single speech and still have single-character word sequence of possibility formation.If this total autonomous word probability is less than this threshold value, think that then this character string forms single speech and these neologisms are added in this speech battle array.If this total autonomous word probability greater than this threshold value, thinks that then this character string is the sequence of a single-character word.
According to an embodiment, the sequence that character quantity is different is used different threshold values, for example, the sequence that has two characters has a threshold value and the sequence that has four characters has different threshold values.According to some embodiment, character string be limited to have two, three or four characters.
For the monocase sequence with two above characters, one embodiment of the invention are determined total autonomous word probability for each possible character combination.For example, if this character string is " ABC ", speech recognizer 320 is each sequence " AB ", " BC " and " ABC " generation total autonomous word probability separately.Then its total autonomous word probability is added in this speech battle array as neologisms less than each sequence of threshold value.Thereby if total autonomous word probability of " AB " is greater than threshold value less than threshold value for total autonomous word probability of " BC " and " ABC ", then the two adds in this speech battle array as possible speech " BC " and " ABC ", but " AB " can not be added in this speech battle array.
In the embodiment of Fig. 3 and 4.Before neologisms being added in the speech battle array, determine the part of speech of these neologisms by part of speech recognizer 324 in the step 412 of Fig. 4.Please note that single neologisms can represent a plurality of parts of speech.Thereby part of speech recognizer 324 attempts to discern all parts of speech that neologisms may be represented.
In order to determine which part of speech neologisms may represent, part of speech identification 324 utilizes the part of speech probability of each character in the speech.The part of speech probability of character is described in the length and the likelihood of this character in the speech of this character at certain specific part of speech under the prerequisite of the position of this speech of given speech. and for example, the part of speech probability can be described the likelihood that character " A " occurs as second character in the three-character doctrine noun.
Can use P, Cat, Loc, Len represent the part of speech probability, and wherein P represents probability, and Cat is the classification of this speech or the abbreviation of part of speech, and Loc is the position of this character in this speech, and Len is the length (number of characters) of this speech.For example the probability of the character at second a character place that appears at one four character verb is represented with Pv24, the probability of the character that occurs as first character in one the two character noun represents with Pn12, and the probability of the character that occurs as three-character doctrine in one the four character adjective is represented with Pa34.
In order to limit the quantity that is necessary for the probability that each character calculates, one embodiment of the present of invention are restricted to noun (n), verb to part of speech (v) and adjective (a), and the length restriction of speech between two to four characters.Thereby the length attribute of probability can be 2,3 or 4, and position attribution can be 1,2,3 or 4.Under the value collection of given such qualification, each character can have 27 part of speech probability at most.
According to an embodiment, determine each part of speech probability by the speech counting that has length-specific and specific part of speech and on ad-hoc location, comprise this character that in a dictionary of this language, finds.Thereby in order to determine the Pn23 of character " A ", various embodiments of the present invention are to counting " A " as the three-character doctrine noun of their second character in this dictionary.Then this counting divided by the quantity of the speech that wherein contains this character to obtain the part of speech probability.Be formulated then and be:
P , cat , loc , len ( c ) = N ( cat , loc , len ( c ) ) N ( c ) - - - ( 2 )
P wherein, cat, loc, len (c) they are the part of speech probability of character c, N (cat, loc, len (c)) is to have the quantity that this specific part of speech, length-specific and this character appear at the speech on the ad-hoc location, and N (c) is the total quantity that wherein contains the speech of this character.
According to an embodiment, only the dictionary centre word is carried out the counting of this definite part of speech probability.According to an embodiment, adopt a dictionary with 85,135 centre words.
According to an embodiment, before hereinafter each character calculating part of speech probability in this language receiving one section to be processed.In Fig. 3, these part of speech probability are shown as and are stored in part of speech (POS) the probability storage 326.
When neologisms recognizer 320 identified neologisms, part of speech recognizer 324 was by the suitable part of speech probability of each character retrieval in 326 pairs of these speech of visit POS probability storage.For example, to neologisms " AB ", part of speech recognizer 324 is character " A " visit part of speech probability P n12, Pv12 and Pa12, is character " B " visit part of speech probability P n22, Pv22 and Pa22.Then the part of speech probability of addition or average identical part of speech with form this speech general probability.Like this, the Pn12 of character " A " is provided speech " AB " to be the general probability of noun with the Pn22 of character B mutually.Similarly, mutually to be provided speech " AB " be the general probability of a verb to the Pr22 of the Pv12 of character " A " and character B.
Then whether each general probability and a threshold ratio can sufficiently represent one or more parts of speech to judge this speech.Thereby, this speech be the general probability of a noun and this threshold ratio, this speech be the general probability of a verb and this threshold ratio, and this speech be an adjectival general probability and this threshold ratio.Although note that the single threshold value of top use, in other embodiments, different parts of speech is used different threshold values.
Its general probability is considered to a kind of possible part of speech candidate of neologisms above every kind of part of speech of its threshold value.Therefore, when part of speech knowledge device 324 added this speech in this speech battle array, it distributed a kind of its general probability to surpass the part of speech of threshold value to this speech.According to an embodiment, if a speech has a plurality of parts of speech that exceed threshold value, each inserts once this speech for every kind of part of speech.Thereby, be that the general probability of verb surpasses this threshold value if the general probability that speech is a noun surpasses this threshold value and this speech, this speech can add in this speech battle array once as a noun but also add in this speech battle array once as a verb.In other embodiment, only speech being added in this speech battle array once and distributes general probability is the highest part of speech.
According to an embodiment, if the part of speech general probability all is no more than their threshold value, part of speech recognizer 324 is inserted into this speech in this speech battle array and this speech and is appointed as noun.
Although and the method for identification neologisms of the present invention illustrated the method for the part of speech of definite neologisms in combination, do not need to realize in the lump these two aspects of the present invention.Therefore, can under this method that does not have the part of speech that is used for determining neologisms, use the method for these identification neologisms, and this method that is used for discerning the part of speech of neologisms can be used together with the method for any identification neologisms.
The speech battle array that is generated by part of speech recognizer 324 is provided for syntax analyzer 316, and the latter utilizes the speech battle array of this expansion to carry out grammatical analysis in the step 414 of Fig. 4.In one embodiment, thus utilize by incrementally set up bigger phrase from less speech and less phrase and create the bottom-up graphic analysis of grammatical analysis and carry out grammatical analysis.In order to set up bigger phrase, syntax analyzer 316 is used the morphology sign syntax rule of determining speech or phrase and is determined how to make up them to form bigger speech or phrase.At an embodiment, utilize the two-dimensional grammar of checking two adjacent speech or phrase to determine how to make up them.
The grammatical analysis that syntax analyzer 316 carries out relates to all fields in this expansion speech battle array.This analyzer is limited to only combination and represents the character of phase hyphen in the original input text and final analysis to cross over the entire chapter input text.Thereby this syntax analyzer does not produce the effective phrase that relates to two stack fields, does not produce the phrase of a group field of the integrality of not representing input string yet.
During analyzing, syntax analyzer 316 will judge whether any neologisms of neologisms recognizer 320 identifications constitute certain effective phrase part according to the part of speech of 324 pairs of neologisms identifications of part of speech recognizer.If certain neologisms does not constitute an effective phrase part, it is not identified as a speech of the text.According to some embodiments of the present invention, only about half of speech by 320 identifications of neologisms recognizer does not constitute effective phrase part, thereby gives up them in analytic process.Yet, all the other neologisms by 320 identifications of neologisms recognizer are chosen as effective short speech part, and they are key for the effective phrase that constitutes the text in many cases.In single test, do not having under the speech that provides by neologisms recognizer 320, about 21 percent the sentence by syntax analyzer 316 analyses can not be analyzed.
In certain embodiments, syntax analyzer 316 generates a plurality of efficient syntax phrases of respectively representing the effective cutting of a kind of independence of input text.In one embodiment, each so effective phrase is transmitted the logic formation maker 318 of a semantic relation in each phrase of identification.Then can utilize these semantic relations to select those most possibly is effective phrase of the correct phrase in this input.This semanteme identification is shown at step 416 place of Fig. 4.
Although the present invention has been described, those skilled in the art will appreciate that and to make various changes in the form and details and do not deviate from the spirit and scope of the present invention with reference to some specific embodiments.

Claims (13)

1. the method for the input character sequence of the non-cutting language of cutting, this method comprises:
Discern the sequence of forming by monocase in this input character sequence;
For each character, determine the autonomous word probability of this character of indication as the likelihood of single-character word appearance;
The autonomous word probability that makes up each character in this monocase sequence is to determine total autonomous word probability of this monocase sequence; And
If this total autonomous word probability less than threshold value, is appointed as single speech to this monocase sequence.
2. the process of claim 1 wherein that the monocase sequence in the identification input character sequence comprises that carrying out vocabulary searches the character group in the input character sequence is become multi-character word and search for the monocase sequence that is not included in the multi-character word in this input character sequence.
3. the method for claim 2, the step of wherein discerning the monocase sequence in the input character sequence also comprise and derive from lexical analysis, is not included in other multi-character word of identification before the monocase sequence in the multi-character word with search in this input character sequence.
4. the method for claim 3 is wherein discerned monocase sequence in the input character sequence and is comprised and carry out special name identification, is not included in other multi-character word of identification before the monocase sequence in the multi-character word with search in this input character sequence.
5. the process of claim 1 wherein that the autonomous word probability of determining character comprises that the number of times that this character in the collected works is occurred as single-character word is counted and with this number of times divided by the number of times of concentrating this character appearance at this article.
6. the process of claim 1 wherein that each autonomous word probability of combination comprises each autonomous word probability is averaged.
7. the process of claim 1 wherein that each autonomous word probability of combination comprises each autonomous word probability of addition.
8. the method for claim 1 also comprises the single speech for the appointment of monocase sequence is distributed part of speech.
9. the method for claim 8 wherein distributes part of speech to comprise: each character in this single speech is determined to describe the part of speech probability of this character at the likelihood of a specific part of speech according to the length and the position of this character in this single speech of this single speech;
The part of speech probability that makes up all characters in this single speech is to determine total part of speech probability; And
If this total part of speech probability surpasses threshold value, to the part of speech of this single speech distribution and this total part of speech probability correlation.
10. the method for claim 9, determine that wherein character part of speech probability comprises:
To length this part of speech of expression in the dictionary, that have this single speech and and this single speech on same position, comprise the number count of speech of this character so that part of speech counting is provided; And
This part of speech counting is divided by the quantity that contains the speech of this character in this dictionary, so that be this character generation part of speech probability.
11. the method for claim 9 determines that wherein the part of speech probability comprises from the definite in the past part of speech probability of memory search.
12. the method for claim 9 also comprises if total part of speech probability of each part of speech does not all surpass threshold value, and noun be appointed as in this single speech.
13. the method for claim 1 also comprises:
Discern the sub-monocase sequence in this monocase sequence;
The autonomous word probability that makes up each character in this sub-monocase sequence is to determine total autonomous word probability of this sub-monocase sequence; And
If this total autonomous word probability less than threshold value, is appointed as single speech to this sub-monocase sequence.
CNB011353570A 2000-10-04 2001-10-08 Method and system for identifying property of new word in non-divided text Expired - Fee Related CN1193304C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US23752200P 2000-10-04 2000-10-04
US60/237,522 2000-10-04

Publications (2)

Publication Number Publication Date
CN1369877A CN1369877A (en) 2002-09-18
CN1193304C true CN1193304C (en) 2005-03-16

Family

ID=22894078

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB011353570A Expired - Fee Related CN1193304C (en) 2000-10-04 2001-10-08 Method and system for identifying property of new word in non-divided text

Country Status (2)

Country Link
CN (1) CN1193304C (en)
TW (1) TW548600B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271450B (en) * 2007-03-19 2010-09-29 株式会社东芝 Method and device for cutting language model

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100347723C (en) * 2005-07-15 2007-11-07 清华大学 Off-line hand writing Chinese character segmentation method with compromised geomotric cast and sematic discrimination cost
DE602006016846D1 (en) * 2005-11-23 2010-10-21 Dun & Bradstreet Inc SYSTEM AND METHOD FOR BROWSING AND COMPARING DATA WITH IDEOGRAMMATIC CONTENTS
CN100478961C (en) * 2007-09-17 2009-04-15 中国科学院计算技术研究所 New word of short-text discovering method and system
CN100489863C (en) * 2007-09-27 2009-05-20 中国科学院计算技术研究所 New word discovering method and system thereof
CN101882226B (en) * 2010-06-24 2013-07-24 汉王科技股份有限公司 Method and device for improving language discrimination among characters
CN107203542A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 Phrase extracting method and device
CN109858010B (en) * 2018-11-26 2023-01-24 平安科技(深圳)有限公司 Method and device for recognizing new words in field, computer equipment and storage medium
CN109858011B (en) * 2018-11-30 2022-08-19 平安科技(深圳)有限公司 Standard word bank word segmentation method, device, equipment and computer readable storage medium
CN112101308B (en) * 2020-11-11 2021-02-09 北京云测信息技术有限公司 Method and device for combining text boxes based on language model and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271450B (en) * 2007-03-19 2010-09-29 株式会社东芝 Method and device for cutting language model

Also Published As

Publication number Publication date
TW548600B (en) 2003-08-21
CN1369877A (en) 2002-09-18

Similar Documents

Publication Publication Date Title
CN1285068C (en) Text normalization using context-free grammar
CN1135485C (en) Identification of words in Japanese text by a computer system
CN106537370B (en) Method and system for robust tagging of named entities in the presence of source and translation errors
US8660834B2 (en) User input classification
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
CN1143232C (en) Automatic segmentation of text
US7831911B2 (en) Spell checking system including a phonetic speller
EP1949260B1 (en) Speech index pruning
US7702680B2 (en) Document summarization by maximizing informative content words
US8321201B1 (en) Identifying a synonym with N-gram agreement for a query phrase
US7809568B2 (en) Indexing and searching speech with text meta-data
CN1573926B (en) Discriminative training of language models for text and speech classification
US20070100890A1 (en) System and method of providing autocomplete recommended word which interoperate with plurality of languages
CN1667699A (en) Generating large units of graphonemes with mutual information criterion for letter to sound conversion
US20060277028A1 (en) Training a statistical parser on noisy data by filtering
US20060265222A1 (en) Method and apparatus for indexing speech
US20080059146A1 (en) Translation apparatus, translation method and translation program
KR20060043682A (en) Systems and methods for improved spell checking
US20090055386A1 (en) System and Method for Enhanced In-Document Searching for Text Applications in a Data Processing System
CN1839386A (en) Internet searching using semantic disambiguation and expansion
Das et al. SEMAFOR 1.0: A probabilistic frame-semantic parser
CN1910573A (en) System for identifying and classifying denomination entity
CN1254787C (en) Method and device for speech recognition with disjoint language moduls
WO2022134355A1 (en) Keyword prompt-based search method and apparatus, and electronic device and storage medium
CN111832299A (en) Chinese word segmentation system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20170825

Address after: Washington State

Patentee after: Micro soft technique license Co., Ltd

Address before: Washington, USA

Patentee before: Microsoft Corp.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20050316

Termination date: 20191008

CF01 Termination of patent right due to non-payment of annual fee