CN1920812A

CN1920812A - Language processing system

Info

Publication number: CN1920812A
Application number: CNA2006101256010A
Authority: CN
Inventors: 濑户重宣
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2005-08-24
Filing date: 2006-08-24
Publication date: 2007-02-28
Anticipated expiration: 2026-08-24
Also published as: US7917352B2; JP2007058509A; CN1920812B; US20070055496A1

Abstract

The present invention provided a language processing system for preliminarily preventing the generation of a word series including words which are not desired for a system user. This language processing system is provided with a forbidden morpheme storing part 202 for storing use forbidden morphemes, a series candidate generating part 111 for generating a plurality of word series candidates separately written in a plurality of morphemes from a solidly written text and an optimal series selecting part 112 for reading use forbidden morphemes from the forbidden morpheme storing part 202, for excluding the word series candidates including the use forbidden morphemes from among those, and for selecting the optimal word series whose inter-morpheme connectability is the highest from among a plurality of word series candidates.

Description

Language processing system

Technical field

The present invention relates to the morphemic analysis technology, particularly language processing system.

Background technology

From the system of text synthetic video, utilize following function, promptly compare with the system's word that is registered in advance in the system, preferentially the user of system's registration that the user appends is registered word and be used for sound and synthesize.For example, even in system, registered " refreshing Kobe (こうべ) " such system's word, if system user has appended " refreshing Kobe (かんべ) " such user to system and has registered word, also make " refreshing Kobe (かんべ) " more preferential then than the pronunciation of " refreshing Kobe (こうべ) ", and synthetic video.

But, not as Japanese, word to be separated to write (for example in the Japanese, for read easily understand and with separated literary style between speech and the speech) language in, even the stage of writing continuously, comprise the user that the user of system appended in the text and register under the situation of word, in the process of morphemic analysis, also might generate and do not comprise the word sequence of registering the corresponding morpheme of word with the user.For example, at " at the refreshing Kobe of slope " such text, the supposing the system user wishes the part with " refreshing Kobe (かんべ) " such pronunciation output " refreshing Kobe ", and is registered as the user and registers word.But, carry out in the process of morphemic analysis in system, generated and separate and be written as "-slope god-Kobe-" the situation of word sequence under, in order to cut apart between " slope god " and " Kobe ", and do not export " refreshing Kobe (かんべ) " such pronunciation.On the contrary, following technology has been proposed: in text, comprise to play and forbid that term etc. is under the situation of unfavorable word for the user of system, after having determined word sequence by morphemic analysis, detect be documented in tabulation in broadcast forbid the morpheme that term is consistent, skip then with playing with forbidding morpheme that term is consistent and read, perhaps change the word of reading to other (for example referring to Patent Document 1).But, before determining to separate the word sequence of writing, do not prevent to generate the system that comprises the word sequence of unfavorable word for the user of system in advance.

Word is being separated in the language of writing, also still having same problem.This be because: even the boundary of word is tangible, determine word sequence if in morphemic analysis, estimate internuncial intensity of the word that links to each other with front and back, even then have under the situation of user's registration form speech, also might not be only limited to generation and comprise the word sequence of registering the corresponding morpheme of word with the user in registration.

Patent documentation 1: the spy opens flat 5-165486 communique

Summary of the invention

The invention provides a kind of language processing system that comprises the word sequence of unfavorable word for the user of system that prevents from advance to generate.

According to first form of the present invention, a kind of language processing system is provided, possess: preserve to use forbid morpheme forbid the morpheme memory unit; Generate parts with the sequence candidates that a plurality of morphemes separate a plurality of word sequence candidates that write respectively according to the text generation of writing continuously; From forbid the morpheme memory unit, read to use and forbid morpheme, from a plurality of word sequence candidates, get rid of to comprise and use the candidate who forbids morpheme, the optimal sequence alternative pack of the optimum word sequence that the possibility that connects between a plurality of morphemes of selection in a plurality of word sequence candidates is the highest.

According to second form of the present invention, a kind of language processing system is provided, possess: preserve to use forbid morpheme forbid the morpheme memory unit; Read and be kept at the use of forbidding in the morpheme memory unit and forbid morpheme, ban use of the use of forbidding morpheme, generate parts with the sequence candidates that a plurality of morphemes separate a plurality of word sequence candidates that write respectively according to the text generation of writing continuously; The optimal sequence alternative pack of the optimum word sequence that the possibility that connects between a plurality of morphemes of selection in a plurality of word sequence candidates is the highest.

According to the present invention, can provide a kind of language processing system that comprises the word sequence of unfavorable word for the user of system that prevents from advance to generate.

Description of drawings

Fig. 1 is the block diagram of the language processing system of expression embodiments of the invention 1.

Fig. 2 is first mode chart of grid (lattice) structure of an example of the Japanese that language processing system generated of embodiments of the invention 1.

Fig. 3 is first mode chart of grid system of an example of the middle national language that language processing system generated of embodiments of the invention 1.

Fig. 4 is first mode chart of grid system of an example of the English that language processing system generated of embodiments of the invention 1.

Fig. 5 is first table of forbidding morpheme of an example that is kept at the Japanese of forbidding in the morpheme memory unit of expression embodiments of the invention 1.

Fig. 6 is first table of forbidding morpheme of an example that is kept at the middle national language of forbidding in the morpheme memory unit of expression embodiments of the invention 1.

Fig. 7 is first table of forbidding morpheme of an example that is kept at the English of forbidding in the morpheme memory unit of expression embodiments of the invention 1.

Fig. 8 is second mode chart of grid system of an example of the Japanese that language processing system generated of embodiments of the invention 1.

Fig. 9 is second mode chart of grid system of an example of the middle national language that language processing system generated of embodiments of the invention 1.

Figure 10 is second mode chart of grid system of an example of the English that language processing system generated of embodiments of the invention 1.

Figure 11 is the process flow diagram of the language processing method of expression embodiments of the invention 1.

Figure 12 is that being kept at of expression embodiments of the invention 1 forbidden second table of forbidding morpheme in the morpheme memory unit.

Figure 13 is first mode chart of grid system of other examples of the English that language processing system generated of embodiments of the invention 1.

Figure 14 is first table of forbidding morpheme that is kept at other examples of forbidding the English in the morpheme memory unit of expression embodiments of the invention 1.

Figure 15 is second mode chart of grid system of other examples of the English that language processing system generated of embodiments of the invention 1.

Figure 16 is the block diagram of the language processing system of expression embodiments of the invention 2.

Figure 17 is the mode chart of grid system of an example of the Japanese that language processing system generated of embodiments of the invention 2.

Figure 18 is the mode chart of grid system of an example of the middle national language that language processing system generated of embodiments of the invention 2.

Figure 19 is the mode chart of grid system of an example of the English that language processing system generated of embodiments of the invention 2.

Figure 20 is the mode chart of grid system of other examples of the English that language processing system generated of embodiments of the invention 2.

Figure 21 is the process flow diagram of the language processing method of expression embodiments of the invention 2.

Figure 22 is the block diagram of the language processing system of expression embodiments of the invention 3.

Figure 23 is the process flow diagram of the language processing method of expression embodiments of the invention 3.

Figure 24 is the block diagram of the language processing system of expression embodiments of the invention 4.

Figure 25 is the table of forbidding morpheme of an example that is kept at the Japanese of forbidding in the morpheme memory unit of expression embodiments of the invention 4.

Figure 26 is used for illustrating that the morpheme of will forbidding of embodiments of the invention 4 appends the figure of the situation of an example that is saved in the middle national language of forbidding the morpheme memory unit.

Figure 27 is used for illustrating that the morpheme of will forbidding of embodiments of the invention 4 appends the figure of the situation of an example that is saved in the English of forbidding the morpheme memory unit.

Figure 28 is the process flow diagram of the language processing method of expression embodiments of the invention 4.

Figure 29 is used for illustrating that the morpheme of will forbidding of embodiments of the invention 4 appends the figure of other examples that are saved in the middle national language of forbidding the morpheme memory unit.

Figure 30 is the block diagram of the language processing system of expression embodiments of the invention 5.

Figure 31 is used for illustrating that the morpheme of will forbidding of embodiments of the invention 5 appends the figure of an example that is saved in the middle national language of forbidding the morpheme memory unit.

Figure 32 is used for illustrating that the morpheme of will forbidding of embodiments of the invention 5 appends the figure of an example that is saved in the English of forbidding the morpheme memory unit.

Figure 33 is the process flow diagram of the language processing method of expression embodiments of the invention 5.

Figure 34 is used for illustrating that the morpheme of will forbidding of embodiments of the invention 5 appends the figure of other examples that are saved in the middle national language of forbidding the morpheme memory unit.

Figure 35 is used for illustrating that the morpheme of will forbidding of embodiments of the invention 5 appends the figure of other examples that are saved in the English of forbidding the morpheme memory unit.

Embodiment

Then, with reference to the accompanying drawings, embodiments of the invention are described.In the record of following accompanying drawing, to the same or similar symbol of identical or similar part additional phase.In addition, embodiment shown below is that example is used for the device that technological thought of the present invention is specialized or the example of method, and the configuration of the component parts of technological thought of the present invention etc. has more than and is limited to the following description.In the claim scope, can carry out various changes to technological thought of the present invention.

(embodiment 1)

The language processing system of embodiment 1 as shown in Figure 1, the data storage device 200 that possess central calculation processing apparatus (CPU) 100a, is connected with CPU100a.Data storage device 200 and then possess the morpheme of forbidding memory unit 202 and system's dictionary memory unit 201.Forbid that morpheme memory unit 202 is preserved with what forbidden pronunciation was read and forbid morpheme.System's dictionary memory unit 201 keeping records the pronunciation of a plurality of words and system's dictionary of part of speech.In addition, CPU100a also possesses sequence candidates generation parts 111, optimal sequence alternative pack 112.Sequence candidates generates parts 111 and separates a plurality of word sequence candidates that write with a plurality of morphemes respectively according to the text generation write continuously.Optimal sequence alternative pack 112 is read to use from forbid morpheme memory unit 202 and is forbidden morpheme, from a plurality of word sequence candidates, get rid of to comprise and use the candidate who forbids morpheme, in a plurality of word sequence candidates, select the highest optimum word sequence of possibility that connects between a plurality of morphemes.

Specifically, sequence candidates generates parts 111 reference system dictionaries and will write the text of having imported continuously and be decomposed into a plurality of morphemes, and then generates a plurality of morphemes are configured in grid (lattice) structure on the grid point.For example import " main note Yi Shang Kong Inter Ga " such text of Japanese, in system's dictionary, registered and added " main (ぬ) " respectively, " main (ゆ) ", " main (あゐじ) ", " main (おも) ", " note Yi (I おく) ", " overhead (うわそら) ", " go up (うえ) ", " go up (か body) ", " go up (じ I う) ", " ", " empty (そら) ", " empty (くう) ", " empty (から) " “ Kong Inter (くうかん) " “ Inter (かん) " “ Inter (あいだ) " “ Inter (はざま) ", under the situation of the morpheme of the pronunciation of " Ga ", sequence candidates generates parts 111 and generates grid shown in Figure 2 (lattice) structure 50 that conduct is registered in the combination of the morpheme in system's dictionary.In grid system 50, comprise a plurality of word sequence candidates.If be starting point for example, then can generate the such word sequence candidate of " main (ぬ) note Yi (I おく) overhead (うわそら) Inter (かん) Ga " such word sequence candidate, " main (ぬ) note Yi (I おく) upward (うえ) Kong Inter (くうかん) Ga " etc. with " main (ぬ) ".

Equally, " you see that he holds train ticket " such text of national language in for example importing, in system's dictionary, registered respectively added " you (ni3) " ... under the situation of the morpheme of the pronunciation of " ticket (che1piao4) ", sequence candidates generates parts 111 and generates as the grid system shown in Figure 3 50 that is registered in the combination of the morpheme in system's dictionary.In grid system 50, comprise a plurality of word sequence candidates.If for example with " wearing " point to start with, then can generate such word sequence candidate of " (zhe) train ticket (huo3che1piao4) " such word sequence candidate, " (zhao2huo3) ticket (che1piao4) catches fire " etc.

In addition, for example import " Drink much mate " such text of English, in system's dictionary, register to have and added " drink " respectively ... under the situation of the morpheme of the pronunciation of " mate ", sequence candidates generates parts 111 and generates as the grid system shown in Figure 4 50 that is registered in the combination of the morpheme in system's dictionary.In grid system 50, comprise a plurality of word sequence candidates.If be starting point for example, then can generate such word sequence candidate of " much mate[meit] " such word sequence candidate, " much mate[ma:tei] " etc. with " much ".

Shown in Figure 1 forbid that morpheme memory unit 202 is preserved with what " pronunciation " not wanting to export read forbid morpheme for the user of system.For example as shown in Figure 5, for literal " master ", preservation has added as the morpheme of forbidding of the pronunciation of " the おも " of the pronunciation of not wanting to export for the user of system and " has led (おも) ", for character string " sky ", preserve to have added and forbid morpheme " overhead (うわそら) " etc. as the pronunciation of " the うわそら " of the pronunciation of for the user of system, not wanting to export.

Equally, for example as shown in Figure 6, " see " for literal, preservation has added as the morpheme of forbidding of the pronunciation of " ka1 " of the pronunciation of not wanting to export for the user of system and " has seen (ka1) ", " catch fire " for character string, preserve to have added and forbid morpheme " (zhao2huo3) catches fire " etc. as the pronunciation of " zhao2huo3 " of the pronunciation of for the user of system, not wanting to export.

In addition, for example as shown in Figure 7,, preserve to have added and forbid morpheme " mate[ma:tei] " etc. as the pronunciation of " ma:tei " of the pronunciation of for the user of system, not wanting to export for character string " mate ".

Optimal sequence alternative pack 112 shown in Figure 1 also possesses disabled module 114 and selects module 12.In a plurality of morphemes of disabled module 114 in being included in grid system shown in Figure 2 50, whether retrieval has and is kept at the morpheme of forbidding in the morpheme memory unit 202 that morpheme is corresponding of forbidding.And then disabled module 114 retrieves in grid system 50 under the situation of forbidding morpheme, and morpheme is forbidden in deletion from grid system 50.For example forbid morpheme " main (おも) " and forbid under the situation in morpheme " sky (うわそら) " forbidding as shown in Figure 5 having preserved respectively in the morpheme memory unit 202, as shown in Figure 8, morpheme " main (おも) " and " overhead (うわそら) " are forbidden in deletion from grid system 50.

Equally, as shown in Figure 9, morpheme " (ka1) " and " (zhao2huo3) catches fire " are forbidden in deletion from grid system 50.

In addition, as shown in figure 10, morpheme " mate[ma:tei] " is forbidden in deletion from grid system 50.

Selection module 12 shown in Figure 1 utilizes depth-first to explore (depth-first search), heuristic algorithms such as (breadth-first search) is explored in breadth-first, from deletion shown in Figure 8 forbid the grid system 50 behind the morpheme, select connection possibility between morpheme the highest and be judged as the immediate optimum word sequence of pronunciation.When selecting, also utilize exploratory methods (heuristics) such as the longest consensus method, the minimum method of civilian joint number, the minimum method of cost simultaneously.At this, as optimum word sequence, selection module 12 shown in Figure 1 is selected " main (ゆ) the note Yi (I おく) go up (じ I う) Kong Inter (くうかん) Ga " as the highest word sequence of the connection possibility between morpheme from grid system 50.Audio files generates the audio files that parts 116 generate the pronunciation that is used to export optimum word sequence.

Data storage device 200 also possesses grid system memory unit 203 and optimal sequence memory unit 204.Grid system memory unit 203 saving sequence candidates generate the grid system 50 that parts 111 are generated.Optimal sequence memory unit 204 is preserved the selected optimum word sequence that goes out of optimal sequence alternative pack 112.In addition, CPU100a also is connected with loudspeaker 342, input media 340, output unit 341, program storage device 230, temporary storage device 231.Loudspeaker 342 is included in the pronunciation of the optimum word sequence in the audio files by voice output.For example can use pointing devices such as keyboard, mouse etc. as input media 340.Output unit 341 can use image display devices such as LCD, monitor, printer etc.Program storage device 230 is preserved the operating system of control CPU100a etc.Temporary storage device 231 is stored the result of calculation of CPU100a one by one.As program storage device 230 and temporary storage device 231, can use recording medium of logging programs such as semiconductor memory, disk, CD, photomagneto disk, tape for example etc.

Then, use the language processing method of flowchart text embodiment 1 shown in Figure 11.

(a) in step S100, generate the text of writing continuously that parts 111 inputs comprise Chinese character to the sequence candidates of CPU100a by input media shown in Figure 1 340.As an example, supposed to import " main note Yi Shang Kong Inter Ga " such text at this.Then, in step S101, sequence candidates generates parts 111 with reference to the system's dictionary that is kept in system's dictionary memory unit 201, will be decomposed into a plurality of morphemes as " main note Yi goes up empty Inter Ga " of input text, and then generate the grid system shown in Figure 2 50 that forms with a plurality of morphemes.Sequence candidates generates parts 111 grid system 50 that generates is saved in the grid system memory unit 203.

(b) in step S102, disabled module 114 shown in Figure 1 is read grid system shown in Figure 2 50 from grid system memory unit 203.Then, in disabled module 114 shown in Figure 1 a plurality of morphemes in being included in grid system shown in Figure 2 50, whether retrieval has and is kept at the morpheme of forbidding in the morpheme memory unit 202 that morpheme is corresponding of forbidding.At this, as shown in Figure 5, in forbidding morpheme memory unit 202, preserved and forbidden morpheme " main (おも) " and forbid under the situation in morpheme " sky (うわそら) ", disabled module 114 is deleted from grid system 50 and is forbidden morpheme " main (おも) " and " overhead (うわそら) " as shown in Figure 8.Then, disabled module 114 shown in Figure 1 will have been deleted the grid system 50 of forbidding behind the morpheme and be write and be saved in the grid system memory unit 203.

(c) in step S103, select module 12 from grid system memory unit 203, to read to have deleted to forbid the grid system 50 behind the morpheme.Then, select module 12 to use heuristic algorithm and exploratory methods, from deletion shown in Figure 8 forbid selecting to be judged as the immediate optimum word sequence of pronunciation the grid system 50 behind the morpheme.At this,, select module 12 to select " main (ゆ) note Yi (I おく) go up (じ I う) Kong Inter (くうかん) Ga " as optimum word sequence.Then, the optimum word sequence that will select of optimal sequence alternative pack 112 is saved in the optimal sequence memory unit 204.

(d) in step S104, audio files generation parts 116 are read " main (ゆ) note Yi (I おく) goes up (じ I う) empty Inter (くうかん) Ga " as optimum word sequence from optimal sequence memory unit 204.Then, audio files generation parts 116 are transformed to audio files with the pronunciation of optimum word sequence " main (ゆ) note Yi (I おく) goes up (じ I う) empty Inter (くうかん) Ga ".Then, audio files generates parts 116 and is contained in the pronunciation of the optimum word sequence the audio files from loudspeaker 342 output packets, finishes the language processing method of embodiment 1.

More than, language processing system and language processing method according to Fig. 1 and embodiment 1 shown in Figure 11, in system's dictionary, preserved the word of reading not wish the pronunciation of exporting for the user, to forbid that morpheme is kept in advance and forbid in the morpheme memory unit 202, and can prevent the additional undesirable pronunciation of the text of input.Therefore, can add the pronunciation that the user wishes to text with higher probability.In addition, in example shown in Figure 5, represented the combination of title and pronunciation is kept at the example of forbidding in the morpheme memory unit 202.To this, also can be as shown in figure 12, the combination of title, pronunciation and part of speech is kept at forbids in the morpheme memory unit 202.

For example, " Colored pencil leads break easily " such text of input English, in system's dictionary, registered and added " colored " respectively ... under the situation of the morpheme of the pronunciation of " easily ", sequence candidates generates parts 111 and generates as the grid system shown in Figure 13 50 that is registered in the combination of the morpheme in system's dictionary.

At this, for example as shown in figure 14, at character string " pencil ", the forbidding that morpheme " pencil (v) [pensl] " etc. is saved in and forbid in the morpheme memory unit 202 of pronunciation of the part of speech v that do not wish to export, pronunciation " pensl " will have been added for the user of system.

Thus, disabled module 114 as shown in figure 15 from grid system 50 deletion forbid morpheme " pencil (v) [pensl] ".

Thus, be not the pronunciation mark of word, can also correctly handle syntax, improved the naturalities such as modulation in tone when reading.

(embodiment 2)

The difference of the language processing system of embodiment 2 and language processing system shown in Figure 1 is: as shown in figure 16, forbid that parts 214 and sequence candidates generate parts 211 and be connected.Forbid parts 214 in system's dictionary memory unit 201, preserve be kept at the situation of forbidding the morpheme that morpheme is consistent of forbidding in the morpheme memory unit 202 under, be provided with and forbid sequence candidates generate parts 211 with reference to and be registered in system's dictionary in forbid the morpheme that morpheme is consistent.

Therefore, for example imported under the situation of " main note Yi goes up empty Inter Ga " such text generating parts 211 to sequence candidates, sequence candidates generate parts 211 not with reference to be included in system's dictionary in forbid the consistent morpheme of morpheme " overhead (うわそら) " and " Inter (かん) ", generation does not comprise the grid system 51 of forbidding morpheme in advance as shown in figure 17.Because other inscapes of language processing system shown in Figure 16 are same as in figure 1, so omit explanation.

Equally, for example under the situation of " you see that he holds train ticket " of having imported middle national language to sequence candidates generation parts 211 such text, sequence candidates generate parts 211 not with reference to be included in system's dictionary in forbid that the consistent morpheme of morpheme " is seen (ka1) " and " (zhao2huo3) catches fire ", generation does not comprise the grid system 51 of forbidding morpheme in advance as shown in figure 18.

In addition, equally, for example under the situation of " the Drink much mate " that imported English to sequence candidates generation parts 211 such text, sequence candidates generate parts 211 not with reference to be included in system's dictionary in forbid the consistent morpheme of morpheme " mate[ma:tei] ", generate as shown in figure 19 and do not comprise the grid system 51 of forbidding morpheme in advance.

And then, equally, for example under the situation of " the Colored pencil leads break easily " that imported English to sequence candidates generation parts 211 such text, sequence candidates generate parts 211 not with reference to be included in system's dictionary in forbid the consistent morpheme of morpheme " pencil (v) [pensl] ", generate as shown in figure 20 and do not comprise the grid system 51 of forbidding morpheme in advance.

Then, use the language processing method of flowchart text embodiment 2 shown in Figure 21.

(a) in step S200, generate the text of writing continuously " main note Yi goes up empty Inter Ga " that parts 211 inputs comprise Chinese character to the sequence candidates of CPU100b by input media shown in Figure 16 340.In step S201, forbid parts 214 in system's dictionary memory unit 201, preserve be kept at the situation of forbidding the morpheme that morpheme is consistent of forbidding in the morpheme memory unit 202 under, be provided with and forbid sequence candidates generate parts 211 with reference to and be registered in system's dictionary in forbid the morpheme that morpheme is consistent.

(b) in step S202, sequence candidates generates parts 211 with reference to the system's dictionary that is kept in system's dictionary memory unit 201, to be decomposed into a plurality of morphemes as " the main note Yi Shang Kong Inter Ga " of input text, and then generate the grid system shown in Figure 17 51 that forms with a plurality of morphemes.At this moment, owing in step S201, be provided with forbid sequence candidates generate parts 211 with reference to and be registered in system's dictionary in forbid the morpheme that morpheme is consistent, do not forbid morpheme so in the grid system 51 that is generated, do not comprise.The grid system 51 of forbidding morpheme that do not comprise that sequence candidates generation parts 211 will generate is saved in the grid system memory unit 203.

(c) in step S203, optimal sequence alternative pack 212 is read from grid system memory unit 203 and is not comprised the grid system 51 of forbidding morpheme.Then, optimal sequence alternative pack 212 uses heuristic algorithm and exploratory method, selects to be judged as the immediate optimum word sequence of pronunciation from grid system 51.Then, ground the same implementation step S204, the language processing method of end embodiment 2 with step S104.

More than, according to language processing system and the language processing method of Figure 16 and embodiment 2 shown in Figure 21, can prevent the additional undesirable pronunciation of input text.

(embodiment 3)

The difference of the language processing system of embodiment 3 and language processing system shown in Figure 1 is: as shown in figure 22, forbid that parts 314 are connected with optimal sequence alternative pack 312.Forbid parts 214 in system's dictionary memory unit 201, preserve be kept at the situation of forbidding the morpheme that morpheme is consistent of forbidding in the morpheme memory unit 202 under, be provided with and forbid optimal sequence alternative pack 312 select to comprise forbid morpheme the word sequence candidate as optimum word sequence.Because other inscapes of language processing system shown in Figure 22 are same as in figure 1, so omit explanation.

Then, use the language processing method of flowchart text embodiment 3 shown in Figure 23.

(a) in step S300, generate the text of writing continuously " main note Yi goes up empty Inter Ga " that parts 111 inputs comprise Chinese character to the sequence candidates of CPU100c by input media shown in Figure 1 340.Then, in step S301, sequence candidates generates parts 111 with reference to the system's dictionary that is kept in system's dictionary memory unit 201, will be decomposed into a plurality of morphemes as " main note Yi goes up empty Inter Ga " of input text, and then generate the grid system shown in Figure 2 50 that forms with a plurality of morphemes.Sequence candidates generates parts 111 grid system 50 that generates is saved in the grid system memory unit 203.

(b) in step S302, forbid parts 314 in system's dictionary memory unit 201, preserve be kept at the situation of forbidding the morpheme that morpheme is consistent of forbidding in the morpheme memory unit 202 under, be provided with and forbid optimal sequence alternative pack 312 select to comprise forbid morpheme the word sequence candidate as optimum word sequence.In step S303, optimal sequence alternative pack 312 is read grid system 50 from grid system memory unit 203.Then, optimal sequence alternative pack 312 uses heuristic algorithm and exploratory method, selects to be judged as the immediate optimum word sequence of pronunciation from grid system 50.Then, ground the same implementation step S304, the language processing method of end embodiment 3 with step S104.

More than, according to language processing system and the language processing method of Figure 22 and embodiment 3 shown in Figure 23, can prevent the additional undesirable pronunciation of input text.

(embodiment 4)

The difference of the language processing system of embodiment 4 and language processing system shown in Figure 1 is: as shown in figure 24, CPU100d also comprises error range specified parts 120 and forbids that morpheme appends parts 121.At this, for example at input text " main note Yi Shang Kong Inter Ga ", optimal sequence alternative pack 112 has selected " main (ゆ) note Yi (I おく) overhead (うわそら) Inter (かん) Ga " as optimum word sequence mistakenly.In this case, the quilt of error range specified parts 120 from the optimum word sequence that system's user's acceptance error has been selected added the appointment of misreading morpheme of unfavorable pronunciation.For example, under the situation of having specified character string " Shang Kong Inter ", error range specified parts 120 is by " going up empty Inter " with character string and grid system 50 contrasts, and is divided into morpheme " overhead (うわそら) " and morpheme " Inter (かん) ", and each is defined as misreads morpheme.Forbid that morpheme appends parts 121 and will misread morpheme and forbid in the morpheme memory unit 202 as forbidding that morpheme appends to be saved in.In Figure 25, expression is at this moment to forbidding that the morpheme memory unit appends the saved example of forbidding morpheme.Because other inscapes of language processing system shown in Figure 24 are same as in figure 1, so omit explanation.

Equally, as shown in figure 26, for example suppose the input text " you see that he holds train ticket " at middle national language, optimal sequence alternative pack 112 has selected " you (ni3) " " seeing (kan4) " " he (ta1) " " (zhao2huo3) catches fire " " ticket (che1piao4) " that " take (na2) " as optimum word sequence mistakenly.The quilt of error range specified parts 120 from the optimum word sequence that system's user's acceptance error has been selected added the appointment of misreading morpheme of unfavorable pronunciation.For example under the situation of having specified character string " train ticket ", error range specified parts 120 is by " train ticket " with character string and grid system 50 contrasts, and be divided into morpheme " (zhao2huo3) catches fire " and morpheme " ticket (che1piao4) ", and each is defined as misreads morpheme.Forbid that morpheme appends parts 121 and will misread morpheme and forbid in the morpheme memory unit 202 as forbidding that morpheme appends to be saved in.

In addition, as shown in figure 27, for example suppose the input text " Drink muchmate " at English, optimal sequence alternative pack 112 has selected " drink (v) " " much (adv) " " mate (n) [ma:tei]) " as optimum word sequence mistakenly.The quilt of error range specified parts 120 from the optimum word sequence that system's user's acceptance error has been selected added the appointment of misreading morpheme of unfavorable pronunciation.For example under the situation of having specified character string " mate ", error range specified parts 120 is by with character string " mate " and grid system 50 contrasts, and is defined as morpheme " mate (n) [meit] ", and each is defined as misreads morpheme.Forbid that morpheme appends parts 121 and will misread morpheme and forbid in the morpheme memory unit 202 as forbidding that morpheme appends to be saved in.

Then, use the language processing method of flowchart text embodiment 4 shown in Figure 28.

(a) implement step S400 shown in Figure 28 and step S401 with step S100 and step S101 shown in Figure 11 the samely.In step S402, disabled module 114 shown in Figure 24 is read grid system from grid system memory unit 203.Then, in a plurality of morphemes of disabled module 114 in being included in grid system, delete and be kept at the morpheme of forbidding in the morpheme memory unit 202 that morpheme is corresponding of forbidding.In addition, suppose at this moment in forbidding morpheme memory unit 202, not preserve morpheme " overhead (うわそら) " and " Inter (かん) ".Then, disabled module 114 will have been deleted the grid system of forbidding behind the morpheme and be write and be saved in the grid system memory unit 203.

(b) in step S403, select module 12 from grid system memory unit 203, to read to have deleted to forbid the grid system behind the morpheme.Then, select module 12 to use heuristic algorithm and exploratory methods, from deletion shown in Figure 8 forbid selecting to be judged as the immediate optimum word sequence of pronunciation the grid system behind the morpheme.At this, and 12 selections of selection module " main (ゆ) note Yi (the I おく) sky (うわそら) Inter (かん) Ga " as optimum word sequence.Then, the optimum word sequence that optimal sequence alternative pack 112 goes out wrong choice is saved in the optimal sequence memory unit 204, the optimum word sequence that output unit 341 output errors are selected.

(c) in step S404, error range specified parts 120 is via the input of input media 340 from the user of system acceptance error scope.Under the situation of the character string " Shang Kong Inter " in imported the optimum word sequence that is included in wrong choice and goes out " main (ゆ) note Yi (I おく) overhead (うわそら) Inter (かん) Ga " by the user of system as error range, error range specified parts 120 is by " going up empty Inter " with character string and grid system 50 contrasts, and be divided into morpheme " overhead (うわそら) " and morpheme “ Inter (かん) ", and each is defined as misreads morpheme.Then, error range specified parts 120 will be misread morpheme and be transferred to and forbid that morpheme appends parts 121.

Equally, at input text " you see that he holds train ticket " at middle national language, system user has imported under the situation that character string among the optimum word sequence " you are (ni3) " " seeing (kan4) " " he (ta1) " that is included in wrong choice and goes out " (zhao2huo3) catches fire " " ticket (che1piao4) " that " take (na2) " " train ticket " as error range, error range specified parts 120 is by " train ticket " with character string and grid system 50 contrasts, and be divided into morpheme " (zhao2huo3) catches fire " and morpheme " ticket (che1piao4) ", and each is defined as misreads morpheme.Then, error range specified parts 120 will be misread morpheme and be transferred to and forbid that morpheme appends parts 121.

Equally, at input text " Drink much mate " at English, system user has imported under the situation of character string " mate " as error range in the optimum word sequence that is included in wrong choice and goes out " drink (v) " " much (adv) " " mate (n) [ma:tei]) ", error range specified parts 120 is with character string " mate " and grid system 50 contrasts, and be defined as morpheme " mate (n) [meit] ", and each is defined as misreads morpheme.Then, error range specified parts 120 will be misread morpheme and be transferred to and forbid that morpheme appends parts 121.

(d) in step S405, forbid that morpheme appends parts 121 and will misread morpheme " overhead (うわそら) " and misread morpheme " Inter (かん) " and forbid in the morpheme memory unit 202 as forbidding that morpheme is saved in respectively, the language processing method of end embodiment 4.

More than, language processing system and language processing method according to Figure 24 and embodiment 4 shown in Figure 28, after next time, can not select to comprise the word sequence candidate that forbids morpheme " overhead (うわそら) " and forbid morpheme " Inter (かん) " as optimum word sequence.

In addition, the error range of appointment also can not must be the scope that can be divided into morpheme in optimum word sequence in step S404.Specifically, be not " overhead (うわそら) Inter (かん) ", also can specify " empty (そら) Inter (かん) " as error range.In this case, forbid that morpheme appends parts 121 and the morpheme " overhead (うわそら) " that partly comprises as " empty (そら) " of error range appointment can be forbidden in the morpheme memory unit 202 as forbidding that morpheme is saved in.In addition, in embodiment 4, represented in language processing system shown in Figure 1 and then comprise error range specified parts 120 and forbid that morpheme appends the example of parts 121, but can certainly and then in Figure 10 or language processing system shown in Figure 22, comprise error range specified parts 120 and forbid that morpheme appends parts 121.

Equally, for the error range of appointment in step S404, in Chinese Language example originally, as shown in figure 29, be not " train ticket ", also can specify " train ticket " as error range.In this case, forbid that morpheme appends parts 121 and also the morpheme " (zhao3huo3) catches fire " that partly comprises as " fire " of error range appointment can be forbidden in the morpheme memory unit 202 as forbidding that morpheme is saved in.

(embodiment 5)

The difference of the language processing system of embodiment 5 and language processing system shown in Figure 1 is: as shown in figure 30, CPU100e also comprises pronunciation input block 122, contrast is extracted parts 123 out and forbidden that morpheme appends parts 121.At this, suppose that at input text " main note Yi Shang Kong Inter Ga " optimal sequence alternative pack 112 has selected " main (ゆ) note Yi (I おく) overhead (うわそら) Inter (かん) Ga " as optimum word sequence mistakenly.In this case, pronunciation input block 122 is accepted the input of the correct pronunciation " ゆ I おくじ I うくうかん Ga " of input text " main note Yi Shang Kong Inter Ga " from the user of system.The pronunciation that the optimum word sequence that parts 123 go out wrong choice is extracted in contrast out compares with correct pronunciation, extracts the difference different with correct pronunciation partly " うわそら " in the pronunciation of the optimum word sequence that wrong choice goes out out.Forbid that morpheme appends the morpheme " overhead (うわそら) " of misreading that parts 121 will add the pronunciation of difference part " うわそら " and forbids in the morpheme memory unit 202 as forbidding that morpheme is saved in.Because other inscapes of language processing system shown in Figure 30 are same as in figure 1, so omit explanation.

Equally, as shown in figure 31, suppose the input text " you see that he holds train ticket " at middle national language, optimal sequence alternative pack 112 has selected " you (ni3) " " seeing (kan4) " " he (ta1) " " by (na2) " " (zhao2huo3) catches fire " " ticket (che1piao4) " as optimum word sequence mistakenly.In this case, pronunciation input block 122 is accepted the input of the correct pronunciation " ni3 kan4 ta1 na2 zhe huo3che1 piao4 " of input text " you see that he holds train ticket " from the user of system.The pronunciation that the optimum word sequence that parts 123 go out wrong choice is extracted in contrast out compares with correct pronunciation, extracts the difference different with correct pronunciation partly " zhe huo3 che1 piao4 " in the pronunciation of the optimum word sequence that wrong choice goes out out.Forbid that morpheme appends misread morpheme " (zhao2huo3) catches fire " and " ticket (che1piao4) " conduct that parts 121 will add the pronunciation of difference part " zhe huo3 che1 piao4 " and forbids that morpheme is saved in and forbid in the morpheme memory unit 202.

In addition, shown in figure 32, suppose the input text " Drink muchmate " at English, optimal sequence alternative pack 112 has selected " drink (v) " " much (adv) " " mate (n) [ma:tei]) " as optimum word sequence mistakenly.In this case, pronunciation input block 122 is accepted the input of the correct pronunciation " drink mats meit " of input text " Drink much mate " from the user of system.The pronunciation that the optimum word sequence that parts 123 go out wrong choice is extracted in contrast out compares with correct pronunciation, extracts the difference different with correct pronunciation partly " meit " in the pronunciation of the optimum word sequence that wrong choice goes out out.Forbid that morpheme appends the morpheme " mate (n) [meit] " misread that parts 121 will add the pronunciation of difference part " meit " and forbids in the morpheme memory unit 202 as forbidding that morpheme is saved in.

Then, use the language processing method of flowchart text embodiment 5 shown in Figure 33.

(a) implement step S500 shown in Figure 33 to step S503 with step S400 shown in Figure 28 to step S403 the samely, suppose that optimal sequence alternative pack 112 selected " main (ゆ) note Yi (I おく) overhead (うわそら) Inter (かん) Ga " as optimum word sequence mistakenly.Then, the optimum word sequence that optimal sequence alternative pack 112 goes out wrong choice is saved in the optimal sequence memory unit 204, the optimum word sequence that output unit 341 output errors are selected.

(b) in step S504, pronunciation input block 122 is accepted the input of the correct pronunciation " ゆ I おくじ I うくうかん Ga " of text " main note Yi Shang Kong Inter Ga " from the user of system via input media 340.Pronunciation input block 122 is saved in correct pronunciation " ゆ I おくじ I うくうかん Ga " in the pronunciation memory unit 205.In step S405, the optimum word sequence " overhead (うわそら) Inter (かん) Ga of main (ゆ) note Yi (I おく) " that parts 123 readout error from optimal sequence memory unit 204 is selected is extracted in contrast out, reads correct pronunciation " ゆ I おくじ I うくうかん Ga " from pronunciation memory unit 205.Then, the pronunciation that the optimum word sequence that parts 123 go out wrong choice is extracted in contrast out compares with correct pronunciation, extracts the difference different with correct pronunciation partly " うわそら " in the pronunciation of the optimum word sequence that wrong choice goes out out.

(c) in step S505, contrast is extracted parts 123 out and will be included in the optimum word sequence that wrong choice goes out and the morpheme " overhead (うわそら) " of misreading that added the pronunciation of difference part " うわそら " is transferred to and forbids that morpheme appends parts 121.Forbid that morpheme appends parts 121 and will misread morpheme " overhead (うわそら) " and forbid in the morpheme memory unit 202 as forbidding that morpheme is saved in, finish the language processing method of embodiment 5.

More than, according to language processing system and the language processing method of Figure 30 and embodiment 5 shown in Figure 33, after next time, can not select to comprise forbid morpheme " overhead (うわそら) " the word sequence candidate as optimum word sequence.In addition, in embodiment 5, represented in language processing system shown in Figure 1 so comprise pronunciation input block 122, contrast is extracted parts 123 out, is forbidden that morpheme appends the example of parts 121, but can certainly and then comprise pronunciation input block 122 in Figure 16 or language processing system shown in Figure 22, contrast is extracted parts 123 out, is forbidden that morpheme appends parts 121.

(other embodiment)

As mentioned above, embodiments of the invention have been described, but should be understood that to limit content of the present invention as the argumentation and the accompanying drawing of the part of the disclosure.It is apparent that those skilled in the art can obtain various alternative forms of implementation, embodiment and application technology from the disclosure.For example illustrated that pronunciation input block 122 shown in Figure 30 accepts the input of the correct pronunciation of input text from the user of system.Relative therewith, the input of also can be pronunciation input block 122 have added from the user of system is received in the part of the input text morpheme of correct pronunciation.For example also can select mistakenly under " main (ゆ) note Yi (I おく) overhead (うわそら) Inter (かん) Ga " situation as optimum word sequence at optimal sequence alternative pack 112,122 acceptance of pronunciation input block have added the input of the morpheme " empty Inter (くうかん) " of correct pronunciation, and contrast is extracted parts 123 out and extracted out and the inconsistent morpheme of morpheme " empty Inter (くうかん) " " sky (うわそら) " and " Inter (かん) ".

Equally, as shown in figure 34, also can be at input text " you see that he holds train ticket " at middle national language, optimal sequence alternative pack 112 has been selected under " you (ni3) " " seeing (kan4) " " he (ta1) " " by (the na2) " situation of " (zhao2huo3) catches fire " " ticket (che1piao4) " as optimum word sequence mistakenly, 122 acceptance of pronunciation input block have added the input of the morpheme " train ticket (huo3 che1 piao4) " of correct pronunciation, and contrast is extracted parts 123 out and extracted out and morpheme " train ticket (huo3 che1 piao4) " inconsistent morpheme " (zhao2huo3) catches fire " and " ticket (che1piao4) ".

Equally, as shown in figure 35, also can be at input text " Drink muchmate " at English, optimal sequence alternative pack 112 has been selected under the situation of " drink (v) " " much (adv) " " mate (n) [ma:tei]) " as optimum word sequence mistakenly, 122 acceptance of pronunciation input block have added the input of the morpheme " mate (n) [meit] " of correct pronunciation, and contrast is extracted parts 123 out and extracted out and the inconsistent morphemes of morpheme " mate (n) [meit] " " mate (n) [ma:tei] ".

In addition, in an embodiment, represented that audio files generates the example that parts 116 generate the audio files of the pronunciation that is used to export optimum word sequence.But, just directly do not generate audio files from optimum word sequence, also can be to generate pronunciation information (pronunciation mark) file, and then generate the system of audio files from pronunciation mark file according to optimum word sequence.In addition, in Fig. 1, represented example that loudspeaker 342 is connected with CPU100a, but loudspeaker 342 must not be connected with CPU100a, can certainly other computing machine or sound system in use the audio files that has generated.

In addition, above-mentioned language processing method can be used as a series of processing or the operation that connect on the sequential and shows.Therefore, for effective language disposal route in CPU100a shown in Figure 1, can realize language processing method shown in Figure 5 by the computer program that produces a plurality of functions that processor in the CPU100a etc. produced.At this, computer program is exactly to carry out the recording medium of input and output or pen recorder etc. to CPU100a.As recording medium, comprise storage arrangement, disk set, optical disc apparatus, other can logging program device.Like this, the present invention also is included in these various embodiment that do not put down in writing etc. certainly.Therefore, as seen from the above description, only determine technical scope of the present invention according to the object of invention of suitable claim.

Claims

1. language processing system is characterized in that comprising:

Preserve to use forbid morpheme forbid the morpheme memory unit;

Generate parts with the sequence candidates that a plurality of morphemes separate a plurality of word sequence candidates that write respectively according to the text generation of writing continuously;

Forbid reading the morpheme memory unit above-mentioned use and forbid morpheme from above-mentioned, from above-mentioned a plurality of word sequence candidates, get rid of and comprise the candidate that morpheme is forbidden in above-mentioned use, the optimal sequence alternative pack of the optimum word sequence that the possibility that connects between the above-mentioned a plurality of morphemes of selection in above-mentioned a plurality of word sequence candidates is the highest.

2. language processing system is characterized in that comprising:

Preserve to use forbid morpheme forbid the morpheme memory unit;

Read to be kept at and above-mentionedly forbid that the above-mentioned use in the morpheme memory unit forbids morpheme, forbid that above-mentioned use forbids the use of morpheme, generate parts with the sequence candidates that a plurality of morphemes separate a plurality of word sequence candidates that write respectively according to the text generation of writing continuously;

The optimal sequence alternative pack of the optimum word sequence that the possibility that connects between the above-mentioned a plurality of morphemes of selection in above-mentioned a plurality of word sequence candidates is the highest.

3. language processing system according to claim 1 and 2 is characterized in that also comprising:

Accept the error range specified parts that the quilt in the above-mentioned optimum word sequence has added the appointment of misreading morpheme of the pronunciation different with the correct pronunciation of above-mentioned text.

4. language processing system according to claim 1 and 2 is characterized in that also comprising:

The pronunciation of above-mentioned optimum word sequence and the correct pronunciation of above-mentioned text are compared, from above-mentioned optimum word sequence, extract the contrast of misreading morpheme that has been added the pronunciation different out and extract parts out with above-mentioned correct pronunciation.

5. language processing system according to claim 3 is characterized in that also comprising:

The above-mentioned morpheme of misreading is forbidden that morpheme appends to be saved in and above-mentionedly forbidden that the morpheme of forbidding in the morpheme memory unit appends parts as above-mentioned.

6. language processing system according to claim 4 is characterized in that also comprising: