CN101470700A - Text template generator, text generation equipment, text checking equipment and method thereof - Google Patents

Text template generator, text generation equipment, text checking equipment and method thereof Download PDF

Info

Publication number
CN101470700A
CN101470700A CNA2007103066231A CN200710306623A CN101470700A CN 101470700 A CN101470700 A CN 101470700A CN A2007103066231 A CNA2007103066231 A CN A2007103066231A CN 200710306623 A CN200710306623 A CN 200710306623A CN 101470700 A CN101470700 A CN 101470700A
Authority
CN
China
Prior art keywords
speech
text
usage
search
groove position
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2007103066231A
Other languages
Chinese (zh)
Inventor
靳简明
吴根清
许荔秦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC China Co Ltd
Original Assignee
NEC China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC China Co Ltd filed Critical NEC China Co Ltd
Priority to CNA2007103066231A priority Critical patent/CN101470700A/en
Publication of CN101470700A publication Critical patent/CN101470700A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a text template generator, a text generating device, a text inspection device and a method thereof, wherein the text template generator comprises a slot position determination unit for determining the positions of words needing to be replaced in an inputted text according to constraint conditions as slot positions, and a target replacement determining unit for determining targets for replacing the slot positions according to the constraint conditions in order to generate a text template comprising targets, and generate expected templates according to different constraint conditions. Furthermore, the invention further provides a text generating device and a method thereof, a text inspection device for inspecting the text whether meeting idiomatic usage and a method thereof, and a system for generating text in line with the idiomatic usage and a method thereof, thereby determining whether inputted text fits for the idiomatic usage, and outputting the text in line with the idiomatic usage.

Description

Text template generator, text generation equipment, text inspection machine and method thereof
Technical field
The present invention relates to the technical field of natural language processing, more specifically, relate to a kind of text template generator and method thereof, the text generation Apparatus for () and method therefor is used to check whether text is the text inspection machine and the method thereof of usage text and system and the method thereof that is used to generate the usage text.
Background technology
Along with the extensive employing of computing machine and the Internet use growing, computing machine becomes more and more universal in every field.Natural language processing technique has been adopted in a large amount of application used in everyday, for example, and text classification system and text search engine.
The text classification system is different classifications with text classification, wherein belongs to of a sort text and has common feature.Under the situation of different application, the text that can be classified comprises article, Email, and short message, sentence, phrase or the like, the feature of classification can be semantic, form, syntax or the like.For example, SPAM be discerned and be stopped to anti-rubbish E-mail system can, is exactly a kind of text classification system.The short message Mk system also is an a kind of text classification system, the short message that sends or receive can be added different marks, and is for example urgent, spam, private or the like.Specific label according to short message can be carried out specific operation.For example, mobile phone is receiving after label is urgent short message, always jingle bell.Do not input correct password, then label is that private short message can not be viewed.The quantity of training sample is the key factor that influences the text classification system, and usually, the corpus that system uses is many more, and system will be accurate more.Therefore, very important for the enough corpus of text classification system constructing.The work that corpus makes up is consuming time and irksome, and therefore, the text that generates usage will be very useful as corpus.
All documents that the text search engine search is relevant with the input inquiry text.Usually, search engine is only searched for the document that accurately comprises query text, that is, though can not find the document that does not comprise that query text and query text are closely related.Therefore, the generation method that can produce the relevant inquiring text will strengthen the performance of search engine.
Usually, the generation of usage text is handled and is comprised two key steps: text generation step and usage text checking procedure.
Existing text generation method comprises the method based on grammer, based on the method for template and based on the method for adding up.
At first determine to generate text based on the method for grammer and should narrate what (be notion, a notion is exactly a semantic formation); Secondly, the relation between the calculating notion; The 3rd, the syntactic structure of the text that generation generates according to relation; The 4th, generate the text of describing each notion; Afterwards, generate actual text according to syntactic structure.The grammer that has adopted has: standardized syntax; Phrase structure grammar; Systemic grammar; Adjacent tree grammar generally expands the switching network grammer, classification grammer etc.Method based on grammer is more effective, but is difficult to make up and depend on language.
Method based on template is used under the environment of the similar message of frequent generating structure.Usually, the structure of the text of generation is a prototype text that fix or given, and fills some open fields according to specific limiting mode.Typical environment is the text that generates weather forecast.For example: " be today _ degree, and weather is _." this method is easy to realize, but can only be used for specific environment.
Method based on statistics generates text according to language statistics information, for example, and N meta-model, entropy information etc.This method is by generating text conceptually, that is, the text of a notion is described in each generation.If can utilize polytype description to describe a notion, then according to before the text that generates and still the needs notion of utilizing language statistics information to generate select most possible description.The method of selecting most possible description text is to be independent of language and enforcement easily, but the method that generates conceptual description depends on language and is difficult to enforcement.
The existing usage text method of inspection comprises based on method of semantic with based on the method for word class collocation.
Use semantic dictionary to check the speech collocation whether reasonable based on method of semantic.For example, " watching TV " is rational collocation, and still " eating TV " is not rational collocation just.
Method based on the word class collocation is checked rationality according to word class collocation information.For example, part of speech is a kind of word class, and the collocation pattern is that " verb+noun " then is reasonably, but the collocation pattern is not reasonably combined just for " adjective+verb ".
Japanese patent application JP11328180 has proposed a kind of method, is used for supporting to use of the sentence generation of sentence structure framework at target language, and the sentence generation of the use example sentence corresponding with the sentence structure framework.When having imported the predicate of the main verb that serves as the sentence that the user will generate, sentence structure framework retrieving portion retrieval sentence structure framework, wherein predicate can partly obtain from sentence structure frame data library storage, afterwards the sentence structure framework is tabulated and shows.When from tabulation, selecting the groove of a sentence structure framework and having imported noun phrase, the noun phrase analysis part extracts keyword from noun phrase, obtain the semantic information of keyword from the grammer dictionary storage area that is used to analyze, and send it to the semantic information compatible portion.When from the semantic restricted information coupling of the semantic information of noun phrase analysis part and groove, editor's control section is determined the noun phrase of groove, and when having determined the noun phrase of all grooves, finishes the generation of target sentences.Generally speaking, the text that generates in this patented claim is the sentence that has same structure with the input sentence.Wherein adopted grammer and based on the text generation method of template, because analyzed sentence structure and generated sentence according to structural information, and adopted usage text based on speech classification collocation, because adopted part of speech collocation information and the speech semantic assignment information of classifying.
Japanese patent application JP2064859 proposes a kind of except presenting the synonym function, when text constitutes element as usage statement a part of, present the synonym that is used for the usage statement and explain the method that rewrites text effectively, wherein adopted usage expression dictionary and synonymous expression dictionary to rewrite Japanese sentence.The text that generates in this patented claim is the synonym statement of input sentence, because only can change the synonym statement of some phrases, so the text generation method based on template has been adopted in this patented claim, owing to used usage statement dictionary and synonym statement dictionary, so the usage text method of inspection based on semanteme has been adopted in this patented claim.
Paper (Use of statistical N-gram models in natural languagegeneration for machine translation.Fu-Hua Liu, Liang Gu, Yuqing Gao, Picheny, M.IBM T.J.Watson Res.Center, Yorktown Heights, NY, USA.Proceedings of 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003, page I-636-I-639 vol.1) the various language modeling problems that exist has been described in voice one speech translation system.In this paper, when having used statistics natural language generation model based on maximum entropy to generate target language sentence as translation output, various conjugations and synonym problem have appearred, because used compromise scheme to avoid the rare problem of data in semantic expressiveness.This paper has used the N meta-model to strengthen the generation performance as post-processing step.Wherein, the text of generation is sentence or the phrase that is used for the mechanical translation purpose, owing to used maximum entropy and N meta-model, has adopted based on the text generation method of statistics and has used the usage text method of inspection based on semanteme.
In a word,, multiple application can be arranged, but grammer is difficult to make up and is different according to the difference of language based on the method for grammer for the text generation method.Though the method based on statistics is independent of language, the quality of the text that generates is lower.Method based on template is applicable to condition of limited, for example, the text structure of generation be fix or sample text is provided.Method based on template only can generate the text that satisfies fixed form (constraint condition).Do not have corresponding method to change or generate template (constraint condition).
For the method for inspection of usage text, comparatively useful based on method of semantic, but the structure of semantic dictionary is very expensive.Method based on speech classification collocation is comparatively coarse, and needs the linguist that speech is divided into class and makes up dictionary of collocations.Can not determine accurately whether text meets usage.
Summary of the invention
In order to address the above problem, the present invention has been proposed, can generate text template and can check text whether to meet usage according to constraint condition, can also generate the text that meets usage.
According to a first aspect of the invention, proposed a kind of text template generator, having comprised:
The groove position determination unit is used for determining according to constraint condition the position of the speech that the text needs of input are replaced, as the groove position; And
Target is replaced determining unit, is used for determining to replace the object of groove position according to constraint condition, thereby generates the text template that comprises object.
According to a second aspect of the invention, proposed a kind of text template generation method, having comprised:
Groove position determining step, the position of the speech that needs are replaced in the text of determining to import according to constraint condition is as the groove position;
Target is replaced determining step, determines to replace the object of groove position according to constraint condition, thereby generates the text template that comprises object.
According to a third aspect of the invention we, proposed a kind of text generation equipment, having comprised:
Text template generator is used for according to the text generation text template after the constraint condition operational analysis;
Based on the text generating apparatus that speech is replaced, the dictionary that is used for being associated with constraint condition according to the text template utilization generates text.
According to a forth aspect of the invention, proposed a kind of text generation method, having comprised:
Text template generates step, according to the text generation text template after the constraint condition operational analysis;
Based on the text generation step that speech is replaced, the dictionary that is associated with constraint condition according to the text template utilization generates text.
According to a fifth aspect of the invention, proposed a kind of text inspection machine, be used to check text whether to meet usage, having comprised:
Speech screening unit is used for the speech that will check from the text selection of having cut apart;
Speech is to generation unit, and it is right to be used for generating the speech relevant with each speech of the speech that will check;
Speech usage intensity computing unit is used for calculating according to the right occurrence number of each speech the speech usage intensity of the speech that will check that screens from text; And
Text usage intensity computing unit is used for the text usage intensity according to speech usage intensity calculating text.
According to a sixth aspect of the invention, proposed a kind of text method of inspection, be used to check text whether to meet usage, having comprised:
Speech screening step is selected the speech that will check from the text of having cut apart;
Speech is to generating step, and it is right to generate the speech relevant with each speech in the speech that will check;
Speech usage intensity calculation procedure, the speech usage intensity of the speech that will check that right occurrence number calculating is screened from text according to each speech; And
Text usage intensity calculation procedure is according to the text usage intensity of speech usage intensity calculating text.
According to seventh aspect present invention, a kind of system that is used to generate the usage text has been proposed, comprising:
According to text generation equipment of the present invention, be used to generate text; And
According to text inspection machine of the present invention, be used to judge whether the text of generation is the usage text; And
Text selecting equipment is used for selecting the usage text according to judged result.
According to eighth aspect present invention, a kind of method that is used to generate the usage text has been proposed, comprising step:
Generate text by text generation method according to the present invention; And
Judge by the text method of inspection according to the present invention whether the text that generates is the usage text;
Select the usage text according to judged result.
Description of drawings
Fig. 1 a shows the synoptic diagram that utilizes input text and constraint condition to generate the text generation equipment of text according to of the present invention;
Fig. 1 b shows the synoptic diagram that utilizes input text to generate the text inspection machine of text usage intensity according to of the present invention;
Fig. 1 c shows the synoptic diagram that utilizes input text and constraint condition to generate the usage text generation system that meets usage according to of the present invention;
Fig. 2 shows the hardware structure diagram that is used to generate the example system of usage text according to of the present invention;
Fig. 3 shows the structural drawing according to text generation equipment of the present invention;
Fig. 4 shows the structural drawing according to the text analyzer of text generation equipment of the present invention;
Fig. 5 a shows the structural drawing according to the text template generator of text generation equipment of the present invention;
Fig. 5 b shows the process flow diagram that generates text template according to text template generator;
Fig. 6 shows the synoptic diagram according to the text generating apparatus of replacing based on speech of text generation equipment of the present invention;
Fig. 7 shows the process flow diagram according to text generation method of replacing based on speech of the present invention;
Fig. 8 shows the concrete structure figure according to text inspection machine of the present invention;
Fig. 9 shows the process flow diagram according to text inspection machine check text of the present invention;
Figure 10 shows the synoptic diagram of speech according to the present invention to search unit;
Figure 11 shows speech according to the present invention search unit is carried out the process flow diagram of speech to search;
Figure 12 shows the synoptic diagram of an application of the present invention;
Figure 13 shows the synoptic diagram of the Another application according to the present invention.
Embodiment
Below, the preferred embodiments of the present invention will be described with reference to the drawings.In the accompanying drawings, components identical will be by identical reference symbol or numeral.In addition, in following description of the present invention, with the specific descriptions of omitting known function and configuration, to avoid making theme of the present invention unclear.
Fig. 1 a shows the synoptic diagram that utilizes input text to generate the text generation equipment of text usage intensity according to of the present invention.With reference to figure 1a, generate the text 904 that satisfies constraint condition according to input text 901 and text generation constraint condition 902 by text generation equipment 110 according to the present invention.Text can be speech, phrase or sentence.
Fig. 1 b shows the synoptic diagram that utilizes input text to generate the text inspection machine of text usage intensity according to of the present invention.With reference to figure 1b, test by 220 pairs of input texts 901 of text inspection machine according to the present invention, and the output text usage intensity 905 corresponding with input text, thereby determine according to usage intensity whether the text of input meets usage.
Fig. 1 c shows the synoptic diagram that utilizes input text and constraint condition to generate the usage text generation system that meets usage according to of the present invention.With reference to figure 1c, text generation equipment 110 generates and satisfies the text 904 of predetermined constraint conditions, and offers text inspection machine 220, and whether the text 904 that generates in 220 pairs of text generation devices 110 of text inspection machine meets usage is tested; Text selecting equipment 230 is selected the text that meets usage from the assay of text inspection machine 220, and output meets the text of usage.
Fig. 2 shows the hardware structure diagram of system that is used to generate the usage text of Fig. 1 c.This system for example can be the computer system of operation specific program.09 critical component of having indicated this system wherein.Described system comprises CPU 01, and being used for application programs provides computing function; Internal bus 04, described system is by internal bus 04 swap data between internal memory 06 and permanent storage 07 (can be hard disk and flash memory); Input media 03 for example can be to be used for the keyboard of button input or to be used for microphone of phonetic entry or the like, is used to accept user input text 901 and text generation constraint condition 902; Output unit (not shown) and accessory part 02.Storer 07 stores operating system file 071, usage text generation system file 073, and the text 904 of generation concerns dictionary 52, local data base 53, other file 072 of local language material 56 and backup system work.Described internal memory 06 comprises operating system 061, usage text generation system 063 and other application program 062.Described system also comprises network interface card 05 and internet search engine 55.Described system is undertaken alternately by network interface card 05 and the Internet 08, to come search and webpage or other website 081 by internet search engine 55.According to Fig. 1 c, utilize this system can generate the text that meets usage.
Fig. 3 shows the concrete structure figure according to text generation equipment of the present invention.The text generates equipment 110 and comprises text analyzer 11, is used for input text 901 is carried out for example participle, adds the part of speech mark, the analysis of grammatical analysis and so on; Text template generator 12 is used for utilizing the text that text analyzer 11 analyzes and concern that dictionary 52 generations satisfy the text template of constraint condition and activate dictionary 133 based on the text generation constraint condition 902 of input; Text generating apparatus 131 based on speech is replaced is used for utilizing the predetermined speech of 133 pairs of text templates of dictionary to replace; And the storer 134 of the text of storage generation.Concern that dictionary 52 can comprise thesaurus, antisense dictionary, Wordnet dictionary, Hownet dictionary and other specific dictionary.Dictionary 133 can comprise a plurality of dictionaries, thesaurus for example, dictionary for translation etc.Wherein, described constraint condition can comprise: the desired number of the text that generate, and the speech of which kind of part of speech that replace is for information about; Which kind of syntax that generates for information about and to generate which kind of text for information about, or the like.In the present embodiment, can activate dictionary 133 according to the text generation constraint condition of importing by the text generating apparatus of replacing based on speech 131, rather than activate dictionary 133 by text template generator.Text generation equipment can not comprise storer 134 in addition, but the text that generates is directly exported.
With reference to figure 4, show an example of text analyzer.Text analyzer 11 comprises: participle unit 111; Part of speech (POS) indexing unit 112; Semantic analysis unit 113 and syntax analysis unit 114.Described text analyzer 11 depends on language, and the input text 901 that receives is analyzed with output text analyzing result.Usually, the included participle unit 111 of described text analyzer 11 is divided into the sequence of speech with input text 901, and POS indexing unit 112 carries out mark to the part of speech of each speech.Word segmentation result and part of speech mark result can interact.Then, semantic analysis and syntax analysis are carried out to the text of input respectively in semantic analysis unit 113 and syntax analysis unit 114, and the text analyzing result is exported.Among the present invention, described text analyzer 11 also can not comprise participle unit 111, for example, when input is English text, does not need it is carried out participle.Described text analyzer 11 also can include only semantic analysis unit 113 and syntax analysis unit 114 both one of.
Following table 1 and table 2 show the analysis result that utilizes 11 pairs of English example sentences of text analyzer " Iam very happy to meet you " and Chinese example sentence " you may remember once in a while that he comes " respectively.
Figure A200710306623D00161
Table 1
Figure A200710306623D00162
Figure A200710306623D00171
Table 2
With reference to figure 5a, text template generator 12 comprises groove position determination unit 50, and target is replaced determining unit 52, and dictionary activates unit 54 and template knowledge data base 124.Though being shown among Fig. 5 a, text template generator 12 comprises that dictionary activates unit 54,, clearly, text template generator 12 can not comprise that also dictionary activates unit 54, but directly exports the text template that generates.
Fig. 5 b shows the process flow diagram that generates text template according to text template generator.At S511, the groove position determination unit 50 of text template generator 12 is according to the constraint condition of input, and the position of the speech of required replacement in the input text after determining to analyze is as the position of groove.Wherein, each position that need be replaced is exactly the position of a groove.Method for determining position to groove comprises following three kinds: (1) if the constraint condition of input clear and definite part of speech or the speech of require replacing, such as " replacement part of speech: verb ", " substitute: may ", just can be according to the position of the speech that can replace in the directly definite input text of the text analyzing result of input.(2) if the constraint condition of input clear and definite the part of speech of require replacing, such as " replace part of speech: motion ",,, determine the position of speech that need replacement such as Hownet just according to semantic dictionary (concern a kind of of dictionary).(3) if the constraint condition of input does not explicitly call for part of speech, part of speech or the speech of replacement, such as the constraint condition that provides is " synonym text ", just determine to allow part of speech, part of speech or the speech of replacement, determine the groove position with this according to pre-set masterplate knowledge data base 124.More than several constraint conditions can be used in combination and the method for determining position of groove of the present invention is not limited to above-mentioned three kinds of situations.
At S512, the target of text template generator 12 is replaced the constraint condition of determining unit 52 according to input, determines that the target of each groove is replaced.The target of this groove is replaced part of speech, part of speech or the speech that refers to that groove definite in step S511 can be replaced by.If the constraint condition of input is clear and definite target part of speech or speech, such as " target part of speech: noun ", " target word: football " just can directly determine the target replacement of this groove.If the constraint condition of input is clear and definite target part of speech, such as " target part of speech: leisure ", just according to semantic dictionary, such as Hownet, definite target word that can be replaced.If the constraint condition of input does not have specific aim,, just from pre-set masterplate knowledge data base, obtain target part of speech, part of speech or speech such as " synonym text ".Afterwards, at S513, the constraint condition according to input activates corresponding dictionary.Such as, when constraint condition is " synonym text ", just activate thesaurus.At S514, the text template that output produces.
With reference to figure 6, the text generating apparatus of replacing based on speech comprises: input block 62 is used to receive text template; Groove filler cells 64 is used for utilizing the position of the groove of 133 pairs of text templates of dictionary to fill; And output unit 66, be used to export the text of generation.
Below in conjunction with Fig. 7 the flow process that the text generating apparatus of replacing based on speech generates the text of replacing based on speech is described.At first, at S611, input block 62 receives text template.At S612, groove filler cells 64 is selected the speech of the replacement condition that meets groove and is filled from the dictionary 133 that activates.At S613, groove filler cells 64 judges whether to also have unfilled groove, if exist, then carries out S612, otherwise, at S614, by the text of output unit 66 output generations.
Under tabulate and 3 provided after providing input text and constraint condition, generate template and activate dictionary and the example of the final text that generates.
Table 3
Figure A200710306623D00181
Figure A200710306623D00191
Fig. 8 shows according to text inspection machine of the present invention.Text inspection machine 220 is used for the text of input is carried out verification, calculates the intensity level of the usage of the text of importing, thereby can determine whether the text of importing meets the statement custom.The usage intensity level of text is big more, shows that then the text meets the statement custom more.
Text inspection machine 220 comprises: the text input block (not shown) that is used to receive input; Be used for text participle device 82 that text is cut apart; Be used for from what the text cut apart was selected the speech that will check selecting speech unit 84; Be used for generating the right speech of the speech relevant to generation unit 86 with each speech of the speech that will check; Be used to calculate speech to the speech of occurrence number to search unit 92; Be used to calculate the speech usage intensity computing unit 88 of idiomatic usage of words intensity; The text usage intensity computing unit 90 and being used to that is used to calculate the usage intensity of text is exported the output unit (not shown) of the usage intensity of text.
How will specifically describe text inspection machine 220 below checks the text of input whether to meet usage.With reference to figure 9, at S911, input block receives the text of input.At S912, text participle device 82 is divided into a plurality of speech with text.Afterwards, at S913, select speech unit 84 from text, to select the speech that to check.It is one of following that the mode of selecting can comprise, but be not limited to this: 1) select each speech successively; 2) only select non-stop-word; 3) only select predetermined speech.
At S914, it is right that speech generates the speech relevant with the speech that will check to generation unit 86.It is right to utilize search window to generate relevant speech.Suppose that current check speech is W j, the setting search window is that (n, m), it represents speech W jThe n of a front speech and speech W jThe m of a back speech and speech W jRelevant.Utilize search window can search check speech W altogether jThe individual related term of m+n+2 (n and m are respectively greater than 1) right.For search window is that (it is right then can to search 3 related terms for n, m) (n and m equal 1 respectively).Width is that (it is right to comprise following search word for n, search window m): current speech; The character string that first speech before the current speech and current speech are formed; Second speech before the current speech until n speech respectively with the character string of any speech and current speech composition; The character string that first speech after current speech and the current speech is formed; Second speech after the current speech until m speech respectively with the character string of any speech and current speech composition; The character string that first speech is formed after first speech, current speech and the current speech before the current speech.
Tabulating 4 down, to show m+n+2 the speech that searches out right.
Figure A200710306623D00201
Figure A200710306623D00211
Table 5 shows when given input text is " A little boy was standing out infront of a store window ", and the related term of the check speech " boy " that use search window (2,2) and search window (1,1) search is right.
Generated speech to after, at S915, each speech of search is right in language material, and obtains the right occurrence number of speech.At S916, calculate check idiomatic usage of words intensity I diomatic (W according to formula 1 j), wherein utilize formula 2 that the occurrence number of m+n+2 speech to correspondence is mapped on (0,1).
Idiomatic ( W j ) = Σ i = 1 n + m + 1 w i P ( N i ) Formula 1
P (N i) be a mapping function, P ( N i ) = ( ln N i - 1 ln N i + 1 ) 2 / ln ( ln N 0 - ln N i ) Formula 2
W wherein iBe weight, satisfy Σ i = 1 n + m + 1 w i = 1 , N i If be i the occurrence number that speech is right. search window is (1,1), and then formula 1 can be reduced to
Idiomatic(W)=0.25×P(N 1)+0.25×P(N 2)+0.5×P(N 3).
Tabulate down and 6 to have provided the example of calculating speech usage intensity.
[Example] A little boy was boy: 105,000,000 little boy: 1,530,000 boywas: 1,350,000 little boy was: 259,000 Idiomatic(boy)= 0.461 [Example] A small boy was... boy: 105,000,000 small boy: 1,130,000 Boywas: 1,350,000 small boy was: 26,000 Idiomatic(boy)= 0.411 [Example]A small boy were... boy: 105,000,000 small boy: 1,130,000 boy were:480,000 small boy were: 333 Idiomatic(boy)= 0.332
[Example]I like eating food. eating: 37,000,000 like eating: 1,120,000 [Example]I like eating brick. eating: 37,000,000 like eating: 1,120,000
eating food: 1,010,000 like eating food: 11,800 Idiomatic(eating) =0.452 eating brick:70 like eating brick: 5 Idiomatic(eating) =0.209
Wherein, for check speech " boy " and " eating " in the different input texts, its speech usage intensity has nothing in common with each other.
Afterwards, judge whether and need test another speech at S917, if, then carry out S913-S916, otherwise, S918 carried out.At S918, after speech usage intensity being obtained in all speech that will check, calculate the usage intensity of text according to one of formula 3-5.
Idiomatic ( Text ) = min ∀ W i ∈ Text ( Idiomatic ( W i ) ) ; (formula 3)
Wherein, text usage intensity is by the intensity decision of the speech that has minimum usage in the speech of being checked.
Idiomatic ( Text ) = Σ W i ∈ Text q i Idiomatic ( W i ) ; (formula 4), wherein q iBe weight, satisfy Σ i q i = 1 , q iCan be by W iUsage decision, such as part of speech etc.;
Idiomatic ( Text ) = Π i Idiomatic ( W i ) ; (formula 5)
Though speech to computing unit only search for local data base search speech to and obtain the process of the right occurrence number of each speech can be very quick, if but wish to obtain a large amount of speech to and accurately speech can search for the Internet and the local language material of search obtains the right occurrence number of relevant speech to occurrence number.At S919, the text usage intensity that output generates, thus whether the decision text meets usage.
Figure 10 is the example of speech to search unit.This speech comprises the local search unit 101 of carrying out local search to search unit; Local data base 102; Judging unit 103; Language material search unit 104; The internet search engine 55 of search the Internet 08; And the local search engine 106 of searching for local language material 108.
With reference to Figure 11, at S1111, speech search unit is received the speech relevant with the speech that will check to after, local search unit 101 search local data bases 102, with search word to occurrence number.At S1112,, then carry out S1115 if found the right occurrence number of related term.Do not find relevant speech right in the local data base 102 if judging unit 103 is judged,, utilize language material search unit 104 to carry out further search then at S1113.Language material search unit 104 use internet search engines 55 and local search engine 106 are searched at least one in the Internet 08 and the local language material 108 respectively, thereby obtain the right occurrence number of speech.At S1114, after obtaining the right occurrence number of speech, it is write local data base 102.At S1115, export this speech to occurrence number.
Figure 12 is the synoptic diagram of an application of the present invention.The present invention can be applied to automatic short-message classified device system.This system comprises two stages: training stage and sorting phase.
In the training stage, each short message (short message is called for short SM) is by manual label.Label can comprise individual, urgent, spam or the like.Because manual label is expensive and time-consuming, adopt usage text generation equipment of the present invention to generate the short message that meets usage.At every turn, tagged short message sent to text generation equipment and be synonym with the constraint condition information setting that generates, the short message of all generations will have identical label with the short message of input so.The manual tagged short message and the short message of generation are used for training short-message classified device.The data of training are many more, and the performance of sorter is good more.
Figure 13 is the synoptic diagram of another application of the invention.This system has adopted usage text generation equipment of the present invention to be used to strengthen text search engine.At first, the extended mode (that is, constraint condition) of query string and query string is sent to usage text generation equipment, to generate the query string of a plurality of expansions.The query string of original query string and expansion is sent to search engine search for relevant text.Though with the Chinese and English is example, describe the present invention,, clearly, the present invention can be applied to the text generation and the check of other language.
Although with reference to specific embodiment, invention has been described, the present invention should not limited by these embodiment, and should only be limited by claims.Should be understood that under the prerequisite that does not depart from scope and spirit of the present invention, those of ordinary skills can change or revise embodiment.

Claims (44)

1. text template generator comprises:
The groove position determination unit is used for determining according to constraint condition the position of the speech that the text needs of input are replaced, as the groove position; And
Target is replaced determining unit, is used for determining to replace the object of groove position according to constraint condition, thereby generates the text template that comprises object.
2. text template generation method comprises:
Groove position determining step, the position of the speech that needs are replaced in the text of determining to import according to constraint condition is as the groove position;
Target is replaced determining step, determines to replace the object of groove position according to constraint condition, thereby generates the text template that comprises object.
3. text generation equipment comprises:
Text template generator is used for according to the text generation text template after the constraint condition operational analysis;
Based on the text generating apparatus that speech is replaced, the dictionary that is used for being associated with constraint condition according to the text template utilization generates text.
4. equipment as claimed in claim 3, the text after the wherein said analysis is analyzed by text analyzer.
5. equipment as claimed in claim 4, wherein text analyzer comprises:
Part of speech POS indexing unit is used for the part of speech POS of the speech of text is carried out mark;
The semantic analysis unit is used for the semanteme of the speech behind the mark is analyzed.
6. equipment as claimed in claim 4, wherein text analyzer comprises:
Part of speech POS indexing unit is used for the part of speech POS of the speech of text is carried out mark;
The syntax analysis unit is used for the syntax of the text behind the evaluation of markers.
7. equipment as claimed in claim 4, wherein text analyzer comprises:
Part of speech POS indexing unit is used for the part of speech POS of the speech of text is carried out mark;
The semantic analysis unit is used for the semanteme of the speech behind the mark is analyzed; And
The syntax analysis unit is used for the syntax of the text of semantic analysis unit output are analyzed.
8. as the described equipment of one of claim 5 to 7, wherein text analyzer also comprises:
The participle unit, the text segmentation that is used for input is a speech.
9. equipment as claimed in claim 3, wherein text template generator comprises:
The groove position determination unit is used for determining according to constraint condition the position of the speech that the text needs of input are replaced, as the groove position; And
Target is replaced determining unit, is used for determining to replace the object of groove position according to constraint condition, thereby generates the text template that comprises object.
10. equipment as claimed in claim 9, wherein when constraint condition defined the part of speech POS that will be replaced or speech, the text after the operational analysis of groove position determination unit was determined the groove position.
11. equipment as claimed in claim 9, wherein when constraint condition defined the classification of the speech that will replace, groove position determination unit and target were replaced determining unit respectively by using grammatical dictionary to determine groove position and the speech that can replace the groove position.
12. equipment as claimed in claim 9, wherein when constraint condition defined the feature of the text that will be replaced, groove position determination unit and target were replaced determining unit and are determined the groove position respectively and can be used to replace the speech of groove position, part of speech POS or word class.
13. equipment as claimed in claim 3, wherein the text generating apparatus of replacing based on speech is selected the speech of object as an alternative and is filled lexeme and put from dictionary, thereby generates the text after filling.
14. a text generation method comprises:
Text template generates step, according to the text generation text template after the constraint condition operational analysis;
Based on the text generation step that speech is replaced, the dictionary that is associated with constraint condition according to the text template utilization generates text.
15. method as claimed in claim 14, wherein said method also comprises the text analyzing step, analyzes the text of input.
16. method as claimed in claim 15, wherein text analyzing step comprises:
Part of speech POS markers step is carried out mark to the part of speech POS of the speech in the text;
The semantic analysis step is analyzed the semanteme of the speech behind the mark.
17. method as claimed in claim 15, wherein text analyzing step comprises:
Part of speech POS markers step is carried out mark to the part of speech POS of the speech in the text;
The syntax analysis step, the syntax of the text behind the evaluation of markers.
18. method as claimed in claim 15, wherein text analyzing step comprises:
Part of speech POS markers step is carried out mark to the part of speech POS of the speech in the text;
The semantic analysis step is analyzed the semanteme of the speech behind the mark; And
The syntax analysis step, the syntax of the text that the semantic analysis step is obtained are analyzed.
19. as the described method of one of claim 16 to 18, wherein text analyzing step also comprises:
The participle step is a speech with the text segmentation of importing.
20. method as claimed in claim 14, wherein text template generation step comprises:
Groove position determining step, the position of the speech that needs are replaced in the text of determining to import according to constraint condition is as the groove position; And
Target is replaced determining step, determines to replace the object of groove position according to constraint condition, thereby generates the text template that comprises object.
21. method as claimed in claim 20, wherein when constraint condition defines the part of speech POS that will be replaced or speech, groove position determining step comprises that the text after the operational analysis determines the step of groove position.
22. method as claimed in claim 20, wherein when constraint condition defines the classification of the speech that will replace, groove position determining step comprises by using grammatical dictionary to determine the step of groove position, and target is replaced the step that determining step comprises and can replace the speech of groove position by using grammatical dictionary to determine.
23. method as claimed in claim 20, wherein when constraint condition defines the feature of the text that will be replaced, groove position determining step comprises the step of determining the groove position, and target replacement determining step comprises definite can be used to the replace speech of groove position, the step of part of speech POS or word class.
24. method as claimed in claim 14, wherein the text generation step of replacing based on speech comprises the speech of selecting object as an alternative from dictionary and fills lexeme and put, thereby generates the step of the text after filling.
25. a text inspection machine is used to check text whether to meet usage, comprising:
Speech screening unit is used for the speech that will check from the text selection of having cut apart;
Speech is to generation unit, and it is right to be used for generating the speech relevant with each speech of the speech that will check;
Speech usage intensity computing unit is used for calculating according to the right occurrence number of each speech the speech usage intensity of the speech that will check that screens from text; And
Text usage intensity computing unit is used for the text usage intensity according to speech usage intensity calculating text.
26. equipment as claimed in claim 25, wherein:
The speech screening unit speech that selection will be checked according to one of following manner: sequentially select each speech, select non-stop-word and select predetermined speech.
27. equipment as claimed in claim 25, wherein:
Speech utilizes search window search each the speech right occurrence number relevant with the speech that will check to search unit, and wherein search window is that (m, n), n speech after m speech before the speech of indicating to check and the speech that will check is relevant with the speech that will check.
28. equipment as claimed in claim 25, wherein:
It is right that speech utilizes search window can search m+n+2 speech to search unit.
29. equipment as claimed in claim 25, wherein speech comprises search unit:
The local search unit is used to search for local data base, to obtain speech to occurrence number; And
The language material search unit is used for when local data base does not search the right occurrence number of speech, and at least one obtaining speech to occurrence number, and adds in the local data base in search the Internet and the local language material.
30. equipment as claimed in claim 25, wherein:
Speech usage intensity is calculated unit by using first predefined weight and the occurrence number that each speech is right and is mapped to the speech usage intensity that value on (0,1) interval is calculated each speech.
31. equipment as claimed in claim 25, wherein:
The speech usage intensity of minimum is as the usage intensity of text in the speech usage intensity of the speech that text usage intensity computing unit will be checked.
32. equipment as claimed in claim 25, wherein:
Text usage intensity computing unit is used for calculating text usage intensity according to second predefined weight and speech usage intensity.
33. equipment as claimed in claim 25, wherein:
Text usage intensity computing unit is used for calculating text usage intensity according to speech usage intensity.
34. a text method of inspection is used to check text whether to meet usage, comprising:
Speech screening step is selected the speech that will check from the text of having cut apart;
Speech is to generating step, and it is right to generate the speech relevant with each speech in the speech that will check;
Speech usage intensity calculation procedure, the speech usage intensity of the speech that will check that right occurrence number calculating is screened from text according to each speech; And
Text usage intensity calculation procedure is according to the text usage intensity of speech usage intensity calculating text.
35. method as claimed in claim 34, wherein:
Speech screening step comprises the step of the speech that selection will be checked according to one of following manner: sequentially select each speech, select non-stop-word and select predetermined speech.
36. method as claimed in claim 34, wherein:
Speech comprises the step of utilizing the right occurrence number of search window search each speech relevant with the speech that will check to search step, wherein search window is (m, n), indicate m speech before the speech of check (unified for will check) and speech n the speech afterwards that will check is relevant with the speech that will check.
37. method as claimed in claim 36, wherein:
It is right that speech utilizes search window can search m+n+2 speech to search step.
38. method as claimed in claim 34, wherein speech comprises search step:
The local search step, the search local data base is to obtain speech to occurrence number; And
The language material search step, when not searching the right occurrence number of speech in local data base, at least one obtaining speech to occurrence number, and adds in the local data base in search the Internet and the local language material.
39. method as claimed in claim 34, wherein:
Speech usage intensity calculation procedure comprises utilizes first predefined weight and the occurrence number that each speech is right to be mapped to the step that value on (0,1) interval is calculated the speech usage intensity of each speech.
40. method as claimed in claim 34, wherein:
Text usage intensity calculation procedure comprises the step of speech usage intensity minimum in the speech usage intensity of the speech that will check as the usage intensity of text.
41. method as claimed in claim 34, wherein:
Text usage intensity calculation procedure comprises the step of calculating text usage intensity according to second predefined weight and speech usage intensity.
42. method as claimed in claim 34, wherein:
Text usage intensity calculation procedure comprises the step of calculating text usage intensity according to speech usage intensity.
43. a system that is used to generate the usage text comprises:
Text generation equipment as claimed in claim 3 is used to generate text; And
As the text inspection machine of claim 25, be used to judge whether the text of generation is the usage text; And
Text selecting equipment is used for selecting the usage text according to judged result.
44. a method that is used to generate the usage text, comprising step:
By using text generation method to generate text as claim 14; And
Judge by using the text method of inspection as claimed in claim 34 whether the text that generates is the usage text;
Select the usage text according to judged result.
CNA2007103066231A 2007-12-28 2007-12-28 Text template generator, text generation equipment, text checking equipment and method thereof Pending CN101470700A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2007103066231A CN101470700A (en) 2007-12-28 2007-12-28 Text template generator, text generation equipment, text checking equipment and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2007103066231A CN101470700A (en) 2007-12-28 2007-12-28 Text template generator, text generation equipment, text checking equipment and method thereof

Publications (1)

Publication Number Publication Date
CN101470700A true CN101470700A (en) 2009-07-01

Family

ID=40828177

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2007103066231A Pending CN101470700A (en) 2007-12-28 2007-12-28 Text template generator, text generation equipment, text checking equipment and method thereof

Country Status (1)

Country Link
CN (1) CN101470700A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064885A (en) * 2012-12-06 2013-04-24 安徽科大讯飞信息科技股份有限公司 System and method for achieving synchronous inputting of key words
CN103688254A (en) * 2011-10-26 2014-03-26 Sk电信有限公司 Example-based error detection system for automatic evaluation of writing, method for same, and error detection apparatus for same
CN104750705A (en) * 2013-12-27 2015-07-01 华为技术有限公司 Information returning method and device
CN105183761A (en) * 2015-07-27 2015-12-23 网易传媒科技(北京)有限公司 Sensitive word replacement method and apparatus
CN106095742A (en) * 2016-06-20 2016-11-09 北京金山安全软件有限公司 Text content generation method and server
CN108090041A (en) * 2016-11-22 2018-05-29 北京国双科技有限公司 The generation method and device of a kind of advertising creative
CN108121697A (en) * 2017-11-16 2018-06-05 北京百度网讯科技有限公司 Method, apparatus, equipment and the computer storage media that a kind of text is rewritten
CN108519966A (en) * 2018-04-11 2018-09-11 掌阅科技股份有限公司 The replacement method and computing device of e-book particular text element
CN109241286A (en) * 2018-09-21 2019-01-18 百度在线网络技术(北京)有限公司 Method and apparatus for generating text
CN109241519A (en) * 2018-06-28 2019-01-18 平安科技(深圳)有限公司 Environmental Evaluation Model acquisition methods and device, computer equipment and storage medium
CN109271492A (en) * 2018-11-16 2019-01-25 广东小天才科技有限公司 A kind of automatic generation method and system of corpus regular expression
CN109344231A (en) * 2018-10-31 2019-02-15 广东小天才科技有限公司 A kind of method and system of the semantic incomplete corpus of completion
CN109800421A (en) * 2018-12-19 2019-05-24 武汉西山艺创文化有限公司 A kind of game scenario generation method and its device, equipment, storage medium
CN109800419A (en) * 2018-12-18 2019-05-24 武汉西山艺创文化有限公司 A kind of game sessions lines generation method and system
CN110162753A (en) * 2018-11-08 2019-08-23 腾讯科技(深圳)有限公司 For generating the method, apparatus, equipment and computer-readable medium of text template
CN110188177A (en) * 2019-05-28 2019-08-30 北京搜狗科技发展有限公司 Talk with generation method and device
CN110309507A (en) * 2019-05-30 2019-10-08 深圳壹账通智能科技有限公司 Testing material generation method, device, computer equipment and storage medium
CN110399499A (en) * 2019-07-18 2019-11-01 珠海格力电器股份有限公司 A kind of corpus generation method, device, electronic equipment and readable storage medium storing program for executing
CN111353293A (en) * 2018-12-21 2020-06-30 深圳市优必选科技有限公司 Statement material generation method and terminal equipment
CN112036164A (en) * 2020-09-17 2020-12-04 深圳市欢太科技有限公司 Sample generation method and device, computer-readable storage medium and electronic device
CN113901763A (en) * 2021-09-30 2022-01-07 北京百度网讯科技有限公司 Table description text generation method, device, equipment and storage medium
WO2023015841A1 (en) * 2021-08-12 2023-02-16 平安科技(深圳)有限公司 Sql statement generation method, apparatus, and device based on artificial intelligence, and storage medium

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103688254A (en) * 2011-10-26 2014-03-26 Sk电信有限公司 Example-based error detection system for automatic evaluation of writing, method for same, and error detection apparatus for same
CN103688254B (en) * 2011-10-26 2016-11-16 Sk电信有限公司 Error-detecting system based on example, method and error-detecting facility for assessment writing automatically
CN103064885B (en) * 2012-12-06 2015-12-23 安徽科大讯飞信息科技股份有限公司 One realizes the synchronous input system of multi-key word and method
CN103064885A (en) * 2012-12-06 2013-04-24 安徽科大讯飞信息科技股份有限公司 System and method for achieving synchronous inputting of key words
CN104750705B (en) * 2013-12-27 2019-05-28 华为技术有限公司 Information replying method and device
CN104750705A (en) * 2013-12-27 2015-07-01 华为技术有限公司 Information returning method and device
US10230668B2 (en) 2013-12-27 2019-03-12 Huawei Technologies Co., Ltd. Information replying method and apparatus
CN105183761B (en) * 2015-07-27 2020-04-07 网易传媒科技(北京)有限公司 Sensitive word replacing method and device
CN105183761A (en) * 2015-07-27 2015-12-23 网易传媒科技(北京)有限公司 Sensitive word replacement method and apparatus
CN106095742A (en) * 2016-06-20 2016-11-09 北京金山安全软件有限公司 Text content generation method and server
CN108090041A (en) * 2016-11-22 2018-05-29 北京国双科技有限公司 The generation method and device of a kind of advertising creative
CN108090041B (en) * 2016-11-22 2021-05-18 北京国双科技有限公司 Method and device for generating advertisement creativity
CN108121697A (en) * 2017-11-16 2018-06-05 北京百度网讯科技有限公司 Method, apparatus, equipment and the computer storage media that a kind of text is rewritten
CN108121697B (en) * 2017-11-16 2022-02-25 北京百度网讯科技有限公司 Method, device and equipment for text rewriting and computer storage medium
CN108519966B (en) * 2018-04-11 2019-03-29 掌阅科技股份有限公司 The replacement method and calculating equipment of e-book particular text element
CN108519966A (en) * 2018-04-11 2018-09-11 掌阅科技股份有限公司 The replacement method and computing device of e-book particular text element
CN109241519A (en) * 2018-06-28 2019-01-18 平安科技(深圳)有限公司 Environmental Evaluation Model acquisition methods and device, computer equipment and storage medium
CN109241286A (en) * 2018-09-21 2019-01-18 百度在线网络技术(北京)有限公司 Method and apparatus for generating text
CN109344231A (en) * 2018-10-31 2019-02-15 广东小天才科技有限公司 A kind of method and system of the semantic incomplete corpus of completion
CN109344231B (en) * 2018-10-31 2021-08-17 广东小天才科技有限公司 Method and system for completing corpus of semantic deformity
CN110162753A (en) * 2018-11-08 2019-08-23 腾讯科技(深圳)有限公司 For generating the method, apparatus, equipment and computer-readable medium of text template
CN110162753B (en) * 2018-11-08 2022-12-13 腾讯科技(深圳)有限公司 Method, apparatus, device and computer readable medium for generating text template
CN109271492A (en) * 2018-11-16 2019-01-25 广东小天才科技有限公司 A kind of automatic generation method and system of corpus regular expression
CN109800419A (en) * 2018-12-18 2019-05-24 武汉西山艺创文化有限公司 A kind of game sessions lines generation method and system
CN109800421A (en) * 2018-12-19 2019-05-24 武汉西山艺创文化有限公司 A kind of game scenario generation method and its device, equipment, storage medium
CN111353293A (en) * 2018-12-21 2020-06-30 深圳市优必选科技有限公司 Statement material generation method and terminal equipment
CN110188177A (en) * 2019-05-28 2019-08-30 北京搜狗科技发展有限公司 Talk with generation method and device
CN110309507A (en) * 2019-05-30 2019-10-08 深圳壹账通智能科技有限公司 Testing material generation method, device, computer equipment and storage medium
CN110399499A (en) * 2019-07-18 2019-11-01 珠海格力电器股份有限公司 A kind of corpus generation method, device, electronic equipment and readable storage medium storing program for executing
CN110399499B (en) * 2019-07-18 2022-02-18 珠海格力电器股份有限公司 Corpus generation method and device, electronic equipment and readable storage medium
CN112036164A (en) * 2020-09-17 2020-12-04 深圳市欢太科技有限公司 Sample generation method and device, computer-readable storage medium and electronic device
WO2023015841A1 (en) * 2021-08-12 2023-02-16 平安科技(深圳)有限公司 Sql statement generation method, apparatus, and device based on artificial intelligence, and storage medium
CN113901763A (en) * 2021-09-30 2022-01-07 北京百度网讯科技有限公司 Table description text generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN101470700A (en) Text template generator, text generation equipment, text checking equipment and method thereof
US20210157975A1 (en) Device, system, and method for extracting named entities from sectioned documents
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
Linhares Pontes et al. Impact of OCR quality on named entity linking
CN104063387A (en) Device and method abstracting keywords in text
CN102402584A (en) Language identification in multilingual text
CN110059163B (en) Method and device for generating template, electronic equipment and computer readable medium
CN101013422A (en) Language information translating device and method
CN106570180A (en) Artificial intelligence based voice searching method and device
Gupta et al. Text summarization of Hindi documents using rule based approach
CN104866511A (en) Method and equipment for adding multi-media files
CN110532354A (en) The search method and device of content
Barrière Natural language understanding in a semantic web context
KR101136037B1 (en) Method and apparatus for indexing and retrieving documents
Kumar et al. Sanskrit compound processor
CN115186654A (en) Method for generating document abstract
Agbago et al. Truecasing for the Portage system
JP4143085B2 (en) Synonym acquisition method and apparatus, program, and computer-readable recording medium
Singh et al. Identification of languages and encodings in a multilingual document
JP2004070636A (en) Concept searching device
CN115617965A (en) Rapid retrieval method for language structure big data
CN116151220A (en) Word segmentation model training method, word segmentation processing method and device
Jia et al. A natural language sentence analysis algorithm based on word order modifier syntax rules
Karmani et al. Building a standardized Wordnet in the ISO LMF for aeb language
Shitaoka et al. Dependency structure analysis and sentence boundary detection in spontaneous Japanese

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20090701