CN104050156B - For extracting device, method and the electronic equipment of maximum noun phrase - Google Patents

For extracting device, method and the electronic equipment of maximum noun phrase Download PDF

Info

Publication number
CN104050156B
CN104050156B CN201310084666.5A CN201310084666A CN104050156B CN 104050156 B CN104050156 B CN 104050156B CN 201310084666 A CN201310084666 A CN 201310084666A CN 104050156 B CN104050156 B CN 104050156B
Authority
CN
China
Prior art keywords
noun phrase
language
maximum
template
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310084666.5A
Other languages
Chinese (zh)
Other versions
CN104050156A (en
Inventor
葛乃晟
付亦雯
郑仲光
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201310084666.5A priority Critical patent/CN104050156B/en
Publication of CN104050156A publication Critical patent/CN104050156A/en
Application granted granted Critical
Publication of CN104050156B publication Critical patent/CN104050156B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides for the device, method and the electronic equipment that extract maximum noun phrase, to overcome the not high problem of the processing accuracy existing for existing language data treatment technology.Said apparatus include:Translate the noun phrase determining unit of the reference language noun phrase that determination is consistent with noun phrase template in sentence in the reference language of pending target language sentences, noun phrase template includes the part of speech label of reference language;By noun phrase label for labelling to the mark unit with the corresponding object language noun phrase of reference language noun phrase determining in target language sentences;With the maximum noun phrase determining unit that the phrase being consistent with maximum noun phrase template is defined as in target language sentences maximum noun phrase, maximum noun phrase template includes part of speech label and/or the noun phrase label of object language.The above-mentioned technology of the present invention can be applied to data processing field.

Description

For extracting device, method and the electronic equipment of maximum noun phrase
Technical field
The present invention relates to data processing field, more particularly, to a kind of device for extracting maximum noun phrase, method with And electronic equipment.
Background technology
With advancing by leaps and bounds of information technology and network technology, data processing is increasingly becoming a hot topic and indispensable Field.However, due to the rich and multiformity of data message and data source, the purpose of process and requirement are also not quite similar.
Language data, as one of numerous categorical datas, is extremely common in people's daily life, work.Example As needed included in various file to be processed in the short message mutually sent out between Email, mobile phone and people's study and work Word message, be all language data.The existing technology for processing language data remains in not high enough the asking of processing accuracy Topic.
Content of the invention
Brief overview with regard to the present invention is given below, to provide basic with regard to certain aspects of the invention Understand.It should be appreciated that this general introduction is not the exhaustive general introduction with regard to the present invention.It is not intended to determine the pass of the present invention Key or pith, nor is it intended to limit the scope of the present invention.Its purpose only provides some concepts in simplified form, In this, as the preamble in greater detail discussed after a while.
In consideration of it, the invention provides a kind of device for extracting maximum noun phrase, method and electronic equipment, with At least solve the problems, such as that the processing accuracy existing for existing language data treatment technology is not high.
According to an aspect of the invention, it is provided a kind of device for extracting maximum noun phrase, this device includes: Noun phrase determining unit, translates in sentence for the reference language in pending target language sentences, determines and at least one name The reference language noun phrase that any one of word phrase template is consistent, wherein, each noun phrase template includes corresponding to by it Predefined procedure arrangement at least one reference language part of speech label;Mark unit, for giving noun phrase label for labelling The object language noun phrase corresponding with the reference language noun phrase determining in target language sentences;And maximum noun Phrase determining unit, for will be consistent with any one of at least one maximum noun phrase template in target language sentences Phrase is defined as maximum noun phrase, and wherein, each maximum noun phrase template is included by its corresponding predefined procedure arrangement The part of speech label of at least one object language and/or at least one noun phrase label.
According to another aspect of the present invention, a kind of method for extracting maximum noun phrase, the method are additionally provided Including:Translate in sentence in the reference language of pending target language sentences, appointing in determination and at least one noun phrase template The reference language noun phrase that one is consistent, wherein, each noun phrase template is included by its corresponding predefined procedure arrangement The part of speech label of at least one reference language;By noun phrase label for labelling to the reference language with determination in target language sentences The corresponding object language noun phrase of speech noun phrase;And will be short with least one maximum noun in target language sentences The phrase that any one of language template is consistent is defined as maximum noun phrase, wherein, each maximum noun phrase template include by The part of speech label of at least one object language of its corresponding predefined procedure arrangement and/or at least one noun phrase label.
According to another aspect of the present invention, additionally provide a kind of electronic equipment, this electronic equipment includes as above For extracting the device of maximum noun phrase.
According to a further aspect of the invention, the program additionally providing a kind of instruction code of the machine-readable that is stored with is produced Product, said procedure product can make the above-mentioned machine execution side for extracting maximum noun phrase as above upon execution Method.
Additionally, according to other aspects of the invention, additionally provide a kind of computer-readable recording medium, be stored thereon with as Upper described program product.
The above-mentioned device for extracting maximum noun phrase according to embodiments of the present invention, method and electronic equipment, its First with reference language(As English)Noun phrase template determining reference language noun phrase, and on here basis again Using object language(As Chinese)Maximum noun phrase template determining object language maximum noun phrase, thus by by different level Twin-stage process and achieve the determination of maximum noun phrase, be obtained in that one of at least following benefit:The accuracy of result Higher;The complexity processing is relatively low;Can be relatively defined based on the reference language noun phrase template of bilingual alignment building of corpus Really reflect the characteristic of noun phrase;Using the above-mentioned reference language noun phrase template institute based on bilingual alignment building of corpus The processing accuracy of the data processing of the such as noun phrase extraction carrying out etc is also higher, and effect is preferable;Based on a determination that name The object language maximum noun phrase template that the target language corpora of word phrase and maximum noun phrase builds can relatively be defined Really reflect the characteristic of maximum noun phrase;And using above-mentioned based on a determination that the target of noun phrase and maximum noun phrase The number of for example maximum noun phrase extraction that the object language maximum noun phrase template that language corpus builds is carried out etc Also higher according to the processing accuracy processing, effect is preferable.
By the detailed description to highly preferred embodiment of the present invention below in conjunction with accompanying drawing, the these and other of the present invention is excellent Point will be apparent from.
Brief description
The present invention can be by reference to being better understood below in association with the description given by accompanying drawing, wherein in institute Have and employ same or analogous reference in accompanying drawing to represent same or like part.Described accompanying drawing is together with following Describe the part comprising in this manual and being formed this specification together in detail, and be used for being further illustrated this The preferred embodiment of invention and the principle and advantage explaining the present invention.In the accompanying drawings:
Fig. 1 is to schematically show the one kind being used for the device extracting maximum noun phrase according to an embodiment of the invention The block diagram of exemplary construction.
Fig. 2 is schematically show the device being used for extracting maximum noun phrase according to an embodiment of the invention another Plant the block diagram of exemplary construction.
Fig. 3 is a kind of possible exemplary construction schematically showing noun phrase template obtaining unit as shown in Figure 2 Block diagram.
Fig. 4 is schematically show the device being used for extracting maximum noun phrase according to an embodiment of the invention another Plant the block diagram of exemplary construction.
Fig. 5 is a kind of possible example schematically showing maximum noun phrase template obtaining unit as shown in Figure 4 The block diagram of structure.
Fig. 6 is to schematically show the one kind being used for the method extracting maximum noun phrase according to an embodiment of the invention The flow chart of exemplary process.
Fig. 7 is schematically show the method being used for extracting maximum noun phrase according to an embodiment of the invention another The flow chart planting the part steps in possible exemplary process.
Fig. 8 is other that schematically show the method being used for extracting maximum noun phrase according to an embodiment of the invention The flow chart of the part steps in possible exemplary process.
Fig. 9 show the device that can be used to realize to be used for according to an embodiment of the invention to extract maximum noun phrase and A kind of structure diagram of the hardware configuration of possible messaging device of information processing method.
It will be appreciated by those skilled in the art that element in accompanying drawing be used for the purpose of simple and clear for the sake of and illustrate, And be not necessarily drawn to scale.For example, in accompanying drawing, the size of some elements may be exaggerated with respect to other elements, with Just it is favorably improved the understanding to the embodiment of the present invention.
Specific embodiment
Hereinafter in connection with accompanying drawing, the one exemplary embodiment of the present invention is described.For clarity and conciseness, All features of actual embodiment are not described in the description.It should be understood, however, that developing any this actual enforcement A lot of decisions specific to embodiment, to realize the objectives of developer, for example, symbol must be made during example Close those restrictive conditions related to system and business, and these restrictive conditions may have with the difference of embodiment Changed.Additionally, it also should be appreciated that although development is likely to be extremely complex and time-consuming, but to having benefited from the disclosure For those skilled in the art of content, this development is only routine task.
Here is in addition it is also necessary to illustrate is a bit, in order to avoid having obscured the present invention because of unnecessary details, in the accompanying drawings Illustrate only and the apparatus structure closely related according to the solution of the present invention and/or process step, and eliminate and the present invention The little other details of relation.
The embodiment provides a kind of device for extracting maximum noun phrase, this device includes:Noun is short Language determining unit, translates in sentence for the reference language in pending target language sentences, determines and at least one noun phrase The reference language noun phrase that any one of template is consistent, wherein, each noun phrase template includes corresponding predetermined by it The part of speech label of at least one reference language tactic;Mark unit, for by noun phrase label for labelling to target language The object language noun phrase corresponding with the reference language noun phrase determining in speech sentence;And maximum noun phrase is true Order unit, for will be true for the phrase being consistent with any one of at least one maximum noun phrase template in target language sentences It is set to maximum noun phrase, wherein, each maximum noun phrase template includes at least by its corresponding predefined procedure arrangement The part of speech label of individual object language and/or at least one noun phrase label.
To describe the device being used for extracting maximum noun phrase according to an embodiment of the invention with reference to Fig. 1 in detail One example.
As shown in figure 1, the device 100 being used for extracting maximum noun phrase according to an embodiment of the invention to include noun short Language determining unit 110, mark unit 120 and maximum noun phrase determining unit 130.
Noun phrase determining unit 110 is used for translating in sentence in the reference language of pending target language sentences, determine with The reference language noun phrase that any one of at least one noun phrase template is consistent.
In a kind of implementation being used for according to an embodiment of the invention extracting the device of maximum noun phrase, target Language can be the language that can carry out word segmentation processing.
In one example, object language can be for example Chinese.In another example, object language can also be day Language or Korean, or other are capable of the language of word segmentation processing as mentioned above.Hereinafter by main using Chinese as object language Example each embodiment of the present invention is described, and other language are similar as the situation of object language, will no longer Repeat.
It should be noted that " can carry out the language of word segmentation processing " mentioned above refers in this language, word and word Between there is no the separator in such as space etc, such as Chinese, Japanese etc. does not all have this separator, if not carrying out word segmentation processing, It is a word which word then can not be distinguished.
Additionally, being used for extracting a kind of implementation of the device of maximum noun phrase according to an embodiment of the invention In, reference language can be the language needing not move through word segmentation processing.
In one example, reference language can be for example English.In another example, object language can also be method Language or German, or other need not move through the language of word segmentation processing as mentioned above.Hereinafter will be mainly using English as target The example of language each embodiment of the present invention is described, and other language are similar as the situation of object language, Will not be described in great detail.
It should be noted that " needing not move through the language of word segmentation processing " mentioned above refers in this language, word and There is the separator in such as space etc between word.As in an English sentence, all divided using space between adjacent word Separate.Thus, which this language that there is separator of such as English, in the case of without word segmentation processing, can distinguish Individual letter is a word.
Further, it should be noted that pending sentence is " target language sentences ", and what noun phrase template was comprised It is the part of speech label of reference language.In each noun phrase template, its comprise those(For example, at least one)Reference language Part of speech label be according to necessarily tactic.That is, between each part of speech label of being comprised of noun phrase template It is sequential.
Wherein, in one implementation, noun phrase template can prestore in the device 100, for example, it may be In the noun phrase determining unit 110 of pre-existing device 100.
In another implementation, noun phrase template can also be obtained by the other functions unit in device 100 , to describe this implementation in detail below in conjunction with Fig. 2.
As an example it is assumed that " having this book on desk " is pending Chinese(Example as object language)Sentence, with And assume that " (/DT) (/NN)+" is a predetermined English(Example as reference language)Noun phrase template.
Wherein, " (/DT) (/NN)+" includes the part of speech label "/DT " of article and the part of speech label "/NN " of noun, " (/ In DT) (/NN)+" "+" represent that part of speech label "/NN " can repeat in the structure shown here.It should be noted that " (/DT) (/NN) + " part of speech label be the part of speech label of English.
Noun phrase determining unit 110 can be translated in sentence in the English of " having this book on desk " and be determined and noun phrase template The Noun Phrase that " (/DT) (/NN)+" is consistent.
Wherein, sentence translated in the English " having a book on desk " can be for example user input, or can also be to pass through Machine translation obtains.
Assume to pass through machine translation, the English obtaining " having a book on desk " translates sentence for " There is a book on the desk”.Then, noun phrase determining unit 110, can be for waiting to locate according to noun phrase template " (/DT) (/NN)+" Reason Chinese sentence " this book is had on desk " English translate obtain in sentence " There is a book on the desk " two with The Noun Phrase that noun phrase template " (/DT) (/NN)+" is consistent, i.e. " a book " and " the desk ".
So, by the process of noun phrase determining unit 110, can be in the reference language of pending target language sentences Determination one or some reference language noun phrases translated in sentence in speech, and this or these reference language noun phrase be with predetermined Any one of at least one noun phrase template is consistent.
That is, the reference language in above-mentioned pending target language sentences is translated in sentence, if certain reference language name If word phrase is consistent with wherein one or more at least one predetermined noun phrase template, then noun phrase determines list Unit 110 is chosen.So, noun phrase determining unit 110 determines one or more reference language noun phrases.
Then, each in one or more reference language noun phrases noun phrase determining unit 110 being determined Individual, mark unit 120 finds and this reference language noun phrase corresponding object language noun phrase in target language sentences. So, mark unit 120 can find out one or more object language noun phrases in target language sentences.
For the one or more object language noun phrases found out, mark unit 120 and can mark upper name respectively to it Word phrase label.
Pending Chinese sentence described above is that in the example of " having a book on desk ", as noted earlier, noun is short Two Noun Phrase that language determining unit 110 determines are " a book " and " the desk "." book is had on desk " In be " desk " and " book " respectively with " a book " and " the desk " corresponding Chinese noun phrase.Then, in " desk On have a book " in, mark unit 120 noun phrase label can be marked respectively to " desk " and " book ".
In one example, " desk " and " book " in " having a book on desk " has been marked noun phrase label Afterwards, form can be expressed as:
" [desk]NPOn have [book]NP”.
Wherein, [...]NPRepresent the noun phrase being marked noun phrase label.Here, NP is used for representing " noun phrase Label ".In other examples, " noun phrase label " can also adopt the character representation of other forms, and is not limited to above example Son.
Then, the result being marked based on mark unit 120, and using predetermined maximum noun phrase template, maximum name Word phrase determining unit 130 can in target language sentences by with least one of predetermined maximum noun phrase template phase That or those phrase of symbol is defined as the maximum noun phrase in this target language sentences.
It should be noted that maximum noun phrase template includes at least one target by its corresponding predefined procedure arrangement The part of speech label of language and/or at least one noun phrase label.
That is, maximum noun phrase template may only comprise a part of speech label, or it is short only to comprise a noun Language label.Or, maximum noun phrase template is likely to comprise arbitrarily individual part of speech label and arbitrarily individual noun phrase label simultaneously Combination.
In other implementations, in maximum noun phrase template, predetermined keyword can also be comprised.Wherein, preset and close Keyword for example may be located between adjacent part of speech label and/or noun phrase label.It should be noted that being not often adjacent All necessarily there is between two part of speech labels and/or noun phrase label predetermined keyword.
For example, predetermined keyword can be any one in the function words such as preposition, conjunction and auxiliary word and modal verb.
The result that mark unit 120 described above is marked is " [desk]NPOn have [book]NP" example in, Assume that maximum noun phrase template includes:
* > X+ < *(Hereinafter referred to as first maximum noun phrase template);
* the X+ < * of > (/a)(Hereinafter referred to as second maximum noun phrase template);
* > X+ (non-Chinese character) X+ < *(Hereinafter referred to as the 3rd maximum noun phrase template);With
, > can (/v) X+ X < quilt(Hereinafter referred to as the 4th maximum noun phrase template).
Wherein, the part between " > " and " < " in " > ... < " represents maximum noun phrase, before " > "(I.e. left Side)After " < "(I.e. right side)Part represent context restrictive condition, " * " be the unrestricted condition of context.X represents and is marked Note the noun phrase of noun phrase label.(/a) is adjective part of speech, and (/v) is verb part of speech.
It should be noted that the above several examples being only used as maximum noun phrase template, in actual applications, maximum name Word phrase template is not limited thereto.
According to the first maximum noun phrase template, " [desk]NPOn have [book]NP" in, phrase " desk " has The structure of the first maximum noun phrase template.This is because " desk ", be equivalent to " X " in " * > X+ < * ", "+" expression " X " Repetition(But do not have here), and context is unrestricted.Similarly, phrase " book " is also to meet the first maximum noun phrase The structure of template, wherein, " book " is equivalent to " X " in " * > X+ < * ".In addition, " [desk]NPOn have [one Book]NP" in there is no any phrase meeting the second to the 4th maximum noun phrase template.
Thus, by the place of noun phrase determining unit 110, mark unit 120 and maximum noun phrase determining unit 130 Reason, in the above example, can obtain the maximum noun phrase in " having a book on desk " is " desk " and " book ".
It should be noted that maximum noun phrase mentioned above refers to the name not comprised by any other noun phrase Word phrase.The set of all leaf nodes from syntax tree it is simply that from first NP mark that root runs into downwards.So And, object language(As Chinese)Noun phrase structure extremely complex, its participle often contains a lot of ambiguities, and its part of speech mark Note there is also considerable mistake, therefore, the object language that carried out using traditional method(As Chinese)Noun phrase know Often accuracy is not relatively low, and the recognition accuracy of maximum noun phrase is relatively low.
By contrast, the above-mentioned device being used for extracting maximum noun phrase according to an embodiment of the invention is extracted Big noun phrase is first with reference language(As English)Noun phrase template determining reference language noun phrase, and here Object language is recycled on basis(As Chinese)Maximum noun phrase template come to determine object language maximum noun phrase, thus Process the determination achieving maximum noun phrase by with different levels twin-stage, the accuracy of the result obtained by processing is higher.This Outward, utilize the complexity of the process that the above-mentioned device being used for extracting maximum noun phrase according to an embodiment of the invention carried out Relatively low.
Fig. 2 schematically shows the another of the device being used for extracting maximum noun phrase according to an embodiment of the invention Individual example.As shown in Fig. 2 for extracting the device 200 of maximum noun phrase except noun phrase determining unit 210, mark list Outside unit 220 and maximum noun phrase determining unit 230, also include obtaining for the noun phrase template obtaining noun phrase template Obtain unit 240.Wherein, the noun phrase determining unit 210 in the device 200 for extracting maximum noun phrase shown in Fig. 2, Mark unit 220 and maximum noun phrase determining unit 230 can be respectively provided with and carry with above in conjunction with being used for described by Fig. 1 The noun phrase determining unit 110 in the device 100 of maximum noun phrase, mark unit 120 and maximum noun phrase is taken to determine Unit 130 identical 26S Proteasome Structure and Function, and similar technique effect can be reached, repeat no more here.
Noun phrase template obtaining unit 240 can be in the bilingual alignment corpus of predetermined object language and reference language In, by counting what each object language noun phrase corresponding reference language noun phrase in bilingual alignment corpus was comprised Part of speech label, to determine above-mentioned bilingual alignment corpus at least one reference language noun phrase template corresponding.
Wherein, bilingual alignment corpus comprises that multigroup bilingual sentence is right, and every group of sentence is to translation each other and aligned with each other, and often Each word in the target language sentence and reference language sentence of group sentence centering aligns respectively(For example can utilize existing Alignment schemes are realizing).For example, in reference language sentence, certain word is by which word translation in target language sentence Be known, vice versa.The following is one group of alignment sentence right:
From the sentence that aligns above to a word can be translated into one or more words, is not even translated.
In above alignment sentence centering, bilingual corresponding relation is as follows:
?:in
Economical:economic
Field:field
Cooperation:cooperation
Fig. 3 shows a kind of possible exemplary construction of noun phrase template obtaining unit 240.
As shown in figure 3, noun phrase template obtaining unit 240 can include the first determination subelement 310 and the second determination Subelement 320.
Wherein, the object language noun phrase in bilingual alignment corpus can be predetermined, for example, it may be logical Cross the input of user to determine.In other implementations, the object language noun phrase in bilingual alignment corpus also may be used To be determined by other prior arts.
Additionally, each the reference language sentence in bilingual alignment corpus can be for example pre- to first pass through part-of-speech tagging.
For example, it is assumed that the Chinese following to inclusion of one group of bilingual sentence in above-mentioned bilingual alignment corpus(As target language The example of speech)Sentence:
" because the higher degree of accuracy of market demand, such as FCC stage II E-911 services, and can execute the shifting of positioning measurement Dynamic station 200 is expected and can commercially spread unchecked.”(Hereinafter referred to as first Chinese sentence)
Moreover, it is assumed that above-mentioned one group of bilingual sentence is to included English(Example as reference language)Sentence is as follows:
“An accuracy of because market need better,such as FCC stage II for E-911services,it is possible to perform a localization measurement mobile station200are expected will on the market from overflowing.”(Hereinafter referred to as first English Sentence)
First Chinese sentence can obtain after participle and part-of-speech tagging:
Because | p market | n needs | v is higher | a's | u degree of accuracy | n, | w for example | v FCC stage | n II | m E-911 services | vn, | w can | v executes | v positions | n measures | v's | u moves | vn stands | v200 | m quilt | p expects that n is upper in | v meeting | v exists | p market | | F spreads unchecked | v.|w
Wherein, the character on " | " right side is used for representing the part of speech of this " | " left side word, and such as p represents preposition, and n represents noun, v table Show verb, a represents adjective, u represents auxiliary word, w represents punctuation mark, m represents number, vn represents and is verb and noun, F represents the noun of locality, etc..
Additionally, the first English sentence through part-of-speech tagging is:
“An/DT accuracy/NN of/IN because/IN market/NN need/VV better/JJR,/, such/JJ as/IN FCC/NNP stage/NN II/NNP for/IN E/NNP-/:911/CD services/NN,/,it/ PRP is/VBZ possible/JJ to/TO perform/VV a/DT localization/NN measurement/NN mobile/JJ station/NN200/CD are/VBP expected/VBN will/MD on/IN the/DT market/ NN from/IN overflowing/VBG./.”
Similarly, the character on the "/" right side is used for representing the part of speech of this "/" left side word, and such as DT represents article, and NN represents name Word, IN represents preposition, and JJR represents adjectival comparative degree, and JJ represents adjective, and NNP represents proper noun, and VV represents verb, CD represents number, and PRP represents pronoun, and TO represents infinitive, and VBG represents the ing form of verb, etc..Wherein, above-mentioned part of speech Symbol implication may be referred to this area open source information to obtain, and no longer repeats one by one here.
Alignment relation between first English sentence and the first Chinese sentence is as follows:
1:52:64:15:26:37:48:79:810:811:912:1013:1114:1215:1216:1217:1218: 1319:1420:1522:1524:1626:1727:1828:2029:2130:2231:2332:2433:2534:2835:2636: 2737:2938:2939:30
It should be noted that in above-mentioned alignment relation, each colon left-hand digit represents in the first English sentence The order sequence number of each word, and the order sequence number of each word in digitized representation first Chinese sentence on the right of each colon. Word in each corresponding first English sentence of colon left side sequence number is first Chinese sentence corresponding with sequence number on the right of this colon In words aligning.
Further, it should be noted that every group in bilingual alignment corpus bilingual sentence centering, reference language sentence and mesh Word in mark language statement is mutually aligned, but the word order in reference language sentence and/or target language sentences is not necessarily Absolutely accurate.For example, as the first above Chinese sentence and the first English sentence, between each word of this two sentences Alignment relation is relatively accurate as noted earlier, but each word order in the first English sentence therein(Word order)But can be phase Inaccurate.
So, for each the object language noun phrase in bilingual alignment corpus(Predetermined), the first determination Subelement 310 can find this object language noun phrase phase in this object language noun phrase corresponding reference language sentence That reference language noun phrase of alignment, and reference is generated according to the part of speech label that this reference language noun phrase is comprised First candidate template of language noun phrase template.
It is assumed that " degree of accuracy " is one of predetermined " object language noun taking above-mentioned first Chinese sentence as a example Phrase "." degree of accuracy " serial number 6 in the first Chinese sentence, is understood using above-mentioned alignment relation, in the first English sentence 2nd word is the 6th word alignment with the first Chinese sentence.Therefore, corresponding with object language noun phrase " degree of accuracy " Reference language sentence(I.e. the first English sentence)In, the reference language noun phrase aliging with " degree of accuracy " is “accuracy”.The part of speech label that " accuracy " comprises is "/NN ", therefore, thus can generate first candidate template: (/NN)+.
It should be noted that being presented above when " reference language noun phrase " comprises multiple part of speech label, each word The order of property label should meet its each order in " reference language sentence " for the equivalent.
So, by predetermined to each of bilingual alignment corpus " object language noun phrase " carry out as On process, multiple first candidate template can be obtained.
For example, multiple first candidate template that the first determination subelement 310 obtains are as follows:
(/NN)+
(/DT)(/NN)+
(/DT)(/NN)+(/JJ)+(/NN)+
(/JJ)
(/NN)+(/DT)(/NN)+
(/JJ)(/VV)(/NN)+
(/VV)(/NN)+(/CC)(/NN)+
(/DT)(/NN)(/VBG)(/NN)+
(/DT)(/NN)+(/CC)(/NN)+
(/JJ)(/NN)+(/PREP)(/NN)+
Then, the second determination subelement 320 can count to the frequency of occurrence of each the first candidate template, and The first candidate template that frequency of occurrence is higher than the first predetermined threshold is defined as reference language noun phrase template.Thus, At least one reference language noun phrase template can be obtained.Wherein, the first predetermined threshold for example can set based on experience value Fixed, or can also be determined by the method for test, repeat no more here.
For example, it is assumed that the first predetermined threshold is 100.Moreover, it is assumed that " (/VV) (/NN)+(/CC's) (/NN)+" repeats Occurrence number is 120, and the frequency of occurrence of " (/NN)+" is 200, and the frequency of occurrence of " (/DT) (/NN)+" is 110, and The frequency of occurrence of remaining each the first candidate template is below 100.Then, the final reference language noun phrase determining Template is:
(/VV)(/NN)+(/CC)(/NN)+
(/NN)+
(/DT)(/NN)+
So, noun phrase template obtaining unit 240 can be obtained using predetermined bilingual alignment corpus and comprise to join The reference language noun phrase template of written comments on the work, etc of public of officials words label.The above-mentioned reference language noun phrase template obtaining can be other Used by data handling procedure, for example, above-mentioned reference language noun phrase template is pre-stored in other equipment, and ought need to process When call this reference language noun phrase template.So, the above-mentioned reference language noun based on bilingual alignment building of corpus is short Language template can relatively accurately reflect the characteristic of noun phrase, the example being carried out using above-mentioned reference language noun phrase template Processing accuracy as the data processing of noun phrase extraction etc is also higher, and effect is preferable.Additionally, utilizing above-mentioned reference language Noun phrase template carries out the efficiency that subsequent treatment can also improve subsequent treatment.
To describe the device being used for extracting maximum noun phrase according to an embodiment of the invention with reference to Fig. 4 in detail Another example.
In the example as depicted in fig. 4, the device 400 for extracting maximum noun phrase determines except including noun phrase Outside unit 410, mark unit 420 and maximum noun phrase determining unit 430, also include maximum noun phrase template and obtain list Unit 450.Wherein, the noun phrase determining unit 410 in the device 400 for extracting maximum noun phrase shown in Fig. 4, mark Unit 420 and maximum noun phrase determining unit 430 can have with above in conjunction with described by Fig. 1 for extracting maximum name Noun phrase determining unit 110 in the device 100 of word phrase, mark unit 120 and maximum noun phrase determining unit 130 phase Same 26S Proteasome Structure and Function, and similar technique effect can be reached, repeat no more here.
Maximum noun phrase template obtaining unit 450 is used for based on a determination that the target of noun phrase and maximum noun phrase Language corpus, at least counts the part of speech mark corresponding to maximum noun phrase of each determination in above-mentioned target language corpora Sign and noun phrase label, to determine above-mentioned target language corpora at least one object language maximum noun phrase mould corresponding Plate.
Wherein, the noun phrase in target language corpora and maximum noun phrase can be for example the inputs according to user Predetermined.
Fig. 5 shows a kind of possible exemplary construction of maximum noun phrase template obtaining unit 450.
As shown in figure 5, in one example, maximum noun phrase template obtaining unit 450 can include part-of-speech tagging Unit 510, label for labelling subelement 520, the 3rd determination subelement 530 and the 4th determination subelement 540.
Wherein, part-of-speech tagging subelement 510 is used for carrying out part of speech mark to each sentence in above-mentioned target language corpora Note.
Label for labelling subelement 520 is used for noun phrase label for labelling to each determination in above-mentioned target language corpora Noun phrase.Wherein, the concrete process of label for labelling subelement 520 with above in conjunction with the mark unit 120 described by Fig. 1 Process similar, and similar effect can be reached, repeat no more here.
For example, in above-mentioned target language corpora, certain sentence is after the process of label for labelling subelement 520:
Because | p [market]NPNeeding | v is higher | a's | u [degree of accuracy]NP, | w such as [FCC stage]NPII | m E-911 [clothes Business]NP, | w can | v execution | v [positioning measurement]NP| u [movement station 200]NPQuilt | p expects | v meeting |, and v exists | p [market]NPUpper | f Spread unchecked | v.|w
The maximum noun phrase that 3rd determination subelement 530 is used for according to each determines in target language corpora is comprised Part of speech label, noun phrase label, obtain object language maximum noun phrase template the second candidate template.
Assume that it is short that " can execute the movement station 200 of positioning measurement " is determined in advance as maximum noun in above-mentioned sentence Language, then the part of speech label being comprised according to it, noun phrase label, the second following candidate template can be obtained:
* the X+ < * of > (/v) (/v) X.
In wherein, " movement station 200 of positioning measurement being executed " " " as predetermined key word, and can not adopt Use its part of speech.
So, it is processed as above for the maximum noun phrase that in target language corpora, each determines, can be obtained Multiple second candidate template.
4th determination subelement 540 can count to the frequency of occurrence of each the second candidate template, and will weigh Multiple occurrence number is higher than that the second candidate template of the second predetermined threshold is defined as object language maximum noun phrase template.Thus, At least one object language maximum noun phrase template can be obtained.Wherein, the second predetermined threshold for example can be based on experience value To set, or can also be determined by the method for test, to repeat no more here.
So, maximum noun phrase template obtaining unit 450 can be using determining noun phrase and maximum noun phrase Target language corpora come to obtain comprise object language part of speech label and noun phrase label object language maximum noun short Language template.The above-mentioned object language maximum noun phrase template obtaining can be used by other data handling procedures, for example, will be upper State object language maximum noun phrase template to be pre-stored in other equipment, and call this object language maximum name when needing and processing Word phrase template.So, upper based on the above-mentioned target language corpora structure determining noun phrase and maximum noun phrase State the characteristic that object language maximum noun phrase template can relatively accurately reflect maximum noun phrase, using above-mentioned target language The processing accuracy of the data processing of for example maximum noun phrase extraction that the maximum noun phrase template of speech is carried out etc also compares Height, effect is preferable.Additionally, carrying out subsequent treatment using above-mentioned object language maximum noun phrase template can also improve follow-up place The efficiency of reason.
In addition it is also necessary to explanation, in some other implementation, for extracting the device of maximum noun phrase 400 except including noun phrase determining unit 410 described above, mark unit 420, maximum noun phrase determining unit 430 and sequence label filter element 450 outside, can also optionally include noun phrase template obtaining unit 440.Wherein, Noun phrase template obtaining unit 440 can have single with above in conjunction with the noun phrase template acquisition described by Fig. 2 or Fig. 3 First 240 identical 26S Proteasome Structure and Functions, and similar technique effect can be reached, repeat no more here.
Additionally, embodiments of the invention additionally provide a kind of method for extracting maximum noun phrase, the method includes: Translate in sentence in the reference language of pending target language sentences, determine and any one of at least one noun phrase template phase The reference language noun phrase of symbol, wherein, each noun phrase template includes at least by its corresponding predefined procedure arrangement The part of speech label of individual reference language;By noun phrase label for labelling to the reference language noun with determination in target language sentences The corresponding object language noun phrase of phrase;And will be with least one maximum noun phrase template in target language sentences Any one of the phrase that is consistent be defined as maximum noun phrase, wherein, each maximum noun phrase template includes corresponding to by it Predefined procedure arrangement the part of speech label of at least one object language and/or at least one noun phrase label.
A kind of exemplary process of the above-mentioned method for extracting maximum noun phrase to be described with reference to Fig. 6.
As shown in fig. 6, being used for extracting the handling process 600 of the method for maximum noun phrase according to an embodiment of the invention Start from step S610, then execution step S620.
In step S620, translate in sentence in the reference language of pending target language sentences, determine and at least one name The reference language noun phrase that any one of word phrase template is consistent.Then execution step S630.Wherein, each noun phrase Template includes the part of speech label of at least one reference language by its corresponding predefined procedure arrangement.
In step S620 performed process for example can with above in conjunction with the noun phrase determining unit described by Fig. 1 110 process is identical, and can reach similar technique effect, will not be described here.
In one implementation, the noun phrase template used in step S620 can obtain in the following way in advance ?:In the bilingual alignment corpus of predetermined object language and reference language, count each object language noun phrase and correspond to The part of speech label that comprised of reference language noun phrase, to determine at least one reference corresponding of above-mentioned bilingual alignment corpus Language noun phrase template.
Wherein, each the object language noun phrase in bilingual alignment corpus can be for example predetermined.
Additionally, each the reference language sentence in bilingual alignment corpus can be for example pre- to first pass through part-of-speech tagging.
In one example, before step S620(As between step S610 and step S620, or before step S610), Step S710 as shown in Figure 7 and S720 can also optionally be executed.
Wherein, as shown in fig. 7, in step S710, can be according to each object language noun phrase in its corresponding ginseng Examine the part of speech label that the reference language noun phrase being alignd with this object language noun phrase in language statement is comprised, obtain ginseng The first candidate template of noun phrase template sayed in the written comments on the work, etc of public of officials.Then, execution step S720.
The frequency of occurrence of each the first candidate template in step S720, can be counted, frequency of occurrence is high The first candidate template in the first predetermined threshold is defined as at least one reference language noun phrase template.
So, by the process of step S710 and S720, it is possible to use predetermined bilingual alignment corpus comprises to obtain The reference language noun phrase template of reference language part of speech label.Above-mentioned reference language name based on bilingual alignment building of corpus Word phrase template can relatively accurately reflect the characteristic of noun phrase, is carried out using above-mentioned reference language noun phrase template Such as noun phrase extract etc data processing processing accuracy also higher, effect is preferable.
Additionally, the construction method of noun phrase template used in step S620 can using above in conjunction with Fig. 2 or Noun phrase template obtaining unit 240 described by Fig. 3 or the processing mode of its composition subelement, and can reach similar Technique effect, repeats no more here.
In step S630, by noun phrase label for labelling to the reference language noun with determination in target language sentences The corresponding object language noun phrase of phrase.Then execution step S640.Wherein, performed process in step S630 is for example Can be identical with the process above in conjunction with the mark unit 120 described by Fig. 1, and similar technique effect can be reached, This repeats no more.
In step S640, will be with any one of at least one maximum noun phrase template phase in target language sentences The phrase of symbol is defined as maximum noun phrase.Then execution step S650.Wherein, each maximum noun phrase template is included by it The part of speech label of at least one object language of corresponding predefined procedure arrangement and/or at least one noun phrase label.
In step S640, performed process for example can determine with above in conjunction with the maximum noun phrase described by Fig. 1 The process of unit 130 is identical, and can reach similar technique effect, will not be described here.
In one implementation, the maximum noun phrase template used in step S640 can be pre- in the following way First obtain:Based on a determination that the target language corpora of noun phrase and maximum noun phrase, at least count object language language material The part of speech label corresponding to maximum noun phrase and noun phrase label that each in storehouse determines, to determine object language language material Storehouse at least one object language maximum noun phrase template corresponding.
In one example, before step S640(As between step S610 and step S620, or step S620 and step Between S630, or between step S630 and step S640, or before step S610), can also optionally execute as shown in Figure 8 Step S810-S840.
Wherein, as shown in figure 8, in step S810, part of speech can be carried out to each sentence in target language corpora Mark.Then, execution step S820.
In step S820, can be short to the noun that in target language corpora, each determines by noun phrase label for labelling Language.Then, execution step S830.
In step S830, can be according to the word that be comprised of maximum noun phrase that in target language corpora, each determines Property label, noun phrase label, obtain object language maximum noun phrase template the second candidate template.Then, execution step S840.
The frequency of occurrence of each the second candidate template in step S840, can be counted, and by frequency of occurrence It is defined as at least one object language maximum noun phrase template higher than the second candidate template of the second predetermined threshold.
So, by the process of step S810-S840, it is possible to use determine noun phrase and the mesh of maximum noun phrase Mark language corpus to obtain the object language maximum noun phrase mould comprising object language part of speech label and noun phrase label Plate.Above-mentioned object language based on the above-mentioned target language corpora structure determining noun phrase and maximum noun phrase is maximum Noun phrase template can relatively accurately reflect the characteristic of maximum noun phrase, using above-mentioned object language maximum noun phrase The processing accuracy of the data processing of for example maximum noun phrase extraction that template is carried out etc is also higher, and effect is preferable.
Additionally, the construction method of noun phrase template used in step S640 can using above in conjunction with Fig. 4 or Maximum noun phrase template obtaining unit 450 described by Fig. 5 or the processing mode of its composition subelement, and can reach similar As technique effect, repeat no more here.
Handling process 600 ends at step S650.
The maximum noun that the above-mentioned method being used for extracting maximum noun phrase according to an embodiment of the invention is extracted is short Language is first with reference language(As English)Noun phrase template determining reference language noun phrase, and on here basis Recycle object language(As Chinese)Maximum noun phrase template come to determine object language maximum noun phrase, thus by layering Secondary twin-stage processes the determination achieving maximum noun phrase, and the accuracy of the result obtained by processing is higher.Additionally, using upper The complexity stating the process that the method being used for extracting maximum noun phrase according to an embodiment of the invention is carried out is relatively low.
Additionally, embodiments of the invention additionally provide a kind of electronic equipment, this electronic equipment includes being used for as above Extract the device of maximum noun phrase.In the specific implementation of above-mentioned according to an embodiment of the invention electronic equipment, on State any one equipment that electronic equipment can be in following equipment:Computer;Panel computer;Personal digital assistant;Multimedia Playback equipment;Mobile phone and electric paper book etc..Wherein, this electronic equipment has the above-mentioned device for extracting maximum noun phrase Various functions and technique effect, repeat no more here.
Each component units in the above-mentioned device being used for according to an embodiment of the invention extract maximum noun phrase, son Unit, module etc. can be configured by way of software, firmware, hardware or its combination in any.By software or firmware In the case of realization, can be from storage medium or network to the machine with specialized hardware structure(General-purpose machinery for example shown in Fig. 9 900)The program constituting this software or firmware is installed, this machine, when being provided with various program, is able to carry out above-mentioned each composition single Unit, the various functions of subelement.
Fig. 9 show the device that can be used to realize to be used for according to an embodiment of the invention to extract maximum noun phrase and A kind of structure diagram of the hardware configuration of possible messaging device of method.
In fig .9, CPU (CPU) 901 is according to the program of storage in read only memory (ROM) 902 or from depositing Storage part 908 is loaded into the various process of program performing of random access memory (RAM) 903.In RAM903, always according to needs The storage data required when various process of CPU901 execution etc..CPU901, ROM902 and RAM903 are via bus 904 each other Connect.Input/output interface 905 is also connected to bus 904.
Components described below is also connected to input/output interface 905:Importation 906(Including keyboard, mouse etc.), output Part 907(Including display, such as cathode ray tube (CRT), liquid crystal display (LCD) etc., and speaker etc.), storage part 908(Including hard disk etc.), communications portion 909(Including NIC such as LAN card, modem etc.).Communications portion 909 Via network such as the Internet execution communication process.As needed, driver 910 can be connected to input/output interface 905. Detachable media 911 such as disk, CD, magneto-optic disk, semiconductor memory etc. can be installed in driver as needed So that the computer program reading out can be installed in storage part 908 as needed on 910.
In the case that above-mentioned series of processes is realized by software, can be from network such as the Internet or from storage medium example As detachable media 911 installs the program constituting software.
It will be understood by those of skill in the art that this storage medium be not limited to wherein having program stored therein shown in Fig. 9, Separately distribute with equipment to provide a user with the detachable media 911 of program.The example of detachable media 911 comprises disk (comprising floppy disk), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk(Comprise mini Disk (MD) (registered trade mark)) and semiconductor memory.Or, storage medium can ROM902, storage part 908 in comprise Hard disk etc., wherein computer program stored, and it is distributed to user together with the equipment comprising them.
Additionally, a kind of the invention allows for program product of the instruction code of the machine-readable that is stored with.Above-mentioned instruction When code is read and executed by machine, executable above-mentioned it is used for according to an embodiment of the invention extracting the side of maximum noun phrase Method.Correspondingly, for carrying the various storages of the such as disk of this program product, CD, magneto-optic disk, semiconductor memory etc. Medium is also included within disclosure of the invention.
In the description to the specific embodiment of the invention above, for a kind of description of embodiment and/or the feature that illustrates Can be used in one or more other embodiments in same or similar mode, with the feature in other embodiment Combined, or substitute the feature in other embodiment.
Additionally, the method for various embodiments of the present invention be not limited to specifications described in or accompanying drawing shown in when Between sequentially to execute it is also possible to according to other time sequencings, concurrently or independently execute.Therefore, described in this specification The execution sequence of method the technical scope of the present invention is not construed as limiting.
It should be further understood that can also can be stored in various machines according to each operating process of the said method of the present invention The mode of the computer executable program in the storage medium read is realized.
And, the purpose of the present invention can also be accomplished by:By the above-mentioned executable program code that is stored with The computer that storage medium is directly or indirectly supplied in system or equipment, and this system or equipment or central authorities are processed Unit(CPU)Read and execute said procedure code.
Now, as long as this system or equipment have the function of configuration processor, then embodiments of the present invention are not limited to Program, and this program can also be arbitrary form, for example, target program, the program of interpreter execution or be supplied to behaviour Make shell script of system etc..
These machinable mediums above-mentioned include but is not limited to:Various memorizeies and memory element, semiconductor equipment, Disk cell such as light, magnetic and magneto-optic disk, and other is suitable to medium of storage information etc..
In addition, client computer is by the corresponding website being connected on the Internet, and the computer by the foundation present invention Program code is downloaded and is installed in computer and then executes this program it is also possible to realize the present invention.
Last in addition it is also necessary to explanation, herein, such as left and right, first and second or the like relational terms only Only it is used for making a distinction an entity or operation with another entity or operation, and not necessarily require or imply that these are real There is any this actual relation or order between body or operation.And, term " inclusion ", "comprising" or its any its His variant is intended to comprising of nonexcludability, so that including a series of process of key elements, method, article or equipment not Only include those key elements, but also include other key elements being not expressly set out, or also include for this process, method, Article or the intrinsic key element of equipment.In the absence of more restrictions, by wanting that sentence "including a ..." limits It is not excluded that also there is other identical element in process, method, article or the equipment including described key element in element.
To sum up, in an embodiment according to the present invention, the invention provides following scheme but not limited to this:
A kind of device for extracting maximum noun phrase of remarks 1., including:
Noun phrase determining unit, it is arranged to translate in sentence in the reference language of pending target language sentences, Determine the reference language noun phrase being consistent with any one of at least one noun phrase template, wherein, each described noun Phrase template includes the part of speech label of at least one reference language by its corresponding predefined procedure arrangement;
Mark unit, its be arranged to by noun phrase label for labelling give described target language sentences in determine The corresponding object language noun phrase of described reference language noun phrase;And
Maximum noun phrase determining unit, it is arranged in described target language sentences will be maximum with least one The phrase that any one of noun phrase template is consistent is defined as maximum noun phrase, wherein, each described maximum noun phrase Template includes the part of speech label of at least one object language arranging by its corresponding predefined procedure and/or at least one noun is short Language label.
Device for extracting maximum noun phrase according to remarks 1 for the remarks 2., also includes:
Noun phrase template obtaining unit, it is arranged to the bilingual alignment in predetermined object language and reference language In corpus, count the part of speech label that each object language noun phrase corresponding reference language noun phrase is comprised, with true Fixed described bilingual alignment corpus at least one reference language noun phrase template corresponding.
Device for extracting maximum noun phrase according to remarks 2 for the remarks 3., wherein:
Each object language noun phrase in described bilingual alignment corpus is predetermined;And
Each reference language sentence in described bilingual alignment corpus is pre- to first pass through part-of-speech tagging.
Device for extracting maximum noun phrase according to remarks 2 or 3 for the remarks 4., wherein, described noun phrase Template obtaining unit includes:
First determination subelement, it is arranged to according to each described object language noun phrase in its corresponding reference The part of speech label that the reference language noun phrase being alignd with this object language noun phrase in language statement is comprised, obtains reference First candidate template of language noun phrase template;And
Second determination subelement, it is arranged to count the frequency of occurrence of each described first candidate template, will Frequency of occurrence is higher than that the first candidate template of the first predetermined threshold is defined as at least one reference language noun phrase template.
Device for extracting maximum noun phrase according to any one of remarks 1-4 for the remarks 5., also includes:
Maximum noun phrase template obtaining unit, it is configured for determining noun phrase and maximum noun phrase Target language corpora, at least count in described target language corpora each determine maximum noun phrase corresponding to Part of speech label and noun phrase label, to determine described target language corpora at least one object language maximum noun corresponding Phrase template.
Device for extracting maximum noun phrase according to remarks 5 for the remarks 6., wherein, described maximum noun phrase Template obtaining unit includes:
Part-of-speech tagging subelement, it is arranged to carry out part of speech mark to each sentence in described target language corpora Note;
Label for labelling subelement, it is arranged to noun phrase label for labelling to every in described target language corpora The noun phrase of individual determination;
3rd determination subelement, it is arranged to according to the maximum noun phrase that in target language corpora, each determines The part of speech label that comprised, noun phrase label, obtain the second candidate template of object language maximum noun phrase template;And
4th determination subelement, it is arranged to count the frequency of occurrence of each described second candidate template, and It is short that the second candidate template that frequency of occurrence is higher than the second predetermined threshold is defined as at least one object language maximum noun Language template.
Device for extracting maximum noun phrase according to any one of remarks 1-6 for the remarks 7., wherein, at least portion Divide, between the partly adjacent part of speech label in described maximum noun phrase template and/or noun phrase label, there is preset critical Word.
Device for extracting maximum noun phrase according to remarks 7 for the remarks 8., wherein, described predetermined keyword bag Include at least one in following key word:
Preposition, conjunction, auxiliary word and modal verb.
Device for extracting maximum noun phrase according to any one of remarks 1-8 for the remarks 9., wherein, described ginseng The language for being not required to participle sayed in the written comments on the work, etc of public of officials.
Device for extracting maximum noun phrase according to remarks 9 for the remarks 10., wherein, described reference language is Any one in following language:
English;French;And German.
Device for extracting maximum noun phrase according to any one of remarks 1-10 for the remarks 11., wherein, described Object language is the language needing participle.
Device for extracting maximum noun phrase according to remarks 11 for the remarks 12., wherein, described object language is Any one in following language:
Chinese;And Japanese.
A kind of method for extracting maximum noun phrase for extracting maximum noun phrase of remarks 13., including:
Translate in sentence in the reference language of pending target language sentences, in determination and at least one noun phrase template The reference language noun phrase that any one is consistent, wherein, each described noun phrase template is included by its corresponding predefined procedure The part of speech label of at least one reference language of arrangement;
Noun phrase label for labelling is given the described reference language noun phrase with determination in described target language sentences Corresponding object language noun phrase;And
The phrase being consistent with any one of at least one maximum noun phrase template in described target language sentences It is defined as maximum noun phrase, wherein, each described maximum noun phrase template is included by its corresponding predefined procedure arrangement The part of speech label of at least one object language and/or at least one noun phrase label.
Method for extracting maximum noun phrase according to remarks 13 for the remarks 14., wherein, described noun phrase mould Plate obtains in the following way:
In the bilingual alignment corpus of predetermined object language and reference language, count each object language noun phrase The part of speech label that corresponding reference language noun phrase is comprised, with determine described bilingual alignment corpus corresponding at least one Reference language noun phrase template.
Method for extracting maximum noun phrase according to remarks 14 for the remarks 15., wherein:
Each object language noun phrase in described bilingual alignment corpus is predetermined;And
Each reference language sentence in described bilingual alignment corpus is pre- to first pass through part-of-speech tagging.
Method for extracting maximum noun phrase according to remarks 14 or 15 for the remarks 16., wherein, described noun is short Language template obtains in the following way:
According to each described object language noun phrase in its corresponding reference language sentence with this object language noun The part of speech label that the reference language noun phrase of phrase alignment is comprised, obtains the first candidate of reference language noun phrase template Template;And
Count the frequency of occurrence of each described first candidate template, frequency of occurrence is higher than the first predetermined threshold The first candidate template be defined as at least one reference language noun phrase template.
Remarks 17. a kind of electronic equipment, including in such as remarks 1-12 arbitrary described for extracting maximum noun phrase Device.
Electronic equipment according to remarks 17 for the remarks 18., wherein, described electronic equipment is any one in following equipment Kind:
Computer;Panel computer;Personal digital assistant;Multimedia play equipment;Mobile phone and electric paper book.
A kind of program product of the instruction code of the machine-readable that is stored with of remarks 19., described program product is upon execution Described machine can be made to execute according to the described method for extracting maximum noun phrase arbitrary in remarks 12-16.
A kind of computer-readable recording medium of remarks 20., is stored thereon with the program product according to remarks 19.

Claims (28)

1. a kind of device for extracting maximum noun phrase, including:
Noun phrase determining unit, it is arranged to translate in sentence in the reference language of pending target language sentences, determines The reference language noun phrase being consistent with any one of at least one noun phrase template, wherein, each described noun phrase Template includes the part of speech label of at least one reference language by its corresponding predefined procedure arrangement;
Mark unit, its be arranged to by noun phrase label for labelling give described target language sentences in determine described in The corresponding object language noun phrase of reference language noun phrase;And
Maximum noun phrase determining unit, it is arranged in described target language sentences will be with least one maximum noun The phrase that any one of phrase template is consistent is defined as maximum noun phrase, wherein, each described maximum noun phrase template Part of speech label including at least one object language by its corresponding predefined procedure arrangement and/or at least one noun phrase mark Sign.
2. the device for extracting maximum noun phrase according to claim 1, also includes:
Noun phrase template obtaining unit, it is arranged to the bilingual alignment language material in predetermined object language and reference language In storehouse, count the part of speech label that each object language noun phrase corresponding reference language noun phrase is comprised, to determine State bilingual alignment corpus at least one reference language noun phrase template corresponding.
3. the device for extracting maximum noun phrase according to claim 2, wherein:
Each object language noun phrase in described bilingual alignment corpus is predetermined;And
Each reference language sentence in described bilingual alignment corpus is pre- to first pass through part-of-speech tagging.
4. the device for extracting maximum noun phrase according to Claims 2 or 3, wherein, described noun phrase template Obtaining unit includes:
First determination subelement, it is arranged to according to each described object language noun phrase in its corresponding reference language The part of speech label that the reference language noun phrase being alignd with this object language noun phrase in sentence is comprised, obtains reference language First candidate template of noun phrase template;And
Second determination subelement, it is arranged to count the frequency of occurrence of each described first candidate template, will repeat Occurrence number is higher than that the first candidate template of the first predetermined threshold is defined as at least one reference language noun phrase template.
5. the device for extracting maximum noun phrase according to any one of claim 1-3, also includes:
Maximum noun phrase template obtaining unit, it is configured for determining the mesh of noun phrase and maximum noun phrase Mark language corpus, at least counts the part of speech corresponding to maximum noun phrase of each determination in described target language corpora Label and noun phrase label, to determine described target language corpora at least one object language maximum noun phrase corresponding Template.
6. the device for extracting maximum noun phrase according to claim 4, also includes:
Maximum noun phrase template obtaining unit, it is configured for determining the mesh of noun phrase and maximum noun phrase Mark language corpus, at least counts the part of speech corresponding to maximum noun phrase of each determination in described target language corpora Label and noun phrase label, to determine described target language corpora at least one object language maximum noun phrase corresponding Template.
7. the device for extracting maximum noun phrase according to claim 5, wherein, described maximum noun phrase template Obtaining unit includes:
Part-of-speech tagging subelement, it is arranged to carry out part-of-speech tagging to each sentence in described target language corpora;
Label for labelling subelement, it is arranged to, and by noun phrase label for labelling, in described target language corpora, each is true Fixed noun phrase;
3rd determination subelement, the maximum noun phrase that it is arranged to according to each determines in target language corpora is wrapped The part of speech label that contains, noun phrase label, obtain the second candidate template of object language maximum noun phrase template;And
4th determination subelement, it is arranged to count the frequency of occurrence of each described second candidate template, and will weigh Multiple occurrence number is higher than that the second candidate template of the second predetermined threshold is defined as at least one object language maximum noun phrase mould Plate.
8. the device for extracting maximum noun phrase according to claim 6, wherein, described maximum noun phrase template Obtaining unit includes:
Part-of-speech tagging subelement, it is arranged to carry out part-of-speech tagging to each sentence in described target language corpora;
Label for labelling subelement, it is arranged to, and by noun phrase label for labelling, in described target language corpora, each is true Fixed noun phrase;
3rd determination subelement, the maximum noun phrase that it is arranged to according to each determines in target language corpora is wrapped The part of speech label that contains, noun phrase label, obtain the second candidate template of object language maximum noun phrase template;And
4th determination subelement, it is arranged to count the frequency of occurrence of each described second candidate template, and will weigh Multiple occurrence number is higher than that the second candidate template of the second predetermined threshold is defined as at least one object language maximum noun phrase mould Plate.
9. the device for extracting maximum noun phrase according to any one of claim 1-3, wherein, at least part of institute State, between partly adjacent part of speech label and/or noun phrase label in maximum noun phrase template, there is predetermined keyword.
10. the device for extracting maximum noun phrase according to claim 4, wherein, at least partly described maximum name Between partly adjacent part of speech label in word phrase template and/or noun phrase label, there is predetermined keyword.
11. devices for extracting maximum noun phrase according to claim 5, wherein, at least partly described maximum name Between partly adjacent part of speech label in word phrase template and/or noun phrase label, there is predetermined keyword.
12. devices for extracting maximum noun phrase according to claim 6, wherein, at least partly described maximum name Between partly adjacent part of speech label in word phrase template and/or noun phrase label, there is predetermined keyword.
13. devices for extracting maximum noun phrase according to claim 7, wherein, at least partly described maximum name Between partly adjacent part of speech label in word phrase template and/or noun phrase label, there is predetermined keyword.
14. devices for extracting maximum noun phrase according to claim 8, wherein, at least partly described maximum name Between partly adjacent part of speech label in word phrase template and/or noun phrase label, there is predetermined keyword.
15. devices for extracting maximum noun phrase according to any one of claim 1-3, wherein, described reference Language is the language being not required to participle, and described object language is the language needing participle.
16. devices for extracting maximum noun phrase according to claim 4, wherein, described reference language is to be not required to The language of participle, and described object language is the language needing participle.
17. devices for extracting maximum noun phrase according to claim 5, wherein, described reference language is to be not required to The language of participle, and described object language is the language needing participle.
18. devices for extracting maximum noun phrase according to claim 6, wherein, described reference language is to be not required to The language of participle, and described object language is the language needing participle.
19. devices for extracting maximum noun phrase according to claim 7, wherein, described reference language is to be not required to The language of participle, and described object language is the language needing participle.
20. devices for extracting maximum noun phrase according to claim 8, wherein, described reference language is to be not required to The language of participle, and described object language is the language needing participle.
21. devices for extracting maximum noun phrase according to claim 9, wherein, described reference language is to be not required to The language of participle, and described object language is the language needing participle.
22. devices for extracting maximum noun phrase according to claim 10, wherein, described reference language is to be not required to The language of participle, and described object language is the language needing participle.
23. devices for extracting maximum noun phrase according to claim 11, wherein, described reference language is to be not required to The language of participle, and described object language is the language needing participle.
24. devices for extracting maximum noun phrase according to claim 12, wherein, described reference language is to be not required to The language of participle, and described object language is the language needing participle.
25. devices for extracting maximum noun phrase according to claim 13, wherein, described reference language is to be not required to The language of participle, and described object language is the language needing participle.
26. devices for extracting maximum noun phrase according to claim 14, wherein, described reference language is to be not required to The language of participle, and described object language is the language needing participle.
A kind of 27. methods for extracting maximum noun phrase, including:
Translate in sentence in the reference language of pending target language sentences, determine with least one noun phrase template in arbitrary The individual reference language noun phrase being consistent, wherein, each described noun phrase template is included by its corresponding predefined procedure arrangement At least one reference language part of speech label;
Noun phrase label for labelling is given relative with the described reference language noun phrase determining in described target language sentences The object language noun phrase answered;And
The phrase being consistent with any one of at least one maximum noun phrase template is determined by described target language sentences For maximum noun phrase, wherein, each described maximum noun phrase template includes arranging at least by its corresponding predefined procedure The part of speech label of one object language and/or at least one noun phrase label.
28. a kind of electronic equipments, including the described device for extracting maximum noun phrase arbitrary in such as claim 1-26.
CN201310084666.5A 2013-03-15 2013-03-15 For extracting device, method and the electronic equipment of maximum noun phrase Expired - Fee Related CN104050156B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310084666.5A CN104050156B (en) 2013-03-15 2013-03-15 For extracting device, method and the electronic equipment of maximum noun phrase

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310084666.5A CN104050156B (en) 2013-03-15 2013-03-15 For extracting device, method and the electronic equipment of maximum noun phrase

Publications (2)

Publication Number Publication Date
CN104050156A CN104050156A (en) 2014-09-17
CN104050156B true CN104050156B (en) 2017-03-01

Family

ID=51503010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310084666.5A Expired - Fee Related CN104050156B (en) 2013-03-15 2013-03-15 For extracting device, method and the electronic equipment of maximum noun phrase

Country Status (1)

Country Link
CN (1) CN104050156B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021234A (en) * 2016-05-31 2016-10-12 徐子涵 Label extraction method and system
CN107861952A (en) * 2017-09-25 2018-03-30 沈阳航空航天大学 Neural machine translation method based on Maximal noun phrase divide-and-conquer strategy
CN110532567A (en) * 2019-09-04 2019-12-03 北京百度网讯科技有限公司 Extracting method, device, electronic equipment and the storage medium of phrase

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271448A (en) * 2007-03-19 2008-09-24 株式会社东芝 Chinese language fundamental noun phrase recognition, its regulation generating method and apparatus
CN101751385A (en) * 2008-12-19 2010-06-23 华建机器翻译有限公司 Multilingual information extraction method adopting hierarchical pipeline filter system structure
CN102681981A (en) * 2011-03-11 2012-09-19 富士通株式会社 Natural language lexical analysis method, device and analyzer training method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271448A (en) * 2007-03-19 2008-09-24 株式会社东芝 Chinese language fundamental noun phrase recognition, its regulation generating method and apparatus
CN101751385A (en) * 2008-12-19 2010-06-23 华建机器翻译有限公司 Multilingual information extraction method adopting hierarchical pipeline filter system structure
CN102681981A (en) * 2011-03-11 2012-09-19 富士通株式会社 Natural language lexical analysis method, device and analyzer training method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
汉英双语语料库中名词短语的自动对应;刘冬明 等;《中文信息学报》;20030930;第17卷(第5期);第6-12页 *
汉语最长名词短语的自动识别;周强 等;《软件学报》;20001231;第11卷(第2期);第195-201页 *
统计和规则相结合的汉语最长名词短语自动识别;代翠 等;《中文信息学报》;20081130;第22卷(第6期);第110-115页 *

Also Published As

Publication number Publication date
CN104050156A (en) 2014-09-17

Similar Documents

Publication Publication Date Title
TWI636452B (en) Method and system of voice recognition
US8990066B2 (en) Resolving out-of-vocabulary words during machine translation
US7917355B2 (en) Word detection
KR101465770B1 (en) Word probability determination
US7475063B2 (en) Augmenting queries with synonyms selected using language statistics
US8762358B2 (en) Query language determination using query terms and interface language
JP2021089705A (en) Method and device for evaluating translation quality
US8874433B2 (en) Syntax-based augmentation of statistical machine translation phrase tables
US9959340B2 (en) Semantic lexicon-based input method editor
CN104408078A (en) Construction method for key word-based Chinese-English bilingual parallel corpora
CN104239286A (en) Method and device for mining synonymous phrases and method and device for searching related contents
CN104573099A (en) Topic searching method and device
KR101509727B1 (en) Apparatus for creating alignment corpus based on unsupervised alignment and method thereof, and apparatus for performing morphological analysis of non-canonical text using the alignment corpus and method thereof
Li et al. Improving text normalization using character-blocks based models and system combination
CN104035918A (en) Chinese organization name abbreviation recognition system adopting context feature matching
CN104050156B (en) For extracting device, method and the electronic equipment of maximum noun phrase
CN104679735A (en) Pragmatic machine translation method
CN103678371B (en) Word library updating device, data integration device and method and electronic equipment
Farhath et al. Integration of bilingual lists for domain-specific statistical machine translation for sinhala-tamil
Álvarez et al. Towards customized automatic segmentation of subtitles
CN104516870A (en) Translation check method and system
CN103678270B (en) Semantic primitive abstracting method and semantic primitive extracting device
US10650195B2 (en) Translated-clause generating method, translated-clause generating apparatus, and recording medium
CN112765977B (en) Word segmentation method and device based on cross-language data enhancement
CN104239293B (en) A kind of proper name interpretation method based on machine translation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170301

Termination date: 20180315

CF01 Termination of patent right due to non-payment of annual fee