CN102375839A - Method and device for acquiring target data set from candidate data set, and translation machine - Google Patents

Method and device for acquiring target data set from candidate data set, and translation machine Download PDF

Info

Publication number
CN102375839A
CN102375839A CN201010257678XA CN201010257678A CN102375839A CN 102375839 A CN102375839 A CN 102375839A CN 201010257678X A CN201010257678X A CN 201010257678XA CN 201010257678 A CN201010257678 A CN 201010257678A CN 102375839 A CN102375839 A CN 102375839A
Authority
CN
China
Prior art keywords
target data
data
characteristic
candidate
candidate data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201010257678XA
Other languages
Chinese (zh)
Inventor
郑仲光
何中军
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201010257678XA priority Critical patent/CN102375839A/en
Publication of CN102375839A publication Critical patent/CN102375839A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a device for acquiring a target data set from a candidate data set, and a translation machine. The method comprises the following steps of: extracting characteristics from a target data sample; and extracting target data from the candidate data set by using the characteristics to form the target data set. According to the implementation mode, the target data can be extracted from the candidate data set according to the provided sample.

Description

Obtain the method and apparatus and the machine translator of target data set from the candidate data collection
Technical field
The application relates to data extract, in particular to a kind of method and apparatus that obtains target data set from the candidate data collection.In addition, the application also relates to a kind of machine translator.
Background technology
Traditionally; Concentrate from candidate data according to specific target data sample and to obtain specific target data usually through judging that data that candidate data is concentrated and the similarity between the target data sample manually select, some data of picked at random are as target data perhaps even from the candidate data level.Significantly, this traditional mode and method can not provide high-quality target data.
Summary of the invention
To provide about brief overview of the present invention hereinafter, so that the basic comprehension about some aspect of the present invention is provided.Should be appreciated that this general introduction is not about exhaustive general introduction of the present invention.It is not that intention is confirmed key of the present invention or pith, neither be intended to limit scope of the present invention.Its purpose only is to provide some notion with the form of simplifying, with this as the preorder in greater detail of argumentation after a while
According to the application's embodiment, from target data sample extraction characteristic, utilize said characteristic to concentrate the extracting objects data, thereby form target data set from said candidate data.
Like this, extract subclass based on target data sample from the candidate data collection and form target data set to said specific purpose.The generation of formed target data set is more rapid.In addition, formed target data set more meets the requirement of subsequent treatment.
Description of drawings
The present invention can wherein use same or analogous Reference numeral to represent identical or similar parts in institute's drawings attached through with reference to hereinafter combining the given description of accompanying drawing to be better understood.Said accompanying drawing comprises in this manual and forms the part of this instructions together with following detailed description, and is used for further illustrating the preferred embodiments of the present invention and explains principle and advantage of the present invention.In the accompanying drawings:
Fig. 1 shows the process flow diagram that is used for obtaining from the candidate data collection method of target data set according to an embodiment of the invention,
Fig. 2 shows the process flow diagram that is used for obtaining from the candidate data collection method of target data set according to another embodiment of the present invention,
Fig. 3 shows the process flow diagram that is used for obtaining from the candidate data collection method of target data set according to another embodiment of the present invention,
Fig. 4 shows the process flow diagram that is used for obtaining from the candidate data collection method of target data set according to another embodiment of the present invention,
Fig. 5 shows the schematic diagram that is used for obtaining from the candidate data collection device of target data set according to an embodiment of the invention,
Fig. 6 shows the schematic diagram of extracting unit that is used for obtaining from the candidate data collection device of target data set according to an embodiment of the invention,
Fig. 7 shows the schematic diagram of extracting unit that is used for obtaining from the candidate data collection device of target data set according to another embodiment of the present invention,
Fig. 8 shows the schematic diagram of extracting unit that is used for obtaining from the candidate data collection device of target data set according to another embodiment of the present invention, and
Fig. 9 shows and can be used for implementing the schematic block diagram of computing machine according to an embodiment of the invention.
Embodiment
To combine accompanying drawing that example embodiment of the present invention is described hereinafter.In order to know and for simplicity, in instructions, not describe all characteristics of actual embodiment.Yet, should understand, in the process of any this actual embodiment of exploitation, can make a lot of decisions specific to embodiment, so that realize developer's objectives, and these decisions may change along with the difference of embodiment to some extent.
At this, what also need explain a bit is for fear of having blured the present invention because of unnecessary details, only to show in the accompanying drawings and the closely-related apparatus structure of scheme according to the present invention, and omitted other details little with relation of the present invention.
First embodiment
Fig. 1 shows and obtains the process flow diagram of the method for target data set according to the application's embodiment from the candidate data collection.In order to obtain target data set from the candidate data collection, in S110 from target data sample extraction characteristic.This target data sample can comprise one or more data, and wherein data comprise data element.Data can comprise: character string, sentence or pictures.Correspondingly, data element can be character, word or picture.Obviously, said characteristic can be any characteristic, and as nonrestrictive example, the characteristic of being extracted can be made up of at least a portion in the data element.For example, if the target data sample is a sentence, then its data element is the word that constitutes this sentence, and the characteristic of extracting then is at least one word in the sentence.When from the target data sample, extracting a plurality of characteristic, the frequency that in the target data sample, occurs based on characteristic is confirmed the weight of each characteristic.Choose the characteristic of the characteristic of high weight as the target data sample.In other words, the frequency that characteristic occurs in the target data sample is high more, and then its weight is just high more.
In S120, utilize said characteristic to concentrate the extracting objects data from candidate data, form target data set.The candidate data collection can be to contain the data set of target data set and can comprise picture, text or language material or the like.Target data set to be formed is the data set specific to the target data sample.By S120, according to the characteristic of the target data sample that in S110, extracts, at the concentrated candidate data that has said characteristic of finding out of candidate data, to form target data set.At this, candidate data can comprise data element equally, and its data element also can be character, word or picture.
Through this method; For example can according to the user the information (is the target data sample at this) in interested field from big information bank or ensemble of communication (is the candidate data collection at this), extract relevant information (be target data at this), thereby the information (is target data set at this) of formation customization.For example; The user is interested in the information of computer realm; Therefore can utilize one piece of article relevant as sample with computer realm; From the internet, search the information relevant the numerous information, and choose the information that is closely related with computer realm and offer this user as target data set with this field.
Second embodiment
As from can seeing Fig. 2, this embodiment is the improvement project to embodiment shown in Figure 1.It is pointed out that saved for describe succinct with Fig. 1 in the description of the effect part identical with function.
In S110, from target data sample extraction characteristic.S110 in this embodiment is identical with S110 among the embodiment shown in Figure 1, so repeat no more at this.
In S130, utilize characteristic to come the query candidate data set.Characteristic through the target data sample that will in S110, extract comes the candidate data of candidate data collection is inquired about as the information retrieval keyword.At this, keyword can be character, word or picture.In other words, use the characteristic of being extracted that each concentrated candidate data of candidate data is compared, find out the candidate data that has this characteristic.
In S140, obtain target data according to the similarity of candidate data that inquires and target data sample.At this, similarity can comprise how much confirming of characteristic according to the candidate data that inquires.That is to say that the characteristic that the candidate data that inquires comprises is many more, then candidate data is just similar more with target sample, otherwise just dissimilar.Certainly, also can adopt frequency that characteristic occurs basis in candidate data as similarity evaluation.For example, the frequency of the appearance of this characteristic in candidate data is high more in a plurality of candidate datas that have this characteristic, and then similarity is high more, otherwise just low.Certainly, also can adopt multiple evaluation method to come similarity is carried out comprehensive evaluation.In addition, similarity can also be confirmed by the similarity scoring that information retrieval method obtains.
The 3rd embodiment
As from can seeing Fig. 3, this embodiment is the improvement to embodiment shown in Figure 1.It is pointed out that for describe succinct saved with Fig. 1 in the description of the effect part identical with function.
In S110, from target data sample extraction characteristic.S110 in this embodiment is identical with S110 among the embodiment shown in Figure 1, so repeat no more at this.
In S150, the characteristic that is utilized in the target data sample that extracts among the S110 is carried out cluster to the concentrated candidate data of candidate data.
In S160, select suitable class as said target data according to the class and the similarity of target data sample that cluster produced.At this, suitable class is interpreted as the high class of class similarity that target data sample and cluster are produced.
The 4th embodiment
As from can seeing Fig. 4, embodiment shown in Figure 4 is the combination of second embodiment and the 3rd embodiment.
In S110, from target data sample extraction characteristic.
In S130, utilize characteristic to come the query candidate data set, and in S170, the candidate data that judgement inquires and the similarity of target data sample.
In S150, the candidate data that the characteristic that is utilized in the target data sample that extracts among the S110 is concentrated candidate data carries out cluster, and in S180, judges the similarity of class that cluster produced and target data sample.Succinct for what describe, saved in Fig. 4 with Fig. 2 and Fig. 3 in act on the detailed description of the part identical with function.Its particular content sees also the description to Fig. 2 and Fig. 3, repeats no more at this.
In S190, will select suitable candidate data as said target data according to comparative result according to the similarity comparison of the similarity of cluster produced class and target data sample with candidate data that inquires and target data sample.The candidate data that extracts respectively among Fig. 2 and Fig. 3 and the similarity of target data sample are compared, choose the high candidate data of similarity as target data.The advantage of comprehensive above-mentioned two kinds of methods provides more accurate target data thus.
The 5th embodiment
In this embodiment, exemplarily be applied to the target data sample of textual form according to the method for obtaining target data set from the candidate data collection of the application's embodiment.At this moment, can be by means of interval n unit phrase (interval n-gram phrase), n unit's phrase (n-gram phrase) or vocabulary from target data sample extraction characteristic.At this, the candidate data collection can comprise a plurality of texts accordingly, and wherein each text can be regarded as comprising the sentence set of a plurality of sentences formations.Therefore, when the target data sample is single sentence, can be regarded as target data and only comprises text with a sentence.At this only is that example describes with the text, but is equally applicable to the target data sample of single sentence according to the method for the application's embodiment.
In the embodiment that adopts n unit's phrase at interval or n unit phrase to inquire about, be that the window of S puts on the target data sample and thus with [w with size 1, w 2..., w S] represent the target data sample.The n unit phrase that is generated by the target data sample is expressed as P (w 1, w 2..., w S), w wherein iBe in the window i word and | P|≤n, n are natural number.Word in this phrase can continuously also can be discontinuous, is n unit phrase under continuous situation, and be the first phrase of n at interval under the discontinuous situation.For example; Sentence " is promoted the stable development of national economy " to be represented with interval n unit phrase; If n=3 then obtains " promotion ", " promoting national ", " promotion national economy ", " promotion country ", " promotion national stability ", " promoting economical ", " promotion economy " and " promotion economic stability ".N unit phrase by from the target data sample extraction comes the query candidate data set, and confirms candidate data according to comprehensive evaluation, thereby forms target data.
In the embodiment that adopts vocabulary to inquire about, form vocabulary by the target data sample.To arrange according to its frequency descending ground that the target data sample, occurs from all words of target data sample.So the word that can use highest frequency is query candidate data set, or the word that also can adopt intermediate frequency query candidate data set in groups in groups, and confirm candidate data according to similarity, thus the formation target data.At this, similarity really usual practice as can come confirming according to what of the word that comprises highest frequency.
Equally, also can come the candidate data collection is carried out cluster by the interval n unit phrase (interval n-gram phrase), n unit's phrase (n-gram phrase) or the vocabulary that form by the target data sample.For example, in adopting the embodiment of interval n unit's phrase, form the feature set F of n unit phrase (n-gram phrase) form by the target data sample to the candidate data cluster.All data-switching that candidate data is concentrated become proper vector V s<f 1: w 1, f 2: w 2..., f m: w m>, f wherein iIt is the n unit phrase that in F, finds; w iBe corresponding weight, i is a natural number.Preferably, the frequency that occurs in the target data sample based on said characteristic is confirmed the weight of said characteristic.Institute's directed quantity is formed eigenmatrix.Carry out cluster and from cluster result, select preceding N class according to similarity, to obtain target data, wherein N is natural number and can rule of thumb selects.
Adopt vocabulary similar with the first phrase of above-mentioned employing interval n, repeat no more at this to the embodiment of candidate data cluster to the embodiment of candidate data cluster.
In addition, will be applied to the embodiment in the field of translating according to the method for obtaining target data set from the candidate data collection of the application's embodiment, the candidate data collection is a translated corpora, and this translated corpora comprises the language material of macaronic at least mutual correspondence.At this, statement " corresponding each other " is interpreted as macaronic intertranslation.Term " language material " is interpreted as term, sentence or article at this.Target data set is the subclass that extracts from said translated corpora for specific purpose.The target data sample is the text of at least a language in the said bilingual of preparing to said specific purpose at least.For example the text of at least a language can be the article write with the wherein a kind of language in the said bilingual at least or sentence or the like.
Exemplarily, the candidate data collection is the English-Chinese corpus in control field, and wherein Electromechanical Control, electrical control, machinery control or the like are contained in the control field.And the target data sample only relates to the English language material in electrical control field.According to the method for the application's embodiment, can extract characteristic from target data sample (being the English language material in electrical control field at this, perhaps Chinese language material); Utilize said characteristic from candidate data collection (being the English-Chinese corpus in control field) extracting objects data, thereby form target data set (being the English-Chinese corpus in electrical control field) at this at this.At this; Mentioned n unit phrase, interval n unit's phrase or vocabulary inquired about the English-Chinese corpus in control field above also can being formed by the English language material (or Chinese language material) in electrical control field; Perhaps to controlling the English-Chinese corpus cluster in field; Perhaps combine this dual mode, confirm candidate data according to the candidate data (being English corpus or Chinese corpus in the English-Chinese corpus in control field) and the similarity of target data sample (English or Chinese language material accordingly) at this; Thereby the formation target data set has just formed the corpus of the english-chinese bilingual in electrical control field.Realized forming the corpus of specific purpose thus.
Certainly it should be understood that to those skilled in the art that the candidate data collection also comprises multilingual, for example Germany and Britain, Great Britain and France or the like.Target data sample and candidate data collection also can be contained bigger scope but not be confined to technical field, for example normal dictionary or the like.
Similarly, also can the method for obtaining target data set from the candidate data collection according to the application's embodiment be applied to professional dictionary, be used to form the target data set that Chinese idiom, idiom, technical term are made an explanation.At this, with different being of application in translation field, the candidate data collection comprises term subset and explanatory corpus.Other handle identical with the application in translation field, so repeat no more at this.
Though to should be noted also that at this be example with regard to translation and the corpus explained only is described the method for the application's embodiment.Yet what those skilled in the art should understand that is; The application's embodiment is not limited to above-mentioned word processing field, but can be applied to any field of choosing the data set the most relevant with target data set according to the characteristic of target data set from the candidate data centralization.For example, if the target numbers sample is an image, from image, extracting characteristic is to extract the image that has the characteristic of extracting the picture library from the candidate data collection, and the image that is wherein extracted can be a plurality of.
The 7th embodiment
Fig. 5 shows the schematic diagram that is used for obtaining from the candidate data collection device of target data set according to the application's a embodiment.This device has: feature extraction unit 510 is used for from target data sample extraction characteristic; Extracting unit 520 is used to utilize said characteristic to concentrate the extracting objects data to form target data set from said candidate data.Feature extraction unit 510 offers extracting unit 520 with the characteristic of extracting, so that it concentrates the extracting objects data to form target data set from candidate data.Likewise, this target data sample can comprise one or more data, and wherein data comprise data element.Data can comprise: character string, sentence or pictures.Correspondingly, data element can be character, word or picture.Obviously, said characteristic can be any characteristic.As nonrestrictive example, the characteristic of being extracted can be made up of at least a portion in the data element.For example, if the target data sample is a sentence, then its data element is the word that constitutes this sentence, and the characteristic of extracting then is at least one word in the sentence.When from the target data sample, extracting a plurality of characteristic, the frequency that in the target data sample, occurs based on characteristic is confirmed the weight of each characteristic.Choose the characteristic of the characteristic of high weight as the target data sample.In other words, the frequency that characteristic occurs in the target data sample is high more, and then its weight is just high more.The detail of the working method of this device is with corresponding according to the process flow diagram of the described method of Fig. 1 to Fig. 4.Repeat no more at this.
In another embodiment, differently with above-mentioned enforcement only be that extracting unit 520 also comprises: query unit 521 and generation unit 522, as shown in Figure 6.That is to say that the extracting unit 520 among this embodiment extracts candidate data and generates target data set by method for information retrieval.In this embodiment, this query unit 521 is utilized characteristic query candidate data collection.The candidate data that generation unit 522 inquires according to query unit 521 and the similarity of target data sample obtain said target data.The concrete working method of query unit 521 and generation unit 522 sees also above-mentioned to the embodiment that sets forth according to the method for the application's embodiment.Repeat no more at this.
In another embodiment, differently with above-mentioned enforcement only be that extracting unit 520 comprises: cluster cell 523 and generation unit 522, as shown in Figure 7.That is to say that the extracting unit 520 among this embodiment extracts candidate data and generates target data set by clustering method.In this embodiment, this cluster cell 523 utilizes said characteristic that the concentrated data of candidate data are carried out cluster.This generation unit 522 selects suitable class as said target data according to the class that cluster produced with the similarity of target data sample.The concrete working method of cluster cell 523 and generation unit 522 sees also above-mentioned to the embodiment that sets forth according to the method for the application's embodiment.Repeat no more at this.
In addition, in another embodiment, extracting unit 520 comprises query unit 521, cluster cell 523, comparing unit 521 and generation unit 522, and is as shown in Figure 8.Extracting unit 520 among this embodiment extracts candidate data that candidate data and this dual mode of comparison obtain by information retrieval and clustering method and the similarity of target data sample generates target data set.In other words, promptly combine above-mentioned by information retrieval and clustering method.In this embodiment, this cluster cell 523 utilizes characteristic that the concentrated candidate data of candidate data is carried out cluster.This comparing unit 524 will be according to the similarity comparison with the sample of target data of cluster produced class and the candidate data that inquires.Generation unit 522 is configured to select suitable candidate data as target data according to comparative result.
In one embodiment, the device according to the application's embodiment is applied to the translation field.In the case; The candidate data collection is a translated corpora; Translated corpora comprises the language material of macaronic at least mutual correspondence; Said target data set is the subclass that extracts from said translated corpora for specific purpose, and said target data sample is the text of at least a language in the said bilingual of preparing to said specific purpose at least.The details of the concrete work of this device is identical with process described in the 5th embodiment.Repeat no more at this.
The application's embodiment has also proposed a kind of machine translator, and it has the language material database of the language material that comprises macaronic at least mutual correspondence, and this language material database comprises the target data set according to the method acquisition of the application's embodiment.
In addition, the application's embodiment has also proposed a kind of machine translator, and it has the device according to the application's embodiment.
The feature extraction unit, the extracting unit that are used for obtaining from the candidate data collection device of target data set can be configured through the mode of software, firmware, hardware or its combination.Dispose spendable concrete means or mode and be well known to those skilled in the art, repeat no more at this.Under situation about realizing through software or firmware; From storage medium or network the program that constitutes this software is installed to the computing machine with specialized hardware structure (multi-purpose computer 900 for example shown in Figure 9); This computing machine can be carried out various functions etc. when various program is installed.
In Fig. 9, CPU (CPU) 901 carries out various processing according to program stored among ROM (read-only memory) (ROM) 902 or from the program that storage area 908 is loaded into random-access memory (ram) 903.In RAM 903, also store data required when CPU 901 carries out various processing or the like as required.CPU 901, ROM 902 and RAM 903 are connected to each other via bus 904.Input/output interface 905 also is connected to bus 904.
Following parts are connected to input/output interface 905: importation 906 (comprising keyboard, mouse or the like), output 907 (comprise display; Such as cathode ray tube (CRT), LCD (LCD) etc. and loudspeaker etc.), storage area 908 (comprising hard disk etc.), communications portion 909 (comprising that NIC is such as LAN card, modulator-demodular unit etc.).Communications portion 909 is handled such as the Internet executive communication via network.As required, driver 910 also can be connected to input/output interface 905.Detachable media 911 is installed on the driver 910 such as disk, CD, magneto-optic disk, semiconductor memory or the like as required, makes the computer program of therefrom reading be installed to as required in the storage area 908.
Realizing through software under the situation of above-mentioned series of processes, such as detachable media 911 program that constitutes software is being installed such as the Internet or storage medium from network.
It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 9 wherein having program stored therein, distribute so that the detachable media 911 of program to be provided to the user with equipment with being separated.The example of detachable media 911 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM 902, the storage area 908 or the like, computer program stored wherein, and be distributed to the user with the equipment that comprises them.
The present invention also proposes a kind of program product that stores the instruction code of machine-readable.When said instruction code is read and carried out by machine, can carry out above-mentioned method according to the embodiment of the invention.
Correspondingly, the storage medium that is used for carrying the program product of the above-mentioned instruction code that stores machine-readable is also included within of the present invention open.Said storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick or the like.
At last; Also need to prove; Term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability; Thereby make to comprise that process, method, article or the equipment of a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or also be included as this process, method, article or equipment intrinsic key element.In addition, under the situation that do not having much more more restrictions, the key element that limits by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises said key element and also have other identical element.
Though more than combine accompanying drawing to describe embodiments of the invention in detail, should be understood that top described embodiment just is used to explain the present invention, and be not construed as limiting the invention.For a person skilled in the art, can make various modifications and change to above-mentioned embodiment and do not deviate from essence of the present invention and scope.Therefore, scope of the present invention is only limited appended claim and equivalents thereof.
Description through above is not difficult to find out, according to embodiments of the invention, following scheme is provided:
1. 1 kinds of remarks obtain the method for target data set from the candidate data collection, comprising:
From target data sample extraction characteristic;
Utilize said characteristic to concentrate the extracting objects data, form target data set from said candidate data.
Remarks 2. is according to remarks 1 described method, wherein,
Comprise from said candidate data collection extracting objects data: utilize said characteristic to inquire about said candidate data collection, obtain said target data according to the similarity of candidate data that inquires and target data sample.
Remarks 3. is according to remarks 1 described method, wherein,
Comprise from said candidate data collection extracting objects data: the candidate data that utilizes said characteristic that candidate data is concentrated carries out cluster, selects suitable class as said target data according to the class and the similarity of target data sample that cluster produced.
Remarks 4. is according to remarks 2 described methods, wherein,
Comprise from said candidate data collection extracting objects data: the data of utilizing said characteristic that candidate data is concentrated are carried out cluster; And will select suitable candidate data as said target data according to comparative result according to the similarity comparison of the similarity of cluster produced class and target data sample with candidate data that inquires and target data sample.
Remarks 5. is according to one of remarks 1 to 4 described method, wherein,
In the said data each comprises data element, constitutes said characteristic by at least a portion in the said data element.
Remarks 6. is according to one of remarks 1 to 4 described method, wherein,
The frequency that in the target data sample, occurs based on said characteristic is confirmed the weight of said characteristic.
Remarks 7. is according to one of remarks 1 to 4 described method, and wherein, said data element is character, word or picture, and correspondingly, said data comprise: character string, sentence or pictures.
Remarks 8. is according to one of remarks 1 to 4 described method; Wherein, Said candidate data collection is a translated corpora; Translated corpora comprises the language material of macaronic at least mutual correspondence, and said target data set is the subclass that extracts from said translated corpora for specific purpose, and said target data sample is the text of at least a language in the said bilingual of preparing to said specific purpose at least.
Remarks 9. is according to remarks 8 described methods, wherein, and said n unit phrase, n unit's phrase or the vocabulary at interval of being characterized as.
10. 1 kinds of devices that are used for obtaining target data set of remarks from the candidate data collection, it has:
Feature extraction unit is used for from target data sample extraction characteristic;
Extracting unit is used to utilize said characteristic to concentrate the extracting objects data to form target data set from said candidate data.
Remarks 11. is according to remarks 10 described devices, wherein,
Said extracting unit comprises query unit and generation unit; This query unit is utilized the said candidate data collection of said characteristic query, and the candidate data that said generation unit inquires according to said query unit and the similarity of target data sample obtain said target data.
Remarks 12. is according to remarks 10 described devices, wherein,
Said extracting unit comprises cluster cell and generation unit; The data that this cluster cell utilizes said characteristic that candidate data is concentrated are carried out cluster, and this generation unit selects suitable class as said target data according to the class that cluster produced with the similarity of target data sample.
Remarks 13. is according to remarks 11 described devices, wherein,
Said extracting unit also comprises cluster cell and comparing unit; The candidate data that this cluster cell utilizes said characteristic that candidate data is concentrated carries out cluster; This comparing unit will be according to the similarity comparison with the sample of target data of cluster produced class and the candidate data that inquires, and said generation unit is configured to select suitable candidate data as said target data according to comparative result.
Remarks 14. is according to one of remarks 10 to 13 described device, wherein,
In the said data each comprises data element, constitutes said characteristic by at least a portion in the said data element.
Remarks 15. is according to one of remarks 10 to 13 described device, wherein,
Said feature extraction unit is configured to confirm based on the frequency that said characteristic occurs the weight of said characteristic in the target data sample.
Remarks 16. is according to one of remarks 10 to 13 described device, and wherein, said data element is character, word or picture, and correspondingly, said data comprise: character string, sentence or pictures.
Remarks 17. is according to one of remarks 10 to 13 described device; Wherein, Said candidate data collection is a translated corpora; Translated corpora comprises the language material of macaronic at least mutual correspondence, and said target data set is the subclass that extracts from said translated corpora for specific purpose, and said target data sample is the text of at least a language in the said bilingual of preparing to said specific purpose at least.
Remarks 18. is according to remarks 17 described devices, wherein, and said n unit phrase, n unit's phrase or the vocabulary at interval of being characterized as.
19. 1 kinds of machine translators of remarks, it has the language material database of the language material that comprises macaronic at least mutual correspondence, and this language material database comprises the target data set that obtains according to the described method of one of remarks 1 to 9.
20. 1 kinds of machine translators of remarks, it has according to one of remarks 10 to 18 described device.

Claims (10)

1. one kind is obtained the method for target data set from the candidate data collection, comprising:
From target data sample extraction characteristic;
Utilize said characteristic to concentrate the extracting objects data, form target data set from said candidate data.
2. method according to claim 1, wherein,
Comprise from said candidate data collection extracting objects data: utilize said characteristic to inquire about said candidate data collection, obtain said target data according to the similarity of candidate data that inquires and target data sample.
3. method according to claim 1, wherein,
Comprise from said candidate data collection extracting objects data: the candidate data that utilizes said characteristic that candidate data is concentrated carries out cluster, selects suitable class as said target data according to the class and the similarity of target data sample that cluster produced.
4. device that is used for obtaining target data set from the candidate data collection, it has:
Feature extraction unit is used for from target data sample extraction characteristic;
Extracting unit is used to utilize said characteristic to concentrate the extracting objects data to form target data set from said candidate data.
5. device according to claim 4, wherein,
Said extracting unit comprises query unit and generation unit; This query unit is utilized the said candidate data collection of said characteristic query, and the candidate data that said generation unit inquires according to said query unit and the similarity of target data sample obtain said target data.
6. device according to claim 4, wherein,
Said extracting unit comprises cluster cell and generation unit; The data that this cluster cell utilizes said characteristic that candidate data is concentrated are carried out cluster, and this generation unit selects suitable class as said target data according to the class that cluster produced with the similarity of target data sample.
7. device according to claim 5, wherein,
Said extracting unit also comprises cluster cell and comparing unit; The candidate data that this cluster cell utilizes said characteristic that candidate data is concentrated carries out cluster; This comparing unit will be according to the similarity comparison with the sample of target data of cluster produced class and the candidate data that inquires, and said generation unit is configured to select suitable candidate data as said target data according to comparative result.
8. according to the described device of one of claim 4 to 7, wherein,
In the said data each comprises data element, constitutes said characteristic by at least a portion in the said data element.
9. machine translator, it has the language material database of the language material that comprises macaronic at least mutual correspondence, and this language material database comprises the target data set that obtains according to the described method of one of claim 1 to 3.
10. machine translator, it has according to the described device of one of claim 4 to 8.
CN201010257678XA 2010-08-17 2010-08-17 Method and device for acquiring target data set from candidate data set, and translation machine Pending CN102375839A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010257678XA CN102375839A (en) 2010-08-17 2010-08-17 Method and device for acquiring target data set from candidate data set, and translation machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010257678XA CN102375839A (en) 2010-08-17 2010-08-17 Method and device for acquiring target data set from candidate data set, and translation machine

Publications (1)

Publication Number Publication Date
CN102375839A true CN102375839A (en) 2012-03-14

Family

ID=45794462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010257678XA Pending CN102375839A (en) 2010-08-17 2010-08-17 Method and device for acquiring target data set from candidate data set, and translation machine

Country Status (1)

Country Link
CN (1) CN102375839A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832305A (en) * 2020-07-03 2020-10-27 广州小鹏车联网科技有限公司 User intention identification method, device, server and medium
CN114781409A (en) * 2022-05-12 2022-07-22 北京百度网讯科技有限公司 Text translation method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040068493A1 (en) * 2002-10-04 2004-04-08 International Business Machines Corporation Data retrieval method, system and program product
CN1801140A (en) * 2004-12-30 2006-07-12 中国科学院自动化研究所 Method and apparatus for automatic acquisition of machine translation template
CN101008943A (en) * 2006-01-23 2007-08-01 富士施乐株式会社 Word alignment apparatus, example sentence bilingual dictionary, word alignment method, and program product for word alignment
CN101571852A (en) * 2008-04-28 2009-11-04 富士通株式会社 Dictionary generating device and information retrieving device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040068493A1 (en) * 2002-10-04 2004-04-08 International Business Machines Corporation Data retrieval method, system and program product
CN1801140A (en) * 2004-12-30 2006-07-12 中国科学院自动化研究所 Method and apparatus for automatic acquisition of machine translation template
CN101008943A (en) * 2006-01-23 2007-08-01 富士施乐株式会社 Word alignment apparatus, example sentence bilingual dictionary, word alignment method, and program product for word alignment
CN101571852A (en) * 2008-04-28 2009-11-04 富士通株式会社 Dictionary generating device and information retrieving device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832305A (en) * 2020-07-03 2020-10-27 广州小鹏车联网科技有限公司 User intention identification method, device, server and medium
CN111832305B (en) * 2020-07-03 2023-08-25 北京小鹏汽车有限公司 User intention recognition method, device, server and medium
CN114781409A (en) * 2022-05-12 2022-07-22 北京百度网讯科技有限公司 Text translation method and device, electronic equipment and storage medium
CN114781409B (en) * 2022-05-12 2023-12-01 北京百度网讯科技有限公司 Text translation method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US5724593A (en) Machine assisted translation tools
US7475063B2 (en) Augmenting queries with synonyms selected using language statistics
JP3272288B2 (en) Machine translation device and machine translation method
US8606826B2 (en) Augmenting queries with synonyms from synonyms map
CN100474301C (en) System and method for obtaining words or phrases unit translation information based on data excavation
US20070288450A1 (en) Query language determination using query terms and interface language
Prochasson et al. Rare word translation extraction from aligned comparable documents
JP2007257644A (en) Program, method and device for acquiring translation word based on translation word candidate character string prediction
WO2010042452A2 (en) Machine learning for transliteration
US20070288230A1 (en) Simplifying query terms with transliteration
US20100153396A1 (en) Name indexing for name matching systems
Gregorowicz et al. Mining a large-scale term-concept network from Wikipedia
CN102375839A (en) Method and device for acquiring target data set from candidate data set, and translation machine
Zeng Exploration and study of multilingual thesauri automation construction for digital libraries in China
Hanumanthappa et al. A detailed study on Indian languages text mining
Sirajzade et al. The LuNa Open Toolbox for the Luxembourgish Language
Moukdad et al. How do search engines handle Chinese queries
Taghva et al. Farsi searching and display technologies
Baishya et al. Present state and future scope of Assamese text processing
Bilac et al. Extracting transliteration pairs from comparable corpora
Jena et al. A comprehensive survey on cross-language information retrieval system
Dash Polysemy and homonymy: a conceptual labyrinth
Thanadechteemapat et al. Thai word segmentation for visualization of thai web sites
Feng et al. Using html tags to improve parallel resources extraction
CN110175268B (en) Longest matching resource mapping method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120314