CN103729445B - The acquisition methods and device of vocabulary translation - Google Patents

The acquisition methods and device of vocabulary translation Download PDF

Info

Publication number
CN103729445B
CN103729445B CN201310745535.7A CN201310745535A CN103729445B CN 103729445 B CN103729445 B CN 103729445B CN 201310745535 A CN201310745535 A CN 201310745535A CN 103729445 B CN103729445 B CN 103729445B
Authority
CN
China
Prior art keywords
vocabulary
translated
translation
search results
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310745535.7A
Other languages
Chinese (zh)
Other versions
CN103729445A (en
Inventor
王海峰
吴华
刘占
刘占一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201310745535.7A priority Critical patent/CN103729445B/en
Publication of CN103729445A publication Critical patent/CN103729445A/en
Application granted granted Critical
Publication of CN103729445B publication Critical patent/CN103729445B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3337Translation of the query language, e.g. Chinese to English
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion

Abstract

The present invention proposes a kind of acquisition methods and device of vocabulary translation, and method includes:Vocabulary to be translated is obtained, and the first Search Results are generated according to vocabulary to be translated;At least one associated entity vocabulary related to vocabulary to be translated is extracted from the first Search Results according to vocabulary to be translated;The search condition of vocabulary to be translated is generated according at least one associated entity vocabulary;Scanned for obtaining the second Search Results according to search condition;The corresponding translation of vocabulary to be translated is extracted from the second Search Results.The acquisition methods of the vocabulary translation of the embodiment of the present invention, associated entity vocabulary is extracted in the first Search Results generated according to vocabulary to be translated, and the search condition generated according to associated entity vocabulary obtains the second Search Results, the corresponding translation of vocabulary to be translated is extracted in the second Search Results finally, the corresponding translation of neologisms can quickly be obtained, not only convenient, intelligence, and the accuracy rate for obtaining the corresponding translation of neologisms is effectively improved, improve Consumer's Experience.

Description

The acquisition methods and device of vocabulary translation
Technical field
The present invention relates to field of computer technology, more particularly to a kind of acquisition methods and device of vocabulary translation.
Background technology
With the development of the Internet, people have been no longer satisfied with obtaining information from single language material, start more next More concerns obtain information from the data of other language, it is therefore desirable to realize across language by automatic translation by computer system Acquisition of information.Current machine translation system disclosure satisfy that basic reading requirement, and which is mainly used in from bilingual specific distribution Mutual translation word pair is obtained in the Chinese web page of type(It is during bracket be must be present in such as English and adjacent with Chinese translation).For example: " ... the investigation entrusts economic motility item by the outstanding foundation of organ of survey skin (Pew Charitable Trusts) of non-political parties and groups Mesh is carried out ... " in this section of text, the translation " Pew of " the outstanding foundation of skin " can be obtained by machine translation system Charitable Trusts”。
But, for the vocabulary in the media event or hot news of burst, firstly because existing dictionary for translation does not have Include, next to that such vocabulary is difficult to obtain correct translation by automatic translating method, therefore the accuracy rate of translation compares It is low.In addition the translation of such vocabulary generally needs technical translator worker according to hot news background to carry out the translation of specialty, Labor intensive, not convenient enough, intelligence, poor user experience.
The content of the invention
It is contemplated that at least solving one of above-mentioned technical problem.
For this purpose, first purpose of the present invention is to propose a kind of acquisition methods of vocabulary translation.The method can be quick The corresponding translation of acquisition neologisms, not only facilitate, intelligence, and be effectively improved and obtain the accurate of the corresponding translation of neologisms Rate, improves Consumer's Experience.
Second object of the present invention is to propose a kind of acquisition device of vocabulary translation.
To achieve these goals, the acquisition methods of the vocabulary translation of first aspect present invention embodiment, including following step Suddenly:Vocabulary to be translated is obtained, and the first Search Results are generated according to the vocabulary to be translated;According to the vocabulary to be translated from institute At least one associated entity vocabulary related to the vocabulary to be translated is extracted in stating the first Search Results, wherein, it is described to wait to turn over Translation word is converged and belongs to first language with least one associated entity vocabulary;Generated according at least one associated entity vocabulary The search condition of the vocabulary to be translated, wherein, the search condition belongs to second language;Searched according to the search condition Rope is obtaining the second Search Results;And the corresponding translation of the vocabulary to be translated is extracted from second Search Results.
The acquisition methods of the vocabulary translation of the embodiment of the present invention, in the first Search Results generated according to vocabulary to be translated Associated entity vocabulary is extracted, and the search condition generated according to associated entity vocabulary obtains the second Search Results, finally second The corresponding translation of vocabulary to be translated is extracted in Search Results, the corresponding translation of neologisms can be quickly obtained, not only convenient, intelligence Can, and the accuracy rate for obtaining the corresponding translation of neologisms is effectively improved, improve Consumer's Experience.Additionally, drawing by search Hold up and retrieve the multi-lingual webpage related to neologisms there is ageing feature, thus the translation for obtaining also have it is very high ageing.
To achieve these goals, the acquisition device of the vocabulary translation of second aspect present invention embodiment, including:It is to be translated Bilingual lexicon acquisition module, for obtaining vocabulary to be translated;First search module, searches for generating first according to the vocabulary to be translated Hitch fruit;Extraction module, for being extracted from first Search Results and the word to be translated according to the vocabulary to be translated At least one related associated entity vocabulary of remittance, wherein, the vocabulary to be translated is belonged to at least one associated entity vocabulary In first language;Search condition generation module, for generating the word to be translated according at least one associated entity vocabulary The search condition of remittance, wherein, the search condition belongs to second language;Second search module, for according to the search condition Scan for obtaining the second Search Results;And translation extraction module, for extracting described from second Search Results The corresponding translation of vocabulary to be translated.
The acquisition device of the vocabulary translation of the embodiment of the present invention, in the first Search Results generated according to vocabulary to be translated Associated entity vocabulary is extracted, and the search condition generated according to associated entity vocabulary obtains the second Search Results, finally second The corresponding translation of vocabulary to be translated is extracted in Search Results, the corresponding translation of neologisms can be quickly obtained, not only convenient, intelligence Can, and the accuracy rate for obtaining the corresponding translation of neologisms is effectively improved, improve Consumer's Experience.
The additional aspect of the present invention and advantage will be set forth in part in the description, and partly will become from the following description Obtain substantially, or recognized by the practice of the present invention.
Description of the drawings
Of the invention above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments It is substantially and easy to understand, wherein,
Fig. 1 is the flow chart of the acquisition methods of vocabulary translation according to an embodiment of the invention;
Fig. 2 is the extraction according to an embodiment of the invention at least one associated entity vocabulary related to vocabulary to be translated Flow chart;
Fig. 3 is searching according at least one associated entity vocabulary generation vocabulary to be translated according to an embodiment of the invention The flow chart of rope condition;
Fig. 4 is the flow chart of the acquisition methods of vocabulary translation in accordance with another embodiment of the present invention;
Fig. 5 is the flow chart detected to webpage similarity according to an embodiment of the invention;
Fig. 6 is that the translation vocabulary and the corresponding translation of vocabulary to be translated treated according to an embodiment of the invention carries out translation The flow chart of detection;
Fig. 7 is the structural representation of the acquisition device of vocabulary translation according to an embodiment of the invention;
Fig. 8 is the structural representation of the acquisition device of vocabulary translation according to an embodiment of the invention.
Specific embodiment
Embodiments of the invention are described below in detail, the example of embodiment is shown in the drawings, wherein identical from start to finish Or similar label represents same or similar element or the element with same or like function.Retouch below with reference to accompanying drawing The embodiment stated is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.Conversely, of the invention The embodiment all changes, modification and the equivalent that include falling in the range of the spirit and intension of attached claims.
In describing the invention, term " first ", " second " etc. be only used for describe purpose, and it is not intended that indicate or Hint relative importance.In describing the invention, unless otherwise clearly defined and limited, answer term " being connected ", " connection " It is interpreted broadly, for example, it may be being fixedly connected, or being detachably connected, or is integrally connected;Can be that machinery connects Connect, or electrically connect;Can be joined directly together, it is also possible to be indirectly connected to by intermediary.It is common for this area For technical staff, above-mentioned term concrete meaning in the present invention can be understood with concrete condition.Additionally, in description of the invention In, unless otherwise stated, " multiple " are meant that two or more.
In flow chart or here any process described otherwise above or method description are construed as, expression includes It is one or more for realizing specific logical function or process the step of the module of code of executable instruction, fragment or portion Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein the suitable of shown or discussion can not be pressed Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.
Below with reference to the accompanying drawings the acquisition methods and device of the vocabulary translation of the embodiment of the present invention are described.
Fig. 1 is the flow chart of the acquisition methods of vocabulary translation according to an embodiment of the invention.
As shown in figure 1, the acquisition methods of vocabulary translation are comprised the following steps:
S101, obtains vocabulary to be translated, and generates the first Search Results according to vocabulary to be translated.
In an embodiment of the present invention, vocabulary to be translated can be the neologisms in the media event or hot news of burst, or Person is emerging popular vocabulary etc., the vocabulary do not included in existing dictionary for translation.User can be input into using search engine Vocabulary to be translated is scanned for, to generate the first Search Results.For example, treat translation vocabulary " aunt " to scan for, can Multiple webpages related to " aunt " are obtained, then above-mentioned webpage and its content are the first Search Results.
S102, extracts at least one association related to vocabulary to be translated from the first Search Results according to vocabulary to be translated Entity vocabulary.Wherein, vocabulary to be translated and at least one associated entity vocabulary belong to first language.
In an embodiment of the present invention, as shown in Fig. 2 being extracted from the first Search Results according to vocabulary to be translated and waiting to turn over At least one related associated entity vocabulary of translation word remittance, specifically includes following steps:
S201, obtains at least one of the first Search Results original language webpage, wherein, at least one original language webpage category In first language.
In an embodiment of the present invention, can treat with this comprising multiple according in the first Search Results of vocabulary to be translated generation The related original language webpage of translation vocabulary, obtains wherein at least one original language webpage, treats to obtain to include in the source web page The related content of translation vocabulary.Wherein, vocabulary to be translated and at least one original language webpage belong to first language.
S202, extracts the vocabulary at least one original language webpage, and records occurrence number.
For example, according to vocabulary to be translated " aunt ", multiple webpages related to " aunt " can be obtained.Extract wherein extremely Vocabulary in a few webpage, for example:Include " then, price of gold keep stabilize a period of time.Press is Chinese aunt's war Win Wall Street and hail." webpage in, extract " Wall Street ", " China ", the vocabulary such as " price of gold ", and record above-mentioned vocabulary appearance Number of times.
S203, the vocabulary that will appear from number of times more than preset times threshold value is used as at least one pass related to vocabulary to be translated Connection entity vocabulary.
In an embodiment of the present invention, if the occurrence number of certain vocabulary is more than preset times threshold value in original language webpage When, represent the vocabulary and compare high with word-correlativity to be translated, then using the vocabulary as associated entity vocabulary.For example:According to " big In multiple webpages that mother " obtains, when the number of times that entity vocabulary " price of gold ", " Wall Street " occur is more than preset times threshold value, then Can be using " price of gold ", " Wall Street " as " aunt " related associated entity vocabulary.
S103, generates the search condition of vocabulary to be translated according at least one associated entity vocabulary, wherein, search condition category In second language.
In an embodiment of the present invention, as shown in figure 3, generating vocabulary to be translated according at least one associated entity vocabulary Search condition, specifically includes following steps:Below so that " press is hailed as Chinese aunt defeats Wall Street." as a example by said It is bright.
S301, translates at least one associated entity vocabulary, generates that at least one associated entity vocabulary is corresponding translates Text.
" aunt " related associated entity vocabulary " price of gold ", " Wall Street " etc. are translated, English translation is generated." gold Valency " is translated into " gold price ", " Wall Street " and translates into " Wall Street ", " press " and translates into " The press " Deng.
S302, is combined to the corresponding translation of at least one associated entity vocabulary, to generate the search of vocabulary to be translated Condition.
Entity vocabulary in webpage is combined, search condition " gold price "+" Wall of English is generated Street "+" holdsteady " or " gold price "+" defeating "+" The press " etc..
S104, scans for obtaining the second Search Results according to search condition.
According to search condition " gold price "+" Wall Street "+" hold steady " or " gold price "+ " defeating "+" The press ", can search for multiple English webpages.The content of one of webpage comprising " ... As a result,the gold price held steady for a while.The press hailed the Chinese dama for defeating Wall Street.......”。
S105, extracts the corresponding translation of vocabulary to be translated from the second Search Results.
According to web page contents " ... As a result, the gold price held steady for a While.The press hailed the Chinese dama for defeating Wall Street....... " can be obtained Take the translation " dama " of vocabulary to be translated " aunt ".
The acquisition methods of the vocabulary translation of the embodiment of the present invention, in the first Search Results generated according to vocabulary to be translated Associated entity vocabulary, and the search condition generated according to associated entity vocabulary are extracted, search condition is scanned for obtain the Two Search Results, finally extract the corresponding translation of vocabulary to be translated in the second Search Results, not only convenient, intelligent, and have The accuracy rate for obtaining the corresponding translation of neologisms is improve effect, Consumer's Experience is improved.Additionally, go out by search engine retrieving and The related multi-lingual webpage of neologisms has an ageing feature, thus the translation for obtaining also have it is very high ageing.
Fig. 4 is the flow chart of the acquisition methods of vocabulary translation in accordance with another embodiment of the present invention.
As shown in figure 4, the acquisition methods of vocabulary translation are comprised the following steps:
S401, obtains vocabulary to be translated, and generates the first Search Results according to vocabulary to be translated.
In an embodiment of the present invention, vocabulary to be translated can be the neologisms in the media event or hot news of burst, or Person is emerging popular vocabulary etc., the vocabulary do not included in existing dictionary for translation.User can be input into using search engine Vocabulary to be translated is scanned for, to generate the first Search Results.For example, treat translation vocabulary " aunt " to scan for, can Multiple webpages related to " aunt " are obtained, then above-mentioned webpage and its content are the first Search Results.
S402, extracts at least one association related to vocabulary to be translated from the first Search Results according to vocabulary to be translated Entity vocabulary, wherein, vocabulary to be translated and at least one associated entity vocabulary belong to first language.
In an embodiment of the present invention, at least one of first Search Results original language webpage is obtained first, and is extracted Vocabulary at least one original language webpage, and occurrence number is recorded, then will appear from word of the number of times more than preset times threshold value Converge as at least one associated entity vocabulary related to vocabulary to be translated.
For example, according to vocabulary to be translated " aunt ", multiple webpages related to " aunt " can be obtained.Including " then, price of gold keeps stabilizing a period of time.Press is hailed for Chinese aunt defeats Wall Street." webpage in, extract The vocabulary such as " Wall Street ", " China ", " price of gold ", and the number of times that above-mentioned vocabulary occurs is recorded, when the number of times for occurring is more than default time During number threshold value, then can be using " price of gold ", " Wall Street " as " aunt " related associated entity vocabulary.
S403, generates the search condition of vocabulary to be translated according at least one associated entity vocabulary, wherein, search condition category In second language.
In an embodiment of the present invention, first at least one associated entity vocabulary is translated, generates at least one and close The corresponding translation of connection entity vocabulary, is then combined to the corresponding translation of at least one associated entity vocabulary, waits to turn over to generate The search condition that translation word is converged.
For example, " aunt " related associated entity vocabulary " price of gold ", " Wall Street " etc. are translated, generates English Translation." price of gold " is translated into " gold price ", " Wall Street " and translates into " Wall Street ", " press " and translate into " The Then above-mentioned translation is combined by press " etc., and English search condition " gold price "+" the Wall Street " of generation+ " hold steady " or " gold price "+" defeating "+" The press " etc..
S404, scans for obtaining the second Search Results according to search condition.
In continuation, example is illustrated, according to search condition " gold price "+" Wall Street "+" hold steady " Or " gold price "+" defeating "+" The press ", can search for multiple English webpages.One of webpage it is interior Hold comprising " ... As a result, the gold price held steady for a while.The press hailed the Chinese dama for defeating Wall Street.......”。
S405, extracts the corresponding translation of vocabulary to be translated from the second Search Results.
In continuation, example is illustrated, according to web page contents " ... As a result, the gold price held steady for a while.The press hailed the Chinese dama for defeating Wall Street....... the translation " dama " of vocabulary to be translated " aunt " can " be obtained.
S406, treating translation vocabulary and the corresponding translation of vocabulary to be translated carries out translation detection.
As shown in figure 5, before treating translation vocabulary and the corresponding translation of vocabulary to be translated carries out translation detection, it is concrete to wrap Include following steps:
S501, obtains at least one of the second Search Results object language webpage.
Wherein, the second Search Results and object language webpage belong to second language.
S502, obtains the similarity of at least one original language webpage and at least one object language webpage.
In an embodiment of the present invention, original language webpage belongs to first language, and object language webpage belongs to the second voice, because This needs based on the method for context intertranslation vocabulary to calculate the similarity across language web page.Computing formula is as follows:
Wherein, F and E represent original language webpage and object language webpage, α respectivelyiWeight is represented, is automatic by development set Study is obtained, fi(E, F) representative feature.Specifically, fi(E, F) includes two category features:Popular word translation each other it is average general Rate rnormalAverage probability r of (E, F) and entity vocabulary NE translation each otherNE(E, F) concrete formula is as follows:
Wherein, p (e | f) is the translation probability of f to e, and F and E represents original language webpage and object language webpage, f and e respectively The vocabulary in original language webpage F and object language webpage E is represented respectively.
S503, carries out translation inspection according to the similarity of at least one original language webpage and at least one object language webpage Survey.
Specifically, as Sim (E, F)>During β, then it is believed that the content of original language webpage and object language webpage is similar. Wherein, threshold value beta is calculated automatically from beforehand through development set.
As shown in fig. 6, treating translation vocabulary and the corresponding translation of vocabulary to be translated carries out translation detection, specifically include following Step:
S601, detects the dependency between vocabulary to be translated and vocabulary to be translated correspondence translation.
In an embodiment of the present invention, the dependency between vocabulary to be translated and the corresponding translation of vocabulary to be translated can lead to Cross various methods to weigh, such as frequency, mutual information, hypothesis testing etc..Used in the present embodiment, frequency freq (f, e) is being said It is bright.Wherein, frequency be vocabulary to be translated translation corresponding with vocabulary to be translated in original language webpage and object language webpage while The frequency of appearance.Frequency is higher, then illustrate that the intertranslation degree of vocabulary to be translated and the corresponding translation of vocabulary to be translated is higher.
S602, detects the context similarity between vocabulary to be translated and vocabulary to be translated correspondence translation.
In an embodiment of the present invention, the context similarity meter between vocabulary to be translated and vocabulary to be translated correspondence translation Calculation method will not be described here as the method for webpage similarity is calculated in step S502.
S603, carries out translation detection according to dependency and context similarity.
In an embodiment of the present invention, translation can be detected according to below equation, formula is specific as follows:
Wherein, ηi(e, f) represents the features such as the frequency of occurrences of word pair, upper and lower similarity;βiWeight is represented, is by exploitation What collection was automatically learned.
S407, if it is determined that meeting translation examination criteria, then provides vocabulary to be translated corresponding translation to user.
In an embodiment of the present invention, by intertranslation degree highest target language vocabulary e*Translate as vocabulary to be translated is corresponding Text is provided to user.
The acquisition methods of the vocabulary translation of the embodiment of the present invention, by detecting between original language webpage and object language webpage Similarity, can effectively find the similar multi-lingual webpage of content;By detecting vocabulary to be translated and vocabulary translation to be translated Dependency, context similarity, more efficiently improve the accuracy of the translation for obtaining neologisms, improve Consumer's Experience.
Fig. 7 is the structural representation of the acquisition device of vocabulary translation according to an embodiment of the invention.
As shown in fig. 7, the acquisition device of vocabulary translation includes:Bilingual lexicon acquisition module 100 to be translated, the first search module 200th, extraction module 300, search condition generation module 400, the second search module 500 and translation extraction module 600.Wherein, Search condition generation module 400 is specifically included:Translation submodule 410 and combination submodule 420.
Specifically, bilingual lexicon acquisition module 100 to be translated is used to obtain vocabulary to be translated.
In an embodiment of the present invention, vocabulary to be translated can be the neologisms in the media event or hot news of burst, or Person is emerging popular vocabulary etc., the vocabulary do not included in existing dictionary for translation.
First search module 200 is for according to vocabulary to be translated the first Search Results of generation.
User can be input into vocabulary to be translated using search engine and scan for, and then the first search module 200 generates first Search Results.For example, treat translation vocabulary " aunt " to scan for, the first search module 200 can obtain multiple and " big The related webpage of mother ", then above-mentioned webpage and its content are the first Search Results.
Extraction module 300 for according to vocabulary to be translated extract from the first Search Results it is related to vocabulary to be translated to Few associated entity vocabulary, wherein, vocabulary to be translated and at least one associated entity vocabulary belong to first language.
Specifically, extraction module 300 obtains at least one of the first Search Results original language webpage, and extracts at least one Vocabulary in individual original language webpage and record occurrence number, then will appear from number of times more than preset times threshold value vocabulary as with At least one related associated entity vocabulary of vocabulary to be translated, wherein, at least one original language webpage belongs to first language.
In an embodiment of the present invention, extraction module 300 obtains at least one of the first Search Results original language first Webpage, and the vocabulary at least one original language webpage is extracted, and occurrence number is recorded, number of times is then will appear from more than default time The vocabulary of number threshold value is used as at least one associated entity vocabulary related to vocabulary to be translated.
For example, according to vocabulary to be translated " aunt ", multiple webpages related to " aunt " can be obtained.Including " then, price of gold keeps stabilizing a period of time.Press is hailed for Chinese aunt defeats Wall Street." webpage in, extract The vocabulary such as " Wall Street ", " China ", " price of gold ", and the number of times that above-mentioned vocabulary occurs is recorded, when the number of times for occurring is more than default time During number threshold value, then can be using " price of gold ", " Wall Street " as " aunt " related associated entity vocabulary.
Search condition generation module 400 is for the search bar according at least one associated entity vocabulary generation vocabulary to be translated Part, wherein, search condition belongs to second language.Specifically, search condition generation module 400 is specifically included:Translation submodule 410 With combination submodule 420.Below so that " press is hailed as Chinese aunt defeats Wall Street." as a example by illustrate.
Translation submodule 410 generates at least one associated entity for translating at least one associated entity vocabulary The corresponding translation of vocabulary.
" aunt " related associated entity vocabulary " price of gold ", " Wall Street " etc. are translated, English translation is generated." gold Valency " is translated into " gold price ", " Wall Street " and translates into " Wall Street ", " press " and translates into " The press " Deng.
Combination submodule 420 is waited to turn over to generate for being combined the corresponding translation of at least one associated entity vocabulary The search condition that translation word is converged.
Entity vocabulary in webpage is combined, search condition " gold price "+" Wall of English is generated Street "+" hold steady " or " gold price "+" defeating "+" The press " etc..
Second search module 500 is for scanning for obtaining the second Search Results according to search condition.
According to search condition " gold price "+" Wall Street "+" hold steady " or " gold price "+ " defeating "+" The press ", can search for multiple English webpages.The content of one of webpage comprising " ... As a result,the gold price held steady for a while.The press hailed the Chinese dama for defeating Wall Street.......”。
Translation extraction module 600 is for extracting the corresponding translation of vocabulary to be translated from the second Search Results.
According to web page contents " ... As a result, the gold price held steady for a While.The press hailed the Chinese dama for defeating Wall Street....... " can be obtained Take the translation " dama " of vocabulary to be translated " aunt ".
The acquisition device of the vocabulary translation of the embodiment of the present invention, in the first Search Results generated according to vocabulary to be translated Associated entity vocabulary is extracted, and the search condition generated according to associated entity vocabulary obtains the second Search Results, finally second The corresponding translation of vocabulary to be translated is extracted in Search Results, the corresponding translation of neologisms can be quickly obtained, not only convenient, intelligence Can, and the accuracy rate for obtaining the corresponding translation of neologisms is effectively improved, improve Consumer's Experience.Additionally, drawing by search Hold up and retrieve the multi-lingual webpage related to neologisms there is ageing feature, thus the translation for obtaining also have it is very high ageing.
Fig. 8 is the structural representation of the acquisition device of vocabulary translation according to an embodiment of the invention.
As shown in figure 8, the acquisition device of vocabulary translation includes:Bilingual lexicon acquisition module 100 to be translated, the first search module 200th, extraction module 300, search condition generation module 400, the second search module 500, the detection of translation extraction module 600, translation Module 700 and webpage similarity detection module 800.Wherein, search condition generation module 400 is specifically included:Translation submodule 410 and combination submodule 420.Translation detection module 700 is specifically included:Correlation detection submodule 710, similarity detects submodule Block 720 and translation detection sub-module 730.
Translation detection module 700 carries out translation detection for treating translation vocabulary and the corresponding translation of vocabulary to be translated, and Vocabulary to be translated corresponding translation is provided to user when judging and meeting translation examination criteria.
Specifically, translation detection module 700 also includes:Correlation detection submodule 710, similarity detection sub-module 720 And translation detection sub-module 730.
Correlation detection submodule 710 is used to detect the correlation between vocabulary to be translated and vocabulary to be translated correspondence translation Property.
In an embodiment of the present invention, the dependency between vocabulary to be translated and the corresponding translation of vocabulary to be translated can lead to Cross various methods to weigh, such as frequency, mutual information, hypothesis testing etc..Used in the present embodiment, frequency freq (f, e) is being said It is bright.Wherein, frequency be vocabulary to be translated translation corresponding with vocabulary to be translated in original language webpage and object language webpage while The frequency of appearance.Frequency is higher, then illustrate that the intertranslation degree of vocabulary to be translated and the corresponding translation of vocabulary to be translated is higher.
Similarity detection sub-module 720 is used to detect the context between vocabulary to be translated and vocabulary to be translated correspondence translation Similarity.
In an embodiment of the present invention, similarity detection sub-module 720 detects that vocabulary to be translated is corresponding with vocabulary to be translated During context similarity between translation, formula used as the formula of webpage similarity is calculated in step S502, here Do not repeat.
Translation detection sub-module 730 is for carrying out translation detection according to dependency and context similarity.
In an embodiment of the present invention, translation detection sub-module 730 can be detected to translation according to below equation, formula It is specific as follows:
Wherein, ηi(e, f) represents the features such as the frequency of occurrences of word pair, upper and lower similarity;βiWeight is represented, is by exploitation What collection was automatically learned.
Webpage similarity detection module 800 is used to obtain at least one of the second Search Results object language webpage, and The similarity of at least one original language webpage and at least one object language webpage is obtained, and according at least one original language net The similarity of page and at least one object language webpage carries out translation detection.
Wherein, the second Search Results and object language webpage belong to second language.
In an embodiment of the present invention, original language webpage belongs to first language, and object language webpage belongs to the second voice, because This webpage similarity detection module 800 needs based on the method for context intertranslation vocabulary to calculate the similarity across language web page. Computing formula is as follows:
Wherein, F and E represent original language webpage and object language webpage, α respectivelyiWeight is represented, is automatic by development set Study is obtained, fi(E, F) representative feature.
Specifically, fi(E, F) includes two category features:Average probability r of popular word translation each othernormal(E, F) and entity Average probability r of vocabulary NE translations each otherNE(E, F) concrete formula is as follows:
Wherein, p (e | f) is the translation probability of f to e, and F and E represents original language webpage and object language webpage, f and e respectively The vocabulary in original language webpage F and object language webpage E is represented respectively.
Specifically, as Sim (E, F)>During β, then it is believed that the content of original language webpage and object language webpage is similar. Wherein, threshold value beta is calculated automatically from beforehand through development set.
The acquisition device of the vocabulary translation of the embodiment of the present invention, by detecting between original language webpage and object language webpage Similarity, can effectively find the similar multi-lingual webpage of content;By detecting vocabulary to be translated and vocabulary translation to be translated Dependency, context similarity, more efficiently improve the accuracy of the translation for obtaining neologisms, improve Consumer's Experience.
It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned In embodiment, the software that multiple steps or method can be performed in memory and by suitable instruction execution system with storage Or firmware is realizing.For example, if realized with hardware, and in another embodiment, can be with well known in the art Any one of row technology or their combination are realizing:With for the logic gates of logic function is realized to data signal Discrete logic, the special IC with suitable combinational logic gate circuit, programmable gate array(PGA), scene Programmable gate array(FPGA)Deng.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show Example ", or the description of " some examples " etc. mean specific features with reference to the embodiment or example description, structure, material or spy Point is contained at least one embodiment or example of the present invention.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example are referred to necessarily.And, the specific features of description, structure, material or feature can be any One or more embodiments or example in combine in an appropriate manner.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that:Not These embodiments can be carried out with various changes, modification, replacement and modification in the case of the principle and objective that depart from the present invention, this The scope of invention is limited by claim and its equivalent.

Claims (10)

1. a kind of acquisition methods of vocabulary translation, it is characterised in that include:
Vocabulary to be translated is obtained, and the first Search Results are generated according to the vocabulary to be translated;
At least one related to the vocabulary to be translated is extracted from first Search Results according to the vocabulary to be translated Associated entity vocabulary, wherein, the vocabulary to be translated belongs to first language with least one associated entity vocabulary;
The search condition of the vocabulary to be translated is generated according at least one associated entity vocabulary, wherein, the basis is extremely The search condition that few associated entity vocabulary generates the vocabulary to be translated is specifically included:
At least one associated entity vocabulary is translated, the generation at least one associated entity vocabulary is corresponding to be translated Text;And
The corresponding translation of at least one associated entity vocabulary is combined, to generate the search bar of the vocabulary to be translated Part, the search condition belong to second language;
Scanned for obtaining the second Search Results according to the search condition;And
The corresponding translation of the vocabulary to be translated is extracted from second Search Results.
2. the method for claim 1, it is characterised in that it is described according to the vocabulary to be translated from the described first search knot At least one associated entity vocabulary related to the vocabulary to be translated is extracted in fruit to specifically include:
At least one of first Search Results original language webpage is obtained, at least one original language webpage belongs to described First language;
The vocabulary at least one original language webpage is extracted, and records occurrence number;And
The occurrence number is more than the vocabulary of preset times threshold value as at least one pass related to the vocabulary to be translated Connection entity vocabulary.
3. method as claimed in claim 2, it is characterised in that also include:
Translation detection is carried out to the vocabulary to be translated and the corresponding translation of the vocabulary to be translated;
If it is determined that meet translation examination criteria, then the vocabulary to be translated corresponding translation is provided to user.
4. method as claimed in claim 3, it is characterised in that described to treat translation vocabulary and the vocabulary to be translated is corresponding Translation carries out translation detection and specifically includes:
Detect the dependency between the vocabulary to be translated and the vocabulary correspondence translation to be translated;
Detect the context similarity between the vocabulary to be translated and the vocabulary correspondence translation to be translated;And
The translation detection is carried out according to the dependency and the context similarity.
5. method as claimed in claim 3, it is characterised in that described to the vocabulary to be translated and the vocabulary to be translated Before corresponding translation carries out translation detection, also include:
Obtain at least one of second Search Results object language webpage;
Obtain the similarity of at least one original language webpage and at least one object language webpage;
Translation detection is carried out according to the similarity of at least one original language webpage and at least one object language webpage.
6. a kind of acquisition device of vocabulary translation, it is characterised in that include:
Bilingual lexicon acquisition module to be translated, for obtaining vocabulary to be translated;
First search module, for generating the first Search Results according to the vocabulary to be translated;
Extraction module, for being extracted from first Search Results and the vocabulary phase to be translated according to the vocabulary to be translated At least one associated entity vocabulary for closing, wherein, the vocabulary to be translated and at least one associated entity vocabulary belong to the One language;
Search condition generation module, for the search of the vocabulary to be translated is generated according at least one associated entity vocabulary Condition, wherein, the search condition generation module is specifically included:
Translation submodule, for translating at least one associated entity vocabulary, generates at least one association real The corresponding translation of pronouns, general term for nouns, numerals and measure words remittance;And
Combination submodule, for being combined to the corresponding translation of at least one associated entity vocabulary, to generate described treating The search condition of translation vocabulary, the search condition belong to second language;
Second search module, for being scanned for obtaining the second Search Results according to the search condition;And
Translation extraction module, for the corresponding translation of the vocabulary to be translated is extracted from second Search Results.
7. device as claimed in claim 6, it is characterised in that the extraction module is obtained in first Search Results extremely A few original language webpage, and extract the vocabulary at least one original language webpage and record occurrence number, and by institute The vocabulary that occurrence number is stated more than preset times threshold value is used as at least one associated entity word related to the vocabulary to be translated Converge, wherein, at least one original language webpage belongs to the first language.
8. device as claimed in claim 7, it is characterised in that also include:
Translation detection module, for carrying out translation detection to the vocabulary to be translated and the corresponding translation of the vocabulary to be translated, And the vocabulary to be translated corresponding translation is provided to user when judging to meet translation examination criteria.
9. device as claimed in claim 8, it is characterised in that the translation detection module is specifically included:
Correlation detection submodule, for detecting the correlation between the vocabulary to be translated and the vocabulary correspondence translation to be translated Property;
Similarity detection sub-module, it is upper and lower between the vocabulary to be translated and the vocabulary correspondence translation to be translated for detecting Literary similarity;And
Translation detection sub-module, for carrying out the translation detection according to the dependency and the context similarity.
10. device as claimed in claim 8, it is characterised in that also include:
Webpage similarity detection module, for obtaining at least one of second Search Results object language webpage, and obtains Take the similarity of at least one original language webpage and at least one object language webpage, and according to described at least one The similarity of individual original language webpage and at least one object language webpage carries out translation detection.
CN201310745535.7A 2013-12-30 2013-12-30 The acquisition methods and device of vocabulary translation Active CN103729445B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310745535.7A CN103729445B (en) 2013-12-30 2013-12-30 The acquisition methods and device of vocabulary translation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310745535.7A CN103729445B (en) 2013-12-30 2013-12-30 The acquisition methods and device of vocabulary translation

Publications (2)

Publication Number Publication Date
CN103729445A CN103729445A (en) 2014-04-16
CN103729445B true CN103729445B (en) 2017-04-05

Family

ID=50453519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310745535.7A Active CN103729445B (en) 2013-12-30 2013-12-30 The acquisition methods and device of vocabulary translation

Country Status (1)

Country Link
CN (1) CN103729445B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970732B (en) * 2014-05-22 2017-05-10 北京百度网讯科技有限公司 Mining method and device of new word translation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012128419A1 (en) * 2011-03-21 2012-09-27 주식회사 코난테크놀로지 Search system and search method for providing integrated multimedia contents
CN103136192A (en) * 2011-11-30 2013-06-05 北京百度网讯科技有限公司 Method and system of identifying translation demand
CN103324680A (en) * 2012-06-01 2013-09-25 微软公司 Language learning opportunities and general search engine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012128419A1 (en) * 2011-03-21 2012-09-27 주식회사 코난테크놀로지 Search system and search method for providing integrated multimedia contents
CN103136192A (en) * 2011-11-30 2013-06-05 北京百度网讯科技有限公司 Method and system of identifying translation demand
CN103324680A (en) * 2012-06-01 2013-09-25 微软公司 Language learning opportunities and general search engine

Also Published As

Publication number Publication date
CN103729445A (en) 2014-04-16

Similar Documents

Publication Publication Date Title
Bhatia et al. Automatic labelling of topics with neural embeddings
US9244908B2 (en) Generation of a semantic model from textual listings
JP6095621B2 (en) Mechanism, method, computer program, and apparatus for identifying and displaying relationships between answer candidates
Ma Champollion: A Robust Parallel Text Sentence Aligner.
US7949514B2 (en) Method for building parallel corpora
US20050228643A1 (en) Discovery of parallel text portions in comparable collections of corpora and training using comparable texts
JP5216063B2 (en) Method and apparatus for determining categories of unregistered words
CN110555154B (en) Theme-oriented information retrieval method
CN110909539A (en) Word generation method, system, computer device and storage medium of corpus
JP2010287020A (en) Synonym translation system and synonym translation method
US8670974B2 (en) Acquisition of out-of-vocabulary translations by dynamically learning extraction rules
Plaza et al. Using semantic graphs and word sense disambiguation techniques to improve text summarization
US20150220660A1 (en) Method and apparatus for pushing network information
KR20190118744A (en) Method and system for providing biomedical passage retrieval using deep-learning based knowledge structure construction
Jinarat et al. Short text clustering based on word semantic graph with word embedding model
CN101763403A (en) Query translation method facing multi-lingual information retrieval system
KR102083017B1 (en) Method and system for analyzing social review of place
CN103729445B (en) The acquisition methods and device of vocabulary translation
Martinez et al. On the use of automatically acquired examples for all-nouns word sense disambiguation
AU2018226420A1 (en) Voice assisted intelligent searching in mobile documents
CN103377188A (en) Translation library construction method and system
CN103034657B (en) Documentation summary generates method and apparatus
CN107908681A (en) A kind of similar website lookup method, system, equipment and medium
Sridhar et al. A scalable approach to building a parallel corpus from the Web
CN113569044B (en) Method for classifying webpage text content based on natural language processing technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant