CN103729445B - The acquisition methods and device of vocabulary translation - Google Patents
The acquisition methods and device of vocabulary translation Download PDFInfo
- Publication number
- CN103729445B CN103729445B CN201310745535.7A CN201310745535A CN103729445B CN 103729445 B CN103729445 B CN 103729445B CN 201310745535 A CN201310745535 A CN 201310745535A CN 103729445 B CN103729445 B CN 103729445B
- Authority
- CN
- China
- Prior art keywords
- vocabulary
- translated
- translation
- search results
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3337—Translation of the query language, e.g. Chinese to English
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
Abstract
The present invention proposes a kind of acquisition methods and device of vocabulary translation, and method includes:Vocabulary to be translated is obtained, and the first Search Results are generated according to vocabulary to be translated;At least one associated entity vocabulary related to vocabulary to be translated is extracted from the first Search Results according to vocabulary to be translated;The search condition of vocabulary to be translated is generated according at least one associated entity vocabulary;Scanned for obtaining the second Search Results according to search condition;The corresponding translation of vocabulary to be translated is extracted from the second Search Results.The acquisition methods of the vocabulary translation of the embodiment of the present invention, associated entity vocabulary is extracted in the first Search Results generated according to vocabulary to be translated, and the search condition generated according to associated entity vocabulary obtains the second Search Results, the corresponding translation of vocabulary to be translated is extracted in the second Search Results finally, the corresponding translation of neologisms can quickly be obtained, not only convenient, intelligence, and the accuracy rate for obtaining the corresponding translation of neologisms is effectively improved, improve Consumer's Experience.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of acquisition methods and device of vocabulary translation.
Background technology
With the development of the Internet, people have been no longer satisfied with obtaining information from single language material, start more next
More concerns obtain information from the data of other language, it is therefore desirable to realize across language by automatic translation by computer system
Acquisition of information.Current machine translation system disclosure satisfy that basic reading requirement, and which is mainly used in from bilingual specific distribution
Mutual translation word pair is obtained in the Chinese web page of type(It is during bracket be must be present in such as English and adjacent with Chinese translation).For example:
" ... the investigation entrusts economic motility item by the outstanding foundation of organ of survey skin (Pew Charitable Trusts) of non-political parties and groups
Mesh is carried out ... " in this section of text, the translation " Pew of " the outstanding foundation of skin " can be obtained by machine translation system
Charitable Trusts”。
But, for the vocabulary in the media event or hot news of burst, firstly because existing dictionary for translation does not have
Include, next to that such vocabulary is difficult to obtain correct translation by automatic translating method, therefore the accuracy rate of translation compares
It is low.In addition the translation of such vocabulary generally needs technical translator worker according to hot news background to carry out the translation of specialty,
Labor intensive, not convenient enough, intelligence, poor user experience.
The content of the invention
It is contemplated that at least solving one of above-mentioned technical problem.
For this purpose, first purpose of the present invention is to propose a kind of acquisition methods of vocabulary translation.The method can be quick
The corresponding translation of acquisition neologisms, not only facilitate, intelligence, and be effectively improved and obtain the accurate of the corresponding translation of neologisms
Rate, improves Consumer's Experience.
Second object of the present invention is to propose a kind of acquisition device of vocabulary translation.
To achieve these goals, the acquisition methods of the vocabulary translation of first aspect present invention embodiment, including following step
Suddenly:Vocabulary to be translated is obtained, and the first Search Results are generated according to the vocabulary to be translated;According to the vocabulary to be translated from institute
At least one associated entity vocabulary related to the vocabulary to be translated is extracted in stating the first Search Results, wherein, it is described to wait to turn over
Translation word is converged and belongs to first language with least one associated entity vocabulary;Generated according at least one associated entity vocabulary
The search condition of the vocabulary to be translated, wherein, the search condition belongs to second language;Searched according to the search condition
Rope is obtaining the second Search Results;And the corresponding translation of the vocabulary to be translated is extracted from second Search Results.
The acquisition methods of the vocabulary translation of the embodiment of the present invention, in the first Search Results generated according to vocabulary to be translated
Associated entity vocabulary is extracted, and the search condition generated according to associated entity vocabulary obtains the second Search Results, finally second
The corresponding translation of vocabulary to be translated is extracted in Search Results, the corresponding translation of neologisms can be quickly obtained, not only convenient, intelligence
Can, and the accuracy rate for obtaining the corresponding translation of neologisms is effectively improved, improve Consumer's Experience.Additionally, drawing by search
Hold up and retrieve the multi-lingual webpage related to neologisms there is ageing feature, thus the translation for obtaining also have it is very high ageing.
To achieve these goals, the acquisition device of the vocabulary translation of second aspect present invention embodiment, including:It is to be translated
Bilingual lexicon acquisition module, for obtaining vocabulary to be translated;First search module, searches for generating first according to the vocabulary to be translated
Hitch fruit;Extraction module, for being extracted from first Search Results and the word to be translated according to the vocabulary to be translated
At least one related associated entity vocabulary of remittance, wherein, the vocabulary to be translated is belonged to at least one associated entity vocabulary
In first language;Search condition generation module, for generating the word to be translated according at least one associated entity vocabulary
The search condition of remittance, wherein, the search condition belongs to second language;Second search module, for according to the search condition
Scan for obtaining the second Search Results;And translation extraction module, for extracting described from second Search Results
The corresponding translation of vocabulary to be translated.
The acquisition device of the vocabulary translation of the embodiment of the present invention, in the first Search Results generated according to vocabulary to be translated
Associated entity vocabulary is extracted, and the search condition generated according to associated entity vocabulary obtains the second Search Results, finally second
The corresponding translation of vocabulary to be translated is extracted in Search Results, the corresponding translation of neologisms can be quickly obtained, not only convenient, intelligence
Can, and the accuracy rate for obtaining the corresponding translation of neologisms is effectively improved, improve Consumer's Experience.
The additional aspect of the present invention and advantage will be set forth in part in the description, and partly will become from the following description
Obtain substantially, or recognized by the practice of the present invention.
Description of the drawings
Of the invention above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments
It is substantially and easy to understand, wherein,
Fig. 1 is the flow chart of the acquisition methods of vocabulary translation according to an embodiment of the invention;
Fig. 2 is the extraction according to an embodiment of the invention at least one associated entity vocabulary related to vocabulary to be translated
Flow chart;
Fig. 3 is searching according at least one associated entity vocabulary generation vocabulary to be translated according to an embodiment of the invention
The flow chart of rope condition;
Fig. 4 is the flow chart of the acquisition methods of vocabulary translation in accordance with another embodiment of the present invention;
Fig. 5 is the flow chart detected to webpage similarity according to an embodiment of the invention;
Fig. 6 is that the translation vocabulary and the corresponding translation of vocabulary to be translated treated according to an embodiment of the invention carries out translation
The flow chart of detection;
Fig. 7 is the structural representation of the acquisition device of vocabulary translation according to an embodiment of the invention;
Fig. 8 is the structural representation of the acquisition device of vocabulary translation according to an embodiment of the invention.
Specific embodiment
Embodiments of the invention are described below in detail, the example of embodiment is shown in the drawings, wherein identical from start to finish
Or similar label represents same or similar element or the element with same or like function.Retouch below with reference to accompanying drawing
The embodiment stated is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.Conversely, of the invention
The embodiment all changes, modification and the equivalent that include falling in the range of the spirit and intension of attached claims.
In describing the invention, term " first ", " second " etc. be only used for describe purpose, and it is not intended that indicate or
Hint relative importance.In describing the invention, unless otherwise clearly defined and limited, answer term " being connected ", " connection "
It is interpreted broadly, for example, it may be being fixedly connected, or being detachably connected, or is integrally connected;Can be that machinery connects
Connect, or electrically connect;Can be joined directly together, it is also possible to be indirectly connected to by intermediary.It is common for this area
For technical staff, above-mentioned term concrete meaning in the present invention can be understood with concrete condition.Additionally, in description of the invention
In, unless otherwise stated, " multiple " are meant that two or more.
In flow chart or here any process described otherwise above or method description are construed as, expression includes
It is one or more for realizing specific logical function or process the step of the module of code of executable instruction, fragment or portion
Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein the suitable of shown or discussion can not be pressed
Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention
Embodiment person of ordinary skill in the field understood.
Below with reference to the accompanying drawings the acquisition methods and device of the vocabulary translation of the embodiment of the present invention are described.
Fig. 1 is the flow chart of the acquisition methods of vocabulary translation according to an embodiment of the invention.
As shown in figure 1, the acquisition methods of vocabulary translation are comprised the following steps:
S101, obtains vocabulary to be translated, and generates the first Search Results according to vocabulary to be translated.
In an embodiment of the present invention, vocabulary to be translated can be the neologisms in the media event or hot news of burst, or
Person is emerging popular vocabulary etc., the vocabulary do not included in existing dictionary for translation.User can be input into using search engine
Vocabulary to be translated is scanned for, to generate the first Search Results.For example, treat translation vocabulary " aunt " to scan for, can
Multiple webpages related to " aunt " are obtained, then above-mentioned webpage and its content are the first Search Results.
S102, extracts at least one association related to vocabulary to be translated from the first Search Results according to vocabulary to be translated
Entity vocabulary.Wherein, vocabulary to be translated and at least one associated entity vocabulary belong to first language.
In an embodiment of the present invention, as shown in Fig. 2 being extracted from the first Search Results according to vocabulary to be translated and waiting to turn over
At least one related associated entity vocabulary of translation word remittance, specifically includes following steps:
S201, obtains at least one of the first Search Results original language webpage, wherein, at least one original language webpage category
In first language.
In an embodiment of the present invention, can treat with this comprising multiple according in the first Search Results of vocabulary to be translated generation
The related original language webpage of translation vocabulary, obtains wherein at least one original language webpage, treats to obtain to include in the source web page
The related content of translation vocabulary.Wherein, vocabulary to be translated and at least one original language webpage belong to first language.
S202, extracts the vocabulary at least one original language webpage, and records occurrence number.
For example, according to vocabulary to be translated " aunt ", multiple webpages related to " aunt " can be obtained.Extract wherein extremely
Vocabulary in a few webpage, for example:Include " then, price of gold keep stabilize a period of time.Press is Chinese aunt's war
Win Wall Street and hail." webpage in, extract " Wall Street ", " China ", the vocabulary such as " price of gold ", and record above-mentioned vocabulary appearance
Number of times.
S203, the vocabulary that will appear from number of times more than preset times threshold value is used as at least one pass related to vocabulary to be translated
Connection entity vocabulary.
In an embodiment of the present invention, if the occurrence number of certain vocabulary is more than preset times threshold value in original language webpage
When, represent the vocabulary and compare high with word-correlativity to be translated, then using the vocabulary as associated entity vocabulary.For example:According to " big
In multiple webpages that mother " obtains, when the number of times that entity vocabulary " price of gold ", " Wall Street " occur is more than preset times threshold value, then
Can be using " price of gold ", " Wall Street " as " aunt " related associated entity vocabulary.
S103, generates the search condition of vocabulary to be translated according at least one associated entity vocabulary, wherein, search condition category
In second language.
In an embodiment of the present invention, as shown in figure 3, generating vocabulary to be translated according at least one associated entity vocabulary
Search condition, specifically includes following steps:Below so that " press is hailed as Chinese aunt defeats Wall Street." as a example by said
It is bright.
S301, translates at least one associated entity vocabulary, generates that at least one associated entity vocabulary is corresponding translates
Text.
" aunt " related associated entity vocabulary " price of gold ", " Wall Street " etc. are translated, English translation is generated." gold
Valency " is translated into " gold price ", " Wall Street " and translates into " Wall Street ", " press " and translates into " The press "
Deng.
S302, is combined to the corresponding translation of at least one associated entity vocabulary, to generate the search of vocabulary to be translated
Condition.
Entity vocabulary in webpage is combined, search condition " gold price "+" Wall of English is generated
Street "+" holdsteady " or " gold price "+" defeating "+" The press " etc..
S104, scans for obtaining the second Search Results according to search condition.
According to search condition " gold price "+" Wall Street "+" hold steady " or " gold price "+
" defeating "+" The press ", can search for multiple English webpages.The content of one of webpage comprising " ... As
a result,the gold price held steady for a while.The press hailed the Chinese
dama for defeating Wall Street.......”。
S105, extracts the corresponding translation of vocabulary to be translated from the second Search Results.
According to web page contents " ... As a result, the gold price held steady for a
While.The press hailed the Chinese dama for defeating Wall Street....... " can be obtained
Take the translation " dama " of vocabulary to be translated " aunt ".
The acquisition methods of the vocabulary translation of the embodiment of the present invention, in the first Search Results generated according to vocabulary to be translated
Associated entity vocabulary, and the search condition generated according to associated entity vocabulary are extracted, search condition is scanned for obtain the
Two Search Results, finally extract the corresponding translation of vocabulary to be translated in the second Search Results, not only convenient, intelligent, and have
The accuracy rate for obtaining the corresponding translation of neologisms is improve effect, Consumer's Experience is improved.Additionally, go out by search engine retrieving and
The related multi-lingual webpage of neologisms has an ageing feature, thus the translation for obtaining also have it is very high ageing.
Fig. 4 is the flow chart of the acquisition methods of vocabulary translation in accordance with another embodiment of the present invention.
As shown in figure 4, the acquisition methods of vocabulary translation are comprised the following steps:
S401, obtains vocabulary to be translated, and generates the first Search Results according to vocabulary to be translated.
In an embodiment of the present invention, vocabulary to be translated can be the neologisms in the media event or hot news of burst, or
Person is emerging popular vocabulary etc., the vocabulary do not included in existing dictionary for translation.User can be input into using search engine
Vocabulary to be translated is scanned for, to generate the first Search Results.For example, treat translation vocabulary " aunt " to scan for, can
Multiple webpages related to " aunt " are obtained, then above-mentioned webpage and its content are the first Search Results.
S402, extracts at least one association related to vocabulary to be translated from the first Search Results according to vocabulary to be translated
Entity vocabulary, wherein, vocabulary to be translated and at least one associated entity vocabulary belong to first language.
In an embodiment of the present invention, at least one of first Search Results original language webpage is obtained first, and is extracted
Vocabulary at least one original language webpage, and occurrence number is recorded, then will appear from word of the number of times more than preset times threshold value
Converge as at least one associated entity vocabulary related to vocabulary to be translated.
For example, according to vocabulary to be translated " aunt ", multiple webpages related to " aunt " can be obtained.Including
" then, price of gold keeps stabilizing a period of time.Press is hailed for Chinese aunt defeats Wall Street." webpage in, extract
The vocabulary such as " Wall Street ", " China ", " price of gold ", and the number of times that above-mentioned vocabulary occurs is recorded, when the number of times for occurring is more than default time
During number threshold value, then can be using " price of gold ", " Wall Street " as " aunt " related associated entity vocabulary.
S403, generates the search condition of vocabulary to be translated according at least one associated entity vocabulary, wherein, search condition category
In second language.
In an embodiment of the present invention, first at least one associated entity vocabulary is translated, generates at least one and close
The corresponding translation of connection entity vocabulary, is then combined to the corresponding translation of at least one associated entity vocabulary, waits to turn over to generate
The search condition that translation word is converged.
For example, " aunt " related associated entity vocabulary " price of gold ", " Wall Street " etc. are translated, generates English
Translation." price of gold " is translated into " gold price ", " Wall Street " and translates into " Wall Street ", " press " and translate into " The
Then above-mentioned translation is combined by press " etc., and English search condition " gold price "+" the Wall Street " of generation+
" hold steady " or " gold price "+" defeating "+" The press " etc..
S404, scans for obtaining the second Search Results according to search condition.
In continuation, example is illustrated, according to search condition " gold price "+" Wall Street "+" hold steady "
Or " gold price "+" defeating "+" The press ", can search for multiple English webpages.One of webpage it is interior
Hold comprising " ... As a result, the gold price held steady for a while.The press
hailed the Chinese dama for defeating Wall Street.......”。
S405, extracts the corresponding translation of vocabulary to be translated from the second Search Results.
In continuation, example is illustrated, according to web page contents " ... As a result, the gold price held
steady for a while.The press hailed the Chinese dama for defeating Wall
Street....... the translation " dama " of vocabulary to be translated " aunt " can " be obtained.
S406, treating translation vocabulary and the corresponding translation of vocabulary to be translated carries out translation detection.
As shown in figure 5, before treating translation vocabulary and the corresponding translation of vocabulary to be translated carries out translation detection, it is concrete to wrap
Include following steps:
S501, obtains at least one of the second Search Results object language webpage.
Wherein, the second Search Results and object language webpage belong to second language.
S502, obtains the similarity of at least one original language webpage and at least one object language webpage.
In an embodiment of the present invention, original language webpage belongs to first language, and object language webpage belongs to the second voice, because
This needs based on the method for context intertranslation vocabulary to calculate the similarity across language web page.Computing formula is as follows:
Wherein, F and E represent original language webpage and object language webpage, α respectivelyiWeight is represented, is automatic by development set
Study is obtained, fi(E, F) representative feature.Specifically, fi(E, F) includes two category features:Popular word translation each other it is average general
Rate rnormalAverage probability r of (E, F) and entity vocabulary NE translation each otherNE(E, F) concrete formula is as follows:
Wherein, p (e | f) is the translation probability of f to e, and F and E represents original language webpage and object language webpage, f and e respectively
The vocabulary in original language webpage F and object language webpage E is represented respectively.
S503, carries out translation inspection according to the similarity of at least one original language webpage and at least one object language webpage
Survey.
Specifically, as Sim (E, F)>During β, then it is believed that the content of original language webpage and object language webpage is similar.
Wherein, threshold value beta is calculated automatically from beforehand through development set.
As shown in fig. 6, treating translation vocabulary and the corresponding translation of vocabulary to be translated carries out translation detection, specifically include following
Step:
S601, detects the dependency between vocabulary to be translated and vocabulary to be translated correspondence translation.
In an embodiment of the present invention, the dependency between vocabulary to be translated and the corresponding translation of vocabulary to be translated can lead to
Cross various methods to weigh, such as frequency, mutual information, hypothesis testing etc..Used in the present embodiment, frequency freq (f, e) is being said
It is bright.Wherein, frequency be vocabulary to be translated translation corresponding with vocabulary to be translated in original language webpage and object language webpage while
The frequency of appearance.Frequency is higher, then illustrate that the intertranslation degree of vocabulary to be translated and the corresponding translation of vocabulary to be translated is higher.
S602, detects the context similarity between vocabulary to be translated and vocabulary to be translated correspondence translation.
In an embodiment of the present invention, the context similarity meter between vocabulary to be translated and vocabulary to be translated correspondence translation
Calculation method will not be described here as the method for webpage similarity is calculated in step S502.
S603, carries out translation detection according to dependency and context similarity.
In an embodiment of the present invention, translation can be detected according to below equation, formula is specific as follows:
Wherein, ηi(e, f) represents the features such as the frequency of occurrences of word pair, upper and lower similarity;βiWeight is represented, is by exploitation
What collection was automatically learned.
S407, if it is determined that meeting translation examination criteria, then provides vocabulary to be translated corresponding translation to user.
In an embodiment of the present invention, by intertranslation degree highest target language vocabulary e*Translate as vocabulary to be translated is corresponding
Text is provided to user.
The acquisition methods of the vocabulary translation of the embodiment of the present invention, by detecting between original language webpage and object language webpage
Similarity, can effectively find the similar multi-lingual webpage of content;By detecting vocabulary to be translated and vocabulary translation to be translated
Dependency, context similarity, more efficiently improve the accuracy of the translation for obtaining neologisms, improve Consumer's Experience.
Fig. 7 is the structural representation of the acquisition device of vocabulary translation according to an embodiment of the invention.
As shown in fig. 7, the acquisition device of vocabulary translation includes:Bilingual lexicon acquisition module 100 to be translated, the first search module
200th, extraction module 300, search condition generation module 400, the second search module 500 and translation extraction module 600.Wherein,
Search condition generation module 400 is specifically included:Translation submodule 410 and combination submodule 420.
Specifically, bilingual lexicon acquisition module 100 to be translated is used to obtain vocabulary to be translated.
In an embodiment of the present invention, vocabulary to be translated can be the neologisms in the media event or hot news of burst, or
Person is emerging popular vocabulary etc., the vocabulary do not included in existing dictionary for translation.
First search module 200 is for according to vocabulary to be translated the first Search Results of generation.
User can be input into vocabulary to be translated using search engine and scan for, and then the first search module 200 generates first
Search Results.For example, treat translation vocabulary " aunt " to scan for, the first search module 200 can obtain multiple and " big
The related webpage of mother ", then above-mentioned webpage and its content are the first Search Results.
Extraction module 300 for according to vocabulary to be translated extract from the first Search Results it is related to vocabulary to be translated to
Few associated entity vocabulary, wherein, vocabulary to be translated and at least one associated entity vocabulary belong to first language.
Specifically, extraction module 300 obtains at least one of the first Search Results original language webpage, and extracts at least one
Vocabulary in individual original language webpage and record occurrence number, then will appear from number of times more than preset times threshold value vocabulary as with
At least one related associated entity vocabulary of vocabulary to be translated, wherein, at least one original language webpage belongs to first language.
In an embodiment of the present invention, extraction module 300 obtains at least one of the first Search Results original language first
Webpage, and the vocabulary at least one original language webpage is extracted, and occurrence number is recorded, number of times is then will appear from more than default time
The vocabulary of number threshold value is used as at least one associated entity vocabulary related to vocabulary to be translated.
For example, according to vocabulary to be translated " aunt ", multiple webpages related to " aunt " can be obtained.Including
" then, price of gold keeps stabilizing a period of time.Press is hailed for Chinese aunt defeats Wall Street." webpage in, extract
The vocabulary such as " Wall Street ", " China ", " price of gold ", and the number of times that above-mentioned vocabulary occurs is recorded, when the number of times for occurring is more than default time
During number threshold value, then can be using " price of gold ", " Wall Street " as " aunt " related associated entity vocabulary.
Search condition generation module 400 is for the search bar according at least one associated entity vocabulary generation vocabulary to be translated
Part, wherein, search condition belongs to second language.Specifically, search condition generation module 400 is specifically included:Translation submodule 410
With combination submodule 420.Below so that " press is hailed as Chinese aunt defeats Wall Street." as a example by illustrate.
Translation submodule 410 generates at least one associated entity for translating at least one associated entity vocabulary
The corresponding translation of vocabulary.
" aunt " related associated entity vocabulary " price of gold ", " Wall Street " etc. are translated, English translation is generated." gold
Valency " is translated into " gold price ", " Wall Street " and translates into " Wall Street ", " press " and translates into " The press "
Deng.
Combination submodule 420 is waited to turn over to generate for being combined the corresponding translation of at least one associated entity vocabulary
The search condition that translation word is converged.
Entity vocabulary in webpage is combined, search condition " gold price "+" Wall of English is generated
Street "+" hold steady " or " gold price "+" defeating "+" The press " etc..
Second search module 500 is for scanning for obtaining the second Search Results according to search condition.
According to search condition " gold price "+" Wall Street "+" hold steady " or " gold price "+
" defeating "+" The press ", can search for multiple English webpages.The content of one of webpage comprising " ... As
a result,the gold price held steady for a while.The press hailed the Chinese
dama for defeating Wall Street.......”。
Translation extraction module 600 is for extracting the corresponding translation of vocabulary to be translated from the second Search Results.
According to web page contents " ... As a result, the gold price held steady for a
While.The press hailed the Chinese dama for defeating Wall Street....... " can be obtained
Take the translation " dama " of vocabulary to be translated " aunt ".
The acquisition device of the vocabulary translation of the embodiment of the present invention, in the first Search Results generated according to vocabulary to be translated
Associated entity vocabulary is extracted, and the search condition generated according to associated entity vocabulary obtains the second Search Results, finally second
The corresponding translation of vocabulary to be translated is extracted in Search Results, the corresponding translation of neologisms can be quickly obtained, not only convenient, intelligence
Can, and the accuracy rate for obtaining the corresponding translation of neologisms is effectively improved, improve Consumer's Experience.Additionally, drawing by search
Hold up and retrieve the multi-lingual webpage related to neologisms there is ageing feature, thus the translation for obtaining also have it is very high ageing.
Fig. 8 is the structural representation of the acquisition device of vocabulary translation according to an embodiment of the invention.
As shown in figure 8, the acquisition device of vocabulary translation includes:Bilingual lexicon acquisition module 100 to be translated, the first search module
200th, extraction module 300, search condition generation module 400, the second search module 500, the detection of translation extraction module 600, translation
Module 700 and webpage similarity detection module 800.Wherein, search condition generation module 400 is specifically included:Translation submodule
410 and combination submodule 420.Translation detection module 700 is specifically included:Correlation detection submodule 710, similarity detects submodule
Block 720 and translation detection sub-module 730.
Translation detection module 700 carries out translation detection for treating translation vocabulary and the corresponding translation of vocabulary to be translated, and
Vocabulary to be translated corresponding translation is provided to user when judging and meeting translation examination criteria.
Specifically, translation detection module 700 also includes:Correlation detection submodule 710, similarity detection sub-module 720
And translation detection sub-module 730.
Correlation detection submodule 710 is used to detect the correlation between vocabulary to be translated and vocabulary to be translated correspondence translation
Property.
In an embodiment of the present invention, the dependency between vocabulary to be translated and the corresponding translation of vocabulary to be translated can lead to
Cross various methods to weigh, such as frequency, mutual information, hypothesis testing etc..Used in the present embodiment, frequency freq (f, e) is being said
It is bright.Wherein, frequency be vocabulary to be translated translation corresponding with vocabulary to be translated in original language webpage and object language webpage while
The frequency of appearance.Frequency is higher, then illustrate that the intertranslation degree of vocabulary to be translated and the corresponding translation of vocabulary to be translated is higher.
Similarity detection sub-module 720 is used to detect the context between vocabulary to be translated and vocabulary to be translated correspondence translation
Similarity.
In an embodiment of the present invention, similarity detection sub-module 720 detects that vocabulary to be translated is corresponding with vocabulary to be translated
During context similarity between translation, formula used as the formula of webpage similarity is calculated in step S502, here
Do not repeat.
Translation detection sub-module 730 is for carrying out translation detection according to dependency and context similarity.
In an embodiment of the present invention, translation detection sub-module 730 can be detected to translation according to below equation, formula
It is specific as follows:
Wherein, ηi(e, f) represents the features such as the frequency of occurrences of word pair, upper and lower similarity;βiWeight is represented, is by exploitation
What collection was automatically learned.
Webpage similarity detection module 800 is used to obtain at least one of the second Search Results object language webpage, and
The similarity of at least one original language webpage and at least one object language webpage is obtained, and according at least one original language net
The similarity of page and at least one object language webpage carries out translation detection.
Wherein, the second Search Results and object language webpage belong to second language.
In an embodiment of the present invention, original language webpage belongs to first language, and object language webpage belongs to the second voice, because
This webpage similarity detection module 800 needs based on the method for context intertranslation vocabulary to calculate the similarity across language web page.
Computing formula is as follows:
Wherein, F and E represent original language webpage and object language webpage, α respectivelyiWeight is represented, is automatic by development set
Study is obtained, fi(E, F) representative feature.
Specifically, fi(E, F) includes two category features:Average probability r of popular word translation each othernormal(E, F) and entity
Average probability r of vocabulary NE translations each otherNE(E, F) concrete formula is as follows:
Wherein, p (e | f) is the translation probability of f to e, and F and E represents original language webpage and object language webpage, f and e respectively
The vocabulary in original language webpage F and object language webpage E is represented respectively.
Specifically, as Sim (E, F)>During β, then it is believed that the content of original language webpage and object language webpage is similar.
Wherein, threshold value beta is calculated automatically from beforehand through development set.
The acquisition device of the vocabulary translation of the embodiment of the present invention, by detecting between original language webpage and object language webpage
Similarity, can effectively find the similar multi-lingual webpage of content;By detecting vocabulary to be translated and vocabulary translation to be translated
Dependency, context similarity, more efficiently improve the accuracy of the translation for obtaining neologisms, improve Consumer's Experience.
It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned
In embodiment, the software that multiple steps or method can be performed in memory and by suitable instruction execution system with storage
Or firmware is realizing.For example, if realized with hardware, and in another embodiment, can be with well known in the art
Any one of row technology or their combination are realizing:With for the logic gates of logic function is realized to data signal
Discrete logic, the special IC with suitable combinational logic gate circuit, programmable gate array(PGA), scene
Programmable gate array(FPGA)Deng.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
Example ", or the description of " some examples " etc. mean specific features with reference to the embodiment or example description, structure, material or spy
Point is contained at least one embodiment or example of the present invention.In this manual, to the schematic representation of above-mentioned term not
Identical embodiment or example are referred to necessarily.And, the specific features of description, structure, material or feature can be any
One or more embodiments or example in combine in an appropriate manner.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that:Not
These embodiments can be carried out with various changes, modification, replacement and modification in the case of the principle and objective that depart from the present invention, this
The scope of invention is limited by claim and its equivalent.
Claims (10)
1. a kind of acquisition methods of vocabulary translation, it is characterised in that include:
Vocabulary to be translated is obtained, and the first Search Results are generated according to the vocabulary to be translated;
At least one related to the vocabulary to be translated is extracted from first Search Results according to the vocabulary to be translated
Associated entity vocabulary, wherein, the vocabulary to be translated belongs to first language with least one associated entity vocabulary;
The search condition of the vocabulary to be translated is generated according at least one associated entity vocabulary, wherein, the basis is extremely
The search condition that few associated entity vocabulary generates the vocabulary to be translated is specifically included:
At least one associated entity vocabulary is translated, the generation at least one associated entity vocabulary is corresponding to be translated
Text;And
The corresponding translation of at least one associated entity vocabulary is combined, to generate the search bar of the vocabulary to be translated
Part, the search condition belong to second language;
Scanned for obtaining the second Search Results according to the search condition;And
The corresponding translation of the vocabulary to be translated is extracted from second Search Results.
2. the method for claim 1, it is characterised in that it is described according to the vocabulary to be translated from the described first search knot
At least one associated entity vocabulary related to the vocabulary to be translated is extracted in fruit to specifically include:
At least one of first Search Results original language webpage is obtained, at least one original language webpage belongs to described
First language;
The vocabulary at least one original language webpage is extracted, and records occurrence number;And
The occurrence number is more than the vocabulary of preset times threshold value as at least one pass related to the vocabulary to be translated
Connection entity vocabulary.
3. method as claimed in claim 2, it is characterised in that also include:
Translation detection is carried out to the vocabulary to be translated and the corresponding translation of the vocabulary to be translated;
If it is determined that meet translation examination criteria, then the vocabulary to be translated corresponding translation is provided to user.
4. method as claimed in claim 3, it is characterised in that described to treat translation vocabulary and the vocabulary to be translated is corresponding
Translation carries out translation detection and specifically includes:
Detect the dependency between the vocabulary to be translated and the vocabulary correspondence translation to be translated;
Detect the context similarity between the vocabulary to be translated and the vocabulary correspondence translation to be translated;And
The translation detection is carried out according to the dependency and the context similarity.
5. method as claimed in claim 3, it is characterised in that described to the vocabulary to be translated and the vocabulary to be translated
Before corresponding translation carries out translation detection, also include:
Obtain at least one of second Search Results object language webpage;
Obtain the similarity of at least one original language webpage and at least one object language webpage;
Translation detection is carried out according to the similarity of at least one original language webpage and at least one object language webpage.
6. a kind of acquisition device of vocabulary translation, it is characterised in that include:
Bilingual lexicon acquisition module to be translated, for obtaining vocabulary to be translated;
First search module, for generating the first Search Results according to the vocabulary to be translated;
Extraction module, for being extracted from first Search Results and the vocabulary phase to be translated according to the vocabulary to be translated
At least one associated entity vocabulary for closing, wherein, the vocabulary to be translated and at least one associated entity vocabulary belong to the
One language;
Search condition generation module, for the search of the vocabulary to be translated is generated according at least one associated entity vocabulary
Condition, wherein, the search condition generation module is specifically included:
Translation submodule, for translating at least one associated entity vocabulary, generates at least one association real
The corresponding translation of pronouns, general term for nouns, numerals and measure words remittance;And
Combination submodule, for being combined to the corresponding translation of at least one associated entity vocabulary, to generate described treating
The search condition of translation vocabulary, the search condition belong to second language;
Second search module, for being scanned for obtaining the second Search Results according to the search condition;And
Translation extraction module, for the corresponding translation of the vocabulary to be translated is extracted from second Search Results.
7. device as claimed in claim 6, it is characterised in that the extraction module is obtained in first Search Results extremely
A few original language webpage, and extract the vocabulary at least one original language webpage and record occurrence number, and by institute
The vocabulary that occurrence number is stated more than preset times threshold value is used as at least one associated entity word related to the vocabulary to be translated
Converge, wherein, at least one original language webpage belongs to the first language.
8. device as claimed in claim 7, it is characterised in that also include:
Translation detection module, for carrying out translation detection to the vocabulary to be translated and the corresponding translation of the vocabulary to be translated,
And the vocabulary to be translated corresponding translation is provided to user when judging to meet translation examination criteria.
9. device as claimed in claim 8, it is characterised in that the translation detection module is specifically included:
Correlation detection submodule, for detecting the correlation between the vocabulary to be translated and the vocabulary correspondence translation to be translated
Property;
Similarity detection sub-module, it is upper and lower between the vocabulary to be translated and the vocabulary correspondence translation to be translated for detecting
Literary similarity;And
Translation detection sub-module, for carrying out the translation detection according to the dependency and the context similarity.
10. device as claimed in claim 8, it is characterised in that also include:
Webpage similarity detection module, for obtaining at least one of second Search Results object language webpage, and obtains
Take the similarity of at least one original language webpage and at least one object language webpage, and according to described at least one
The similarity of individual original language webpage and at least one object language webpage carries out translation detection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310745535.7A CN103729445B (en) | 2013-12-30 | 2013-12-30 | The acquisition methods and device of vocabulary translation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310745535.7A CN103729445B (en) | 2013-12-30 | 2013-12-30 | The acquisition methods and device of vocabulary translation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103729445A CN103729445A (en) | 2014-04-16 |
CN103729445B true CN103729445B (en) | 2017-04-05 |
Family
ID=50453519
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310745535.7A Active CN103729445B (en) | 2013-12-30 | 2013-12-30 | The acquisition methods and device of vocabulary translation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103729445B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970732B (en) * | 2014-05-22 | 2017-05-10 | 北京百度网讯科技有限公司 | Mining method and device of new word translation |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012128419A1 (en) * | 2011-03-21 | 2012-09-27 | 주식회사 코난테크놀로지 | Search system and search method for providing integrated multimedia contents |
CN103136192A (en) * | 2011-11-30 | 2013-06-05 | 北京百度网讯科技有限公司 | Method and system of identifying translation demand |
CN103324680A (en) * | 2012-06-01 | 2013-09-25 | 微软公司 | Language learning opportunities and general search engine |
-
2013
- 2013-12-30 CN CN201310745535.7A patent/CN103729445B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012128419A1 (en) * | 2011-03-21 | 2012-09-27 | 주식회사 코난테크놀로지 | Search system and search method for providing integrated multimedia contents |
CN103136192A (en) * | 2011-11-30 | 2013-06-05 | 北京百度网讯科技有限公司 | Method and system of identifying translation demand |
CN103324680A (en) * | 2012-06-01 | 2013-09-25 | 微软公司 | Language learning opportunities and general search engine |
Also Published As
Publication number | Publication date |
---|---|
CN103729445A (en) | 2014-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bhatia et al. | Automatic labelling of topics with neural embeddings | |
US9244908B2 (en) | Generation of a semantic model from textual listings | |
JP6095621B2 (en) | Mechanism, method, computer program, and apparatus for identifying and displaying relationships between answer candidates | |
Ma | Champollion: A Robust Parallel Text Sentence Aligner. | |
US7949514B2 (en) | Method for building parallel corpora | |
US20050228643A1 (en) | Discovery of parallel text portions in comparable collections of corpora and training using comparable texts | |
JP5216063B2 (en) | Method and apparatus for determining categories of unregistered words | |
CN110555154B (en) | Theme-oriented information retrieval method | |
CN110909539A (en) | Word generation method, system, computer device and storage medium of corpus | |
JP2010287020A (en) | Synonym translation system and synonym translation method | |
US8670974B2 (en) | Acquisition of out-of-vocabulary translations by dynamically learning extraction rules | |
Plaza et al. | Using semantic graphs and word sense disambiguation techniques to improve text summarization | |
US20150220660A1 (en) | Method and apparatus for pushing network information | |
KR20190118744A (en) | Method and system for providing biomedical passage retrieval using deep-learning based knowledge structure construction | |
Jinarat et al. | Short text clustering based on word semantic graph with word embedding model | |
CN101763403A (en) | Query translation method facing multi-lingual information retrieval system | |
KR102083017B1 (en) | Method and system for analyzing social review of place | |
CN103729445B (en) | The acquisition methods and device of vocabulary translation | |
Martinez et al. | On the use of automatically acquired examples for all-nouns word sense disambiguation | |
AU2018226420A1 (en) | Voice assisted intelligent searching in mobile documents | |
CN103377188A (en) | Translation library construction method and system | |
CN103034657B (en) | Documentation summary generates method and apparatus | |
CN107908681A (en) | A kind of similar website lookup method, system, equipment and medium | |
Sridhar et al. | A scalable approach to building a parallel corpus from the Web | |
CN113569044B (en) | Method for classifying webpage text content based on natural language processing technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |