CN108304412B

CN108304412B - Cross-language search method and device for cross-language search

Info

Publication number: CN108304412B
Application number: CN201710025472.6A
Authority: CN
Inventors: 翟飞飞; 张骏; 许静芳; 薛征山; 祝天刚; 于恒
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-01-13
Filing date: 2017-01-13
Publication date: 2022-09-30
Anticipated expiration: 2037-01-13
Also published as: CN108304412A

Abstract

The embodiment of the invention provides a cross-language searching method and device and a device for cross-language searching, wherein the method specifically comprises the following steps: acquiring a search word of a first language; obtaining a search result of a second language according to the search word; for each search result in the second language, performing the following steps: determining target translation models corresponding to all preset display parts of the search results; acquiring translation search results corresponding to all preset display parts of the search results by using the target translation model; and displaying the translation search results corresponding to the preset display parts of the search results to the user. The embodiment of the invention can improve the accuracy of the translation search result.

Description

Cross-language search method and device for cross-language search

Technical Field

The invention relates to the technical field of information search, in particular to a cross-language search method and device and a cross-language search device.

Background

With the continuous growth of internet information, people put forward higher requirements on information search, and do not meet the requirement of searching in the same language database any more, but require to acquire data of multiple languages. For example, if the search term (query) input by the user is "search term a", the search in the chinese database may not meet the user's requirement to the maximum extent, and the english database from the european and american website may have better and more search results.

Cross-language search technology combines information retrieval technology and machine translation technology. The existing implementation process of the cross-language search scheme may specifically include: firstly, a search word in a source language form is converted into a search word in a target language form through a machine translation technology, and then information retrieval is carried out in a corresponding single language database according to the search word in the source language form and the search word in the target language form respectively to obtain a multi-language search result, wherein the multi-language search result can comprise: search results in a source language and search results in a target language.

In order to meet the requirements of users who do not have the reading capability of the target language or have limited reading capability of the target language, the existing scheme can utilize a translation model to translate the search result of the target language to obtain the translated search result in the source language.

The inventor finds that the prior scheme at least has the following problems in the process of implementing the embodiment of the invention: in the existing scheme, a general translation model is usually adopted to translate the search result of the target language, and the accuracy of translating the search result is easily affected by the limitation of the general translation model, that is, the accuracy of translating the search result obtained in the existing scheme is low.

Disclosure of Invention

In view of the above problems, embodiments of the present invention have been made to provide a cross-language search method, a cross-language search apparatus, and an apparatus for cross-language search that overcome or at least partially solve the above problems, and can improve the accuracy of translation search results.

In order to solve the above problems, the present invention discloses a cross-language search method, comprising:

acquiring a search word of a first language;

obtaining a search result of a second language according to the search word;

for each search result in the second language, performing the following steps:

determining target translation models corresponding to all preset display parts of the search results;

acquiring translation search results corresponding to all preset display parts of the search results by using the target translation model;

and displaying the translation search results corresponding to the preset display parts of the search results to the user.

Optionally, the step of determining a target translation model corresponding to each preset display portion of the search result includes:

determining the display types corresponding to all preset display parts contained in the search results;

and acquiring target translation models corresponding to the preset display parts according to the display types.

Optionally, if the display type corresponding to the preset display part is a title class, the obtaining of the target translation model corresponding to each preset display part includes: obtaining a title translation model, wherein the title translation model is obtained by training according to title corpora;

and/or the presence of a gas in the gas,

if the display type corresponding to the preset display part is an abstract type, the obtaining of the target translation model corresponding to each preset display part comprises: acquiring a summary translation model, wherein the summary translation model is obtained by training according to summary linguistic data;

and/or the presence of a gas in the gas,

if the display type corresponding to the preset display part is a page content type, the obtaining of the target translation model corresponding to each preset display part includes: and acquiring a page content translation model, wherein the content translation model is obtained by training according to preset page content corpora.

Optionally, if the preset display part is a title part, the step of obtaining, by using the target translation model, the translation search result corresponding to each preset display part of the search result includes:

identifying a preset symbol contained in the header portion;

dividing the title part into a plurality of semantic units according to the preset symbols;

translating each semantic unit obtained by segmentation by using a first target translation model corresponding to the title part to obtain a translation result corresponding to each semantic unit;

combining the translation results corresponding to the semantic units according to the preset symbols to obtain a first translation search result corresponding to the title part; the first translated search result includes the preset symbol.

Optionally, the translating, by using the first target translation model corresponding to the header portion, each segmented semantic unit includes:

and respectively inputting each semantic unit and the corresponding context thereof into the first target translation model to obtain the translation result corresponding to each semantic unit output by the first target translation model.

Optionally, if the preset display part is an abstract part, the step of obtaining, by using the target translation model, translation search results corresponding to each preset display part of the search results includes:

extracting target content located at a preset position from the abstract part;

and translating the target content by using a second target translation model corresponding to the preset position to obtain a corresponding second translation search result.

Optionally, the method further comprises: determining a target category to which the search result belongs;

the obtaining of the target translation model corresponding to each preset display part according to the display type comprises:

and acquiring a target translation model corresponding to each preset display part by combining the target type to which the search result belongs and the display type corresponding to each preset display part.

Optionally, the step of determining a target category to which the search result belongs includes:

respectively matching the content included in the search result with the dictionaries of all preset categories to obtain the matching rate corresponding to all preset categories;

and taking the preset category corresponding to the maximum one of the matching rates corresponding to all the preset categories as the target category to which the search result belongs.

Optionally, the step of determining the preset category of the target to which the search result belongs includes:

inputting the content included in the search result into a classifier, and taking the classification result output by the classifier as a target class to which the search result belongs; the classifier is obtained by training according to search result samples of all preset categories.

In another aspect, the present invention discloses a cross-language searching device, comprising:

the search word acquisition module is used for acquiring a search word of a first language;

the search result acquisition module is used for acquiring a search result of a second language according to the search word;

the search result processing module is used for processing the search result of each second language;

the search result processing module comprises: the translation model determining module, the translation searching result obtaining module and the translation searching result displaying module are arranged in the translation model determining module;

the translation model determining module is used for determining a target translation model corresponding to each preset display part of the search result aiming at the search result of each second language;

the translation search result acquisition module is used for acquiring translation search results corresponding to all preset display parts of the search results by using the target translation model; and

and the translation search result display module is used for displaying the translation search results corresponding to the preset display parts of the search results to the user.

Optionally, the translation model determining module includes: the display type determining submodule and the translation model obtaining submodule;

the display type determining submodule is used for determining the display types corresponding to all preset display parts contained in the search results;

and the translation model acquisition submodule is used for acquiring target translation models corresponding to the preset display parts according to the display types.

Optionally, if the display type corresponding to the preset display part is a title class, the translation model obtaining sub-module includes: a first translation model acquisition unit;

the first translation model acquisition unit is used for acquiring a title translation model, and the title translation model is obtained by training according to title corpora;

and/or the presence of a gas in the gas,

if the display type corresponding to the preset display part is an abstract type, the translation model obtaining submodule comprises: a second translation model acquisition unit;

the second translation model acquisition unit is used for acquiring an abstract translation model, and the abstract translation model is obtained by training according to abstract linguistic data;

and/or the presence of a gas in the gas,

if the display type corresponding to the preset display part is a page content type, the translation model obtaining sub-module includes: a third translation model acquisition unit;

the third translation model obtaining unit is used for obtaining a page content translation model, and the content translation model is obtained by training according to preset page content corpora.

Optionally, if the preset display part is a title part, the translation search result obtaining module includes: the system comprises a recognition submodule, a segmentation submodule, a first translation submodule and a combination submodule;

the identification submodule is used for identifying preset symbols contained in the header part;

the division submodule is used for dividing the title part into a plurality of semantic units according to the preset symbols;

the first translation submodule is used for translating each segmented semantic unit by using a first target translation model corresponding to the title part to obtain a translation result corresponding to each semantic unit;

the combination submodule is used for combining the translation results corresponding to the semantic units according to the preset symbols so as to obtain a first translation search result corresponding to the title part; the first translated search result includes the preset symbol.

Optionally, the first translation submodule includes: a translation unit;

the translation unit is configured to input each semantic unit and the context corresponding to the semantic unit to the first target translation model, so as to obtain a translation result corresponding to each semantic unit output by the first target translation model.

Optionally, if the preset display part is an abstract part, the translation search result obtaining module includes: extracting a submodule and a second translation submodule;

the extraction submodule is used for extracting target content located at a preset position from the abstract part;

and the second translation sub-module translates the target content by using a second target translation model corresponding to the preset position to obtain a corresponding second translation search result.

Optionally, the apparatus further comprises: a category determination module;

the category determining module is used for determining a target category to which the search result belongs;

the translation model acquisition sub-module comprises: a model acquisition unit;

and the model acquisition unit is used for acquiring the target translation model corresponding to each preset display part by combining the target type to which the search result belongs and the display type corresponding to each preset display part.

Optionally, the category determining module includes: a matching submodule and a determining submodule;

the matching submodule is used for respectively matching the contents included in the search results with the dictionaries of all the preset categories so as to obtain the matching rate corresponding to each preset category;

the determining submodule is configured to use the preset category corresponding to the largest one of the matching rates corresponding to all the preset categories as the target category to which the search result belongs.

Optionally, the category determination module includes: a classification submodule;

the classification submodule is used for inputting the content included in the search result into a classifier and taking the classification result output by the classifier as the target class to which the search result belongs; the classifier is obtained by training according to search result samples of all preset categories.

In yet another aspect, an apparatus for cross-language searching is disclosed that includes a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors to include instructions for:

acquiring a search word of a first language;

obtaining a search result of a second language according to the search word;

and aiming at each search result of the second language, executing the following steps:

The embodiment of the invention has the following advantages:

in the translation process of the search result of the second language of the cross-language search, the target translation model corresponding to each preset display part of the search result can be determined firstly, and then the translation search result corresponding to the preset display part of the search result is obtained by using the target translation model; therefore, the target translation model can be a translation model matched with each preset display part, namely, the target translation model can translate from the second language to the first language according to the characteristics of each preset display part, and therefore the accuracy of translation search results can be improved.

Drawings

FIG. 1 is a schematic diagram of an application environment for a cross-language search method of the present invention;

FIG. 2 is a flowchart illustrating the steps of a first embodiment of a cross-language search method according to the present invention;

FIG. 3 is a block diagram of an embodiment of a cross-language search apparatus according to the present invention;

FIG. 4 is a block diagram of an apparatus 900 for cross-language search according to the present invention as a terminal; and

fig. 5 is a schematic structural diagram of an apparatus for cross-language search according to the present invention as a server.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

In the embodiment of the invention, the machine translation is regarded as an information transmission process, and the machine translation is explained by using a channel model. The idea is that the translation from a source language sentence to a target language sentence is a probability problem, any target language sentence is possibly a translation of any source language sentence, only the probabilities are different, and the task of machine translation is to find the sentence with the highest probability. The specific method is a decoding process which converts the translation into a translation through a model. Therefore, the translation model can be divided into the following problems: model problem, training problem, decoding problem. The model problem is to establish a translation model for describing probability for machine translation, that is, to define a calculation method of translation probability from a source language sentence to a target language sentence. The training problem is to use a corpus to obtain all the parameters of the model. The decoding problem is to find a translation with the highest probability for any input source language sentence on the basis of a known translation model and parameters.

The inventor finds that, in the process of implementing the embodiment of the invention, the existing scheme generally adopts a general translation model to translate the search result of the target language, and for the general translation model, as long as the input text content is the same, the same translation search result can be obtained. However, different types of search results usually have their own characteristics, and thus, translating all types of search results by using a common translation model easily affects the accuracy of the translated search results, that is, the accuracy of the translated search results obtained in the existing scheme is low.

Aiming at the technical problem that the accuracy of a translation search result is low in the existing scheme, the embodiment of the invention provides a cross-language search scheme which can obtain a search word of a first language; obtaining a search result of a second language according to the search word; determining a target translation model corresponding to a preset display part of the search result aiming at the search result of each second language; acquiring a translation search result corresponding to a preset display part of the search result by using the target translation model; and further displaying the translation search result corresponding to the preset display part of the search result to the user. According to the embodiment of the invention, the target translation model corresponding to each preset display part of the search result can be determined firstly, and then the translation search result corresponding to the preset display part of the search result is obtained by using the target translation model; therefore, the target translation model can be a translation model matched with each preset display part, namely, the target translation model can translate from the second language to the first language according to the characteristics of each preset display part, and therefore the accuracy of translation search results can be improved.

In the embodiment of the present invention, the search word in the first language may be first translated into the search word in the second language, and then, the search word in the second language is retrieved from the database in the second language according to the search word in the second language, so as to obtain the search result in the second language. Therefore, the search result in the second language may be used to represent a search result corresponding to the search term in the second language, and the translated search result may be used to represent a translated search result in the first language translated according to the search result in the second language, where the search result in the second language and the translated search result in the first language may correspond to the same search result (e.g., a web page, a video, a picture, music, etc.), and one of the differences between the two languages is different.

In an application example of the present invention, if a search word in a first language is "search word a" and a corresponding search word in a second language is "translation of search word a", the search word may be retrieved in an english database according to the "translation of search word a" to obtain an english search result, and each preset display portion is translated by using a target translation model corresponding to each preset display portion of the search result to obtain a corresponding translation search result.

The embodiment of the invention can be applied to platform environments with cross-language search functions, such as search APP, search websites (such as a search engine) and the like, can provide search results from a multi-language database for users, and can provide more accurate translation search results for users so as to meet the requirements of users without target language reading capability or limited target language reading capability. The embodiment of the invention mainly takes the search APP as an example to explain the cross-language search method of the embodiment of the invention, and the cross-language search methods corresponding to other platforms such as a search website can be mutually referred.

The cross-language search method provided by the embodiment of the invention can be applied to the application environment shown in fig. 1, as shown in fig. 1, the client 100 and the server 200 are located in a wired or wireless network, and the client 100 and the server 200 perform data interaction through the wired or wireless network.

The cross-language search process of the embodiment of the present invention may be performed by any one or a combination of the client 100 and the server 200:

for example, the client 100 may receive a search word in a first language input by a user and send the search word in the first language to the server 200; after receiving the search term of the first language, the server 200 may obtain search results of second languages according to the search term, and determine, for each search result of the second language, a target translation model corresponding to each preset display part of the search result; and acquiring translation search results corresponding to each preset display part of the search results by using the target translation model, and sending the translation search results corresponding to each preset display part to the client 100, so that the client 100 displays the translation search results corresponding to each preset display part of the search results to the user.

Because the process of obtaining the search result and/or the translation search result in the second language is executed by the server 200, the advantage of rich computing resources of the server 200 can be exerted, and the efficiency and the accuracy of obtaining the search result and/or the translation search result in the second language can be improved. For example, the cloud server may be deployed with a plurality of highly configured computing devices, so that the computing devices are utilized to obtain the search result and/or the translation search result in the second language, so as to improve the obtaining efficiency and the obtaining accuracy of the search result and/or the translation search result in the second language; meanwhile, the computing resources at the client 100 side can be saved, and the performance of the intelligent terminal corresponding to the client 100 is improved.

Of course, the process of obtaining the search result and/or the translated search result in the second language may also be executed by the client 100, and the specific execution subject of the process of obtaining the search result and/or the translated search result in the second language is not limited in the embodiment of the present invention.

Optionally, the client 100 may be run on an intelligent terminal, and the intelligent terminal specifically includes but is not limited to: smart phones, tablet computers, electronic book readers, MP3 (Moving Picture Experts Group Audio Layer III) players, MP4 (Moving Picture Experts Group Audio Layer IV) players, laptop portable computers, car-mounted computers, desktop computers, set-top boxes, smart televisions, wearable devices, and the like.

Method embodiment one

Referring to fig. 2, a flowchart illustrating steps of a first embodiment of a cross-language search method according to the present invention is shown, which may specifically include the following steps:

step 201, obtaining a search term of a first language;

step 202, obtaining a search result of a second language according to the search word;

for each search result in the second language, performing the following steps:

step 203, determining a target translation model corresponding to each preset display part of the search result;

step 204, acquiring translation search results corresponding to each preset display part of the search results by using the target translation model;

and 205, displaying the translation search results corresponding to each preset display part of the search results to the user.

In the embodiment of the present invention, the search term in the first language may be input by the user in the first language. In practical applications, a client of the search APP or the search website may provide a UI (User Interface), and a User may submit a search term of the first language to the client through a search box, a voice Interface, and the like on the UI. No matter how the user submits the search term in the first language to the client, the client may display the received search term in the first language in the search box. Therefore, in the embodiment of the present invention, the search term in the first language input by the user may include: the user submits the search word in the first language to the client in any mode. It is to be understood that the embodiment of the present invention does not limit the specific manner of obtaining the search term in the first language.

In the embodiment of the present invention, the first language and the second language may be used to represent two different languages, and the first language and the second language may be preset by a user, or may be obtained by analyzing a search behavior and/or a browsing behavior of the user by a search APP or a search website. Alternatively, the search APP or the search website may have a language most frequently used by the user as the first language and a language used except the first language as the second language. For example, if the search behavior of the user indicates that the search terms used by the user before are all Chinese search terms, the source language can be determined to be Chinese; the browsing behavior of the user also indicates that the user has accessed the translation website and has performed a mutual translation behavior between chinese and english through the translation website, so that it can be determined that the second language is english. It is understood that the number of the second languages of the embodiments of the present invention may be one or more, for example, for a user with a native language of chinese, the first language may be chinese, and the second language may be one or a combination of english, japanese, korean, german and french. The cross-language searching method of the embodiment of the invention is mainly described by taking the first language as Chinese and the second language as English as an example, and the cross-language searching methods corresponding to other first languages and second languages are mutually referred.

In practical applications, step 202 may be implemented by a client or a server to translate a search word in a first language into a search word in a second language, and then perform a search in a database of the second language according to the search word in the second language, so as to obtain a search result in the second language. Taking the second language as english for example, the data of the european and american website may be stored in the english database. It can be understood that, in the embodiment of the present invention, a specific obtaining manner for obtaining the search result in the second language according to the search term is not limited.

Alternatively, a plurality of different translation results may be obtained during the translation of the search term in the first language into the search term in the second language, in which case the translation result with the highest confidence may be selected from the plurality of different translation results. Further, a search result of a second language can be obtained according to the translation result with the highest confidence; or respectively searching according to one or more of the multiple different translation results, and taking the results obtained by searching as the search results of the second language. In an application example of the present invention, if the search word in the first language is "search word a", the search word in the second language may be "translation of search word a".

Step 203 may determine, for each search result obtained in step 202, a target translation model corresponding to each preset display portion of the search result; step 204 may obtain the translation search results corresponding to each preset display part of the search results by using the target translation model obtained in step 203. Therefore, the target translation model can be a translation model matched with each preset display part, namely, the target translation model can translate from the second language to the first language according to the characteristics of each preset display part, and therefore the accuracy of translation search results can be improved.

In an optional embodiment of the present invention, the step of determining the target translation model corresponding to each preset presentation part of the search result may include: determining the display types corresponding to all preset display parts contained in the search results; and acquiring a target translation model corresponding to each preset display part according to the display type. The display type can reflect the characteristics of the preset display parts, so that the target translation model matched with each preset display part can be obtained according to the display type of each preset display part, and the accuracy of translating the search result can be improved.

In this embodiment of the present invention, the preset display part may be used to represent display contents preset for the search result, and the embodiment of the present invention may provide the following scheme for acquiring the target translation model corresponding to each preset display part, for the preset display part included in the search result and the display type corresponding to the preset display part:

acquisition scheme 1,

In the obtaining scheme 1, the preset displaying part may include: a header portion; the presentation type corresponding to the title part may include: a topic class; the target translation model corresponding to the header portion may include: a title translation model; the title translation model may be obtained by training according to a title corpus.

For the title portion of the search results, it usually has its own features: for example, the term is usually expressed in the form of a short sentence, a phrase or a phrase, or usually contains a special preset symbol "-", "|", "…", etc., so the embodiment of the present invention may obtain the title corpus derived from the search result in advance, and optionally, the title corpus may be bilingual corpus or aligned corpus (i.e., words that can be translated into each other in a bilingual sentence are paired); and then training according to the title corpus to obtain a title translation model. Because the title corpus also has the characteristics of the title part, the title translation model obtained by training according to the title corpus can consider the characteristics of forms of short sentences, phrases or phrases, preset symbols and the like, so that a more accurate translation search result can be obtained for the title part.

Acquisition scheme 2,

In the obtaining scheme 2, the preset displaying part may include: a summary part; the presentation type corresponding to the summary part may include: a digest class; the target translation model corresponding to the abstract part may include: a digest translation model; the abstract translation model can be obtained by training according to abstract linguistic data.

For the abstract part of the search results, it usually has its own features: for example, the form is usually expressed as a long sentence, or a specific type of content appears at a specific position (a relatively fixed content, such as time, information source, etc., appears at the beginning position of the abstract), etc., so the embodiment of the present invention may obtain an abstract corpus derived from the search result in advance, and optionally, the abstract corpus may be a bilingual corpus or an aligned corpus; and then training according to the abstract linguistic data to obtain an abstract translation model. The abstract linguistic data also has the characteristic of an abstract part, and the abstract translation model obtained by training according to the abstract linguistic data can consider the form of a long sentence or the characteristic that specific types of contents appear at a specific position, so that a relatively accurate translation searching effect can be obtained for the abstract part.

Acquisition scheme 3,

In the obtaining scheme 3, the preset displaying part may include: a page content portion; the presentation type corresponding to the page content part may include: a page content class; the target translation model corresponding to the content part of the page may include: a page content translation model; the content translation model is obtained by training according to preset page content corpora.

In addition to the title portion and the summary portion, some websites may also set a page content portion in the search results to allow the user to obtain more accurate information about the website through the page content portion. For example, an e-commerce website may set a page content portion in search results that may be used to present a promotional activity to catch the user's eye through the promotional activity. As another example, a news website may set a page content portion in the search results that may be used to present trending news events to catch the user's eyes through the trending news events.

The page content part of the website setting usually has features related to the website itself, for example, the page content of an e-commerce website is usually related to merchandise, and the page content of a news website is usually related to news. Therefore, the embodiment of the invention can obtain the preset page content linguistic data in advance, wherein the preset page content linguistic data are linguistic data derived from the search result; optionally, the preset page content may be bilingual corpus or aligned corpus; and then training according to preset page content corpora to obtain a page content translation model. The preset page content corpus also has the characteristics of the page content part, and the page content translation model obtained by training according to the preset page content can consider the characteristics of the page content part, so that a relatively accurate translation search result can be obtained for the page content part.

The process of acquiring the target translation model corresponding to each preset display portion is described in detail through the acquiring scheme 1 to the acquiring scheme 3, and it can be understood that a person skilled in the art may adopt any one or a combination of any several of the acquiring scheme 1 to the acquiring scheme 3 according to an actual application requirement, or may also adopt other acquiring schemes for other preset display portions.

In an optional embodiment of the present invention, the title translation model, the abstract translation model, and the page content translation model corresponding to the obtaining schemes 1 to 3 may be further optimized according to the target category to which the search result belongs, so as to further improve the accuracy of translating the search result. Accordingly, the method may further comprise: and determining the target category to which the search result belongs. The obtaining of the target translation model corresponding to each preset display part according to the display type corresponding to each preset display part may include: and acquiring a target translation model corresponding to each preset display part by combining the target category to which the search result belongs and the display type corresponding to each preset display part.

Specifically, the obtaining of the target translation model corresponding to each preset display part by combining the target category to which the search result belongs and the display type corresponding to each preset display part may include: if the preset display part is a title part and the corresponding display type is a title class, the title translation model corresponding to the title part may include: a title translation model corresponding to the target category; the title translation model corresponding to the target category is obtained by training according to the title corpus in the target category;

and/or the presence of a gas in the atmosphere,

if the preset display part is an abstract part and the corresponding display type is an abstract class, the abstract translation model corresponding to the abstract part may include: the abstract translation model corresponding to the target category; the abstract translation model corresponding to the target category is obtained by training according to the abstract linguistic data in the target category;

and/or the presence of a gas in the atmosphere,

if the preset display part is a page content part and the corresponding display type is a page content class, the page translation model corresponding to the page content part may include: a page translation model corresponding to the target category; and the page translation model corresponding to the target category is obtained by training according to preset page content corpora in the target category.

Optionally, the target categories may include: e-commerce, forums, news, novels, videos, etc., the title corpora, abstract corpora, and preset page content corpora within the target categories may be collected according to the search results of the target categories.

In an optional embodiment of the present invention, the step of determining the target category to which the search result belongs may include: respectively matching the content included in the search result with the dictionaries of all preset categories to obtain the matching rate corresponding to all preset categories; and taking the preset category corresponding to the maximum one of the matching rates corresponding to all the preset categories as the target category to which the search result belongs. The content included in the search result may be content included in a webpage corresponding to the search result (that is, webpage content), or may be content included in a preset display portion of the search result.

Optionally, the process of obtaining the matching rate may include: the preset display part and/or the webpage content included in the search result are/is segmented, the number N of all vocabularies and the number M of the vocabularies appearing in the preset category of the dictionary are counted, and the ratio of M to N is used as the matching rate.

In an optional embodiment of the present invention, the step of determining the target category to which the search result belongs may include: inputting the content included in the search result into a classifier, and taking the classification result output by the classifier as a target class to which the search result belongs; the classifier is obtained by training according to search result samples of all preset categories. The classifier can be used for judging which preset category the search result belongs to, that is, the result output by the classifier is also the target category the search result belongs to.

It should be noted that various translation models or classifiers according to the embodiments of the present invention can be obtained by training through a training method of machine learning. In addition, the embodiment of the present invention does not limit the specific types of the various translation models or classifiers, for example, the types of the translation models may include: NMT (Neural Machine Translation), Statistical Machine Translation (SMT); alternatively, the specific types of classifiers may include: SVM (Support Vector Machine), bayes, etc.

Step 204 may obtain a translation search result corresponding to the preset display part of the search result by using the target translation model obtained in step 203. In an optional embodiment of the present invention, a corresponding translation rule may be preset according to a characteristic of the preset presentation part, and the translation model may be intelligently utilized according to the translation rule, so as to obtain a more accurate translation search result.

The embodiment of the invention can provide the following translation scheme for obtaining the translation search results corresponding to each preset display part of the search results by using the target translation model:

translation scheme 1,

In the translation scheme 1, the preset display part may include: the step of obtaining, by using the target translation model, the translation search result corresponding to each preset presentation part of the search result may include:

identifying a preset symbol included in the header portion;

The first target translation model corresponding to the title portion may be the title translation model described above, or may be another translation model corresponding to the title portion. The semantic units may be any of characters, words, phrases or phrases, and the like.

In practical applications, the title portion usually contains special preset symbols "-", "|", "…", etc., and the embodiment of the present invention may preset the corresponding translation rule for the preset symbol of the title portion, and intelligently utilize the translation model using the translation rule to obtain a more accurate translation search result. Specifically, in the process of translating by using the first target translation model corresponding to the title part, the semantic units on two sides of the preset symbols are translated separately, then the translation results corresponding to the semantic units of all parts are combined, and the preset symbols and the relative positions between the semantic units on two sides of the preset symbols are kept in the combined first translation search result, so that the accuracy of the first translation search result corresponding to the title part can be improved.

In an optional embodiment of the present invention, in order to avoid that the phrase or the sentence is broken by separate translation, the step of translating the semantic units of the respective portions by using the target translation models corresponding to the header portions respectively may include: and respectively inputting each semantic unit and the corresponding context thereof into the first target translation model to obtain the translation result corresponding to each semantic unit output by the first target translation model. Because the corresponding context relation is considered in the process of separately translating each part of semantic units, the integrity and the globality of the first translation search result can be ensured.

Translation scheme 2,

In translation scheme 2, the preset display part may include: the abstract section, the step of obtaining the translation search result corresponding to each preset display section of the search result by using the target translation model, may include:

extracting target content located at a preset position from the abstract part;

The embodiment of the invention finds the following characteristics of the abstract part: a particular type of content appears at a particular location. For example, relatively fixed content, such as time, information sources, etc., may appear at the beginning of the summary. The following examples of the abstract section are given here:

example 1, 44 reply-post times: 4, 15 days in 2014

Example 2, 28 min Pre-MOSCOW, Jan. 11 (Xinhua) - - -The Kremlin on Wednesday condensed that has compounded materials on U.S. President-electric Donald search term A translation

Therein, example 1 is a summary portion of search results of forum category, which appears at the beginning position of "44 replies", "posting time: 4/15/2014 "are respectively used for indicating the reply quantity and the posting time of the post type search result, and the reply quantity and the posting time belong to the characteristics of the abstract part of the search result of the forum category.

Example 2 is a summary section of search results for news categories, which appears at the beginning position "28 minutes ago", "mosrow, jan. 11 (Xinhua)" for indicating the difference between the distribution time of a news-type search result and the current time, the distribution date and information source of the news-type search result, the difference between the distribution time and the current time, the distribution date of the news-type search result and the characteristics of the summary section of search results for which the information source belongs to the news category, respectively.

It is understood that the above examples 1 and 2 are examples of search results of forum categories and search results of news categories, and actually, the summary part of search results of other categories also has: the characteristics of a particular type of content appear in a particular location. Therefore, the embodiment of the invention can utilize the characteristic to train the corresponding second target translation model aiming at the preset position, so that the target content positioned at the preset position can be extracted from the abstract part in the translation process; and translating the target content by using a second target translation model corresponding to the preset position to obtain a corresponding second translation search result. The second target translation model can be obtained by training a preset content corpus corresponding to a preset position and can be matched with the characteristics of the preset content corpus corresponding to the preset position, so that a more accurate translation search result can be obtained for target content located at the preset position.

It should be noted that, the second target translation model according to the embodiment of the present invention may be: and the translation model corresponding to the target category and the preset position can train a second target translation model according to the preset content corpus corresponding to the preset position in the target category.

After obtaining the translation search results corresponding to each preset display part of the search results by using the target translation model in step 204, step 205 may display the translation search results corresponding to each preset display part of the search results to the user, where the client may display the translation search results corresponding to one or more preset display parts of the search results.

To sum up, in the translation process of the second language search result of the cross-language search, the cross-language search method of the embodiment of the present invention may first determine a target translation model corresponding to each preset display portion of the search result, and then obtain a translation search result corresponding to the preset display portion of the search result by using the target translation model; in this way, the target translation model may be a translation model adapted to each preset display portion, that is, the target translation model may perform translation from the second language to the first language according to the characteristics of each preset display portion, so that the accuracy of translating the search result may be improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of motion combinations, but those skilled in the art should understand that the present invention is not limited by the described motion sequences, because some steps may be performed in other sequences or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no moving act is required as an embodiment of the invention.

Device embodiment

Referring to fig. 3, a block diagram of a cross-language search apparatus according to an embodiment of the present invention is shown, which may specifically include: a search term acquisition module 301, a search result acquisition module 302, and a search result processing module 303;

the search word obtaining module 301 is configured to obtain a search word in a first language;

the search result obtaining module 302 is configured to obtain a search result in the second language according to the search word;

the search result processing module 303 is configured to process the search result of each second language;

the search result processing module 303 may include: the translation model determining module 3031, the translation search result obtaining module 3032 and the translation search result displaying module 3033;

the translation model determining module 3031 is configured to determine a target translation model corresponding to each preset display portion of the search result;

the translation search result obtaining module 3032 is configured to obtain, by using the target translation model, a translation search result corresponding to each preset display part of the search result; and

the translation search result display module 3033 is configured to display, to the user, the translation search result corresponding to each preset display part of the search result.

Optionally, the translation model determining module 3031 may include: the display type determining submodule and the translation model obtaining submodule;

and the translation model acquisition submodule is used for acquiring a target translation model corresponding to each preset display part according to the display type.

Optionally, if the display type corresponding to the preset display part is a title class, the translation model obtaining sub-module may include: a first translation model acquisition unit;

the first translation model acquisition unit is used for acquiring a title translation model, wherein the title translation model is obtained by training according to title corpora;

and/or the presence of a gas in the gas,

if the display type corresponding to the preset display part is an abstract type, the translation model obtaining sub-module may include: a second translation model acquisition unit;

the second translation model acquisition unit is used for acquiring an abstract translation model, and the abstract translation model is obtained by training according to abstract corpora;

and/or the presence of a gas in the gas,

if the display type corresponding to the preset display part is a page content type, the translation model obtaining sub-module may include: a third translation model acquisition unit;

Optionally, if the preset display part is a title part, the translation search result obtaining module 3032 may include: the system comprises a recognition submodule, a segmentation submodule, a first translation submodule and a combination submodule;

the combination submodule is used for combining the translation results corresponding to the semantic units according to the preset symbols so as to obtain a first translation search result corresponding to the title part; the first translated search result can include the preset symbol.

Optionally, the first translation submodule may include: a translation unit;

Optionally, if the preset display part is an abstract part, the translation search result obtaining module 3032 may include: extracting a submodule and a second translation submodule;

Optionally, the apparatus may further include: a category determination module;

the translation model acquisition sub-module may include: a model acquisition unit;

Optionally, the category determination module may include: a matching sub-module and a determining sub-module;

the matching submodule is used for respectively matching the contents which can be included in the search result with the dictionaries of all the preset categories so as to obtain the matching rate corresponding to each preset category;

Optionally, the category determining module may include: a classification submodule;

the classification submodule is used for inputting the contents which can be included in the search result into a classifier and taking the classification result output by the classifier as the target class to which the search result belongs; the classifier is obtained by training according to search result samples of all preset categories.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

Fig. 4 is a block diagram illustrating an apparatus 900 for cross-language search as a terminal according to an example embodiment. For example, the apparatus 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 4, the apparatus 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.

The processing component 902 generally controls overall operation of the device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operation at the device 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 906 provides power to the various components of the device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 900.

The multimedia component 908 comprises a screen providing an output interface between the device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide motion action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when apparatus 900 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 further includes a speaker for outputting audio signals.

The I/O interface 912 provides an interface between the processing component 902 and a peripheral interface module, which may be a keyboard, click wheel, button, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing various aspects of state assessment for the device 900. For example, the sensor assembly 914 may detect an open/closed state of the device 900, the relative positioning of the components, such as a display and keypad of the apparatus 900, the sensor assembly 914 may also detect a change in the position of the apparatus 900 or a component of the apparatus 900, the presence or absence of user contact with the apparatus 900, orientation or acceleration/deceleration of the apparatus 900, and a change in the temperature of the apparatus 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communications between the apparatus 900 and other devices in a wired or wireless manner. The apparatus 900 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the apparatus 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium, instructions in which, when executed by a processor of a terminal, enable the terminal to perform a cross-language search method, the method comprising: acquiring a search word of a first language; obtaining a search result of a second language according to the search word; and aiming at each search result of the second language, executing the following steps: determining target translation models corresponding to all preset display parts of the search results; acquiring translation search results corresponding to all preset display parts of the search results by using the target translation model; and displaying the translation search results corresponding to each preset display part of the search results to the user.

Optionally, the determining a target translation model corresponding to each preset display part of the search result includes:

and acquiring a target translation model corresponding to each preset display part according to the display type.

and/or the presence of a gas in the gas,

if the display type corresponding to the preset display part is an abstract type, the obtaining of the target translation model corresponding to each preset display part comprises: acquiring an abstract translation model, wherein the abstract translation model is obtained by training according to abstract corpora;

and/or the presence of a gas in the atmosphere,

Optionally, if the preset display part is a title part, the obtaining, by using the target translation model, a translation search result corresponding to each preset display part of the search result includes:

identifying a preset symbol contained in the header portion;

translating each divided semantic unit by using a first target translation model corresponding to the header part to obtain a translation result corresponding to each semantic unit;

Optionally, the translating, by using the first target translation model corresponding to the header portion, each semantic unit obtained by segmenting, includes:

and respectively inputting each semantic unit and the context corresponding to the semantic unit to the first target translation model so as to obtain a translation result corresponding to each semantic unit output by the first target translation model.

Optionally, if the preset display part is an abstract part, the obtaining, by using the target translation model, a translation search result corresponding to each preset display part of the search results includes:

extracting target content located at a preset position from the abstract part;

Optionally, the terminal is also configured to execute the one or more programs by the one or more processors including instructions for:

determining a target category to which the search result belongs;

the obtaining of the target translation model corresponding to each preset display part according to the display type includes:

Optionally, the determining the target category to which the search result belongs includes:

Optionally, the determining the preset category of the target to which the search result belongs includes:

FIG. 5 is a block diagram illustrating an apparatus for cross-language searching as a server in accordance with an exemplary embodiment. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. The invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

The present invention provides a cross-language search method, a cross-language search device and a cross-language search device, detailed introduction, the present document has applied specific examples to explain the principle and implementation of the present invention, the above description of the embodiments is only for help understanding the method of the present invention and its core idea; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A cross-language search method, comprising:

acquiring a search word of a first language;

obtaining a search result of a second language according to the search word;

displaying the translation search results corresponding to all preset display parts of the search results to a user;

the step of obtaining the translation search results corresponding to each preset display part of the search results by using the target translation model comprises the following steps:

if the preset display part is a title part, identifying a preset symbol contained in the title part; dividing the title part into a plurality of semantic units according to the preset symbols; translating each divided semantic unit by using a first target translation model corresponding to the header part to obtain a translation result corresponding to each semantic unit; combining the translation results corresponding to the semantic units according to the preset symbols to obtain a first translation search result corresponding to the title part; the first translated search result includes the preset symbol; the first target translation model is a title translation model obtained by training according to title corpus;

if the preset display part is an abstract part, extracting target content located at a preset position from the abstract part; translating the target content by using a second target translation model corresponding to the preset position to obtain a corresponding second translation search result; the second target translation model is obtained by training according to preset content corpora corresponding to a preset position.

2. The method of claim 1, wherein the step of determining the target translation model corresponding to each preset presentation portion of the search result comprises:

3. The method according to claim 2, wherein if the display type corresponding to the preset display portion is a page content class, the obtaining of the target translation model corresponding to each preset display portion comprises: and acquiring a page content translation model, wherein the content translation model is obtained by training according to preset page content corpora.

4. The method according to claim 1, wherein the step of translating each semantic unit obtained by segmentation by using the first target translation model corresponding to the header portion comprises:

5. A method according to claim 2 or 3, characterized in that the method further comprises: determining a target category to which the search result belongs;

and acquiring a target translation model corresponding to each preset display part by combining the target category to which the search result belongs and the display type corresponding to each preset display part.

6. The method of claim 5, wherein the step of determining the target category to which the search result belongs comprises:

7. The method of claim 5, wherein the step of determining the preset category of the target to which the search result belongs comprises:

inputting the content included in the search result into a classifier, and taking the classification result output by the classifier as a target category to which the search result belongs; the classifier is obtained by training according to search result samples of all preset categories.

8. A cross-language search apparatus, comprising:

the search word acquisition module is used for acquiring search words of a first language;

the search result processing module comprises:

the translation model determining module is used for determining a target translation model corresponding to each preset display part of the search result;

the translation search result display module is used for displaying the translation search results corresponding to all preset display parts of the search results to a user;

if the preset display part is a title part, the translation search result acquisition module comprises: the system comprises a recognition submodule, a segmentation submodule, a first translation submodule and a combination submodule; if the preset display part is an abstract part, the translation search result acquisition module comprises: extracting a submodule and a second translation submodule;

the first translation submodule is used for translating each segmented semantic unit by using a first target translation model corresponding to the title part to obtain a translation result corresponding to each semantic unit; the first target translation model is a title translation model obtained by training according to title corpus;

the combination submodule is used for combining the translation results corresponding to the semantic units according to the preset symbols so as to obtain a first translation search result corresponding to the title part; the first translation search result comprises the preset symbol;

the second translation sub-module translates the target content by using a second target translation model corresponding to the preset position to obtain a corresponding second translation search result; the second target translation model is obtained by training according to preset content corpora corresponding to a preset position.

9. The apparatus of claim 8, wherein the translation model determining module comprises: the display type determining submodule and the translation model obtaining submodule;

10. The apparatus according to claim 9, wherein if the display type corresponding to the preset display part is a page content class, the translation model obtaining sub-module includes: a third translation model acquisition unit;

11. The apparatus of claim 8, wherein the first translation submodule comprises: a translation unit;

12. The apparatus of claim 9 or 10, further comprising: a category determination module;

13. The apparatus of claim 12, wherein the category determination module comprises: a matching sub-module and a determining sub-module;

14. The apparatus of claim 12, wherein the category determination module comprises: a classification submodule;

15. An apparatus for cross-language searching, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:

acquiring a search word of a first language;

obtaining a search result of a second language according to the search word;

determining a target translation model corresponding to each preset display part of the search result;

obtaining translation search results corresponding to each preset display part of the search results by using the target translation model;

16. The apparatus of claim 15, wherein the step of determining the target translation model corresponding to each preset presentation portion of the search result comprises:

determining display types corresponding to preset display parts contained in the search results;

17. The apparatus according to claim 16, wherein if the display type corresponding to the preset display portion is a page content class, the obtaining the target translation model corresponding to each preset display portion comprises: and acquiring a page content translation model, wherein the content translation model is obtained by training according to preset page content corpora.

18. The apparatus according to claim 15, wherein the translating each semantic unit obtained by segmenting by using the first target translation model corresponding to the header portion comprises:

19. The apparatus of claim 16 or 17, further comprising: determining a target category to which the search result belongs;

20. The apparatus of claim 19, wherein the step of determining the target category to which the search result belongs comprises:

21. The apparatus of claim 19, wherein the step of determining the preset category of the target to which the search result belongs comprises:

22. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method recited by one or more of claims 1-7.