CN112613327A - Information processing method and device - Google Patents

Information processing method and device Download PDF

Info

Publication number
CN112613327A
CN112613327A CN202110020630.5A CN202110020630A CN112613327A CN 112613327 A CN112613327 A CN 112613327A CN 202110020630 A CN202110020630 A CN 202110020630A CN 112613327 A CN112613327 A CN 112613327A
Authority
CN
China
Prior art keywords
sentence
replaced
source
target
replacement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110020630.5A
Other languages
Chinese (zh)
Other versions
CN112613327B (en
Inventor
刘绍孔
李健
武卫东
陈明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sinovoice Technology Co Ltd
Original Assignee
Beijing Sinovoice Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sinovoice Technology Co Ltd filed Critical Beijing Sinovoice Technology Co Ltd
Priority to CN202110020630.5A priority Critical patent/CN112613327B/en
Publication of CN112613327A publication Critical patent/CN112613327A/en
Application granted granted Critical
Publication of CN112613327B publication Critical patent/CN112613327B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses an information processing method and device. Through this application, can collect the vocabulary that each other is translated pair automatically, can not have staff's participation in the collection process, from can saving the cost of labor, secondly, source sentence after being replaced respectively with the semantic similarity of each target sentence after replacing through calculating low frequency proper noun, and the vocabulary combination that is replaced in the target sentence after the highest replacement of semantic similarity among the source sentence after will replacing low frequency proper noun and the source sentence after replacing is for the vocabulary pair that each other is translated, can improve the accuracy degree of mutual translation between two vocabularies of the mutual translation vocabulary pair of collecting.

Description

Information processing method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to an information processing method and apparatus.
Background
In machine translation, a translation scenario of proper nouns is often encountered, the frequency of the proper nouns is often low, and the proper nouns need to be translated independently in many cases.
When the proper nouns are translated independently, a bilingual dictionary for translating the proper nouns can be used, the bilingual dictionary is used for directly searching translations of the proper nouns during translation, a translation model for translating the proper nouns trained according to the bilingual dictionary can be used, and the translation model is used for intelligently translating the proper nouns during translation.
For example, in the scenario of english language, it is assumed that the english sentence is "Fridrich II is a great commander," where "Fridrich II" is a proper noun of the human name class, and "Fridrich II" is replaced by < NE >, resulting in "< NE > is a great commander," and then "NE > is a great commander" is translated by the translation model, resulting in a translated text "< NE > being a great commander.
The translation "fibular spirit second" of "Fridrich II" may then be looked up in a bilingual dictionary, or translated by the translation model for translating proper nouns into "Fridrich II" to obtain the translation "fibular spirit second", which may then be replaced with the translation "fibular spirit second" of "Fridrich II" for the translation "< NE > in a great commander", so as to obtain the final translation "fibular spirit second" as a great commander ".
In order to translate proper nouns by using a bilingual dictionary or a translation model for translating proper nouns, a bilingual dictionary or a translation model for translating proper nouns can be generated in advance, and vocabulary pairs translated with each other need to be collected in advance in order to generate the bilingual dictionary or train the translation model for translating proper nouns.
Therefore, how to collect the words translated with each other is a technical problem to be solved urgently.
Disclosure of Invention
The application discloses an information processing method and device.
In a first aspect, the present application shows an information processing method, comprising:
obtaining a sentence pair which is translated mutually, wherein the sentence pair comprises a source sentence and a target sentence; the source sentence and the target sentence are in different languages;
identifying low-frequency proper nouns in the source sentences;
replacing the low-frequency proper nouns in the source sentences by using the specific identifiers to obtain the source sentences after replacement;
sequentially replacing each vocabulary in the target sentence by using the specific identifier to respectively obtain a plurality of replaced target sentences;
obtaining semantic similarity between the source sentences after replacement and each target sentence after replacement;
selecting a replaced target sentence with the highest semantic similarity with the replaced source sentence;
and combining the replaced words in the source sentence after replacement and the replaced words in the target sentence after replacement into mutually translated word pairs.
In an optional implementation manner, the sentence pair which is translated with each other is one of the obtained sentence pairs which are translated with each other;
the identifying low frequency proper nouns in the source sentence comprises:
dividing words of the source sentence to obtain a plurality of words included in the source sentence;
determining proper nouns in the plurality of vocabularies based on named entity recognition;
counting the times of occurrence of the proper nouns in a plurality of sentence pairs which are mutually translated;
and determining the proper noun as a low-frequency proper noun under the condition that the occurrence number is lower than a preset threshold value.
In an alternative implementation, the identifying low frequency proper nouns in the source sentence includes:
dividing words of the source sentence to obtain a plurality of words included in the source sentence;
determining proper nouns in the plurality of vocabularies based on named entity recognition;
searching whether the proper nouns exist in a preset word list;
under the condition that the proper nouns do not exist in the preset word list, splitting the proper nouns into at least two participles;
searching whether the at least two participles exist in the preset word list;
and under the condition that the at least two participles exist in the preset word list, determining the proper noun as a low-frequency proper noun.
In an optional implementation manner, the sequentially replacing each vocabulary in the target sentence by using the specific identifier to obtain a plurality of replaced target sentences, respectively, includes:
segmenting words of the target sentence to obtain a plurality of words included in the target sentence;
replacing a first vocabulary in the target sentence by using the specific identifier to obtain a replaced target sentence, replacing a second vocabulary in the target sentence by using the specific identifier to obtain a replaced target sentence, and so on, and replacing a last vocabulary in the target sentence by using the specific identifier to obtain a replaced target sentence.
In an optional implementation manner, the obtaining semantic similarity between the replaced source sentence and each replaced target sentence respectively includes:
acquiring semantic vectors of the source sentences after replacement, and respectively acquiring semantic vectors of each target sentence after replacement;
and for any replaced target sentence, inputting the semantic vector of the source sentence after replacement and the semantic vector of the target sentence after replacement into a similarity calculation model, and obtaining the semantic similarity between the source sentence after replacement and the target sentence after replacement, which is output by the similarity calculation model.
In an optional implementation manner, the obtaining a semantic vector of the replaced source sentence includes:
and inputting the replaced source sentences into a preset model which is trained in advance and used for processing sentences to obtain semantic vectors of the replaced source sentences output by a coding layer of the preset model.
In an optional implementation manner, the separately obtaining the semantic vector of each replaced target statement includes:
and for any one replaced target statement, inputting the replaced target statement into a preset model which is trained in advance and used for processing the statement, and obtaining a semantic vector of the replaced target statement output by a coding layer of the preset model.
In a second aspect, the present application shows an information processing apparatus comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring sentence pairs which are mutually translated, and the sentence pairs comprise a source sentence and a target sentence; the source sentence and the target sentence are in different languages;
the recognition module is used for recognizing the low-frequency proper nouns in the source sentences;
the first replacement module is used for replacing the low-frequency proper nouns in the source sentences by using the specific identifiers to obtain the source sentences after replacement;
the second replacement module is used for sequentially replacing each vocabulary in the target sentence by using the specific identifier to respectively obtain a plurality of replaced target sentences;
the second acquisition module is used for acquiring semantic similarity between the source sentences after replacement and each target sentence after replacement;
the selection module is used for selecting a replaced target sentence with the highest semantic similarity with the replaced source sentence;
and the combination module is used for combining the replaced vocabulary in the source sentence after replacement and the replaced vocabulary in the target sentence after replacement into a mutually translated vocabulary pair.
In an optional implementation manner, the sentence pair which is translated with each other is one of the obtained sentence pairs which are translated with each other;
the identification module comprises:
the first word segmentation unit is used for segmenting words of the source sentence to obtain a plurality of words included in the source sentence;
a first recognition unit for determining proper nouns in the plurality of vocabularies based on named entity recognition;
a counting unit for counting the number of times the proper nouns appear in a plurality of sentence pairs which are mutually translated;
a first determination unit configured to determine the proper noun as a low-frequency proper noun if the number of occurrences is lower than a preset threshold.
In an alternative implementation, the identification module includes:
the second word segmentation unit is used for segmenting words of the source sentence to obtain a plurality of words included in the source sentence;
a second recognition unit for determining proper nouns in the plurality of vocabularies based on named entity recognition;
the first searching unit is used for searching whether the proper nouns exist in a preset word list;
the splitting unit is used for splitting the proper noun into at least two participles under the condition that the proper noun does not exist in the preset word list;
the second searching unit is used for searching whether the at least two participles exist in the preset word list;
and the second determining unit is used for determining the proper noun as a low-frequency proper noun under the condition that the at least two participles exist in the preset word list.
In an optional implementation manner, the second replacement module includes:
the third word segmentation unit is used for segmenting words of the target sentence to obtain a plurality of words included in the target sentence;
a replacing unit, configured to replace a first vocabulary in the target sentence with the specific identifier to obtain a replaced target sentence, replace a second vocabulary in the target sentence with the specific identifier to obtain a replaced target sentence.
In an optional implementation manner, the second obtaining module includes:
the system comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring semantic vectors of source sentences after replacement, and the second acquisition unit is used for respectively acquiring the semantic vectors of each target sentence after replacement;
and the input unit is used for inputting the semantic vector of the source sentence after replacement and the semantic vector of the target sentence after replacement into a similarity calculation model for any replaced target sentence to obtain the semantic similarity between the source sentence after replacement and the target sentence after replacement output by the similarity calculation model.
In an optional implementation manner, the first obtaining unit is specifically configured to: and inputting the replaced source sentences into a preset model which is trained in advance and used for processing sentences to obtain semantic vectors of the replaced source sentences output by a coding layer of the preset model.
In an optional implementation manner, the second obtaining unit is specifically configured to: and for any one replaced target statement, inputting the replaced target statement into a preset model which is trained in advance and used for processing the statement, and obtaining a semantic vector of the replaced target statement output by a coding layer of the preset model.
In a third aspect, the present application shows an electronic device comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the information processing method according to the first aspect.
In a fourth aspect, the present application shows a non-transitory computer-readable storage medium having instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the information processing method according to the first aspect.
In a fifth aspect, the present application shows a computer program product, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the information processing method according to the first aspect.
The technical scheme provided by the application can comprise the following beneficial effects:
in the method, sentence pairs which are mutually translated are obtained, wherein the sentence pairs comprise a source sentence and a target sentence, and the source sentence and the target sentence are different languages. Low frequency proper nouns in the source sentence are identified. And replacing the low-frequency proper nouns in the source sentence by using the specific identifiers to obtain the source sentence after replacement. And sequentially replacing each vocabulary in the target sentence by using the specific identifier to respectively obtain a plurality of replaced target sentences. And acquiring semantic similarity between the source sentences after replacement and each target sentence after replacement. And selecting the replaced target sentence with the highest semantic similarity with the replaced source sentence. And combining the replaced words in the source sentence after replacement and the replaced words in the target sentence after replacement into mutually translated word pairs.
Through this application, can collect the vocabulary that each other is translated pair automatically, can not have staff's participation in the collection process, from can saving the cost of labor, secondly, source sentence after being replaced respectively with the semantic similarity of each target sentence after replacing through calculating low frequency proper noun, and the vocabulary combination that is replaced in the target sentence after the highest replacement of semantic similarity among the source sentence after will replacing low frequency proper noun and the source sentence after replacing is for the vocabulary pair that each other is translated, can improve the accuracy degree of mutual translation between two vocabularies of the mutual translation vocabulary pair of collecting.
Drawings
FIG. 1 is a flow chart of the steps of an information processing method of the present application;
fig. 2 is a block diagram of a structure of an information processing apparatus of the present application;
FIG. 3 is a block diagram of an electronic device shown in the present application;
fig. 4 is a block diagram of an electronic device shown in the present application.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.
Referring to fig. 1, a flowchart illustrating steps of an information processing method according to the present application is shown, where the method may specifically include the following steps:
in step S101, a sentence pair that is translated with each other is obtained, where the sentence pair includes a source sentence and a target sentence, and the source sentence and the target sentence are in different languages.
In the present application, a sentence pair that is a translation of each other can be crawled over a network, where multiple sentence pairs that are translations of each other can be crawled by a crawler, for example, crawling a sentence pair that is a translation of each other from some translation tools already existing on the market.
Each sentence pair translated with each other comprises a source sentence and a target sentence corresponding to the source sentence, the source sentence and the target sentence in the same sentence pair are sentences of different languages, and the semantics of the source sentence in the same sentence pair are the same as the semantics of the target sentence.
In an example of the present application, the source sentence in each sentence pair of the sentence pairs that are crawled for translation may be a sentence in a specific language, and the target sentence may be another sentence in a specific language, and the specific language may be determined according to actual needs, which is not limited in this application.
The plurality of sentence pairs which are translated with each other and crawled can be cached locally, so that the process from the step S101 to the step S107 of the application can be executed for each sentence pair in the plurality of sentence pairs which are translated with each other.
In this step, one mutually translated sentence pair that has not been selected before may be selected from the plurality of cached mutually translated sentence pairs, and then step S102 may be executed.
In step S102, low frequency proper nouns in the source sentence are identified.
In an embodiment of the present application, the sentence pair translated with each other is one of the obtained sentence pairs translated with each other.
Therefore, when the low-frequency proper nouns in the source sentences are identified, the words can be segmented for the source sentences to obtain a plurality of words included in the source sentences.
The specific word segmentation technology can adopt the existing word segmentation technology in the market, and is not detailed here.
For example, word2vec word vectors or seq2seq techniques based on RNN (Recurrent Neural Networks) or the like may be used.
CRF (Conditional Random Field) and BPE (Byte Pair Encoding) techniques may also be used.
For example, when segmenting words in Chinese, a jieba segmenter can be used, and when segmenting words in English, related technologies such as NLTK (Natural Language Toolkit) and open source tools related to BPE, such as subword-nmt and fastBPE, can be used.
Among the plurality of vocabularies included in the source sentence, proper nouns in the plurality of vocabularies, such as a person name, a place name, an organization name, and the like, can be determined by Named Entity Recognition (Named Entity Recognition).
And then determining the special nouns with the occurrence frequency lower than a preset threshold value in the plurality of sentence pairs which are mutually translated as the low-frequency special nouns.
For example, for any proper noun, the number of times that the proper noun appears in a plurality of sentence pairs translated with each other can be counted; and determining the special noun as a low-frequency special noun under the condition that the occurrence frequency is lower than a preset threshold value.
The above operation is performed for each of the other proper nouns as well, thus achieving the fixed-point stemming of the low-frequency proper nouns in the source sentence.
The preset threshold may include 2, 3, or 4, and may be specifically set according to actual requirements, which is not limited in this application.
Alternatively, in another embodiment, the low frequency proper nouns may be determined among the proper nouns based on sub-words.
For example, the source sentence is segmented to obtain a plurality of words included in the source sentence; determining proper nouns in a plurality of vocabularies based on named entity recognition; searching whether a proper noun exists in a preset word list; in the case where there is a proper noun in the preset vocabulary, it is determined that the proper noun is not a low frequency proper noun. Under the condition that no proper noun exists in a preset word list, splitting the proper noun into at least two participles; searching whether at least two participles exist in a preset word list; and determining the proper noun as a low-frequency proper noun under the condition that at least two participles exist in the preset word list.
In one example, the predetermined vocabulary may include a vocabulary of words. Such as an english word vocabulary, a french word vocabulary, a russian word vocabulary, or a german word vocabulary, among others.
In one example, it is assumed that the proper noun is Fridrich, Fridrich is not located in the English word list, and Fridrich can be split into two parts, Frid and rich, which are located in the English word list respectively, so that Fridrich can be determined as the low-frequency proper noun.
In step S103, the low-frequency proper noun in the source sentence is replaced with the specific identifier, resulting in a replaced source sentence.
In an embodiment of the present application, the specific identifier is an identifier that is set locally by a technician in advance, and includes "< NE >" or the like, for example, but of course, the specific identifier may also be in other forms, and the present application does not limit this.
In step S104, each vocabulary in the target sentence is sequentially replaced with a specific identifier, and a plurality of replaced target sentences are obtained.
In this step, the target sentence may be segmented to obtain a plurality of words included in the target sentence. Then, a specific identifier can be used for replacing a first word in the target sentence to obtain a replaced target sentence, a specific identifier is used for replacing a second word in the target sentence to obtain a replaced target sentence, the process is repeated, the specific identifier is used for replacing a last word in the target sentence to obtain a replaced target sentence, a plurality of replaced target sentences are obtained together, and the number of the obtained replaced target sentences is the same as the number of words obtained after the target sentence is participled.
In step S105, semantic similarity between the replaced source sentence and each replaced target sentence is obtained.
Wherein, this step can be realized through the following process, including:
1051. obtaining semantic vectors of the source sentences after replacement;
in the present application, a preset model may be used to obtain a semantic vector of a source sentence.
The preset model may include an existing model, such as a translation model, and the preset model is configured to process a sentence, where the preset model includes a coding layer, and the coding layer is configured to code an input sentence to obtain a semantic vector of the sentence, and the semantic vector of the sentence is usually input into another layer, which is located behind the coding layer and adjacent to the coding layer, in the preset model.
Therefore, the replaced source sentences can be input into the pre-set model which is trained in advance and used for processing sentences, and semantic vectors of the replaced source sentences output by the coding layer of the pre-set model are obtained.
1052. And respectively acquiring the semantic vector of each replaced target statement.
Accordingly, referring to the manner of step 1051, in this step, for any one replaced target sentence, the replaced target sentence may be input into a pre-set model trained in advance for processing the sentence, so as to obtain a semantic vector of the replaced target sentence output by the coding layer of the pre-set model.
And for each other replaced target statement, the above operation is also executed, so that the semantic vector of each replaced target statement is obtained.
1053. And for any replaced target sentence, inputting the semantic vector of the source sentence after replacement and the semantic vector of the target sentence after replacement into a similarity calculation model to obtain the semantic similarity between the source sentence after replacement and the target sentence after replacement, which is output by the similarity calculation model.
In the present application, the similarity calculation model may be trained in advance, and the specific training mode includes:
obtaining a plurality of training data, wherein each training data comprises two different semantic vectors and semantic similarity between the two semantic vectors, and then training the initialization model by using the training data until parameters in the initialization model are converged, thereby obtaining a similarity calculation model.
In this way, in this step, for any replaced target sentence, the semantic vector of the source sentence after replacement and the semantic vector of the target sentence after replacement may be input into the similarity calculation model, so as to obtain the semantic similarity between the source sentence after replacement and the target sentence after replacement, which is output by the similarity calculation model. The same is true for each of the other replaced target statements.
In step S106, the replaced target sentence having the highest semantic similarity with the replaced source sentence is selected.
In step S107, the replaced vocabulary in the replaced source sentence and the replaced vocabulary in the selected replaced target sentence are combined into a vocabulary pair that are translated with each other.
Further, the process of step S101 to step S107 is also executed for each of the other sentence pairs that are crawled and translated with respect to each other, so as to extract the vocabulary pairs that are translated with respect to each other from each sentence pair that is crawled and then generate a bilingual dictionary from all the extracted vocabulary pairs that are translated with respect to each other, or train a translation model from all the extracted vocabulary pairs that are translated with respect to each other.
In the method, sentence pairs which are mutually translated are obtained, wherein the sentence pairs comprise a source sentence and a target sentence, and the source sentence and the target sentence are different languages. Low frequency proper nouns in the source sentence are identified. And replacing the low-frequency proper nouns in the source sentence by using the specific identifiers to obtain the source sentence after replacement. And sequentially replacing each vocabulary in the target sentence by using the specific identifier to respectively obtain a plurality of replaced target sentences. And acquiring semantic similarity between the source sentences after replacement and each target sentence after replacement. And selecting the replaced target sentence with the highest semantic similarity with the replaced source sentence. And combining the replaced words in the source sentence after replacement and the replaced words in the target sentence after replacement into mutually translated word pairs.
Through this application, can collect the vocabulary that each other is translated pair automatically, can not have staff's participation in the collection process, from can saving the cost of labor, secondly, source sentence after being replaced respectively with the semantic similarity of each target sentence after replacing through calculating low frequency proper noun, and the vocabulary combination that is replaced in the target sentence after the highest replacement of semantic similarity among the source sentence after will replacing low frequency proper noun and the source sentence after replacing is for the vocabulary pair that each other is translated, can improve the accuracy degree of mutual translation between two vocabularies of the mutual translation vocabulary pair of collecting.
It is noted that, for simplicity of explanation, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary and that no action is necessarily required in this application.
Referring to fig. 2, a block diagram of an information processing apparatus according to the present application is shown, and the apparatus may specifically include the following modules:
the first obtaining module 11 is configured to obtain a sentence pair translated with each other, where the sentence pair includes a source sentence and a target sentence; the source sentence and the target sentence are in different languages;
the identification module 12 is used for identifying low-frequency proper nouns in the source sentences;
a first replacing module 13, configured to replace a low-frequency proper noun in the source sentence with a specific identifier, so as to obtain a replaced source sentence;
a second replacement module 14, configured to sequentially replace each vocabulary in the target sentence by using the specific identifier, so as to obtain a plurality of replaced target sentences respectively;
a second obtaining module 15, configured to obtain semantic similarities between the replaced source sentences and each replaced target sentence;
a selecting module 16, configured to select a replaced target sentence with the highest semantic similarity to the replaced source sentence;
and a combination module 17 for combining the replaced vocabulary in the replaced source sentence and the replaced vocabulary in the selected replaced target sentence into a vocabulary pair translated with each other.
In an optional implementation manner, the sentence pair which is translated with each other is one of the obtained sentence pairs which are translated with each other;
the identification module comprises:
the first word segmentation unit is used for segmenting words of the source sentence to obtain a plurality of words included in the source sentence;
a first recognition unit for determining proper nouns in the plurality of vocabularies based on named entity recognition;
a counting unit for counting the number of times the proper nouns appear in a plurality of sentence pairs which are mutually translated;
a first determination unit configured to determine the proper noun as a low-frequency proper noun if the number of occurrences is lower than a preset threshold.
In an alternative implementation, the identification module includes:
the second word segmentation unit is used for segmenting words of the source sentence to obtain a plurality of words included in the source sentence;
a second recognition unit for determining proper nouns in the plurality of vocabularies based on named entity recognition;
the first searching unit is used for searching whether the proper nouns exist in a preset word list;
the splitting unit is used for splitting the proper noun into at least two participles under the condition that the proper noun does not exist in the preset word list;
the second searching unit is used for searching whether the at least two participles exist in the preset word list;
and the second determining unit is used for determining the proper noun as a low-frequency proper noun under the condition that the at least two participles exist in the preset word list.
In an optional implementation manner, the second replacement module includes:
the third word segmentation unit is used for segmenting words of the target sentence to obtain a plurality of words included in the target sentence;
a replacing unit, configured to replace a first vocabulary in the target sentence with the specific identifier to obtain a replaced target sentence, replace a second vocabulary in the target sentence with the specific identifier to obtain a replaced target sentence.
In an optional implementation manner, the second obtaining module includes:
the system comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring semantic vectors of source sentences after replacement, and the second acquisition unit is used for respectively acquiring the semantic vectors of each target sentence after replacement;
and the input unit is used for inputting the semantic vector of the source sentence after replacement and the semantic vector of the target sentence after replacement into a similarity calculation model for any replaced target sentence to obtain the semantic similarity between the source sentence after replacement and the target sentence after replacement output by the similarity calculation model.
In an optional implementation manner, the first obtaining unit is specifically configured to: and inputting the replaced source sentences into a preset model which is trained in advance and used for processing sentences to obtain semantic vectors of the replaced source sentences output by a coding layer of the preset model.
In an optional implementation manner, the second obtaining unit is specifically configured to: and for any one replaced target statement, inputting the replaced target statement into a preset model which is trained in advance and used for processing the statement, and obtaining a semantic vector of the replaced target statement output by a coding layer of the preset model.
In the method, sentence pairs which are mutually translated are obtained, wherein the sentence pairs comprise a source sentence and a target sentence, and the source sentence and the target sentence are in different languages. Low frequency proper nouns in the source sentence are identified. And replacing the low-frequency proper nouns in the source sentence by using the specific identifiers to obtain the source sentence after replacement. And sequentially replacing each vocabulary in the target sentence by using the specific identifier to respectively obtain a plurality of replaced target sentences. And acquiring semantic similarity between the source sentences after replacement and each target sentence after replacement. And selecting the replaced target sentence with the highest semantic similarity with the replaced source sentence. And combining the replaced words in the source sentence after replacement and the replaced words in the target sentence after replacement into mutually translated word pairs.
Through this application, can collect the vocabulary that each other is translated pair automatically, can not have staff's participation in the collection process, from can saving the cost of labor, secondly, source sentence after being replaced respectively with the semantic similarity of each target sentence after replacing through calculating low frequency proper noun, and the vocabulary combination that is replaced in the target sentence after the highest replacement of semantic similarity among the source sentence after will replacing low frequency proper noun and the source sentence after replacing is for the vocabulary pair that each other is translated, can improve the accuracy degree of mutual translation between two vocabularies of the mutual translation vocabulary pair of collecting.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
Fig. 3 is a block diagram of an electronic device 800 shown in the present application. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 3, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, images, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast operation information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Fig. 4 is a block diagram of an electronic device 1900 shown in the present application. For example, the electronic device 1900 may be provided as a server.
Referring to fig. 4, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The information processing method and apparatus provided by the present application are introduced in detail, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. An information processing method, characterized in that the method comprises:
obtaining a sentence pair which is translated mutually, wherein the sentence pair comprises a source sentence and a target sentence; the source sentence and the target sentence are in different languages;
identifying low-frequency proper nouns in the source sentences;
replacing the low-frequency proper nouns in the source sentences by using the specific identifiers to obtain the source sentences after replacement;
sequentially replacing each vocabulary in the target sentence by using the specific identifier to respectively obtain a plurality of replaced target sentences;
obtaining semantic similarity between the source sentences after replacement and each target sentence after replacement;
selecting a replaced target sentence with the highest semantic similarity with the replaced source sentence;
and combining the replaced words in the source sentence after replacement and the replaced words in the target sentence after replacement into mutually translated word pairs.
2. The method of claim 1, wherein the sentence pair that is mutually translated is one of the obtained sentence pairs that are mutually translated;
the identifying low frequency proper nouns in the source sentence comprises:
dividing words of the source sentence to obtain a plurality of words included in the source sentence;
determining proper nouns in the plurality of vocabularies based on named entity recognition;
counting the times of occurrence of the proper nouns in a plurality of sentence pairs which are mutually translated;
and determining the proper noun as a low-frequency proper noun under the condition that the occurrence number is lower than a preset threshold value.
3. The method of claim 1, wherein said identifying low frequency proper nouns in said source sentence comprises:
dividing words of the source sentence to obtain a plurality of words included in the source sentence;
determining proper nouns in the plurality of vocabularies based on named entity recognition;
searching whether the proper nouns exist in a preset word list;
under the condition that the proper nouns do not exist in the preset word list, splitting the proper nouns into at least two participles;
searching whether the at least two participles exist in the preset word list;
and under the condition that the at least two participles exist in the preset word list, determining the proper noun as a low-frequency proper noun.
4. The method of claim 1, wherein the replacing each vocabulary in the target sentence with the specific identifier in turn respectively obtains a plurality of replaced target sentences, comprising:
segmenting words of the target sentence to obtain a plurality of words included in the target sentence;
replacing a first vocabulary in the target sentence by using the specific identifier to obtain a replaced target sentence, replacing a second vocabulary in the target sentence by using the specific identifier to obtain a replaced target sentence, and so on, and replacing a last vocabulary in the target sentence by using the specific identifier to obtain a replaced target sentence.
5. The method of claim 1, wherein obtaining semantic similarity between the replaced source sentence and each replaced target sentence comprises:
acquiring semantic vectors of the source sentences after replacement, and respectively acquiring semantic vectors of each target sentence after replacement;
and for any replaced target sentence, inputting the semantic vector of the source sentence after replacement and the semantic vector of the target sentence after replacement into a similarity calculation model, and obtaining the semantic similarity between the source sentence after replacement and the target sentence after replacement, which is output by the similarity calculation model.
6. The method of claim 5, wherein obtaining the semantic vector of the replaced source sentence comprises:
and inputting the replaced source sentences into a preset model which is trained in advance and used for processing sentences to obtain semantic vectors of the replaced source sentences output by a coding layer of the preset model.
7. The method of claim 5, wherein the obtaining the semantic vector of each replaced target sentence comprises:
and for any one replaced target statement, inputting the replaced target statement into a preset model which is trained in advance and used for processing the statement, and obtaining a semantic vector of the replaced target statement output by a coding layer of the preset model.
8. An information processing apparatus characterized in that the apparatus comprises:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring sentence pairs which are mutually translated, and the sentence pairs comprise a source sentence and a target sentence; the source sentence and the target sentence are in different languages;
the recognition module is used for recognizing the low-frequency proper nouns in the source sentences;
the first replacement module is used for replacing the low-frequency proper nouns in the source sentences by using the specific identifiers to obtain the source sentences after replacement;
the second replacement module is used for sequentially replacing each vocabulary in the target sentence by using the specific identifier to respectively obtain a plurality of replaced target sentences;
the second acquisition module is used for acquiring semantic similarity between the source sentences after replacement and each target sentence after replacement;
the selection module is used for selecting a replaced target sentence with the highest semantic similarity with the replaced source sentence;
and the combination module is used for combining the replaced vocabulary in the source sentence after replacement and the replaced vocabulary in the target sentence after replacement into a mutually translated vocabulary pair.
9. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the information processing method of any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the information processing method of any one of claims 1 to 7.
CN202110020630.5A 2021-01-07 2021-01-07 Information processing method and device Active CN112613327B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110020630.5A CN112613327B (en) 2021-01-07 2021-01-07 Information processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110020630.5A CN112613327B (en) 2021-01-07 2021-01-07 Information processing method and device

Publications (2)

Publication Number Publication Date
CN112613327A true CN112613327A (en) 2021-04-06
CN112613327B CN112613327B (en) 2024-07-16

Family

ID=75253504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110020630.5A Active CN112613327B (en) 2021-01-07 2021-01-07 Information processing method and device

Country Status (1)

Country Link
CN (1) CN112613327B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180210878A1 (en) * 2017-01-26 2018-07-26 Samsung Electronics Co., Ltd. Translation method and apparatus, and translation system
WO2019114695A1 (en) * 2017-12-15 2019-06-20 腾讯科技(深圳)有限公司 Translation model-based training method, translation method, computer device and storage medium
CN110909552A (en) * 2018-09-14 2020-03-24 阿里巴巴集团控股有限公司 Translation method and device
CN111339788A (en) * 2020-02-18 2020-06-26 北京字节跳动网络技术有限公司 Interactive machine translation method, apparatus, device and medium
CN111460838A (en) * 2020-04-23 2020-07-28 腾讯科技(深圳)有限公司 Pre-training method and device of intelligent translation model and storage medium
CN111539229A (en) * 2019-01-21 2020-08-14 波音公司 Neural machine translation model training method, neural machine translation method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180210878A1 (en) * 2017-01-26 2018-07-26 Samsung Electronics Co., Ltd. Translation method and apparatus, and translation system
WO2019114695A1 (en) * 2017-12-15 2019-06-20 腾讯科技(深圳)有限公司 Translation model-based training method, translation method, computer device and storage medium
CN110909552A (en) * 2018-09-14 2020-03-24 阿里巴巴集团控股有限公司 Translation method and device
CN111539229A (en) * 2019-01-21 2020-08-14 波音公司 Neural machine translation model training method, neural machine translation method and device
CN111339788A (en) * 2020-02-18 2020-06-26 北京字节跳动网络技术有限公司 Interactive machine translation method, apparatus, device and medium
CN111460838A (en) * 2020-04-23 2020-07-28 腾讯科技(深圳)有限公司 Pre-training method and device of intelligent translation model and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
梁珊;: "基于英语语义分析的智能算法研究", 微型电脑应用, no. 10, 31 October 2020 (2020-10-31) *
车万金;余正涛;郭军军;文永华;于志强;: "融入分类词典的汉越混合网络神经机器翻译集外词处理方法", 中文信息学报, no. 12, 31 December 2019 (2019-12-31) *

Also Published As

Publication number Publication date
CN112613327B (en) 2024-07-16

Similar Documents

Publication Publication Date Title
CN111145756B (en) Voice recognition method and device for voice recognition
US20170154104A1 (en) Real-time recommendation of reference documents
CN110781305A (en) Text classification method and device based on classification model and model training method
CN107564526B (en) Processing method, apparatus and machine-readable medium
CN109471919B (en) Zero pronoun resolution method and device
CN110069624B (en) Text processing method and device
CN108304412B (en) Cross-language search method and device for cross-language search
CN108345625B (en) Information mining method and device for information mining
CN116166843B (en) Text video cross-modal retrieval method and device based on fine granularity perception
CN111832315B (en) Semantic recognition method, semantic recognition device, electronic equipment and storage medium
CN112036195A (en) Machine translation method, device and storage medium
CN107424612B (en) Processing method, apparatus and machine-readable medium
CN111160047A (en) Data processing method and device and data processing device
CN111414766B (en) Translation method and device
CN110232181B (en) Comment analysis method and device
CN113033163A (en) Data processing method and device and electronic equipment
CN110781689B (en) Information processing method, device and storage medium
CN111324214B (en) Statement error correction method and device
CN111832297A (en) Part-of-speech tagging method and device and computer-readable storage medium
CN116484828A (en) Similar case determining method, device, apparatus, medium and program product
CN116127062A (en) Training method of pre-training language model, text emotion classification method and device
CN107301188B (en) Method for acquiring user interest and electronic equipment
CN114462410A (en) Entity identification method, device, terminal and storage medium
CN112613327B (en) Information processing method and device
CN110837741B (en) Machine translation method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant