CN109597988B

CN109597988B - Cross-language vocabulary semantic prediction method and device and electronic equipment

Info

Publication number: CN109597988B
Application number: CN201811288136.1A
Authority: CN
Inventors: 孙茂松; 岂凡超; 林衍凯; 朱昊; 谢若冰; 刘知远
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2020-04-28
Anticipated expiration: 2038-10-31
Also published as: CN109597988A

Abstract

The embodiment of the invention provides a cross-language vocabulary semantic prediction method, a device and electronic equipment, wherein the method comprises the following steps: respectively determining a loss function of source language word vector learning and target language word vector learning; respectively determining word vector alignment and a loss function fused into the semantic information; selecting a certain number of source language word pairs and target language word pairs with the same semantics based on the monolingual corpus of the source language and the target language; optimizing each loss function based on the source language word and target language word pair and the established semantic knowledge base in the source language to obtain bilingual word vectors belonging to the same semantic space; and on the basis of the bilingual word vectors, performing semantic prediction on the target vocabulary by searching labeled synonyms of source language words close to the target vocabulary word vectors in the target language. The embodiment of the invention can reasonably utilize the sensed original knowledge base to carry out the semantic prediction on the cross-language vocabulary, thereby effectively saving the labor and time cost of the semantic prediction.

Description

Cross-language vocabulary semantic prediction method and device and electronic equipment

Technical Field

The embodiment of the invention relates to the technical field of natural language processing, in particular to a cross-language lexical and semantic prediction method, a cross-language lexical and semantic prediction device and electronic equipment.

Background

In linguistics, a word is defined as the smallest meaningful unit that can be used independently, but not the smallest indivisible semantic unit. That is, words may be further subdivided into smaller semantic elements. For example, the word "man" may be further divided into "human", "male" and "adult".

The smallest irreparable semantic unit in human language is called a sememe, and all words or other semantic concepts can be represented by a closed set of sememes. The introduction of the sememe can analyze words in a finer granularity, and is helpful for better understanding of the nature of the language. However, for most natural languages, the semantic meaning is often quite hidden, and in some languages, a semantic knowledge base is constructed for words and other concepts in a manual labeling mode so as to facilitate tasks such as word similarity calculation, word meaning disambiguation, emotion analysis and the like in natural language processing.

However, most languages have no established knowledge base of the sememes, and it is inconvenient to determine the sememes of the words in these languages, which is not favorable for human beings to further understand and use these languages. When determining the sememes of the words in these languages, if the traditional manual sememe labeling mode is adopted, high labor and time costs are consumed; since there is no one-to-one matching relationship between different languages, for example, the english word "beautiful" can correspond to "beauty" or "beauty" of the chinese word, and the established semantic knowledge base cannot be simply translated into other languages.

Disclosure of Invention

In order to overcome the above problems or at least partially solve the above problems, embodiments of the present invention provide a cross-language vocabulary semantic prediction method, apparatus and electronic device, which are used to reasonably use a sensed-source knowledge base to perform semantic prediction on cross-language vocabularies, thereby effectively saving the labor and time costs of semantic prediction.

In a first aspect, an embodiment of the present invention provides a cross-language lexical-semantic prediction method, including:

determining a first loss function for learning source language word vectors from the monolingual corpus of the source language, and determining a second loss function for learning target language word vectors from the monolingual corpus of the target language;

respectively determining a third loss function for aligning the source language word vector with the target language word vector and a fourth loss function for integrating the source language word vector into the semantic information;

selecting a certain number of source language word pairs and target language word pairs with the same semantics based on the monolingual corpus of the source language and the target language;

optimizing and adjusting the first loss function, the second loss function, the third loss function and the fourth loss function by adopting a random gradient descent method based on the source language word and target language word pair and an established semantic source knowledge base in the source language to obtain bilingual word vectors belonging to the same semantic space, wherein the bilingual word vectors carry semantic corresponding relations between the source language word vectors and the target language word vectors and fusion relations between the semantic sources and the source language word vectors;

and on the basis of the bilingual word vectors, performing semantic prediction on the target vocabulary by searching marked semantic meanings of source language words close to the target vocabulary word vectors in the target language.

In a second aspect, an embodiment of the present invention provides a cross-language lexical-semantic prediction apparatus, including:

the first setting module is used for determining a first loss function for learning a source language word vector from a monolingual corpus of a source language and determining a second loss function for learning a target language word vector from a monolingual corpus of a target language;

the second setting module is used for respectively determining a third loss function for aligning the source language word vector with the target language word vector and a fourth loss function for integrating the source language word vector into the semantic information;

the training word pair extraction module is used for selecting a certain number of source language words and target language word pairs with the same semantics based on the monolingual corpus of the source language and the target language;

the vector alignment fusion module is used for optimizing and adjusting the first loss function, the second loss function, the third loss function and the fourth loss function by adopting a random gradient descent method based on the source language word and target language word pair and an established semantic knowledge base in the source language to obtain bilingual word vectors belonging to the same semantic space, wherein the bilingual word vectors carry semantic corresponding relations between the source language word vectors and the target language word vectors and fusion relations between the semantic elements and the source language word vectors;

and the prediction output module is used for performing semantic prediction on the target vocabulary by searching the labeled semantic sources of the source language words close to the target vocabulary word vector in the target language based on the bilingual word vector.

In a third aspect, an embodiment of the present invention provides an electronic device, including: at least one memory, at least one processor, a communication interface, and a bus; the memory, the processor and the communication interface are used for completing mutual communication through the bus, and the communication interface is used for information transmission between the electronic equipment and the source language equipment and the target language equipment; the memory has stored therein a computer program operable on the processor, which when executed by the processor implements the cross-language lexical-semantic prediction method of the first aspect.

According to the cross-language vocabulary semantic prediction method, the cross-language vocabulary semantic prediction device and the electronic equipment, through the processing steps of single-language word vector learning, cross-language word vector alignment, source language word vector fusion into semantic information and the like, the cross-language vocabulary semantic prediction can be reasonably performed by using a sense original knowledge base, so that the labor and time cost of semantic prediction is effectively saved, the semantic annotation of linguistics experts is facilitated, the semantic knowledge base is constructed for other languages more quickly and better, and the method, the device and the electronic equipment have good practicability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a cross-language lexical semantic prediction method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a cross-language lexical semantic prediction method according to another embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a cross-language lexical-semantic prediction apparatus according to an embodiment of the present invention;

fig. 4 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments of the present invention without any creative efforts belong to the protection scope of the embodiments of the present invention.

In linguistics, the introduction of the sememe can perform finer-grained analysis on words, which is helpful for better understanding of the nature of the language. However, most languages have no established knowledge base of the sememes, and it is inconvenient to determine the sememes of the words in these languages, which is somewhat disadvantageous for human beings to further understand and use these languages. In order to solve the problem of high labor and time cost of manually marking the sememes, the embodiment of the invention utilizes a computer to automatically mark the sememes for cross-language vocabularies so as to assist human experts to construct a complete sememe knowledge base, and has practical significance. Embodiments of the present invention will be described and illustrated with reference to various embodiments.

Fig. 1 is a flowchart illustrating a cross-language lexical semantic prediction method according to an embodiment of the present invention, as shown in fig. 1, the method is used for implementing cross-language lexical semantic prediction, and the method includes:

s101, determining a first loss function for learning source language word vectors from the monolingual corpus of the source language, and determining a second loss function for learning target language word vectors from the monolingual corpus of the target language.

It can be understood that, when performing cross-language semantic prediction, a semantic prediction model needs to be established, including determining a loss function of the semantic prediction model, training a constructed initial model, and the like. In this step, a loss function for learning the source language word vector and the target language word vector in the model can be set. Specifically, the loss function of the learning source-language word vector is set as a first loss function, and the loss function of the learning target-language word vector is set as a second loss function, respectively.

S102, respectively determining a third loss function for aligning the source language word vector with the target language word vector and a fourth loss function source language word target language word for integrating the source language word vector into the semantic information.

It can be understood that, in the embodiment of the present invention, all word vectors of two languages are aligned in one space through the semantic correspondence relationship of a part of vocabularies in the source language and the target language, so that words in different languages but with similar semantics also satisfy the similar property of the word vectors. Meanwhile, for the source language, an original meaning knowledge base of the vocabulary is established in advance, and then according to the original meaning knowledge base, the labeled original meaning of each source language word can be inquired, and the labeled information of the known original meaning of the source language word is blended into the word vector, so that the words labeled by similar original meanings have more similar word vectors.

It can be understood that, in order to perform comprehensive prediction, the step determines, according to the processing idea described above, a loss function that performs vector alignment between the source-language word vector and the target-language word vector, that is, a third loss function, and a loss function that fuses semantic information into the source-language word vector, that is, a fourth loss function, so as to perform comprehensive prediction using these loss functions in subsequent steps.

In addition, the purpose of learning word vectors is to predict the sememes, so that words labeled with similar sememes have similar vectors, and thus words found by the similar properties of word vectors when predicting the sememes of the target language words also have similar sememes.

S103, selecting a certain number of source language word pairs and target language word pairs with the same semantics based on the monolingual corpus of the source language and the target language.

It can be understood that, in order to train the established semantic prediction model, word pairs with the same semantics are selected as the word pairs of the source language and the target language based on the monolingual corpus of the source language and the target language. For example, based on the same semantic meaning, corresponding words are selected from word lists of the monolingual corpus of the source language and the target language to form a bilingual seed dictionary, and then pairs of the source language words and the target language words with the same semantic meaning can be obtained by inquiring the seed dictionary or translation software.

It can be understood that, for a source language word selected from a source language, a target language word having the same language as the source language word can be selected from a vocabulary of a target language corpus, and the two words having the same language form a source language word and target language word pair. For example, an english word having the same semantic meaning as the word "eat" in chinese is "eat", and "eat" constitute a source language word and target language word pair. And selecting according to the rule until the number of the selected source language word and target language word pairs meets certain requirements.

And S104, optimizing and adjusting the first loss function, the second loss function, the third loss function and the fourth loss function by adopting a random gradient descent method based on the source language word and target language word pair and the semantic knowledge base established in the source language to obtain bilingual word vectors belonging to the same semantic space, wherein the bilingual word vectors carry semantic corresponding relations between the source language word vectors and the target language word vectors and fusion relations between the semantic elements and the source language word vectors.

It can be understood that, in this step, based on the processing in the above steps, a random gradient descent method is adopted, and a vector composed of semantic elements labeled by source language word vectors corresponding to source language word vectors in the source language and target language is adjusted by adjusting the word vectors having the same semantics in the source language and the target language, that is, the source language word vectors and the target language word vectors, and simultaneously adjusting each source language word vector in the source language and the semantic element labeled by the source language word corresponding to the source language word vector, so as to perform optimization adjustment on the comprehensive loss function of the loss function, and finally, the word vectors of the source language words and the target language words having the same semantics are aligned to the same semantic space, so as to form a bilingual word vector fused with the semantic information.

Meanwhile, in model training, distributed representation of the monolingual words, namely low-dimensional word vectors, is learned from monolingual corpus of source language and target language respectively, so that the co-occurrence property of words and phrases in the monolingual corpus is converted into mathematical representation. It is understood that for the source language and the target language, different vocabularies corresponding to the language can be learned from the monolingual corpus thereof, and the vocabularies corresponding to the source language can be called source language words, and similarly, the vocabularies corresponding to the target language can be called target language words.

Moreover, no matter the source language or the target language, each word can correspondingly form a word vector, the word vector learned from the monolingual corpus of the source language can be correspondingly used as a word vector of the source language, and the word vector learned from the monolingual corpus of the target language can be used as a word vector of the target language.

That is, the total loss function of the frame can be obtained by adding the loss functions of the above steps, and the bilingual word vectors belonging to the same semantic space can be obtained after training the loss function by using a random gradient descent method. If the meaning of the bilingual word vectors in the same semantic space is similar, the word vectors are similar, for example, if the semantics of "apple" and "apple" are similar, the cosine similarity of the word vectors corresponding to the two should be larger.

And S105, based on the bilingual word vectors, performing semantic prediction on the target vocabulary by searching marked semantic meanings of source language words close to the target vocabulary word vectors in the target language.

It can be understood that, in this step, on the basis of obtaining a bilingual word vector, by using the characteristics of the source language word and the target language word in the same vector space, the source language word corresponding to the source language word vector is determined by searching for the source language word vector close to the word vector of the target language word, and then, according to the semantic marked by the source language word, semantic prediction is performed on the target word.

According to the cross-language vocabulary semantic prediction method provided by the embodiment of the invention, through the processing steps of monolingual word vector learning, cross-language word vector alignment, source language word vector merging into semantic information and the like, cross-language vocabulary semantic prediction can be reasonably carried out by using a sensed original knowledge base, so that the labor and time cost of semantic prediction is effectively saved, the semantic annotation of linguistic experts is facilitated, and then the semantic knowledge base is constructed for other languages more quickly and better, and the method has good practicability.

Optionally, according to the foregoing embodiments, the step of performing optimization adjustment on the first loss function and the second loss function specifically includes:

learning the co-occurrence property among the source language words by using a first loss function based on different monolingual linguistic data of the source language, and converting the property into low-dimensional real-value distributed representation to form a source language word vector;

and based on different monolingual linguistic data of the target language, learning the commonly occurring property among all target language words by using a second loss function, and converting the property into low-dimensional real-value distributed representation to form a target language word vector.

Specifically, for the source language, by using the co-occurrence property between the source language words and the source language words in the monolingual corpus of the source language, the words can be converted into low-dimensional real-value distributed expression, namely word vectors, namely source language word vectors, so that similar word vectors exist for semantically similar words. For example, the Skip-gram model in word2vec can be used to learn the co-occurrence properties between words and phrases in different monolingual corpora of the source language.

Similarly, a similar method can be adopted corresponding to the target language, the learning process is basically the same as the source language word vector learning process, and a Skip-gram model can be also adopted for learning, except that the training corpus is a monolingual corpus of the target language to obtain the target language word vector.

Optionally, according to the foregoing embodiments, the step of determining a loss function for performing vector alignment on the source-language word vector and the target-language word vector further includes:

forming a seed dictionary by using the source language word and the target language word pair with the same semantic meaning;

based on the seed dictionary, a third loss function is determined as follows:

in the formula, w_s ^S、w_t ^TRepresenting words in source and target languages, w_s ^SDenotes w_s ^SCorresponding source language word vectors, w_t ^TDenotes w_t ^TThe corresponding target-language word vector, D, represents the seed dictionary.

It can be understood that, in the embodiment of the present invention, when performing vector alignment of a source language word vector and a target language word vector, bilingual word vector alignment is performed based on a seed dictionary. Specifically, a plurality of translated word pairs in two languages can be obtained conveniently by using online translation software or the like, and the set of word pairs is called a seed dictionary. It is only necessary to have similar word vectors for the two cross-language words in each word pair in the seed dictionary. For example, let "eat" and "eat" have similar word vectors, which can make all words in english and chinese satisfy similar properties.

When the vector alignment training is carried out, only the word vector w needs to be adjusted_s ^SAnd w_t ^TTo minimize the value of the first loss function, w can be made_s ^SAnd w_t ^TMost closely, vector alignment is achieved.

For example, as shown in fig. 2, a flowchart of a cross-language lexical-semantic prediction method according to another embodiment of the present invention is shown, where a corpus exists for each of a source language and a target language, and a seed dictionary is established by using online translation software or the like. And aligning the source language words and the target language words to the same semantic space based on the seed dictionary, and realizing the semantic prediction of the target words on the basis.

On the basis of the foregoing embodiments, after the step of determining the third loss function based on the seed dictionary, the method of an embodiment of the present invention further includes:

respectively setting a retrieval index matched with the source language words for each target language word based on the same semantic meaning, and forming a hidden vector based on the retrieval index corresponding to each target language word;

based on the hidden vector, the monolingual corpus of the source language and the monolingual corpus of the target language, a fifth loss function is determined as follows:

wherein the content of the first and second substances,

in the formula, C^S、C^TRepresenting monolingual corpus in source and target languages, respectively, m representing hidden vector, | V^TI represents the number of search indexes, m_tDenotes a search index, w_t ^TIs represented by C^TThe target-language word in (1), c (w)_t ^T) Denotes w_t ^TAt C^TThe number of times of occurrence of (a),

is represented by C^SMiddle and w_t ^TMatching source language words.

It can be understood that, in order to further align the target-language word vector with the source-language word vector, on the basis of the above embodiments, bilingual word vector alignment based on the matching mechanism is performed. Specifically, assuming that each target-language word has a source-language word matched to it, a hidden variable m is used_tTo represent the word w in the target language_t ^TIndexes of matched source language words, i.e.

Is equal to w_t ^TThe matched source language words. Then a word number | V with the length of the target language can be obtained^TAnd | obtaining the second loss function by using the hidden vector m of | further. Wherein V^TA vocabulary representing a target-language word, the hidden vector m being the hidden variable m_tThe vector of formation, i.e. | V^TOne vector in the | dimension.

Similarly, assuming that each source language word also has a target language word matched to it, the representation of the loss function is similar to the above formula. That is, correspondingly, the sixth loss function corresponding to the source language word is determined according to each source language word by using the corresponding processing flow, which is not described herein again.

Correspondingly, the step of performing optimization adjustment on the third loss function and the fourth loss function specifically includes: and performing weighted summation on the third loss function, the fifth loss function and the sixth loss function to obtain a comprehensive loss function, and performing optimized adjustment on the comprehensive loss function and the fourth loss function.

Optionally, according to the foregoing embodiments, the step of determining the fourth loss function of the source language word vector merged into the semantic information specifically includes:

searching near-meaning source language words in the source language based on an semantic source knowledge base, wherein the near-meaning source language words represent a plurality of source language words of which the number of common synonyms reaches a preset threshold value;

modifying the source language word vector corresponding to the near-meaning source language vocabulary, and determining a fourth loss function shown as follows to blend the source language word vector corresponding to the near-meaning source language vocabulary into the semantic information:

in the formula (I), the compound is shown in the specification,

represents the source language word vector before modification,

represents the modified source language word vector,

representing words in source language

α_iAnd β_ijRepresenting a hyper-parameter.

It can be understood that the embodiment of the present invention performs a process of integrating the semantic information into the word vector based on the inter-word similarity relationship. Specifically, a plurality of source language words having the same synonym number up to a preset threshold may be referred to as near source language words, or may also be referred to as "near synonyms". For example, a source language word that has more than two common synonyms is considered a "synonym". The entire source language space is first searched to find all the "synonyms" of each source language word, and then the word vectors of these source language words, i.e., the source language word vectors, are modified so that the word vectors between the "synonyms" are closer.

Then by adjusting when performing vector blending training

So that

And

as similar as possible, that is, the word vector after the change of the same word vector is as close as possible to the word vector before the change. In addition to the adjustment

And

so that the word vectors of the two "synonyms" are also as close as possible.

On the basis of the foregoing embodiments, the step of determining the fourth loss function for merging source language word vectors into semantic information specifically includes:

based on the knowledge base and structure of the sememeWord and primitive co-occurrence matrix, elements of word and primitive co-occurrence matrix

A value of 1 indicates an ambigen x_jIs tagged to source language words

If not, indicating that the labeling is not carried out;

and determining a fourth loss function shown as the following through decomposing the word and the synonym co-occurrence matrix so as to perform the synonym information fusion of the source language word vector:

in the formula, X represents the collection of the sense original,

a representation of a word in a source language,

to represent

Corresponding source language word vectors, b_sAnd b'_jRespectively representing source language word vectors

Sum-and-sense vector x_jThe deviation of (2).

It is to be understood that embodiments of the present invention incorporate word vector processing based on the semantic information represented by the semantic elements. Specifically, while the word vector is obtained by decomposing the word and the word-related co-occurrence matrix, the word vector can be modified synchronously, so that the word-related semantic annotation relationship between the words is embodied in the word vector.

For word and sememe co-occurrence matrix M^SElements thereof

A value of 1 indicates an atom x_jIs tagged to source language words

Otherwise, it means no label. By decomposing M^SThe fourth loss function described above may be defined. The information implied by this loss function is the source language word

If sense original x_jThen the dot product of the two vectors should be as 1 as possible, otherwise it should be as 0 as possible. The training process is in regulation

Corresponding source language word vector

And the synonym x_jCorresponding primitive vector x_j。

Based on the bilingual word vectors, by searching the labeled sememes of the source language words close to the target vocabulary word vectors in the target language, the sememe prediction for the target vocabulary can be carried out according to the following processing flow:

searching source language words similar to the target vocabulary word vectors based on the bilingual word vectors, and determining all the labeled sememes of the source language words; aiming at each sememe, calculating the score of the sememe by calculating the similarity between source language word vectors and target language word vectors corresponding to the source language words and the target language words containing the sememe; and selecting the corresponding sememe of the score higher than the set threshold value as the sememe of the target vocabulary.

Optionally, according to the above embodiments, the step of calculating the score of the semantic source further includes:

for the sememe x_jThe score of the sememe is calculated using the following calculation formula:

in the formula (I), the compound is shown in the specification,

represents an atom x_jFor the target vocabulary

The score of (a) is obtained,

and

respectively represent a sense atom containing x_jThe source language word vector and the target language word vector

Denotes x_jAnd

the labeling relationship is the corresponding element in the co-occurrence matrix of the word and the sememe,

to represent

And

the serial numbers are arranged in descending order of cosine similarity.

It is understood that when performing the semantic prediction of the target vocabulary, the semantic prediction can be performed based on the word vector relationship based on the above embodiments. Specifically, the sememes for the current target vocabulary are predicted using the sememe labels of the source language words in the vicinity of the target language words. That is, each of the semaphores is scored using the above formula based on the word vectors of the source and target language words containing the semaphoresCosine similarity between them. Final system is a target vocabulary

The predicted sememes are sememes with scores above a certain threshold.

The cross-language vocabulary and semantic source prediction method provided by the embodiment of the invention can effectively predict better semantic sources for cross-language words, is beneficial to assisting the semantic source marking of linguistics experts, further constructs a semantic source knowledge base for other languages more quickly and better, and has good practicability.

As another aspect of the embodiments of the present invention, the embodiments of the present invention provide a cross-language lexical-semantic-prediction apparatus according to the above embodiments, which is used to implement cross-language lexical-semantic-prediction in the above embodiments. Therefore, the description and definition in the cross-language lexical and semantic prediction method of the embodiments described above can be used for understanding the execution modules in the embodiments of the present invention, and specific reference may be made to the embodiments described above, which are not repeated herein.

According to an embodiment of the present invention, a structure of a cross-language vocabulary semantic prediction apparatus is shown in fig. 3, which is a schematic structural diagram of a cross-language vocabulary semantic prediction apparatus provided in an embodiment of the present invention, and the apparatus may be used to implement cross-language vocabulary semantic prediction in the above embodiments of the method, and the apparatus includes a first setting module 301, a second setting module 302, a training word pair extracting module 303, a vector alignment fusion module 304, and a prediction output module 305. Wherein:

the first setting module 301 is configured to determine a first loss function for learning a source language word vector from a monolingual corpus of a source language, and determine a second loss function for learning a target language word vector from a monolingual corpus of a target language; the second setting module 302 is configured to determine a third loss function for aligning the source language word vector with the target language word vector, and a fourth loss function for fusing the source language word vector into the semantic information; the training word pair extraction module 303 is configured to select a certain number of pairs of source language words and target language words having the same semantics based on the monolingual corpus of the source language and the target language; the vector alignment fusion module 304 is configured to perform optimization and adjustment on the first loss function, the second loss function, the third loss function and the fourth loss function by using a random gradient descent method based on a source language word and target language word pair and an origin-of-meaning knowledge base established in the source language, so as to obtain bilingual word vectors belonging to the same semantic space, where the bilingual word vectors carry semantic correspondence between the source language word vectors and the target language word vectors and fusion between the origin-of-meaning and the source language word vectors; the prediction output module 305 is configured to perform semantic prediction for a target vocabulary by searching for labeled semaphores of source language words that are similar to target vocabulary word vectors in the target language based on the bilingual word vectors.

Specifically, when performing cross-language semantic prediction, a semantic prediction model needs to be established, including determining a loss function of the semantic prediction model, training a constructed initial model, and the like. The first setting module 301 sets a loss function for learning the source language word vector and the target language word vector in the semantic prediction model. That is, the loss function of the learning source-language word vector is set as the first loss function, and the loss function of the learning target-language word vector is set as the second loss function, respectively.

And aligning all word vectors of the two languages in a space through the semantic corresponding relation of part of words in the source language and the target language, so that words in different languages but with similar semantics can also meet the similar property of the word vectors. Meanwhile, for the source language, an original meaning knowledge base of the vocabulary is established in advance, and then according to the original meaning knowledge base, the labeled original meaning of each source language word can be inquired, and the labeled information of the known original meaning of the source language word is blended into the word vector, so that the words labeled by similar original meanings have more similar word vectors.

Therefore, for comprehensive prediction, the second setting module 302 determines, according to the processing idea described above, a loss function that performs vector alignment between the source-language word vector and the target-language word vector, that is, a third loss function, and a loss function that blends semantic information into the source-language word vector, that is, a fourth loss function, so as to perform comprehensive prediction using these loss functions in subsequent steps.

In order to train the established semantic prediction model, the training word pair extraction module 303 selects word pairs having the same semantics as the word pairs of the source language and the target language based on the monolingual corpus of the source language and the target language. For example, based on the same semantic meaning, corresponding words are selected from word lists of the monolingual corpus of the source language and the target language to form a bilingual seed dictionary, and then pairs of the source language words and the target language words with the same semantic meaning can be obtained by inquiring the seed dictionary or translation software.

It can be understood that, for a source language word selected from a source language, the training word pair extraction module 303 may select a target language word having the same language as the source language word from a vocabulary of a target language corpus, and the two words having the same language form a source language word and target language word pair. The training word pair extraction module 303 selects according to this rule until the number of the selected source language word and target language word pairs meets a certain requirement.

Then, the vector alignment and fusion module 304 adopts a random gradient descent method to perform optimization and adjustment on the comprehensive loss function of the loss function by adjusting word vectors with the same semantics in the source language and the target language, namely the source language word vector and the target language word vector, and simultaneously adjusting a vector formed by each source language word vector in the source language and an semantic origin marked by a source language word corresponding to the source language word vector, and finally aligning the word vectors of the source language words and the target language words with the same semantics to the same semantic space to form a bilingual word vector fused with the semantic information.

Meanwhile, the vector alignment fusion module 304 may learn distributed representations of the monolingual words, i.e., low-dimensional word vectors, from the monolingual corpus of the source language and the target language, respectively, so as to convert the co-occurrence properties of words and phrases in the monolingual corpus into mathematical representations. It is understood that for the source language and the target language, the vector alignment fusion module 304 can learn different words corresponding to the language from the monolingual corpus thereof, and the included words corresponding to the source language can be referred to as source language words, and similarly, the included words corresponding to the target language can be referred to as target language words.

Moreover, no matter the source language or the target language, the vector alignment fusion module 304 may form a word vector for each vocabulary, and then the word vector learned from the monolingual corpus of the source language may be used as the source language word vector, and the word vector learned from the monolingual corpus of the target language may be used as the target language word vector.

The total loss function of the frame can be obtained by adding the loss functions of the steps, and the bilingual word vectors belonging to the same semantic space can be obtained after the training of the total loss function of the frame by adopting a random gradient descent method. The definition of bilingual word vectors in the same semantic space is that the word vectors are similar if the meanings are similar.

Finally, on the basis of obtaining the bilingual word vectors, the prediction output module 305 determines the source language words corresponding to the source language word vectors by searching the source language word vectors close to the target language word vectors of the target language words by using the characteristics of the source language words and the target language words in the same vector space, and performs semantic prediction on the target words according to the semantic meanings marked by the source language words.

According to the cross-language vocabulary semantic prediction device provided by the embodiment of the invention, through arranging the corresponding execution module, through processing steps of single-language word vector learning, cross-language word vector alignment, source language word vector fusion into semantic information and the like, a sense original knowledge base can be reasonably utilized to carry out cross-language vocabulary semantic prediction, so that the labor and time cost of semantic prediction is effectively saved, the semantic annotation of linguistics experts is facilitated, and then the semantic knowledge base is constructed for other languages more quickly and better, and the cross-language vocabulary semantic prediction device has good practicability.

It is understood that, in the embodiment of the present invention, each relevant program module in the apparatus of each of the above embodiments may be implemented by a hardware processor (hardware processor). Moreover, when the cross-language lexical-semantic-source prediction device of the embodiment of the present invention implements cross-language lexical-semantic-source prediction in the embodiments of the methods described above, the beneficial effects produced by the cross-language lexical-semantic-source prediction device of the embodiment of the present invention are the same as those of the embodiments of the corresponding methods described above, and reference may be made to the embodiments of the methods described above, which are not described herein again.

As another aspect of the embodiment of the present invention, the embodiment provides an electronic device according to the above embodiment, and with reference to fig. 4, a schematic physical structure diagram of the electronic device provided in the embodiment of the present invention includes: at least one memory 401, at least one processor 402, a communication interface 403, and a bus 404.

Wherein, the memory 401, the processor 402 and the communication interface 403 are communicated with each other through the bus 404, and the communication interface 403 is used for information transmission between the electronic device and the source language device and the target language device; the memory 401 stores a computer program that can be executed on the processor 402, and when the processor 402 executes the computer program, the cross-language lexical semantic prediction method according to the above-described embodiment is implemented.

It is understood that the electronic device at least includes a memory 401, a processor 402, a communication interface 403 and a bus 404, and the memory 401, the processor 402 and the communication interface 403 are connected in communication with each other through the bus 404, and can complete communication with each other, for example, the processor 402 reads program instructions of a cross-language lexical-semantic prediction method from the memory 401. In addition, the communication interface 403 may also implement communication connection between the electronic device and the source language device and the target language device, and may complete mutual information transmission, such as cross-language lexical-semantic prediction via the communication interface 403.

When the electronic device is running, the processor 402 calls the program instructions in the memory 401 to perform the methods provided by the above-mentioned method embodiments, including for example: determining a first loss function for learning source language word vectors from the monolingual corpus of the source language, and determining a second loss function for learning target language word vectors from the monolingual corpus of the target language; respectively determining a third loss function for aligning the source language word vector with the target language word vector and a fourth loss function for fusing the source language word vector into the semantic information; selecting a certain number of source language word pairs and target language word pairs with the same semantics based on the monolingual corpus of the source language and the target language; optimizing and adjusting a first loss function, a second loss function, a third loss function and a fourth loss function by adopting a random gradient descent method based on a source language word and target language word pair and an established semantic knowledge base in a source language to obtain bilingual word vectors belonging to the same semantic space, wherein the bilingual word vectors carry semantic corresponding relations between the source language word vectors and the target language word vectors and fusion relations between the semantic elements and the source language word vectors; and on the basis of the bilingual word vectors, performing semantic prediction and the like on the target vocabulary by searching marked synonyms of source language words close to the target vocabulary word vectors in the target language.

The program instructions in the memory 401 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Alternatively, all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, where the program may be stored in a computer-readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Embodiments of the present invention also provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a cross-language lexical-semantic-prediction method according to the above embodiments. Examples include: determining a first loss function for learning source language word vectors from the monolingual corpus of the source language, and determining a second loss function for learning target language word vectors from the monolingual corpus of the target language; respectively determining a third loss function for aligning the source language word vector with the target language word vector and a fourth loss function for fusing the source language word vector into the semantic information; selecting a certain number of source language word pairs and target language word pairs with the same semantics based on the monolingual corpus of the source language and the target language; optimizing and adjusting a first loss function, a second loss function, a third loss function and a fourth loss function by adopting a random gradient descent method based on a source language word and target language word pair and an established semantic knowledge base in a source language to obtain bilingual word vectors belonging to the same semantic space, wherein the bilingual word vectors carry semantic corresponding relations between the source language word vectors and the target language word vectors and fusion relations between the semantic elements and the source language word vectors; and on the basis of the bilingual word vectors, performing semantic prediction and the like on the target vocabulary by searching marked synonyms of source language words close to the target vocabulary word vectors in the target language.

According to the electronic equipment and the non-transitory computer readable storage medium provided by the embodiment of the invention, through processing steps of monolingual word vector learning, cross-language word vector alignment, source language word vector fusion into semantic information and the like, cross-language word semantic prediction can be reasonably performed by using a sensed original knowledge base, so that the labor and time cost of semantic prediction is effectively saved, the semantic annotation of linguistic experts is facilitated, and a semantic knowledge base is constructed for other languages more quickly and better, and the method has good practicability.

It is to be understood that the above-described embodiments of the apparatus, the electronic device and the storage medium are merely illustrative, and that elements described as separate components may or may not be physically separate, may be located in one place, or may be distributed on different network elements. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the technical solutions mentioned above may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a usb disk, a removable hard disk, a ROM, a RAM, a magnetic or optical disk, etc., and includes several instructions for causing a computer device (such as a personal computer, a server, or a network device, etc.) to execute the methods described in the method embodiments or some parts of the method embodiments.

In addition, it should be understood by those skilled in the art that in the specification of the embodiments of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In the description of the embodiments of the invention, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects.

However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of an embodiment of this invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention, and not to limit the same; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A cross-language lexical-semantic prediction method, comprising:

2. The method according to claim 1, wherein the step of optimally adjusting the first loss function and the second loss function specifically comprises:

based on different monolingual corpora of the source language, learning the co-occurrence property among the words of the source language by using the first loss function, and converting the property into low-dimensional real-value distributed representation to form a word vector of the source language;

and based on different monolingual linguistic data of the target language, learning the commonly occurring property among all target language words by using the second loss function, and converting the property into low-dimensional real-value distributed representation to form the target language word vector.

3. The method of claim 1, wherein the step of determining a third loss function that aligns the source-language word vector with the target-language word vector specifically comprises:

forming a seed dictionary by using the source language word and the target language word pairs with the same semantics;

based on the seed dictionary, determining a third loss function as shown below:

in the formula, w_s ^S、w_t ^TRepresenting source and target language words, respectively, w_s ^SDenotes w_s ^SCorresponding said source language word vector, w_t ^TDenotes w_t ^TAnd D represents the seed dictionary.

4. The method of claim 3, further comprising, after the step of determining the third loss function based on the seed dictionary:

determining a fifth loss function based on the hidden vector, the monolingual corpus of the source language and the monolingual corpus of the target language as follows:

wherein the content of the first and second substances,

in the formula, C^S、C^TRespectively representing monolingual corpus in the source language and the target language, m representing the hidden vector, | V^TI represents the number of the retrieval indexes, m_tRepresents the search index, w_t ^TIs represented by C^TThe target-language word in (1), c (w)_t ^T) Denotes w_t ^TAt C^TThe number of times of occurrence of (a),

is represented by C^SMiddle and w_t ^TMatching source language words;

correspondingly, according to each source language word, determining a sixth loss function corresponding to the source language word by adopting the corresponding processing flow;

correspondingly, the step of optimally adjusting the third loss function and the fourth loss function specifically includes:

and performing weighted summation on the third loss function, the fifth loss function and the sixth loss function to obtain a comprehensive loss function, and performing optimized adjustment on the comprehensive loss function and the fourth loss function.

5. The method of claim 1, wherein the step of determining a fourth loss function for incorporating semantic information into the source language word vector specifically comprises:

searching near-meaning source language words in the source language based on the semantic source knowledge base, wherein the near-meaning source language words are a plurality of source language words with the number of common synonyms reaching a preset threshold value;

modifying the source language word vector corresponding to the near-meaning source language vocabulary, and determining a fourth loss function shown as follows to blend the source language word vector corresponding to the near-meaning source language vocabulary into semantic information:

in the formula (I), the compound is shown in the specification,

represents the source language word vector before modification,

represents the modified source language word vector,

representing words in source language

α_iAnd β_ijRepresenting a hyper-parameter.

6. The method of claim 1, wherein the step of determining a fourth loss function for incorporating semantic information into the source language word vector specifically comprises:

constructing a word and primitive co-occurrence matrix based on the primitive senses knowledge base, wherein elements of the word and primitive co-occurrence matrix

A value of 1 indicates an ambigen x_jIs tagged to source language words

If not, the marking is not carried out;

determining a fourth loss function as shown below by decomposing the word and source co-occurrence matrix to perform source language word vector semantic information fusion:

in the formula, X represents the collection of the sense original,

a representation of a word in a source language,

to represent

Corresponding source language word vectors, b_sAnd b_j' separately representing Source language word vectors

Sum-and-sense vector x_jThe deviation of (2).

7. The method according to claim 6, wherein the step of performing an ambiance prediction for the target vocabulary by searching for labeled ambiances of source language words that are similar to the target vocabulary word vector in the target language based on the bilingual word vector specifically comprises:

searching source language words similar to the target vocabulary word vectors based on the bilingual word vectors, and determining all the labeled sememes of the source language words;

aiming at each sememe, calculating the similarity between the source language word vector and the target language word vector corresponding to the source language word and the target language word containing the sememe respectively;

and selecting the corresponding sememe of the score higher than the set threshold value as the sememe of the target vocabulary.

8. The method of claim 7, wherein the step of calculating the score of the semantic source by calculating the similarity between the source-language word vector and the target-language word vector corresponding to the source-language word and the target-language word containing the semantic source further comprises:

for the sememe x_jCalculating the score of the antigen using the following calculation formula:

in the formula (I), the compound is shown in the specification,

represents an atom x_jFor the target vocabulary

Is determined based on the score of (a) in (b),

and

respectively represent a sense atom containing x_jThe source language word vectors and the target language word vectors,

denotes x_jAnd

the labeling relationship of (a) is the corresponding element in the co-occurrence matrix of the word and the sememe,

to represent

And

the serial numbers are arranged in descending order of cosine similarity.

9. A cross-language lexical-semantic prediction apparatus comprising:

the training word pair extraction module is used for selecting a certain number of source language words and target language word pairs with the same semantics from the monolingual corpus based on the source language and the target language;

10. An electronic device, comprising: at least one memory, at least one processor, a communication interface, and a bus;

the memory, the processor and the communication interface are used for completing mutual communication through the bus, and the communication interface is used for information transmission between the electronic equipment and the source language equipment and the target language equipment;

the memory has stored therein a computer program operable on the processor, which when executed by the processor, implements the method of any one of claims 1 to 8.