CN112668346A

CN112668346A - Translation method, device, equipment and storage medium

Info

Publication number: CN112668346A
Application number: CN202011554126.5A
Authority: CN
Inventors: 叶忠义; 张为泰; 刘俊华
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-04-16
Anticipated expiration: 2040-12-24
Also published as: WO2022134164A1; CN112668346B

Abstract

The application discloses a translation method, a translation device and a translation storage medium, wherein a source language text, voice information of the source language text and structure information of each text subunit in the source language text are simultaneously obtained, feature extraction is respectively carried out on the source language text, the voice information and the structure information, text features corresponding to the source language text, voice features corresponding to the voice information and structure features corresponding to the source language text are obtained, translation is further carried out based on the three features, and a translated target language text is obtained. Therefore, in the translation process, the text, the pronunciation and the structure are taken as different cognitive layers of the same source language text to be comprehensively considered, translation is carried out based on three characteristics, resource information is fully utilized, and meanwhile, the referenced information is richer in translation, so that the robustness in the translation process can be improved, and the quality of a translation result can be improved.

Description

Translation method, device, equipment and storage medium

Technical Field

The present application relates to the field of machine translation technologies, and in particular, to a translation method, apparatus, device, and storage medium.

Background

In recent years, a great deal of opportunities are brought to the development of natural language processing by the staged breakthrough of technologies such as deep learning and transfer learning and a great deal of data generated by the internet, and the natural language processing task is also in breakthrough progress, wherein machine translation is one of important applications in the natural language processing task, and the accuracy of machine translation is greatly improved along with the development of the natural language processing task. The reliability of using machines to automatically process text is increasing so that more work can be done by the machine. Deep learning generally depends on a large amount of label data, and in recent years, with the proposal of various self-supervision pre-training models, semi-supervision and unsupervised algorithms, the models can learn by using a large amount of linguistic data, the process greatly reduces the required amount of label linguistic data of each task, and the use threshold of deep learning is lower and lower. With accumulation of large-scale parallel corpora and continuous improvement of computing power, the translation level of machine translation in the general field even exceeds that of human beings, so that the machine translation becomes a technology with higher technical maturity in natural language processing, and is gradually applied to daily life of people.

The modern machine translation system utilizes parallel linguistic data to learn the mapping relation between languages end to end, so that the translation task becomes a completely data-driven task. When the number of parallel sentences of the translation task is rich, the quality of the translation model is high, but when the sentences of the translation task are limited to data, the quality of the machine translation model is also greatly limited.

Although the usability of the machine translation effect is very high in the resource-rich scene, the effect in the low-resource scene still has a large promotion space. With globalization, the machine translation technology can reduce cross-language communication obstacles, so a scheme for improving the machine translation quality is explored and found, and the method has important practical significance particularly for the machine translation process under a low-resource scene.

Disclosure of Invention

In view of the above problems, the present application is proposed to provide a translation method, apparatus, device and storage medium to improve the quality of machine translation. The specific scheme is as follows:

in a first aspect of the present application, a translation method is disclosed, comprising:

acquiring a source language text, voice information of the source language text and structure information of each text subunit in the source language text;

acquiring text features corresponding to the source language text, voice features corresponding to the voice information and structural features corresponding to the source language text, wherein the structural features are determined based on structural information of each text subunit in the source language text;

and translating the source language text based on the text characteristics, the voice characteristics and the structure characteristics to obtain a translated target language text.

Preferably, the obtaining of the structure information of each text subunit in the source language text includes:

and acquiring an image of each text subunit in the source language text, wherein the image contains the structure information of the text subunits.

Preferably, the obtaining of the text feature corresponding to the source language text, the speech feature corresponding to the speech information, and the structure feature corresponding to the source language text includes:

and respectively coding the source language text, the voice information and the structure information of each text subunit in the source language text to obtain text characteristics corresponding to the source language text, voice characteristics corresponding to the voice information and structure characteristics corresponding to the source language text.

Preferably, the process of encoding the source language text, the speech information, and the structure information of each text subunit in the source language text includes:

and coding is sequentially performed according to the sequence from the voice information, the structure information to the source language text, wherein:

coding the voice information to obtain voice characteristics;

when the structure information is coded, the coding result of the voice information is fused to obtain the structure characteristic of the fused voice information;

and when the source language text is coded, fusing the coding result of the voice information and the coding result of the structure information to obtain text characteristics fusing the voice information and the structure information.

Preferably, when encoding the structure information, fusing the encoding result of the speech information includes:

and fusing the coding result of the voice information and the coding result of the structure information in a first fusion mode.

Preferably, when the source language text is encoded, fusing the encoding result of the speech information and the encoding result of the structure information includes:

and fusing the coding result of the voice information and the coding result of the structure information with the coding result of the source language text in a second fusion mode.

Preferably, the translating the source language text based on the text feature, the voice feature and the structural feature to obtain a translated target language text includes:

determining a splicing feature based on the speech feature, the structural feature and the text feature;

and determining the translated target language text corresponding to the source language text based on the splicing characteristics.

Preferably, the determining a splicing feature based on the speech feature, the structural feature and the text feature comprises:

splicing the voice feature, the structural feature and the text feature to obtain a first splicing feature;

or the like, or, alternatively,

fusing the voice feature, the structural feature and the text feature to obtain a fusion feature of the source language;

and splicing the fusion characteristic of the source language, the voice characteristic, the structure characteristic and the text characteristic to obtain a second spliced characteristic.

Preferably, the determining the translated target language text corresponding to the source language text based on the splicing feature includes:

performing information conversion between the source language and the target language based on the splicing characteristics to obtain fusion characteristics, text characteristics, structural characteristics and voice characteristics of the target language;

and performing text decoding by combining the fusion characteristics and the text characteristics of the target language to obtain a target language text of the target language.

Preferably, the information conversion between the source language and the target language based on the splicing feature to obtain the fusion feature, the text feature, the structural feature and the voice feature of the target language includes:

coding the splicing characteristics by a translation coder to obtain a coding result;

and decoding the coding result by a translation decoder to obtain the fusion characteristic, the text characteristic, the structural characteristic and the voice characteristic of the decoded target language.

Preferably, after text decoding is performed in combination with the fusion feature and the text feature of the target language, the method further includes:

performing structure decoding by combining the text decoding result and the structural characteristics of the target language to obtain structural information of the target language text;

and performing voice decoding by combining the text decoding result, the structure decoding result and the voice characteristic of the target language to obtain the voice information of the target language text.

Preferably, the text feature, the voice feature and the structure feature are obtained, and the process of translating the source language text is realized through a pre-trained translation model;

the training process of the translation model comprises the following steps:

inputting a source language training text, voice information of the source language training text and structure information of each text subunit in the source language training text;

the translation model processes the input information and outputs a predicted target language text, structural information of the target language text and voice information of the target language text;

performing iterative training by taking a minimized set loss function as a training target, wherein the set loss function comprises the following steps:

zero or at least one of a text alignment loss between the source language training text and the model predicted target language text, and a structural alignment loss and a speech alignment loss.

In a second aspect of the present application, there is disclosed a translation apparatus comprising:

the information acquisition unit is used for acquiring the source language text, the voice information of the source language text and the structure information of each text subunit in the source language text;

the feature acquisition unit is used for acquiring text features corresponding to the source language text, voice features corresponding to the voice information and structural features corresponding to the source language text, wherein the structural features are determined based on structural information of each text subunit in the source language text;

and the feature reference unit is used for translating the source language text based on the text feature, the voice feature and the structural feature to obtain a translated target language text.

In a third aspect of the present application, there is disclosed a translation apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the translation method described above.

In a fourth aspect of the present application, a storage medium is disclosed, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the translation method as described above.

In a fifth aspect of the present application, a computer program product is disclosed, which, when run on a terminal device, causes the terminal device to perform the various steps of the translation method described above.

By means of the technical scheme, the translation method simultaneously obtains the source language text, the voice information of the source language text and the structure information of each text subunit in the source language text, respectively extracts the characteristics of the source language text, the voice information and the structure information to obtain the text characteristics corresponding to the source language text, the voice characteristics corresponding to the voice information and the structure characteristics corresponding to the source language text, and then translates based on the three characteristics to obtain the translated target language text. Therefore, in the translation process, the text, the pronunciation and the structure are taken as different cognitive layers of the same source language text to be comprehensively considered, translation is carried out based on three characteristics, resource information is fully utilized, and meanwhile, the referenced information is richer in translation, so that the robustness in the translation process can be improved, and the quality of a translation result can be improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flow chart of a translation method according to an embodiment of the present disclosure;

FIG. 2 illustrates a schematic diagram of a source language text and its corresponding speech information, structure information;

FIG. 3 illustrates a multi-channel machine translation framework schematic;

FIG. 4 illustrates a DFCNN structure;

FIG. 5 is a schematic diagram illustrating the encoding process of a ResNet image encoder;

FIG. 6 illustrates a GRU encoding source language text information;

FIG. 7 illustrates a diagram of a multi-channel hierarchical feature-coded network;

FIG. 8 illustrates a multi-channel hierarchical feature decoding network;

fig. 9 is a schematic structural diagram of a translation apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a translation device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

During the continuous evolution process of the actual characters, characters with similar structures often appear, and represent similar semantics. For example, the 'left' and 'right' structures in Chinese characters are similar and both represent distance-related semantics. Most of characters are communicated by taking voice as a medium in the development process, which can also cause the condition that similar pronunciations represent similar semantics, for example, the pronunciation of 'order' and 'order' in Chinese characters are similar, and the expressed semantics are also similar. This means that using only text information, the stroke structure, shape properties and pronunciation properties of the text are lost, and these properties tend to provide rich information.

Based on the cognition, the translation scheme is provided, and the speech information, the structural information and the text information are used as different cognitive levels of the same information to be jointly learned so as to achieve the translation purpose. Therefore, the purpose of more fully utilizing resource information is achieved, and the translation quality and the robustness of the translation process are obviously improved.

The scheme can be realized based on a terminal with data processing capacity, and the terminal can be a mobile phone, a computer, a server, a cloud terminal and the like.

Next, as described in conjunction with fig. 1, the translation method of the present application may include the following steps:

step S100, obtaining a source language text, voice information of the source language text and structure information of each text subunit in the source language text.

The source language text is the text content to be translated. The speech information of the source language text is standard pronunciation information corresponding to the source language text. The speech information of the source language text may be uploaded by the user or synthesized speech information obtained by performing speech synthesis based on the source language text.

The source language text is composed of a plurality of text sub-units, which may also be different depending on the source language. For example, a text sub-unit corresponding to Chinese may be a word, a text sub-unit corresponding to English may be a word, and so on. The structural information of the text subunit is a display result of the text subunit by the visual information, for example, when the text subunit is a word, the corresponding structural information may also be referred to as font information, such as stroke order, stroke composition structure, and overall shape structure of the word.

In an alternative mode, the structural information of the text sub-unit may be obtained in the form of an image of the text sub-unit, where the image includes the structural information of the text sub-unit.

As shown in fig. 2, taking the source language text as "i have an apple" as an example, the source language text, the voice information and the structure information corresponding to the source language text are respectively shown in fig. 2.

Step S110, acquiring text characteristics corresponding to the source language text, voice characteristics corresponding to the voice information and structural characteristics corresponding to the source language text.

Wherein the structural features are determined based on structural information of text sub-units in the source language text.

In the step, the text features corresponding to the source language text are obtained, and the text features represent the text semantics of the source language text. And the voice characteristics corresponding to the voice information embody the acoustic information of the source language text. The structural characteristics corresponding to the source language text are determined by the structural information of each text subunit in the source language text, and represent the writing structure and shape information of each text subunit.

And step S120, translating the source language text based on the text characteristics, the voice characteristics and the structure characteristics to obtain a translated target language text.

In the step, when cross-language translation is carried out, the text characteristic, the voice characteristic and the structural characteristic of the source language text are simultaneously referred, so that translation reference information is greatly enriched, the quality of a translation result is improved, and compared with a mode of carrying out translation by referring to a single text characteristic, the method reduces the influence of the disturbance of the text on the translation result and improves the robustness of the translation process.

According to the translation method provided by the embodiment of the application, the source language text, the voice information of the source language text and the structure information of each text subunit in the source language text are obtained at the same time, feature extraction is respectively carried out on the source language text, the voice information and the structure information, text features corresponding to the source language text, voice features corresponding to the voice information and structure features corresponding to the source language text are obtained, translation is carried out based on the three features, and a translated target language text is obtained. Therefore, in the translation process, the text, the pronunciation and the structure are taken as different cognitive layers of the same source language text to be comprehensively considered, translation is carried out based on three characteristics, resource information is fully utilized, and meanwhile, the referenced information is richer in translation, so that the robustness in the translation process can be improved, and the quality of a translation result can be improved.

Referring to FIG. 3, a multi-channel machine translation framework diagram is illustrated.

The translation model provided by the application comprises three parts at the input end, namely a source language text, structural information of the source language text and voice information of the source language text, which respectively correspond to the three parts from top to bottom on the left side in the figure 3.

And the translation model respectively extracts the features of the three input parts and performs decoding translation on the basis of the extracted features to obtain a target language text.

It should be noted that, in this embodiment, while outputting the target language text, the translation model may further output structure information of the target language text and speech information of the target language text, which correspond to the three parts from top to bottom on the right side in fig. 3 respectively.

That is, the translation model provided by this embodiment can capture respective alignment relationships between the source language and the target language, on the speech, the structure and the text, so as to learn the multidimensional alignment relationship between the source language and the target language, which is more beneficial to improving the quality of the translation result.

It will be appreciated that the source language is illustrated in FIG. 3 as Chinese and the target language is English, but that the source and target languages may be in other languages.

In some embodiments of the present application, a process of obtaining the text feature corresponding to the source language text, the speech feature corresponding to the speech information, and the structural feature corresponding to the source language text in step S110 is described.

In this embodiment, the source language text, the voice information, and the structure information of each text subunit in the source language text may be encoded respectively, so as to obtain a text feature corresponding to the source language text, a voice feature corresponding to the voice information, and a structure feature corresponding to the source language text.

Specifically, the source language text may be encoded by a text encoder in the translation model to obtain text features. And coding the structure information of each text subunit in the source language text by a structure coder in the translation model to obtain the structure characteristics corresponding to the source language text. And coding the voice information of the source language text by a voice coder in the translation model to obtain voice characteristics.

The input to the text encoder may be a vocabulary ID corresponding to each text sub-unit in the source language text. The input to the structure encoder may be image information, such as pixel values of an image, for each text sub-unit in the source language text. The input to the speech encoder may be spectral features extracted from the speech information.

In this embodiment, two different implementations of encoding the three types of information are provided.

In the first mode, three kinds of information are coded independently:

specifically, the text encoder encodes only with reference to the source language text to obtain the corresponding text features.

And the structure encoder only refers to the structure information of each text subunit in the source language text for encoding to obtain the corresponding structure characteristics.

And the voice encoder only refers to the voice information of the source language text for encoding to obtain the voice characteristics.

In the second mode, three information layering fusion codes are adopted:

according to the development of linguistics, the abstraction levels of information of three dimensions of a text are different, and the abstraction levels are from low to high: speech, structure, text information.

Based on the above recognition, the present embodiment provides a bottom-up multi-level fusion coding, that is, firstly, the speech information is coded, then, the structure information is coded based on the speech coding result, and finally, the text information is coded based on the speech coding result and the structure coding result. The multi-level fusion coding mode from bottom to top accords with the linguistic development rule better, and richer multi-level information can be shared through hierarchical fusion coding, so that the fusion degree of the multi-level information is improved, and meanwhile, the utilization rate of source language information is improved.

Specifically, in this embodiment, the encoding may be performed sequentially according to a sequence from the speech information, the structure information to the source language text, where:

coding the voice information to obtain voice characteristics;

By the hierarchical coding and the information sharing, the data is efficiently utilized and the translation quality is improved.

1. First, the encoding process of speech information is described:

in this embodiment, for the problem of continuous and long sequence of voice information, a variety of Neural Network structures including, but not limited to, Deep full-sequence Convolutional Neural Network (DFCNN) may be used for encoding. Taking DFCNN as an example, the structure is shown in fig. 4.

The input of the DFCNN simultaneously comprises a frequency spectrum signal of voice information and a spectrogram converted from the voice information, namely, each frame of voice is subjected to Fourier transform, time and frequency are taken as two dimensions of the spectrogram, and then the whole sentence of voice is modeled through the combination of a plurality of convolution layers and pooling layers.

From the input end, the traditional speech features use various artificially designed filter banks to extract features after Fourier transform, which causes information loss in the frequency domain, and the information loss in the high-frequency region is particularly obvious.

In the embodiment, the DFCNN is adopted to directly take the spectrogram as an input, and has natural advantages compared with other schemes taking traditional voice characteristics as an input. From the model structure, the DFCNN is different from the CNN in the traditional speech recognition, and it uses the network configuration with the best effect in the image recognition, each convolution layer uses a small convolution kernel of 3x3, and adds a pooling layer after multiple convolution layers, so as to greatly enhance the expression capability of the CNN, meanwhile, by accumulating a very large number of such convolution pooling layer pairs, the DFCNN can see very long history and future information, which ensures that the DFCNN can excellently express the long-term correlation of the speech, and is more excellent in robustness compared with the RNN network structure. In this embodiment, the output characteristic of the last layer of the DFCNN may be used as the coding characteristic of the speech information.

2. Secondly, the encoding process of the structure information is introduced:

in this embodiment, the encoding process of the structure information is described by taking as an example that the structure information of each text subunit in the source language text is embodied in the form of an image of the text subunit. Correspondingly, the structure encoder may be an image encoder.

The input to the image is 2D pixel information and various types of network architectures including, but not limited to, ResNet, etc. can be employed for the image codebook application. Meanwhile, the voice coding result is merged into the image coding process. Taking the ResNet image encoder as an example, the network structure is shown in fig. 5.

In the image recognition task, it is generally considered that CNN network structures of different levels can learn information of different dimensions, and as the model network structure deepens, the performance of the network can be generally improved. The main idea of ResNet is to add a direct connection channel, i.e. the idea of Highway Network, in the Network. The Highway Network allows a certain proportion of the output of the previous Network layer to be preserved. The idea of ResNet is very similar to that of Highway Network, allowing the original input information to be directly transmitted to the following layer, so that the neural Network of this layer can learn the residual of the last Network output instead of learning the whole output, therefore ResNet is also called residual Network. The traditional convolution network or the full-connection network has the problems of information loss, loss and the like more or less during information transmission, and simultaneously, the gradient disappears or the gradient explodes, so that the deep network cannot be trained. ResNet solves the problem to a certain extent, input information is directly bypassed to output, the integrity of the information is protected, the whole network only needs to learn the part with difference between input and output, and the learning goal and difficulty are simplified.

Since the encoding result of the speech information can assist the representation of the image, based on this, the present embodiment incorporates the encoding result of the speech into the image representation. Specifically, the encoding result of the speech information and the encoding result of the structure information (i.e., the encoding result of the image of the text subunit in this embodiment) may be fused in the first fusion manner. The first fusion manner may be various, for example, various fusion strategies such as a threshold mechanism, an attention mechanism, and the like may be used, and this is not strictly limited in this application.

Referring to fig. 5, a first fusion mode is taken as an example of a threshold mechanism, and between blocks of the ResNet network, a speech coding result is fused through a Gate threshold mechanism, and a Gate controls the fusion degree of the speech coding result. The calculation formula of Gate is as follows:

γ_i＝σ_g(W_gx_i-1+U_gf_sound+h_g)

wherein x_i-1Is the input of the i-1 st ResNetBlock, f_soundAs a result of speech coding, gamma_iIs the coefficient of fusion, σ, of the speech coding result of the ith ResNetBlock_gTo activate a function, W_g、U_g、h_gThe parameters related to the Gate can be learned through a network.

In the embodiment, the coding result of the voice information and the coding result of the structural information are adaptively fused through a threshold mechanism, so that the coding result of the structure can utilize the acoustic information, and the robustness is improved.

3. Thirdly, the encoding process of the source language text is introduced:

the abstract level of the characters is higher than that of the voice and the structure, and the text characteristics of the source language text are constructed based on the voice coding result and the structure coding result, so that a better coding result can be obtained. Based on this idea, this embodiment may use a second fusion method to fuse the coding result of the speech information and the coding result of the structure information with the coding result of the source language text. The second fusion method includes, but is not limited to, various fusion mechanisms including GRU (gated loop unit) and attention mechanism, taking GRU as an example, and the structure thereof is shown in fig. 6.

The Gate mechanism is also used in the GRU to fuse text information, speech information and structure information.

z_t＝σ_g(W_gx_t-1+B_zf_sound+A_zf_pic+U_zh_t-1+b_z)

r_t＝σ_g(W_rx_t-1+B_rf_sound+A_rf_pic+U_rh_t-1+b_r)

h_t′＝σ_h(W_hx_t-1+B_hf_sound+A_hf_pic+r_th_t-1+b_h)

h_t＝(1-z_t)h_t-1+z_th′_t

Wherein f is_soundFor speech coding results, f_picFor the structure-coded result, x_t-1Text information in the source language, z, input for time t-1_t、r_tIs a threshold vector, h_tIs historical status information, h'_tFor historical state correction information W, A, B, U, b is a GRU parameter.

From the above formula, it can be seen that the source language text in GRU fuses the speech coding result and the structure coding result through the Gate mechanism, so as to achieve the purpose of hierarchical coding.

In some embodiments of the application, for the foregoing step S120, a process of translating the source language text to obtain a translated target language text based on the text feature, the speech feature and the structure feature is introduced.

S1, after the text feature is obtained by the text encoder, the speech feature is obtained by the speech encoder, and the structural feature is obtained by the structural encoder, the splicing feature can be determined based on the three features.

An optional splicing mode can be used for directly splicing the voice feature, the structural feature and the text feature to obtain a first splicing feature.

Another alternative splicing method is as follows:

in this embodiment, the speech feature, the structural feature and the text feature may be fused to obtain a fusion feature of the source language.

Specifically, the process of feature fusion can be expressed by referring to the following formula:

s＝σ_g(C_sf_text+B_sf_sound+A_sf_pic+b_s)

p＝σ_p(C_pf_text+B_pf_sound+A_pf_pic+b_p)

t＝σ_t(C_tf_text+B_tf_sound+A_tf_pic+b_t)

f_merge＝sf_sound+pf_pic+tf_text

wherein s, p and t are characteristic fusion coefficients, f_soundRepresenting the result of the encoding of the speech information, i.e. the speech characteristics, f_picRepresenting structural features, f_textRepresenting a text feature, f_mergeThe fusion characteristics are represented by A, B, C, characteristic fusion parameters and b is an offset value, and the learning can be carried out through a network.

Further, the fusion feature of the source language, the voice feature, the structure feature and the text feature are spliced to obtain a second splicing feature.

The second splice characteristic can be expressed as:

concat(f_merge,f_sound,f_pic,f_text)

where concat () represents a feature splice.

And S2, determining the translated target language text corresponding to the source language text based on the obtained splicing characteristics.

Specifically, after the splicing feature is obtained, translation may be performed through a translation model to obtain a translated target language text.

Referring to fig. 7, a diagram of a multi-channel hierarchical eigen-coding network is illustrated.

In FIG. 7, which is illustrated in the source language of Chinese, where the structural information is obtained in the form of an image of a text sub-unit, the structural encoder may be represented as an image encoder and the corresponding structural encoding feature may be represented as an image feature.

Based on the introduction, the present embodiment may sequentially obtain the speech feature, the image feature, the text feature, and the fusion feature through the speech encoder, the image encoder, and the text encoder in a bottom-to-top order. The four features are spliced together to obtain a spliced feature which is used as a final input feature of the translation model encoder.

By the aid of the bottom-to-top hierarchical feature coding mode, different hierarchical representations of semantic features can be acquired, and accordingly richer hierarchical information can be shared. In the final stitched features fed into the translation model encoder, low-level speech and image features are retained to learn the pronunciation and structural alignment of the text between the source and target languages.

In some embodiments of the present application, reference is made to the process of determining, based on the obtained concatenation characteristics, the translated target language text corresponding to the source language text in S2.

An alternative embodiment is as follows:

and S21, performing information conversion between the source language and the target language based on the splicing characteristics to obtain fusion characteristics, text characteristics, structural characteristics and voice characteristics of the target language.

Specifically, the input splicing characteristics can be encoded by an encoder of the translation model, and an encoding result is obtained. And further, decoding the coding result by a translation decoder of the translation model to obtain the fusion characteristic, the text characteristic, the structural characteristic and the voice characteristic of the decoded target language.

The translation model can adopt a Transformer network, and the Transformer network is an Encoder-Decoder network structure composed of an attention mechanism. Wherein, the Encoder is a translation coder, and the Decode is a translation Decoder.

The Encoder and the Decoder are composed of a plurality of identical layers, each Layer is a transform basic building unit and is composed of two sub-layers which are respectively a multi-head sub-attention mechanism and a feedforward full-connection network.

In this embodiment, the translation model performs information conversion between the source language and the target language based on the concatenation characteristics through a Transformer network to obtain fusion characteristics, text characteristics, structural characteristics, and voice characteristics of the target language.

And S22, performing text decoding by combining the fusion characteristics and the text characteristics of the target language to obtain a target language text of the target language.

Specifically, after the fusion feature and the text feature of the target language are obtained, text decoding may be performed by a text decoder, so as to obtain a target language text.

Further optionally, because the final concatenation characteristics sent to the translation model encoder retain low-level speech and image characteristics, at the decoding end, the structural information of the target language text and the speech information of the target language text can be further decoded.

Specifically, the structure decoding may be performed by combining the result of the text decoding in step S22 and the structural feature of the target language, so as to obtain the structural information of the target language text. And carrying out voice decoding by combining the text decoding result, the structure decoding result and the voice feature of the target language to obtain the voice information of the target language text.

In this embodiment, the translation model can simultaneously output the decoded target language text, the structural information of the target language text, and the speech information of the target language text, that is, the translation model of the present application can further output the structural information and the speech information of the target language text while supporting the cross-language text translation, and the structural information and the speech information can be called by a user or other applications to implement corresponding functions. Taking the voice information of the target language text as an example, the voice information can be called by the sound generating device to realize voice broadcast of the target language text and other optional functions.

Referring to fig. 8, a diagram of a multi-channel hierarchical feature decoding network is illustrated.

In fig. 8, the target language is english, and the structure information is obtained in the form of an image of a text subunit, the structure decoder can be represented as an image decoder, and the corresponding structure decoding feature can be represented as an image feature.

The decoder of the translation model outputs the fused features, text features, image features, and speech features of the target language. Then, contrary to the foregoing multi-channel hierarchical coding process, the present embodiment may adopt a top-to-bottom sequence, and sequentially decode through a text decoder, an image decoder, and a speech decoder to obtain a target language text of a target language, a structure of the target language text, and speech information of the target language.

In some embodiments of the present application, a training process for the translation model is further described.

Based on the introduction, the source language text, the voice information of the source language text and the structure information of each text subunit in the source language text are obtained through processing based on the translation model, so that the text characteristics corresponding to the source language text are obtained, the voice characteristics corresponding to the voice information are obtained, the structure characteristics corresponding to the source language text are obtained, and then, the source language text is translated based on the text characteristics, the voice characteristics and the structure characteristics, so that the translated target language text is obtained.

In the training process of the translation model, a source language training text, voice information of the source language training text and structure information of each text subunit in the source language training text are input.

The translation model processes the input information and outputs a predicted target language text, structural information of the target language text, and speech information of the target language text.

And performing iterative training by taking the minimum set loss function as a training target until the translation model is converged, and finishing the training.

In this embodiment, the set loss function may include:

text alignment loss between the source language training text and the model predicted target language text.

Further, in order to make more full use of the hierarchical multi-level information, the set loss function may further include a structure alignment loss and/or a voice alignment loss.

As described above, the text alignment loss represents the alignment loss of the target language text predicted by the model to the source language training text.

The text alignment penalty can be expressed as the following equation:

therein, loss_textIndicating a loss of text alignment, CE indicating cross entropy,

the target language text that represents the model predictions,

representing the source language training text as a label.

The structure alignment loss represents the alignment loss of the structure of the target language text predicted by the model and the structure of the source language training text.

The structural alignment loss can be expressed as the following equation:

therein, loss_picIndicating loss of structural alignment, C_jH_jW_jRepresenting the size of the feature representation of the j-th layer in the pre-trained structural encoder, n is the number of network layers of the structural encoder,

represents the j-th layer feature map in the structure encoder,

represents the structure of the target language text predicted by the model,

the structure of the source language training text as a label is represented.

The speech alignment loss represents the alignment loss of the model predicted speech information of the target language text with the speech information of the source language training text.

The speech alignment penalty can be expressed as the following equation:

therein, loss_soundRepresenting the loss of speech alignment, MSE representing the minimum mean square error loss,

acoustic features of speech information representing target language text predicted by the model,

acoustic features representing speech information of source language training text as labels.

Based on the above expressions of the loss functions, the total loss function loss adopted in the training process of the translation model in this embodiment can be expressed as:

loss＝loss_textor, loss ═ loss_text+loss_picOr, loss ═ loss_text+loss_pic+loss_sound。

The following describes a translation apparatus provided in an embodiment of the present application, and the translation apparatus described below and the translation method described above may be referred to correspondingly.

Referring to fig. 9, fig. 3 is a schematic structural diagram of a translation apparatus disclosed in the embodiment of the present application.

As shown in fig. 9, the apparatus may include:

the information acquisition unit 11 is configured to acquire a source language text, speech information of the source language text, and structure information of each text subunit in the source language text;

a feature obtaining unit 12, configured to obtain a text feature corresponding to the source language text, a speech feature corresponding to the speech information, and a structural feature corresponding to the source language text, where the structural feature is determined based on structural information of each text subunit in the source language text;

and the feature reference unit 13 is configured to translate the source language text based on the text feature, the speech feature, and the structural feature to obtain a translated target language text.

Optionally, the process of acquiring the structure information of each text subunit in the source language text by the information acquiring unit may include:

Optionally, the process of acquiring the text feature corresponding to the source language text, the speech feature corresponding to the speech information, and the structural feature corresponding to the source language text by the feature acquiring unit may include:

Optionally, the process of encoding the source language text, the speech information, and the structure information of each text subunit in the source language text by the feature obtaining unit may include:

coding the voice information to obtain voice characteristics;

Optionally, when the feature obtaining unit encodes the structure information, a process of fusing an encoding result of the speech information may include:

Optionally, when the feature obtaining unit encodes the source language text, the process of fusing the encoding result of the speech information and the encoding result of the structure information may include:

Optionally, the process of the feature reference unit translating the source language text based on the text feature, the speech feature and the structural feature to obtain a translated target language text may include:

Optionally, the determining, by the feature reference unit, a process of determining a splicing feature based on the speech feature, the structural feature, and the text feature may include:

or the like, or, alternatively,

Optionally, the determining, by the feature reference unit, a process of the translated target language text corresponding to the source language text based on the splicing feature may include:

Optionally, the step of performing information conversion between the source language and the target language by the feature reference unit based on the splicing feature to obtain a fusion feature, a text feature, a structural feature, and a speech feature of the target language may include:

Optionally, the translation apparatus of the present application may further include: the structure information decoding unit is used for carrying out structure decoding by combining the text decoding result and the structural characteristics of the target language to obtain the structural information of the target language text;

and the voice information decoding unit is used for carrying out voice decoding by combining the text decoding result, the structure decoding result and the voice characteristic of the target language to obtain the voice information of the target language text.

Optionally, the feature obtaining unit and the feature reference unit may be implemented based on a pre-trained translation model, the translation apparatus of the present application may further include a translation model training unit, and the training process for the translation model may include:

zero or at least one of text alignment loss, structural alignment loss and speech alignment loss between the source language training text and the model predicted target language text.

The translation device provided by the embodiment of the application can be applied to translation equipment, such as a terminal: mobile phones, computers, etc. Alternatively, fig. 10 shows a block diagram of a hardware structure of the translation device, and referring to fig. 10, the hardware structure of the translation device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

Further, an embodiment of the present application also provides a computer program product, which when running on a terminal device, causes the terminal device to execute the steps of the translation method described in the foregoing embodiments.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, the embodiments may be combined as needed, and the same and similar parts may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of translation, comprising:

2. The method of claim 1, wherein obtaining structural information for each text sub-unit in the source language text comprises:

3. The method of claim 1, wherein the obtaining of the text feature corresponding to the source language text, the speech feature corresponding to the speech information, and the structural feature corresponding to the source language text comprises:

4. The method of claim 3, wherein encoding the source language text, the speech information, and the structure information of each text subunit in the source language text comprises:

coding the voice information to obtain voice characteristics;

5. The method according to claim 4, wherein said fusing the encoding result of the speech information when encoding the structure information comprises:

6. The method of claim 4, wherein fusing the encoded result of the speech information and the encoded result of the structure information when encoding the source language text comprises:

7. The method of claim 1, wherein translating the source language text based on the text feature, the speech feature, and the structural feature to obtain a translated target language text comprises:

8. The method of claim 7, wherein determining a splice feature based on the speech feature, the structural feature, and the text feature comprises:

or the like, or, alternatively,

9. The method of claim 7, wherein determining the translated target language text corresponding to the source language text based on the splice characteristic comprises:

10. The method according to claim 9, wherein the information conversion between the source language and the target language based on the splicing feature to obtain a fusion feature, a text feature, a structural feature and a speech feature of the target language comprises:

11. The method of claim 9, wherein after text decoding in conjunction with the fused feature and the text feature of the target language, the method further comprises:

12. The method of claim 1, wherein the process of obtaining the text features, the speech features, and the structural features and translating the source language text is implemented by a pre-trained translation model;

the training process of the translation model comprises the following steps:

13. A translation apparatus, comprising:

14. A translation apparatus, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is used for executing the program and realizing the steps of the translation method according to any one of claims 1-12.

15. A storage medium having stored thereon a computer program for implementing the steps of the translation method according to any of claims 1 to 12 when executed by a processor.

16. A computer program product which, when run on a terminal device, causes the terminal device to perform the steps of the translation method of any of claims 1-12.