CN116486421B

CN116486421B - Training method of image translation model and related products

Info

Publication number: CN116486421B
Application number: CN202310485338.XA
Authority: CN
Inventors: 王启萌; 吴世伟; 高龑
Original assignee: Shuhang Technology Beijing Co ltd
Current assignee: Shuhang Technology Beijing Co ltd
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2024-03-22
Anticipated expiration: 2043-04-28
Also published as: CN116486421A

Abstract

The embodiment of the application discloses an image translation and detection method, an image model training method and related products. The method comprises the following steps: acquiring a text instruction Di corresponding to the ith round of translation, wherein the text instruction Di is used for indicating the translation direction of the ith round of translation; acquiring a first feature vector of an image to be translated, wherein the first feature vector is obtained by converting image features of the image to be translated into a text space in a first round of translation; and determining a translation text Ti corresponding to the ith round of translation according to the first feature vector, the text instruction Di, the text instruction corresponding to each round of translation in the previous i-1 round of translation and the translation text corresponding to each round of translation, wherein the translation text Ti is used for representing the content of the image to be translated in the translation direction. The embodiment of the application realizes the translation of the image into the translation text.

Description

Training method of image translation model and related products

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an image translation and detection method, an image model training method and related products.

Background

Translation models of various companies, such as hundred degree translation, google translation, channel translation, etc., are now text-based translation models, for example, to translate chinese text into english text, or to translate english text into chinese text.

However, in practical application scenarios, it is often necessary to convert the image content, so that subsequent processing is facilitated, for example, the captured image content needs to be translated into text, so as to clarify the content in the picture. The current image translation method is to extract text information contained in an image from the image, and translate the text information into a target language so as to translate the image.

It can be seen that the current image translation method does not actually translate the image, at least does not translate the image content into text. Therefore, how to translate image content into text is a technical problem to be solved.

Disclosure of Invention

The embodiment of the application provides an image translation and detection method, an image model training method and related products, wherein the image can be translated into a text corresponding to the image content by setting a text instruction indicating a direction to the image.

In a first aspect, an embodiment of the present application provides an image translation method, which is characterized by including:

acquiring a text instruction Di corresponding to the ith round of translation, wherein the text instruction Di is used for indicating the translation direction of the ith round of translation;

acquiring a first feature vector of an image to be translated, wherein the first feature vector is obtained by converting image features of the image to be translated into a text space in a first round of translation;

and determining a translation text Ti corresponding to the ith round of translation according to the first feature vector, the text instruction Di, the text instruction corresponding to each round of translation in the previous i-1 round of translation and the translation text corresponding to each round of translation, wherein the translation text Ti is used for representing the content of the image to be translated in the translation direction.

In a second aspect, an embodiment of the present application provides an image translation model training method, including:

constructing a first training data set, wherein the first training data set comprises a plurality of first sample data, each first sample data comprises a first sample image and a first text description corresponding to the first sample image, and the first text description is used for describing the content of the first sample image;

Constructing a second training data set, wherein the second training data set comprises a plurality of second sample data, each second sample data comprises a second sample image, a plurality of first text instructions and a plurality of standard translation texts corresponding to the second sample image, the plurality of first text instructions are in one-to-one correspondence with the plurality of standard translation texts, and one first text instruction of each second sample data is used for indicating a translation direction corresponding to the first text instruction;

and performing model training according to the first training data set and the second training data set to obtain the image translation model.

In a third aspect, an embodiment of the present application provides an image detection method, including:

acquiring an image to be detected;

performing one or more rounds of translation on the image to be detected to obtain one or more translation texts corresponding to the image to be detected, wherein each round of translation on the image to be detected is realized by the method of the first aspect;

detecting whether the image to be detected contains preset content or not according to the one or more translation texts;

if yes, determining that the image to be detected has risks.

In a fourth aspect, an embodiment of the present application provides an image translation apparatus, including: an acquisition unit and a processing unit;

the obtaining unit is configured to obtain a text instruction Di corresponding to an ith round of translation, where the text instruction Di is used to indicate a translation direction of the ith round of translation; acquiring a first feature vector of an image to be translated, wherein the first feature vector is obtained by converting image features of the image to be translated into a text space in a first round of translation;

the processing unit is configured to determine a translation text Ti corresponding to the ith translation according to the first feature vector, the text instruction Di, a text instruction corresponding to each translation in the previous i-1 translation and a translation text corresponding to each translation, where the translation text Ti is used to represent contents of the image to be translated in the translation direction.

In a fifth aspect, an embodiment of the present application provides an image translation model training device, including:

an acquisition unit configured to construct a first training data set, where the first training data set includes a plurality of first sample data, each of the first sample data including a first sample image, and a first text description corresponding to the first sample image, the first text description being used to describe contents of the first sample image;

and the processing unit is used for carrying out model training according to the first training data set and the second training data set to obtain the image translation model.

In a sixth aspect, an embodiment of the present application provides an image detection apparatus, including: an acquisition unit and a processing unit;

the acquisition unit acquires an image to be detected;

the processing unit is configured to perform one or more rounds of translation on the image to be detected to obtain one or more translation texts corresponding to the image to be detected, where each round of translation on the image to be detected is implemented by the method according to the first aspect; detecting whether the image to be detected contains preset content or not according to the one or more translation texts; if yes, determining that the image to be detected has risks.

In a seventh aspect, embodiments of the present application provide an electronic device, including: a processor and a memory, the processor being connected to the memory, the memory being for storing a computer program, the processor being for executing the computer program stored in the memory to cause the electronic device to perform the method according to the first or second or third aspect.

In an eighth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program causing a computer to perform the method according to the first aspect or the second aspect or the third aspect.

In a ninth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program, the computer being operable to cause a computer to perform the method of the first or second or third aspects.

The implementation of the embodiment of the application has the following beneficial effects:

it can be seen that, in this embodiment of the present application, when an i-th round of translation (i.e., any round of translation) is performed on an image to be translated, a text instruction Di corresponding to the i-th round of translation may be input first, then, based on the text instruction Di, a text instruction corresponding to each round of translation in the previous i-1 round of translation, and a translation text corresponding to each round of translation, that is, a translation result, and a feature vector after the image to be translated is converted into a text space, a translation text Ti corresponding to the i-th round of translation is determined, so that successful translation of image content into text is implemented, and the content indicated by the translation text Ti is the content of the image to be translated in the translation direction indicated by the text instruction Di, so that it is ensured that the translation text output by each round of translation is in accordance with the user's expectation, and the accuracy of image translation is ensured. And in each round of translation, the translation result and the text instruction of the previous round are combined to translate the round, so that the accuracy of the translation of the round on the image is further ensured.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an image translation provided in an embodiment of the present application;

fig. 2 is a flow chart of an image translation model training method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an image translation model according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of an image translation scenario provided in an embodiment of the present application;

fig. 5 is a schematic flow chart of an image translation method according to an embodiment of the present application;

fig. 6 is a schematic diagram of implementing image translation by man-machine interaction according to an embodiment of the present application;

fig. 7 is a schematic diagram of an image translating apparatus according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an image translation model training device according to an embodiment of the present application;

fig. 9 is a schematic diagram of an image detection device according to an embodiment of the present application;

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims of this application and in the drawings, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to facilitate understanding of the technical solutions of the present application, explanation will be first made of related technical terms related to the present application.

Text description: the text description is used to describe the content in the image, i.e. to describe the detailed content in the image;

text instruction: the text instruction may be expressed in a text form, where the text instruction is used to indicate a translation direction of an image, that is, to direct a machine to translate the image into a text, for example, given an image, the text instruction is "please describe the content in the image", the translation direction indicated by the text instruction is to translate the content in the image, so that the text translated by the machine is directed to describe the content in the image.

It should be noted that, the user may input a text instruction to the image translation device through other peripheral devices such as a voice, a keyboard, a mouse, and a display screen, and the method for inputting the text instruction by the user is not limited in this application. And the user can input one or more rounds of text to the image translation device to effect one or more rounds of translation. In addition, the text instruction of the present application may be a text instruction in various languages, for example, may be chinese, english, and french, and the present application does not limit the language type of the text instruction.

It should be noted that, when the present application translates a frame of image, the frame of image may include an image containing text information, for example, the image may include text, or may not include text information, which is not limited in the type of the image.

The image translation process of the present application will be described first with reference to the structure of the image translation model shown in fig. 1. As shown in fig. 1, the image translation model includes: the system comprises an image encoder, a target image text semantic alignment module and a text translation module. The target image-text semantic alignment module can adopt, but is not limited to, a transducer network or a fully-connected network, and any neural network model capable of converting image features into text features is within the protection scope of the application.

Based on the model structure shown in fig. 1, firstly, in the first round of translation, an image to be translated is input into an image encoder for feature extraction, and a feature map of the image to be translated is obtained; then, feature conversion is carried out on the feature map of the image to be translated through the target image-text semantic alignment module, namely, image features of the image to be translated are converted into a text space, and a first feature vector is obtained.

Then, when an ith round of translation is carried out on an image to be translated, namely any round of translation is carried out, a text instruction Di corresponding to the ith round of translation is obtained, wherein the text instruction Di is used for indicating the translation direction of the ith round of translation; and obtaining a first feature vector generated in the first round of translation. Then, a text instruction corresponding to each translation round and a translation text corresponding to each translation round in the previous i-1 translation round are obtained; and finally, inputting a text instruction corresponding to each translation round and a translation text corresponding to each translation round in the previous i-1 translation round into a text translation module to perform the ith translation round on the image to be translated by using the first feature vector and the text instruction Di, and obtaining a translation text Ti corresponding to the ith translation round.

In order to facilitate understanding of the technical solution of the embodiments of the present application, a training process of the image translation model of the present application is first described. The image translation model training process is performed on the image translation model training device, that is, the image translation model training device and the image translation device below are the same device or may not be the same device. When the image translation model training device is not the same device, after the image translation model is trained by the image translation model training device, the trained image translation model can be deployed to the image translation device to execute the image translation method, and the device for training the image translation model is not limited.

Referring to fig. 2, fig. 2 is a flow chart of an image translation model training method according to an embodiment of the present application. The method includes, but is not limited to, the following steps:

201: a first training data set is constructed.

Wherein the first training data set comprises a plurality of first sample data, each first sample data comprising one first sample image, and a first text description corresponding to the first sample image, the first text description being for describing the content of the first sample image, i.e. the first text description corresponding to each first sample image is for describing the details in the first sample image.

202: a second training data set is constructed.

The second training data set comprises a plurality of second sample data, and each second sample data comprises a second sample image, a plurality of first text instructions corresponding to the second sample image and a plurality of standard translation texts.

The first text instructions corresponding to each second sample image correspond to the standard translation texts one by one. It will be appreciated that a first text instruction of the second sample image is used to indicate the translation direction corresponding to the first text instruction, i.e. to indicate that the machine should translate in the translation direction indicated by the first text instruction when translating the second sample image into text. The standard translation text corresponding to the first text instruction is a standard translation text to be output after the first text instruction is input, and corresponds to a tag corresponding to the first text instruction.

It should be noted that the plurality of first text instructions in each second sample data may have a hierarchical relationship, or may be independent, without a hierarchical relationship, where the hierarchical relationship includes, but is not limited to, a progressive relationship, a parallel relationship, an association relationship, and so on. By progressive relationship is meant that the following first text instruction is based on the translated text corresponding to the preceding first text instruction. For example, the previous first text instruction is: please describe the content in the picture; the latter first text instruction is: please describe the features of a certain object in the picture in detail. The two first text instructions are in progressive relationship. The present application does not limit whether the plurality of first text instructions have a hierarchical relationship.

203: and performing model training according to the first training data set and the second training data set to obtain an image translation model.

To facilitate an understanding of the training process of the present application, an example will be described below in connection with the specific model structure shown in fig. 3.

As shown in fig. 3, the image translation model includes an image encoder, a first teletext semantic alignment module, a first mapping layer (first embellishment), and a text translation module, wherein the text translation module includes a second mapping layer (second embellishment) and a decoder. Optionally, the first image-text semantic alignment module may be, but is not limited to, a transducer network or a fully-connected network, and any neural network model that can convert image features into text features is within the scope of protection of the present application. In this application, the first image-text semantic alignment module is mainly described by taking a transform network or a full-connection network as an example.

Firstly, the training of the image translation model mainly comprises two training stages, wherein the first training stage adopts a first training data set to train the image translation model, and the second training stage adopts a second training data set to train (can be understood as fine adjustment) the image translation model trained in the first training stage again, so as to obtain a final image translation model. The image encoder, the first mapping layer and the text translation module are all trained in advance, and adjustment is not needed in the training process of the image translation model, in other words, the training process of the image translation model mainly trains the first image-text semantic alignment module. For convenience of distinguishing, the originally designed image-text semantic alignment module is called a first image-text semantic alignment module, the image-text semantic alignment module trained in the first training stage is called a second image-text semantic alignment module, and the image-text semantic alignment module trained in the second training stage is called a target image-text semantic alignment module. That is, the graphic semantic alignment module used in the subsequent practical application is a target graphic semantic alignment module.

The training process of the first training phase is described first.

The first image-text semantic alignment module is trained according to a plurality of first sample data in the first training data set, and a second image-text semantic alignment module is obtained.

For each first sample image in the first training data set, a fourth feature vector of the first sample image is obtained, wherein the fourth feature vector corresponding to the first sample image is a feature vector of the first sample image in a text space, that is, image features of the first sample image are converted into the text space, and the fourth feature vector of the first sample image is obtained.

Optionally, when the first image-text semantic alignment module adopts a transform network, firstly partitioning each first sample image to obtain a plurality of second image blocks; then, inputting each second image block into an image encoder, and extracting the characteristics of each second image block to obtain a characteristic diagram of each second image block; and finally, converting the feature map of each second image block into a text space to obtain a fourth feature vector corresponding to each first sample image.

Specifically, the feature map of each second image block is tiled, and a sixth feature vector corresponding to each second image block is obtained. And then, inputting a sixth feature vector corresponding to each second image block into a first image text semantic alignment module, and performing cross attention processing on the sixth feature vector and the first reference vector to obtain a seventh feature vector corresponding to each first sample image, wherein the number of the first reference vectors is a plurality of, and each first reference vector (namely query) is obtained by random initialization. And in the training process, the first reference vector is continuously adjusted based on loss, and after the training of the image translation model is completed, the first reference vector is adjusted to be a target reference vector, and at the moment, the target reference vector has the function of converting the image characteristics of the image into a text space. And finally, carrying out feature extraction on the seventh feature vector corresponding to each first sample image to obtain a fourth feature vector corresponding to each first sample image, namely carrying out feature extraction on the seventh feature vector corresponding to each first sample image by adopting a self-attention layer, a feedforward layer and other network layers of a transducer to obtain the fourth feature vector corresponding to each first sample image.

Optionally, when the first image-text semantic alignment module adopts a fully connected network, acquiring a feature map corresponding to each first sample image; and then, inputting the feature images corresponding to each first sample image into the first image-text semantic alignment module to obtain a fourth feature vector corresponding to each first sample image. The feature map corresponding to each first sample image is illustratively converted into a fourth feature vector by utilizing a feature conversion function of the fully connected network, namely, a function of converting image features into a text space. With the constraint of continuous training, the fully connected network is finally enabled to accurately convert the image features of the image into text space.

Further, mapping the first text description corresponding to each first sample image to obtain a fifth feature vector corresponding to the first text description, namely inputting the first text description into a first mapping layer for mapping processing to obtain the fifth feature vector corresponding to the first text description. Then, according to the fourth feature vector corresponding to each first sample image and the fifth feature vector corresponding to the first text description of each first sample image, a first loss is obtained.

It should be appreciated that, since the content contained in the different first sample images is different, the content described by the first text description corresponding to the different first sample images is also different. If the image-text semantic alignment module can well align the image features to the text space, the fourth feature vector of each first sample image and the fifth feature vector described by the first text of the sample image should be very similar, that is, the similarity between the fourth feature vector and the fifth feature vector is larger; conversely, the fourth feature vector of the first sample image should be dissimilar to the fifth feature vector of the first text description of the other sample image, i.e., the fourth feature vector of the first sample image has a relatively small similarity to the fifth feature vector of the first text description of the other sample image.

For example, the two first sample images are images of a cat and a dog, respectively, and the two first text descriptions corresponding to the two first sample images describe the long phase of the cat and the long phase of the dog, respectively. If the image-text semantic alignment module can well align the image features to the text space, the information represented by the fourth feature vector of the first sample image corresponding to the cat includes the long phase of the cat, and the information represented by the fourth feature vector of the first sample image corresponding to the dog includes the long phase of the dog, so that the similarity between the fourth feature vector of the first sample image corresponding to the cat and the fifth feature vector of the first text description describing the long phase of the cat is larger (i.e., similar), and the similarity between the fourth feature vector of the first sample image corresponding to the cat and the fifth feature vector of the first text description describing the long phase of the dog is smaller (i.e., dissimilar).

Based on the above description, the similarity between the fourth feature vector of each first sample image and the fifth feature vectors corresponding to the first text descriptions may be obtained, so as to obtain the similarities corresponding to each first sample image, and the first loss may be determined according to the similarities corresponding to each first sample image.

Specifically, a similarity matrix is constructed based on a plurality of similarities corresponding to each first sample image, wherein an element on a main diagonal in the similarity matrix is a similarity between a fourth feature vector of each first sample image and a fifth feature vector of a first text description of the first sample image, and each row of elements in the similarity matrix may be a similarity between the fourth feature vector of one first sample image and a plurality of fifth feature vectors corresponding to the plurality of first text descriptions, respectively.

As follows, matrix a ₁₁ 、……、a _mm Similarity between the fourth feature vector of each first sample image and the fifth feature vector of the first textual description of that first sample image, a ₁₁ 、……、a _1m It is understood that the similarity between the fourth feature vector of the first sample image and the fifth feature vector of the plurality of first textual descriptions.

Further, a label corresponding to each first sample image is constructed for each first sample image. For example, a label corresponding to each first sample image may be generated by one-hot encoding. Specifically, the position of the first text description corresponding to the first sample image is coded as 1, and other positions are coded as 0, so that the label corresponding to the first sample image is obtained. For example, for a first sample image, the label for that first sample image is obtained as [1 0 … … 0]. Finally, determining the loss corresponding to each first sample image based on the plurality of similarities corresponding to each first sample image and the labels of each first sample image. For example, a plurality of similarities corresponding to each first sample image are obtained from the similarity matrix to form a vector, and the vector is obtained to obtain the similarity with the label of the first sample image, so that the loss corresponding to the first sample image can be obtained. And finally, averaging the losses corresponding to each first sample image to obtain first losses.

It can be seen that in the process of obtaining the first loss, a plurality of first sample images need to be input into the image translation model, so that contrast learning can be performed to obtain the first loss. It should be noted that, in each first loss obtaining process, all first sample images in the first training data set may be input, or part of the first sample images may be input, and the number of the first sample images may be set according to actual requirements, so long as the first loss obtained by contrast learning can be satisfied.

It can be seen that, by setting contrast learning, the similarity between the fourth feature vector of each first sample image and the fifth feature vector described by the first text of the first sample image is larger, and the similarity between the fourth feature vector of each first sample image and the fifth feature vector described by the first text of other first sample images is smaller, so that the image text semantic alignment module can convert the features of the image in the image space into text features in the text space with high precision, and text features irrelevant to the image content of the user can not be converted, and the precision of converting the image features into text features is improved.

Further, the conversion accuracy of the first image-text semantic alignment module is constrained from the perspective of the contrast learning task. In addition, whether the text feature of any one of the first sample images converted by the first text semantic alignment module is accurate or not can be measured based on whether the text description decoded by the text feature is matched with the first text description (namely, the label in the text description dimension) corresponding to the first sample image or not. Therefore, in the first training stage, the application further sets another training task, namely, predicts the second text description corresponding to each first sample image according to the fourth feature vector corresponding to each first sample image, namely, decodes the fourth feature vector corresponding to each first sample image to obtain the second text description corresponding to each first sample image, wherein the second text description corresponding to each first sample image is the predicted text for describing the content of the first sample image. Then, a second loss is obtained according to the second text description corresponding to each first sample image and the first text description. For example, the similarity between the second text description and the first text description corresponding to each first sample image may be taken as the loss corresponding to each first sample. Then, the loss corresponding to each first sample is averaged to obtain a second loss. And finally, training the first image-text semantic alignment module according to the first loss and the second loss to obtain a second image-text semantic alignment module. For example, the first loss and the second loss are weighted to obtain a first target loss, and based on the first target loss, model parameters of the first image-text semantic alignment module are adjusted to obtain a second image-text semantic alignment module.

It should be noted that, if the first image-text semantic alignment module adopts a transform structure, when the model parameters of the first image-text semantic alignment module are adjusted, an initialized first reference vector is adjusted, and for convenience of distinguishing, a reference vector after the second image-text semantic alignment module is obtained is referred to as a second reference vector.

The training process of the second training phase will be described below by taking any one of the second sample images as an example.

For any one of the second sample images in the second training data set, the first text instruction Hj corresponding to the j-th translation is obtained from a plurality of first text instructions corresponding to the second sample image, and the value of j is an integer from 1 to n. Specifically, n translation processes may be performed on the second sample image, in each translation process, a first text instruction is selected from a plurality of first text instructions corresponding to the second sample image, so as to translate the second sample image, and in different translation processes, the selected first text instructions may be the same or different, which is not limited in this application.

Then, an eighth feature vector corresponding to the second sample image is obtained, wherein the eighth feature vector is obtained by converting the image features of the second sample image into a text space when the second sample image is translated for the first round. It can be understood that the process of obtaining the eighth feature vector of the second sample image is similar to the process of obtaining the fourth feature vector of the first sample image, except that the eighth feature vector is obtained by the second image-text semantic alignment module at this time, which is not described herein.

Then, inputting the eighth feature vector, the first text instruction Hj, the first text instruction corresponding to each translation in the previous j-1 translation, and the standard translation text corresponding to the first text instruction, and the translation text predicted by each translation in the previous j-1 translation to the text translation module, and predicting the translation text Kj corresponding to the jth translation.

Specifically, a first text instruction Hj, a first text instruction corresponding to each translation in the previous j-1 translation round, a standard translation text corresponding to the first text instruction, and a translation text predicted by each translation in the previous j-1 translation round are spliced to obtain a target text instruction corresponding to the jth translation round; and then mapping the target text instruction corresponding to the jth round of translation through the text translation module to obtain an eighth feature vector corresponding to the jth round of translation, namely, inputting the target text into a second mapping layer for mapping to obtain a ninth feature vector corresponding to the jth round of translation.

For example, in order for the text translation module to better understand the first text instruction and the translated text in the previous round of translation, when the first text instruction Hj, the first text instruction corresponding to each round of translation in the previous j-1 round of translation, the standard translated text corresponding to the first text instruction, and the translated text predicted by each round of translation are spliced, a preset character needs to be added before the first text instruction corresponding to each round of translation (including the j-th round of translation) and the translated text (when the translation text exists). For example, the character "input" may be added before the first text instruction of each translation round: "adding the character" output "in front of the translated text: after the text translation module reads the preset characters, the first text instruction and the translation text corresponding to each previous translation round and the text instruction of the current translation round can be determined, so that the text translation module can better understand semantic information described by the target text instruction.

For example, as shown in fig. 3, in the n-th (i.e., j=n) translation round, a first text instruction corresponding to each translation round in the previous n-1 translation round, a standard translation text corresponding to the first text instruction, and a target text corresponding to the n-th translation round may be obtained by splicing the translation text predicted by each translation round after adding a preset character: "input: text instruction 1 output: translation text 1 standard translation text: standard translation text 1 input: text instruction 2 output: translation text 2 standard translation text: standard translation text 2 … input: text instruction n output:.

It should be noted that, the standard translation text corresponding to the previous translation round is spliced, because the translation text predicted in each round of translation is not necessarily correct, so that only the translation text predicted in the previous translation round is given in the subsequent round of translation, which results in that the incorrect translation text is used to guide the translation of the subsequent round in the subsequent round of translation, thereby causing great error in the translation of the subsequent round, and making the whole training effect worse. After the standard translation texts corresponding to the previous translation rounds are spliced, in the subsequent translation process of the previous translation rounds, even if the translation texts predicted by the previous translation rounds are wrong, the correct translation texts, namely the standard translation texts, can be obtained, the subsequent translation process cannot be influenced, and the difference between the translation texts predicted in the previous translation rounds and the standard translation texts can be compared, so that the subsequent translation rounds can be better guided, and the model training efficiency is improved.

It should be noted that, as shown in fig. 3, for the first round of translation process, there is no translation process before the first round of translation process, and then there is no process of translating text and first text instruction in the translation process before the concatenation, only after adding a preset character to the first text instruction corresponding to the first round of translation, the first text instruction added with the preset character is directly input to the second mapping layer for mapping processing, so as to obtain a feature vector of the first text instruction in the first round of translation.

Optionally, after obtaining the ninth feature vector corresponding to the jth translation, as shown in fig. 3, the ninth feature vector may be spliced (concat) with the eighth feature vector of the second sample image to obtain the target feature vector corresponding to the jth translation. And then decoding the target feature vector corresponding to the jth round of translation through the text translation module, predicting a translation text Kj corresponding to the jth round of translation, namely inputting the target feature vector corresponding to the jth round of translation into a decoder of the text translation module for decoding, and outputting the translation text Kj corresponding to the jth round of translation.

Further, according to the translation text Kj corresponding to the jth round of translation and the standard translation text corresponding to the first text instruction Hj, a penalty corresponding to the jth round of translation is determined, for example, a similarity between the translation text Kj and the standard translation text corresponding to the first text instruction Hj may be used as the penalty corresponding to the jth round of translation. Then, according to the loss corresponding to the j-th round of translation, the loss of the second sample image in each round of translation process can be obtained; and then obtaining the loss corresponding to the second sample image according to the loss of the second sample image in each translation process. For example, n losses corresponding to the second sample image in the n-round translation process are averaged as the losses corresponding to the second sample image. Further, a third loss is obtained according to the loss corresponding to each second sample image. For example, an average of the losses of the plurality of second sample images may be taken as the third loss. And finally, training the second image-text semantic alignment module according to the third loss to obtain an image translation model, namely adjusting model parameters of the second image-text semantic alignment module based on the third loss to obtain the image translation model, and calling the image-text semantic alignment module obtained when the training stop condition is reached as a target image-text semantic alignment module.

It should be noted that, if the target teletext semantic alignment module is a transducer network, the reference vector obtained at this time is referred to as a target reference vector.

It should be noted that, the mapping relationships of the first mapping layer and the second mapping layer are the same, so that mapping of the first text description in the first round training process and mapping of the first text instruction and the target text command corresponding to each round of translation in the second round training process are ensured to be the same, and therefore the feature spaces of the image text semantic alignment module and the text translation module can be aligned accurately.

It should be noted that the above-mentioned first mapping layer is only used in the process of translating the image model, and after training is completed, the first mapping layer may be deleted in practical application.

It should be noted that a semi-automatic construction method may be employed for constructing the first training data set and the second training data set as described above. Illustratively, a portion of the first training data set is first manually constructed; then, training the image-text semantic alignment module by using the small amount of first training data sets, and after training, regenerating another part of first training data sets by using the trained image-text semantic alignment module, wherein the two parts of first training data sets are used as the first training data sets of the application. Similarly, for the second training data set, a part of the second training data set can be constructed first, then model training is carried out, another part of the second training data set is generated by using the trained graphic translation model, and the two parts of the second training data set are used as the second training data set of the application.

Referring to fig. 4, fig. 4 is a schematic diagram of an image translation scenario provided in an embodiment of the present application.

As shown in fig. 4, the image translating apparatus acquires a text instruction for an image to be translated by a user. More specifically, the user can perform multiple rounds of man-machine interaction with the image translation device, each round of man-machine interaction inputs a text instruction to obtain a translation text, and then the multiple rounds of man-machine interaction can realize multiple rounds of translation of the image to be processed. Then, the image translating means may display the translated text obtained for each translation on a display interface of the image translating means for review by the user.

It should be noted that, the image translation method of the present application may be deployed locally in an image translation device, or may be deployed at a server, where the image translation device invokes the image translation method of the present application from the server to implement an image translation function. Of course, when the image translation method of the application is called from the server, after the image translation device obtains the text instruction, the text instruction and the image to be processed are sent to the server, and the server is requested to translate the image to be processed based on the text instruction, wherein the process of executing the image translation by the server is similar to the process of executing the translation by the image translation device, and is not repeated. The image translation method of the present application will be mainly described by way of example in the present application.

As shown in fig. 4, in the first round of translation, the user inputs a text command 1, the image translation device outputs a translation text 1, in the second round of translation, the user inputs a text command 2, the image translation device outputs a translation text 2, … …, and in the nth round of translation, the user inputs a text command n, the image translation device outputs a translation text n.

Specifically, for the ith round of translation process, the image translation device acquires a text instruction Di corresponding to the ith round of translation, wherein the text instruction Di is used for indicating the translation direction of the ith round of translation, and i is an integer which is greater than or equal to 1 and less than n; acquiring a first feature vector of an image to be translated, wherein the first feature vector is obtained by converting image features of the image to be translated into a text space in a first round of translation; and determining a translation text Ti corresponding to the ith round of translation according to the first feature vector, the text instruction Di, the text instruction corresponding to each round of translation in the previous i-1 round of translation and the translation text corresponding to each round of translation, wherein the translation text Ti is used for representing the content of the image to be translated in the translation direction. Finally, the image translation device outputs the translation text Ti, for example, the translation text Ti may be displayed on a display interface of the image translation device for review by a user.

Based on the application scenario of the image translation described above, several more specific application scenarios are described below.

Scene 1: notebook publishing scene

1.1, generating a title and a text for the note:

exemplary, obtaining a note issued by a user, wherein the note contains an image; and performing one or more rounds of translation on the image based on the image translation method to obtain one or more translation texts corresponding to the one or more rounds of translation. A title and body are then generated for the note based on the one or more translated text. Of course, if the note contains text information, one or more of the translated text may be synthesized, and the text information may generate a title and body for the note.

It can be seen that, based on the image translation method of the application, even if the notes issued by the user do not contain text information, the subject and the text can be generated for the notes, thereby facilitating classification and archiving of the notes and recommendation of the notes.

1.2, automatically generating comments for notes:

for example, a target note is obtained from a database, wherein the target note is a note with comments smaller than a threshold in the database; then, acquiring an image in the target note; and performing one or more rounds of translation on the image based on the image translation method to obtain one or more translation texts corresponding to the one or more rounds of translation. Comments are then automatically generated for the target note based on the one or more translated text. Of course, if the target note contains text information, one or more translated texts may be synthesized, and the text information automatically generates comments for the target note.

It can be seen that, based on the image translation method, comments can be automatically generated for notes with few comments (even if text information is not available), and interaction is performed with an author, so that enthusiasm of the author to issue notes is promoted.

Scene 2: search scenarios

2.1, note screening:

for example, for notes containing images, one or more rounds of translation may be performed on the images based on the image translation method of the present application, resulting in one or more translated text corresponding to the one or more rounds of translation. Then searching at the user to obtain the search word of the user; recall a plurality of candidate notes for the user based on the search term; then, for each candidate note, determining whether the candidate note is relevant to the search term based on one or more translated text of each candidate note; if the candidate notes are related, the candidate notes are reserved, and the candidate notes are successfully recalled to the user; if not, the note is deleted and the candidate note is not recalled to the user.

According to the image translation method, the images can be translated into texts, so that recall notes in the user searching process are screened, relevant notes are recalled for the user, and recall accuracy is improved.

2.2, note recall:

for example, for a note without a title and/or a note text and including an image, one or more rounds of translation may be performed on the image based on the image translation method of the present application, to obtain one or more translation texts corresponding to the one or more rounds of translation. A title and body are then generated for the note based on the one or more translated text. Therefore, when the subsequent user searches for the note, the title and the text of the note can be matched with the note, so that recall of the note is realized.

It can be seen that, based on the image translation method, the image is translated into the text, and the recall of the text without title and/or note can be realized, so that the scope and the comprehensiveness of the recall of the note are improved.

2.3, note label extraction:

for example, for notes containing images, one or more rounds of translation may be performed on the images based on the image translation method of the application, so as to obtain one or more translation texts corresponding to the one or more rounds of translation; tags are then constructed for the notes based on the one or more translated text.

Scene 3: note examination

The method includes the steps that an image to be detected is obtained, wherein the image to be detected can be an image in a note to be sent; then, performing one or more rounds of translation on the image based on the image translation method to obtain one or more translation texts corresponding to the one or more rounds of translation; detecting whether the image to be detected contains preset content according to one or more translation texts, thereby determining whether the note to be published is at risk, for example, determining whether the note to be published is illegal, illegal or malicious marketing, and the like. For example, whether the pending form note is at risk is determined by identifying whether one or more of the translated texts contains a preset keyword.

It can be seen that, based on the translation method of the application, the image to be detected can be converted into a text; then, based on one or more translated text of the image to be detected, it is determined whether the image to be detected is at risk. Because the text comparison is more efficient and high-precision compared with the image comparison, the translation method based on the application can improve the precision and efficiency of examining the image to be detected.

Referring to fig. 5, fig. 5 is a flowchart of an image translation method according to an embodiment of the present application. The method is applied to the image translation device. The method includes, but is not limited to, the following steps:

501: and acquiring a text instruction Di corresponding to the ith round of translation.

The text instruction Di is used for indicating the translation direction of the ith round of translation.

Alternatively, the text instruction Di may be input to the image translating apparatus by the user through other peripheral devices such as voice, keyboard, mouse, display screen, and the like.

502: a first feature vector of an image to be translated is obtained.

The first feature vector is obtained by converting the image features of the image to be translated into a text space in the first round of translation.

For example, in the first translation, the image features of the image to be translated can be converted into a text space through the image translation model to obtain a first feature vector of the image to be translated.

Optionally, when the target image-text semantic alignment module adopts a transform network, the image to be translated is segmented to obtain a plurality of first image blocks, wherein the image to be translated can be segmented according to a preset segmentation scale; and extracting the characteristics of each first image block to obtain a characteristic diagram corresponding to each first image block. Illustratively, inputting each first image block into an image encoder to obtain a feature map of each first image block; and then converting the feature map of each first image block into a text space to obtain the first feature vector. Specifically, tiling the feature map corresponding to each first image block to obtain a feature vector corresponding to each first image block; and (3) performing attention processing on the feature vector corresponding to each first image block and a target reference vector (namely the target reference vector obtained after model training is completed) to obtain a third feature vector, namely inputting the feature vector corresponding to each first image block into a target image-text semantic alignment module, and performing attention processing on the feature vector and the target reference vector to obtain the third feature vector. And then, extracting features of the third feature vector to obtain a first feature vector, namely extracting deeper features of the third feature vector through the target image-text semantic alignment module to obtain the first feature vector.

Optionally, when the target image-text semantic alignment module adopts a fully connected network, feature extraction is performed on the image to be translated, so as to obtain a feature map of the image to be translated. And extracting the characteristics of the image to be translated through the encoder to obtain the characteristic diagram. And then, carrying out feature extraction on the feature map of the image to be translated to obtain the second feature vector, namely inputting the feature map of the image to be translated into a target image-text semantic alignment module, and carrying out feature extraction on the feature map of the image to be translated through the target image-text semantic alignment module to obtain the first feature vector.

It should be noted that, in the first round of translation, after the first feature vector of the image to be translated is obtained, the first feature vector can be directly multiplexed in the subsequent rounds of translation, and feature extraction is not required to be performed on the image to be translated.

503: and determining a translation text Ti corresponding to the ith round of translation according to the first feature vector, the text instruction Di, the text instruction corresponding to each round of translation in the previous i-1 round of translation and the translation text corresponding to each round of translation.

The translation text Ti is used for representing the content of the image to be translated in the translation direction.

The text instruction Di, the text instruction corresponding to each translation in the previous i-1 translation round, and the translation text corresponding to each translation round are spliced to obtain a target text instruction corresponding to the ith translation round. And then mapping the target text instruction corresponding to the ith round of translation to obtain a second feature vector. Similarly, similar to the training process shown in fig. 3, before mapping the target text instruction corresponding to the ith round of translation, a preset character needs to be added in front of the text instruction corresponding to each round of translation and the translated text, and the target text instruction added with the preset character is input to a second mapping layer to be mapped, so as to obtain a second feature vector. And then, splicing the first feature vector and the second feature vector to obtain a target feature vector corresponding to the ith round of translation. And finally, decoding the target feature vector corresponding to the ith round of translation to obtain a translation text Ti corresponding to the ith round of translation, namely inputting the target feature vector corresponding to the ith round of translation into a decoder to obtain the translation text Ti corresponding to the ith round of translation.

It should be noted that, the user may perform m rounds of man-machine interaction with the image translation device, and each round of man-machine interaction inputs a text instruction and outputs a text translation. Where m is determined by the number of interactions the user needs. Specifically, whenever the image translating means receives a text instruction input by the user, the translation is performed all the time in combination with the previous translation. In addition, m text instructions input by m rounds of man-machine interaction can have a hierarchical relationship or can be mutually independent, and the application is not limited to the hierarchical relationship.

For example, as shown in fig. 6, in the first round of man-machine interaction, the text instruction 1 of the first round of translation input by the user is: please describe the content of this picture, the image translation device responds to the text command 1 to output the translated text 1 of the first round of translation as follows: in this picture, a woman holds a bundle of pink roses and stands in a green shade park. Then, the user performs a second round of man-machine interaction, and the input text instruction 2 of the second round of translation is: in the figure, what color the girl's hair is, the image translating device responds to the text command 2 to output a second round of translated text 2 as follows: in the figures the female hair is dark brown. After the user stops inputting the text instruction, the image translating apparatus also stops outputting the translated text.

In one embodiment of the present application, after the translated text Ti corresponding to the ith round of translation is acquired, the translated text Ti may be displayed on a visual interface so that the user refers to the translated text Ti.

It can be seen that, in this embodiment of the present application, when an i-th round of translation (i.e., any round of translation) is performed on an image to be translated, a text instruction Di corresponding to the i-th round of translation may be input first, then, based on the text instruction Di, a text instruction corresponding to each round of translation in the previous i-1 round of translation, and a translation text corresponding to each round of translation, that is, a translation result, and a feature vector after the image to be translated is converted into a text space, a translation text Ti corresponding to the i-th round of translation is determined, so that successful translation of image content into text is implemented, and content indicated by the translation text Ti is content of the image to be translated in a translation direction indicated by the text instruction Di, so that it is ensured that the translated text in each round of translation is consistent with the user's expectation, and accuracy of image translation is ensured. And in each round of translation, the translation result and the text instruction of the previous round are combined to translate the round, so that the accuracy of the translation of the round on the image is further ensured.

Referring to fig. 7, fig. 7 is a schematic diagram of an image translating apparatus according to an embodiment of the present application. As shown in fig. 7, the image translating apparatus 700 includes an acquisition unit 701 and a processing unit 702;

an obtaining unit 701, configured to obtain a text instruction Di corresponding to an ith round of translation, where the text instruction Di is used to indicate a translation direction of the ith round of translation; acquiring a first feature vector of an image to be translated, wherein the first feature vector is obtained by converting image features of the image to be translated into a text space in a first round of translation;

and the processing unit 702 is configured to determine a translation text Ti corresponding to the ith round of translation according to the first feature vector, the text instruction Di, a text instruction corresponding to each round of translation in the previous i-1 round of translation, and a translation text corresponding to each round of translation, where the translation text Ti is used to represent contents of the image to be translated in the translation direction.

In one embodiment of the present application, the processing unit 702 is specifically configured to determine, according to the first feature vector, the text instruction Di, a text instruction corresponding to each translation in the previous i-1 translation, and a translation text corresponding to each translation, a translation text Ti corresponding to the ith translation:

Splicing the text instruction Di, the text instruction corresponding to each translation in the previous i-1 translation and the translation text corresponding to each translation to obtain a target text instruction corresponding to the ith translation;

mapping the target text instruction corresponding to the ith translation to obtain a second feature vector;

splicing the first feature vector and the second feature vector to obtain a target feature vector corresponding to the ith translation;

and decoding the target feature vector corresponding to the ith translation to obtain the translation text Ti.

In one embodiment of the present application, when i=1, the obtaining unit 701 is specifically configured to, in taking a first feature vector of an image to be translated:

acquiring the image to be translated;

partitioning the image to be translated to obtain a plurality of first image blocks;

extracting the characteristics of each first image block to obtain a characteristic diagram corresponding to each first image block;

and converting the feature map corresponding to each first image block into a text space to obtain the first feature vector.

In one embodiment of the present application, in converting the feature map corresponding to each first image block into a text space to obtain the second feature vector, the obtaining unit 701 is specifically configured to:

Tiling the feature map corresponding to each first image block to obtain a feature vector corresponding to each first image block;

performing attention processing on the feature vector corresponding to each first image block and the target reference vector to obtain a third feature vector;

and extracting the characteristics of the third characteristic vector to obtain the first characteristic vector.

In one embodiment of the present application, when i=1, in acquiring the first feature vector of the image to be translated, the acquiring unit 701 is specifically configured to:

acquiring the image to be translated;

extracting features of the image to be translated to obtain a feature map of the image to be translated;

and extracting the characteristics of the characteristic image of the image to be translated to obtain the first characteristic vector.

In one embodiment of the present application, the ith round of translation is implemented by an image translation model.

Referring to fig. 8, fig. 8 is a schematic diagram of an image translation model training device according to an embodiment of the present application. The image translation model training device 800 includes: an acquisition unit 801 and a processing unit 802;

an obtaining unit 801, configured to construct a first training data set, where the first training data set includes a plurality of first sample data, each of the first sample data includes a first sample image, and a first text description corresponding to the first sample image, where the first text description is used to describe content of the first sample image;

and a processing unit 802, configured to perform model training according to the first training data set and the second training data set, so as to obtain the image translation model.

In one embodiment of the present application, the image translation model includes a first teletext semantic alignment module; in terms of performing model training according to the first training data set and the second training data set to obtain the image translation model, the processing unit 802 is specifically configured to:

training the first image-text semantic alignment module according to the plurality of first sample data to obtain a second image-text semantic alignment module;

and training the second image-text semantic alignment module according to the second training data set to obtain the image translation model.

In one embodiment of the present application, in training the first teletext semantic alignment module according to the plurality of first sample data to obtain the second teletext semantic alignment module, the processing unit 802 is specifically configured to:

acquiring a fourth feature vector corresponding to each first sample image, wherein the fourth feature vector corresponding to each first sample image is a feature vector of each first sample image in a text space;

mapping the first text description corresponding to each first sample image to obtain a fifth feature vector corresponding to the first text description of each first sample image;

obtaining a first loss according to the fourth feature vector corresponding to each first sample image and the fifth feature vector corresponding to the first text description of each first sample image;

predicting a second text description corresponding to each first sample image according to a fourth feature vector corresponding to each first sample image, wherein the second text description corresponding to each first sample image is a predicted text for describing the content of each first sample image;

obtaining a second loss according to the second text description and the first text description corresponding to each first sample image;

Training the first image-text semantic alignment module according to the first loss and the second loss to obtain a second image-text semantic alignment module.

In one embodiment of the present application, the image translation model further comprises an image encoder, the image encoder being pre-trained; in acquiring the fourth feature vector corresponding to each first sample image, the processing unit 802 is specifically configured to:

partitioning each first sample image to obtain a plurality of second image blocks;

inputting each second image block into the image encoder to obtain a feature map of each second image block;

and converting the feature map of each second image block into a text space to obtain a fourth feature vector corresponding to each first sample image.

In one embodiment of the present application, in converting the feature map of each second image block into a text space to obtain a fourth feature vector corresponding to each first sample image, the processing unit 802 is specifically configured to:

tiling the feature map of each second image block to obtain a sixth feature vector corresponding to each second image block;

inputting a sixth feature vector corresponding to each second image block into the first image-text semantic alignment module, and performing cross attention processing on the sixth feature vector and the first reference vector to obtain a seventh feature vector corresponding to each first sample image;

And carrying out feature extraction on the seventh feature vector corresponding to each first sample image to obtain a fourth feature vector corresponding to each first sample image.

In one embodiment of the present application, in acquiring the fourth feature vector corresponding to each first sample image, the processing unit 802 is specifically configured to:

acquiring a feature map corresponding to each first sample image;

and inputting the feature images corresponding to each first sample image into the first image-text semantic alignment module for feature extraction to obtain a fourth feature vector corresponding to each first sample image.

In one embodiment of the present application, in obtaining the first loss according to the fourth feature vector corresponding to each first sample image and the fifth feature vector corresponding to the first text description of each first sample image, the processing unit 802 is specifically configured to:

respectively obtaining the similarity between a fourth feature vector of each first sample image and a plurality of fifth feature vectors corresponding to the plurality of first text descriptions, and obtaining a plurality of similarities corresponding to each first sample image;

obtaining the loss of each first sample image according to the multiple similarities corresponding to each first sample image and the labels corresponding to each first sample image;

The first loss is derived from the loss of each first sample image.

In one embodiment of the present application, the image translation model further includes an image encoder and a text translation module, wherein the image encoder and the text translation module are pre-trained; in training the second image-text semantic alignment module according to the second training data set to obtain the image translation model, the processing unit 802 is specifically configured to:

for any second sample image, acquiring a first text instruction Hj corresponding to a jth round of translation from a plurality of first text instructions corresponding to the second sample image;

obtaining an eighth feature vector corresponding to the second sample image, wherein the eighth feature vector is obtained by converting image features of the second sample image into a text space in the first translation;

inputting the eighth feature vector, the first text instruction Hj, the first text instruction corresponding to each translation in the previous j-1 translation round, the predicted translation text of each translation round, and the standard translation text corresponding to the first text instruction into the text translation module, and predicting the translation text Kj corresponding to the jth translation round;

Determining a loss corresponding to the jth round of translation according to the translation text Kj corresponding to the jth round of translation and the standard translation text corresponding to the first text instruction Hj;

obtaining a loss corresponding to the second sample image according to the loss corresponding to the j-th round of translation;

obtaining a third loss according to the loss corresponding to each second sample image;

and training the second image-text semantic alignment module according to the third loss to obtain the image translation model.

In one embodiment of the present application, the processing unit 802 is specifically configured to, in terms of inputting the eighth feature vector, the first text instruction Hj, the first text instruction corresponding to each translation of the previous j-1 translations, and the predicted translation text of each translation, and the standard translation text corresponding to the first text instruction, to the text translation module, and predicting the translation text Kj corresponding to the jth translation:

splicing the first text instruction Hj, the first text instruction corresponding to each translation in the previous j-1 translation, the predicted translation text of each translation and the standard translation text corresponding to the first text instruction to obtain a target text instruction corresponding to the jth translation;

Mapping the target text instruction corresponding to the jth round of translation through the text translation module to obtain a ninth feature vector corresponding to the jth round of translation;

splicing the ninth feature vector corresponding to the jth translation with the eighth feature vector to obtain a target feature vector corresponding to the jth translation;

and decoding the target feature vector corresponding to the jth round of translation through the text translation module, and predicting a translation text Kj corresponding to the jth round of translation.

In one embodiment of the present application, the mapping relationship for mapping any one of the first text instructions is the same as the mapping relationship for mapping the target text instruction corresponding to the jth round of translation.

Referring to fig. 9, fig. 9 is a schematic diagram of an image detection device according to an embodiment of the present application. As shown in fig. 9, the image detection apparatus 900 includes an acquisition unit 901 and a processing unit 902;

an acquisition unit 901 for acquiring an image to be detected;

the processing unit 902 is configured to perform one or more rounds of translation on the image to be detected to obtain one or more translation texts corresponding to the image to be detected, where each round of translation on the image to be detected is implemented by the image translation model;

And detecting whether the image to be detected contains preset content or not according to the one or more translation texts.

If yes, determining that the image to be detected has risks.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 10, the electronic device 1000 includes a transceiver 1001, a processor 1002, and a memory 1003. Which are connected by a bus 1004. The memory 1003 is used to store computer programs and data, and the data stored in the memory 1003 can be transferred to the processor 1002. The electronic device 1000 may be the image translation apparatus 700, the image translation model training apparatus 800, or the image detection apparatus 900.

When the electronic device 1000 is the image translation apparatus 700, the processor 1002 is configured to read the computer program in the memory 1003 to perform the following operations:

the method comprises the steps that a control transceiver 1001 obtains a text instruction Di corresponding to an ith round of translation, wherein the text instruction Di is used for indicating the translation direction of the ith round of translation;

In one embodiment of the present application, the processor 1002 is specifically configured to sequentially perform the following operations in determining the translated text Ti corresponding to the ith round of translation according to the first feature vector, the text instruction Di, the text instruction corresponding to each of the previous i-1 rounds of translation, and the translated text corresponding to each round of translation:

In one embodiment of the present application, when i=1, the processor 1002 is specifically configured to sequentially perform the following operations in taking the first feature vector of the image to be translated:

acquiring the image to be translated;

In one embodiment of the present application, in converting the feature map corresponding to each first image block into a text space to obtain the second feature vector, the processor 1002 is specifically configured to sequentially perform the following operations:

In one embodiment of the present application, when i=1, the processor 1002 is specifically configured to sequentially perform the following operations in acquiring the first feature vector of the image to be translated:

Acquiring the image to be translated;

Specifically, the transceiver 1001 may be the acquiring unit 701 of the image translating apparatus 700 of the embodiment shown in fig. 7, and the processor 1002 may be the processing unit 702 of the image translating apparatus 700 of the embodiment shown in fig. 7.

When the electronic device 1000 is the image translation model training apparatus 800, the processor 1002 is configured to read the computer program in the memory 1003 to perform the following operations:

the control transceiver 1001 constructs a first training data set, wherein the first training data set includes a plurality of first sample data, each first sample data including a first sample image, and a first text description corresponding to the first sample image, the first text description describing contents of the first sample image;

In one embodiment of the present application, the image translation model includes a first teletext semantic alignment module; in terms of model training based on the first training data set and the second training data set to obtain the image translation model, the processor 1002 is specifically configured to sequentially perform the following operations:

In one embodiment of the present application, in training the first teletext semantic alignment module according to the plurality of first sample data to obtain the second teletext semantic alignment module, the processor 1002 is specifically configured to sequentially perform the following operations:

In one embodiment of the present application, the image translation model further comprises an image encoder, the image encoder being pre-trained; in acquiring a fourth feature vector corresponding to each first sample image, the processor 1002 is specifically configured to sequentially perform the following operations:

In one embodiment of the present application, the processor 1002 is specifically configured to, in converting the feature map of each second image block into a text space, obtain a fourth feature vector corresponding to each first sample image, perform the following operations:

In one embodiment of the present application, in acquiring the fourth feature vector corresponding to each of the first sample images, the processor 1002 is specifically configured to sequentially perform the following operations:

Acquiring a feature map corresponding to each first sample image;

In one embodiment of the present application, in obtaining the first penalty from the fourth feature vector corresponding to each first sample image and the fifth feature vector corresponding to the first text description of each first sample image, the processor 1002 is specifically configured to sequentially perform the following operations:

the first loss is derived from the loss of each first sample image.

In one embodiment of the present application, the image translation model further includes an image encoder and a text translation module, wherein the image encoder and the text translation module are pre-trained; in training the second image-text semantic alignment module according to the second training data set to obtain the image translation model, the processor 1002 is specifically configured to sequentially perform the following operations:

In one embodiment of the present application, in inputting the eighth feature vector, the first text instruction Hj, the first text instruction corresponding to each translation of the previous j-1 translations, and the predicted translation text of each translation, and the standard translation text corresponding to the first text instruction, to the text translation module, predicting the translation text Kj corresponding to the jth translation, the processor 1002 is specifically configured to:

In one embodiment of the present application, the mapping relationship for mapping any one of the first text descriptions is the same as the mapping relationship for mapping the target text instruction corresponding to the jth round of translation.

Specifically, the transceiver 1001 may be the acquiring unit 801 of the image translation model training device 800 in the embodiment illustrated in fig. 8, and the processor 1002 may be the processing unit 802 of the image translation model training device 800 in the embodiment illustrated in fig. 8.

When the electronic device 1000 is the image detection apparatus 900, the processor 1002 is configured to read the computer program in the memory 903 to perform the following operations:

controlling the transceiver 1001 to acquire an image to be detected;

performing one or more rounds of translation on the image to be detected to obtain one or more translation texts corresponding to the image to be detected, wherein each round of translation on the image to be detected is realized through the image translation model;

If yes, determining that the image to be detected has safety risk.

Specifically, the transceiver 1001 may be the acquiring unit 901 of the image detecting apparatus 900 of the embodiment shown in fig. 9, and the processor 1002 may be the processing unit 902 of the image detecting apparatus 900 of the embodiment shown in fig. 9.

It should be understood that the electronic device in the present application may include a smart Phone (such as an Android mobile Phone, an iOS mobile Phone, a Windows Phone mobile Phone, etc.), a tablet computer, a palm computer, a notebook computer, a mobile internet device MID (Mobile Internet Devices, abbreviated as MID) or a wearable device, etc. The above-described electronic devices are merely examples and are not intended to be exhaustive and include, but are not limited to, the above-described electronic devices. In practical applications, the electronic device may further include: intelligent vehicle terminals, computer devices, etc.

The present application also provides a computer-readable storage medium storing a computer program that is executed by a processor to implement some or all of the steps of any one of the methods as described in the method embodiments above.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the methods described in the method embodiments above.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units described above may be implemented either in hardware or in software program modules.

The integrated units, if implemented in the form of software program modules, may be stored in a computer-readable memory for sale or use as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. An image translation model training method, comprising the steps of:

model training is carried out according to the first training data set and the second training data set to obtain the image translation model, and the image translation model comprises a first image-text semantic alignment module; the training of the model according to the first training data set and the second training data set to obtain the image translation model comprises the following steps:

training the first image-text semantic alignment module according to the plurality of first sample data to obtain a second image-text semantic alignment module; and training the second image-text semantic alignment module according to the second training data set to obtain the image translation model.

2. The method according to claim 1, wherein training the first teletext semantic alignment module according to the plurality of first sample data to obtain a second teletext semantic alignment module includes:

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

The image translation model further includes an image encoder, the image encoder being pre-trained; the obtaining the fourth feature vector corresponding to each first sample image includes:

4. A method according to claim 3, wherein said converting the feature map of each second image block into text space to obtain a fourth feature vector corresponding to each first sample image comprises:

5. The method according to claim 2, wherein the obtaining a fourth feature vector corresponding to each first sample image comprises:

acquiring a feature map corresponding to each first sample image;

6. The method according to any one of claims 1-5, wherein the deriving the first penalty from the fourth feature vector corresponding to each first sample image and the fifth feature vector corresponding to the first text description of each first sample image comprises:

the first loss is derived from the loss of each first sample image.

7. The method according to any one of claims 1 to 5, wherein,

The image translation model further comprises an image encoder and a text translation module, wherein the image encoder and the text translation module are trained in advance; training the second image-text semantic alignment module according to the second training data set to obtain the image translation model, wherein the training comprises the following steps:

8. The method of claim 7, wherein inputting the eighth feature vector, the first text instruction Hj, the first text instruction corresponding to each of the previous j-1 translations, the predicted translated text for each translation, and the standard translated text corresponding to the first text instruction into the text translation module, and predicting the translated text Kj corresponding to the jth translation comprises:

9. The method of claim 8, wherein the step of determining the position of the first electrode is performed,

the mapping relation for mapping any one of the first text descriptions is the same as the mapping relation for mapping the target text instruction corresponding to the jth round of translation.

10. An image translation model training device, characterized by comprising:

The processing unit is used for performing model training according to the first training data set and the second training data set to obtain the image translation model, and the image translation model comprises a first image-text semantic alignment module; the training of the model according to the first training data set and the second training data set to obtain the image translation model comprises the following steps:

11. An electronic device, comprising: a processor and a memory, the processor being connected to the memory, the memory being for storing a computer program, the processor being for executing the computer program stored in the memory to cause the electronic device to perform the method of any one of claims 1-9.

12. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which is executed by a processor to implement the method of any one of claims 1-9.