CN114564606A - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114564606A
CN114564606A CN202210173457.7A CN202210173457A CN114564606A CN 114564606 A CN114564606 A CN 114564606A CN 202210173457 A CN202210173457 A CN 202210173457A CN 114564606 A CN114564606 A CN 114564606A
Authority
CN
China
Prior art keywords
model
image
trained
training
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210173457.7A
Other languages
Chinese (zh)
Inventor
王子豪
易子立
刘玮
何茜
吴兴龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202210173457.7A priority Critical patent/CN114564606A/en
Publication of CN114564606A publication Critical patent/CN114564606A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the disclosure provides a data processing method, a data processing device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring data to be processed, and determining a feature vector to be spliced corresponding to the data to be processed; the data to be processed comprises texts to be processed and/or images to be processed; inputting the feature vector to be spliced into an autoregressive sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the data to be processed; and reconstructing the target coding sequence to obtain a target image matched with the semantic content of the data to be processed.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of computers, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.
Background
At present, the image-text generation technology is usually realized by adopting a solution optimization method, the solution optimization method utilizes image-text correlation provided by a large-scale pre-training image-text model as an optimization index, in the image generation process, image search is carried out in a generation space based on an input text and the provided image-text correlation, and an image to be matched with the input text with the highest matching degree is determined and output, so that the image generation of the input text is completed.
However, in the prior art, the image searching process is time-consuming, so that the image generating speed is slow; and the accuracy of the generated image depends on the condition of the image pre-stored in the generation space, and the accuracy is not controllable.
Disclosure of Invention
The present disclosure provides a data processing method, apparatus, electronic device and storage medium to achieve the effect of data processing convenience.
In a first aspect, an embodiment of the present disclosure provides a data processing method, where the method includes:
acquiring data to be processed, and determining a feature vector to be spliced corresponding to the data to be processed; the data to be processed comprises texts to be processed and/or images to be processed;
inputting the feature vector to be spliced into an autoregressive sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the data to be processed;
and reconstructing the target coding sequence to obtain a target image matched with the semantic content of the data to be processed.
In a second aspect, an embodiment of the present disclosure further provides a data processing apparatus, where the apparatus includes:
the device comprises a to-be-processed data acquisition module, a splicing module and a splicing module, wherein the to-be-processed data acquisition module is used for acquiring to-be-processed data and determining to-be-spliced characteristic vectors corresponding to the to-be-processed data; the data to be processed comprises texts to be processed and/or images to be processed;
the to-be-spliced feature vector input module is used for inputting the to-be-spliced feature vector into an autoregressive sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the to-be-processed data;
and the target coding sequence reconstruction module is used for reconstructing the target coding sequence to obtain a target image matched with the semantic content of the data to be processed.
In a third aspect, an embodiment of the present disclosure further provides an electronic device, where the electronic device includes:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the data processing method according to any one of the embodiments of the present disclosure.
In a fourth aspect, the embodiments of the present disclosure also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are used for executing the data processing method according to any one of the embodiments of the present disclosure.
According to the technical scheme, the data to be processed are obtained, the feature vector to be spliced corresponding to the data to be processed is determined, the feature vector to be spliced contains the semantic content features of the data to be processed, the feature vector to be spliced is input into an autoregressive sequence generation model obtained through pre-training, the autoregressive sequence generation model outputs a target coding sequence reflecting the semantic content features of the data to be processed, and the semantic content of the data to be processed is displayed in an obtained target image through reconstruction processing of the target coding sequence. According to the method and the device for generating the target image, the model is adopted to directly generate the target image corresponding to the semantic content of the data to be processed, time waste caused by search operation on a large number of images in space in the prior art is solved, the efficiency of generating the target image from the data to be processed is improved, and the semantic content contained in the data to be processed can be embodied through the target image.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
Fig. 1 is a schematic flow chart of a data processing method according to a first embodiment of the disclosure;
fig. 2 is a schematic diagram of a cross-mode graphics-text model according to a first embodiment of the disclosure;
FIG. 3 is a schematic diagram of an autoregressive sequence generation model according to an embodiment of the present disclosure;
fig. 4 is a flowchart illustrating a data processing method according to a second embodiment of the disclosure;
fig. 5 is a schematic diagram of a codec model training process according to a second embodiment of the disclosure;
fig. 6 is a block diagram of a data processing apparatus according to a third embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to a fourth embodiment of the disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units. It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Example one
Fig. 1 is a flowchart of a data processing method according to a first embodiment of the present disclosure, where the first embodiment of the present disclosure is applicable to a situation where feature information is obtained from data to be processed, and a target image capable of representing semantic content of the data to be processed is obtained based on the feature information, and the method may be executed by a data processing apparatus, where the apparatus may be implemented in a form of software and/or hardware, and optionally, implemented by an electronic device, where the electronic device may be a mobile terminal, a PC terminal, a server, or the like.
Before the technical solution is introduced, an application scenario may be exemplarily described. The technical scheme can be applied to any application scene needing to display the semantic content of the data by the image, for example, when the text is matched with the corresponding image to be displayed in a matching way, aiming at the existing text, based on the scheme of the embodiment, the corresponding image consistent with the semantic content expressed by the text is generated to display the text by the image; or, for data materials with pictures and texts, when the contents embodied by the data materials need to be reflected in a unified presentation form, based on the scheme of this embodiment, an image capable of presenting the semantics of the pictures is generated for the pictures in the data materials, and an image capable of presenting the semantics of the texts is generated for the texts, so that the semantics contents in the data materials are presented in a unified manner through the presentation mode of the images.
As shown in fig. 1, the method includes:
and S110, acquiring data to be processed, and determining a feature vector to be spliced corresponding to the data to be processed.
The data to be processed comprises texts to be processed and/or images to be processed. The text to be processed is composed of one or more individual words, and may be a consecutive phrase or a segment of speech. Different language types and different identification symbols can be adopted in the text to be processed for displaying, and the text to be processed is used for expressing information such as scenes, events, characters and the like, for example, the semantics of the text to be processed "a gray bird" is the same as that of the text to be processed "a gray bird"; the semantic contents of the text to be processed 'one' and the text to be processed 'r' are both numbers 1. Similarly, information such as scenes, events, people and the like can be expressed in the image to be processed through the displayed contents such as lines, graphs, sizes, colors and the like.
The feature vector to be spliced is a feature vector used for reflecting semantic content of data to be processed. In this embodiment, when the data to be processed is the text to be processed, the text features in the text to be processed may be extracted, a feature vector for expressing the text semantics of the text to be processed is determined based on the text features, and the feature vector is used as the feature vector to be spliced. When the data to be processed is the image to be processed, the image features in the image to be processed can be extracted, the feature vector for expressing the image expression meaning of the image to be processed is determined based on the image features, and the feature vector is used as the feature vector to be spliced.
When the data to be processed simultaneously comprises the text to be processed and the image to be processed, namely the text to be processed and the image to be processed are input simultaneously, a first feature vector used for representing feature information of the text to be processed can be determined, a second feature vector used for representing feature information of the image to be processed is determined, and a feature vector to be spliced is generated based on the first feature vector and the second feature vector.
In this embodiment, two ways of generating the feature vector to be spliced based on the first feature vector and the second feature vector include two ways. The first way is to determine a geometric mean vector of the first feature vector and the second feature vector, and determine the geometric mean vector as a feature vector to be spliced, where the calculation formula may be:
the eigenvector to be spliced is 0.5 (the first eigenvector + the second eigenvector)
The second mode is that the first characteristic vector and the second characteristic vector are subjected to weighted summation calculation, and the calculated vector is determined as the characteristic vector to be spliced. Specifically, a first weight of a first feature vector and a second weight of a second feature vector are respectively determined, a first quantity product vector of the first feature vector and the first weight is determined, a second quantity product vector of the second feature vector and the second weight is determined, and a sum vector of the first quantity product vector and the second quantity product vector is determined as the feature vector to be spliced. It should be noted that, the first weight and the second weight may be determined according to the importance degree of the semantics expressed in the text to be processed and the image to be processed, respectively.
In this embodiment, the manner of determining the feature vectors to be spliced corresponding to the data to be processed may be: and inputting the data to be processed into a cross-mode image-text model obtained by pre-training to obtain a feature vector to be spliced.
The cross-mode image-text model is used for processing the data to be processed into corresponding feature vectors. The cross-modal image-text model can perform cross-modal feature extraction operation on input data to be processed and output a model of a feature vector consistent with semantic content of the data to be processed. The feature vector to be spliced is the feature vector which reflects the same semantic content as the data to be processed.
In this embodiment, the cross-modal image-text model may perform a cross-modal feature extraction operation on any input data to be processed, and map the cross-modal features into a common space, thereby accurately determining feature information expressed by the data to be processed. The principle of mapping into the same space may be: the method and the device enable the feature distance with high semantic correlation degree to be close and the feature distance with low semantic correlation degree to be far in any input data to be processed, and therefore obtain the feature vectors to be spliced reflecting the semantic content of the data to be processed according to the near-far relation between the semantic correlation degrees.
For clarity, how to determine the corresponding feature vectors to be spliced can be described with reference to fig. 2. As shown in fig. 2, the data to be processed includes a text to be processed and an image to be processed, and the text to be processed or the image to be processed may be input into a pre-established cross-mode image-text model. When a text to be processed is input, the cross-modal image-text model can extract text features in the text to be processed, feature vectors to be spliced are generated based on the text features, and semantic content of the text to be processed is reflected through the feature vectors to be spliced. When an image to be processed is input, the cross-mode image-text model can extract image features in the image to be processed, a feature vector to be spliced is generated based on the image features, and the feature vector to be spliced can reflect the image meaning of the image to be processed. Furthermore, the text to be processed and the image to be processed in the data to be processed can be simultaneously input into the cross-modal image-text model, the cross-modal image-text model can extract cross-modal characteristics of the text to be processed and the image to be processed, and corresponding feature vectors to be spliced are output based on the cross-modal characteristics.
And S120, inputting the feature vector to be spliced into an autoregressive sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the data to be processed.
The autoregressive sequence generation model is used for converting the input feature vectors to be spliced into a model of a coding sequence with the same semantic content as the data to be processed. The target coding sequence is a coding sequence which has the same semantic content with the data expression to be processed; the target coding sequence can be composed of a plurality of identifiers, and the sequence length of the target coding sequence is determined by setting the length of a character string when an autoregressive sequence generation model is trained.
The operation of the autoregressive sequence generation model is described in detail below with reference to fig. 3.
As shown in fig. 3, if the feature vector to be spliced is E, E may be used as a first input condition of the autoregressive sequence generation model to correspondingly generate first output data T (1), T (1) is spliced to the feature vector to be spliced E and then used as a current input vector, and the current input vector is input into the autoregressive sequence generation model again to output data T (2); and splicing the output data T (2) after the characteristic vector to be spliced and the output data T (1) are spliced, generating a current input vector, inputting the current input vector into the autoregressive sequence generation model, and repeatedly executing, wherein the current input vector at each time comprises a last input vector and last output data. The number of times of execution of the autoregressive sequence generation model can be set to be n, the next output data in the sequence is generated through continuous prediction of the autoregressive sequence generation model, the output result of the n-1 time is T (n-1), current input vectors formed by E, T (1) and T (2) … T (n-1) are input into the autoregressive sequence generation model, output data T (n) are obtained, and a target coding sequence is formed by T (1) to T (n).
In this embodiment, the method of inputting the feature vector to be spliced into the autoregressive sequence generation model obtained by pre-training to obtain the target coding sequence corresponding to the data to be processed may be as follows: a) taking the feature vector to be spliced as a current splicing feature vector, inputting the current splicing feature vector into an autoregressive sequence generation model, obtaining a current identifier to be decoded, and writing the current identifier into a coding sequence to be used; b) updating the current splicing characteristic vector based on the current identifier to be decoded, taking the updated current splicing characteristic vector as the input of an autoregressive sequence generation model, updating the current identifier to be decoded, and writing the current identifier to be decoded into the coding sequence to be used; c) and c, repeating the step b until the number of the identifiers to be decoded in the coding sequence to be used reaches a preset number threshold value, and taking the coding sequence to be used as a target coding sequence.
As shown in fig. 3, T (1) to T (n) may be respectively used as the identifiers to be decoded output by the autoregressive sequence generation model. The identifier to be decoded obtained by inputting the identifier to be decoded into the autoregressive sequence generation model each time can be used as the current decoding identifier, the current identifier to be decoded is written into the coding sequence to be used, the current decoding identifier is spliced to the feature vector to be spliced, the vector obtained after splicing is determined as the current splicing feature vector, and the current splicing feature vector is formed by the identifier to be decoded and the feature vector to be spliced which are output at each time. Inputting the current splicing characteristic vector into an autoregressive sequence generation model, performing repeated operation, writing the current identifier to be decoded obtained by each output into the coding sequence to be used, and taking the coding sequence to be used as a target coding sequence when detecting that the number of the identifiers to be decoded in the coding sequence to be used reaches a preset number threshold value.
Furthermore, the target number of the identifiers in the target coding sequence can be preset, and the total number of the current identifiers to be decoded can be counted in the repeated operation process of the autoregressive sequence generation model. Before data is input into the autoregressive sequence generation model each time, whether the total number of the current to-be-decoded identifications is smaller than a target number or not can be judged, and if the total number of the current to-be-decoded identifications is smaller than the target number, repeated operation is continuously executed; and if the number of the identifiers is equal to the target number, stopping repeated operation of the autoregressive sequence generation model, splicing the identifiers to be decoded output at each time, and determining the spliced identifier sequence as a target coding sequence.
S130, reconstructing the target coding sequence to obtain a target image matched with the semantic content of the data to be processed.
In this embodiment, based on the target coding sequence output by the autoregressive sequence generation model, image reconstruction is performed to obtain a target image, and the obtained target image can reflect semantic content of data to be processed. Furthermore, the target coding sequence can be subjected to reconstruction processing twice or more than twice to respectively obtain two or more than two alternative images correspondingly, whether the alternative images are consistent images or not is compared, if yes, the reconstruction process is free from error, and any alternative image can be determined as the target image; if not, indicating that operation errors may occur in the reconstruction process, deleting each alternative image, performing reconstruction processing on the target coding sequence for a preset number of times again, respectively generating the alternative images, and if the alternative images are still inconsistent, determining that the reconstruction operation is incorrect, and generating alarm information for prompting. The reasons and benefits of multiple reconstructions are: under normal conditions, target images obtained by reconstructing the same target coding sequence are consistent, and if the target images are inconsistent, errors in the image reconstruction process can be reflected. In this way, the stability and accuracy of the image reconstruction process can be verified.
In this embodiment, the method for reconstructing the target coding sequence to obtain the target image matched with the semantic content corresponding to the data to be processed may include: and decoding the target coding sequence based on the image reconstruction model to obtain a target image matched with semantic content corresponding to the data to be processed.
Specifically, the process of reconstructing the target coding sequence is a decoding process, the target coding sequence is decoded by an image reconstruction model, the semantic content of the data to be processed is determined, and a target image matched with the semantic content is generated. The image reconstruction model may include a decoder in a DVAE model and a VQ-GAN model.
It should be noted that the target image is used for showing semantic content of the data to be processed. When the data to be processed is the image to be processed, the constituent elements in the target image and the image to be processed can be different, and the semantic contents expressed by the target image and the image to be processed are consistent.
According to the technical scheme, the data to be processed are obtained, the feature vector to be spliced corresponding to the data to be processed is determined, the feature vector to be spliced contains the semantic content features of the data to be processed, the feature vector to be spliced is input into an autoregressive sequence generation model obtained through pre-training, the autoregressive sequence generation model outputs a target coding sequence reflecting the semantic content features of the data to be processed, and the semantic content of the data to be processed is displayed in an obtained target image through reconstruction processing of the target coding sequence. According to the method and the device for generating the target image, the model is adopted to directly generate the target image corresponding to the semantic content of the data to be processed, time waste caused by search operation on a large number of images in space in the prior art is solved, the efficiency of generating the target image from the data to be processed is improved, and the semantic content contained in the data to be processed can be embodied through the target image.
Example two
Fig. 4 is a schematic flow chart of a data processing method provided in the second embodiment of the present disclosure, and on the basis of the foregoing embodiments, a cross-modal image-text model to be trained, an encoding/decoding model to be trained, and an autoregressive sequence generation model to be trained are trained based on multiple training samples, so as to obtain an autoregressive sequence generation model that can convert input feature vectors to be spliced into data having the same semantic content as that of the data to be processed, which is helpful for improving the image quality of a generated target image. The specific implementation manner can be referred to the technical scheme of the embodiment. The technical terms that are the same as or corresponding to the above-mentioned embodiments are not described in detail herein.
As shown in fig. 4, the method specifically includes the following steps:
s210, training the cross-modal image-text model to be trained based on the plurality of first training samples to obtain the cross-modal image-text model.
In order to train the cross-modal graphics-text model to be trained, a plurality of first training samples are firstly acquired. It can be understood that, in the actual application process, in order to improve the accuracy of the model, as many and as abundant first training samples as possible can be obtained for training the cross-modal graphics-text model to be trained.
The first training sample comprises characters to be trained and images to be trained corresponding to the characters to be trained. It can be understood that the semantic content of the image to be trained in the first training sample is the same as that of the word to be trained. For example, if the text to be trained in the first training sample is a "red vehicle", the image to be trained in the first training sample should be an image on which the red vehicle is drawn; and if the character to be trained in the second training sample is 'jumping', the image to be trained in the second training sample is drawn with an image of jumping motion.
The cross-modal image-text model to be trained comprises a cross-modal text sub-model to be trained and a cross-modal image sub-model to be trained. The cross-modal text submodel to be trained is used for extracting character features of characters to be trained, wherein the character features can contain the language type of the characters, the part of speech and the character meaning of each word in the characters and the like, and character feature vectors corresponding to the character features are generated based on the character features. And the cross-mode image sub-model to be trained is used for extracting the image characteristics of the image to be trained, the image characteristics can comprise the shape, size, meaning, color and other contents of all the constituent elements, and the image characteristic vector corresponding to the image characteristics is generated based on the image characteristics. And correcting model parameters in the cross-modal image-text model to be trained based on the character feature vectors and the corresponding image feature vectors, and training to obtain the cross-modal image-text model.
In this embodiment, the cross-modal graphics-text model obtained by training may be: training a to-be-trained cross-modal image-text model based on a plurality of first training samples to obtain the cross-modal image-text model, and the training comprises the following steps: inputting characters to be trained and images to be trained in the current first training sample into a cross-modal image-text model to be trained aiming at each first training sample, respectively obtaining character characteristic vectors and image characteristic vectors, and determining an actual characteristic similarity matrix based on the character characteristic vectors and the image characteristic vectors; correcting model parameters in the cross-modal image-text model to be trained based on the actual characteristic similarity matrix and the theoretical characteristic matrix; and (4) taking the loss function convergence in the cross-modal image-text model to be trained as a training target, and training to obtain the cross-modal image-text model.
It should be noted that the process of training the cross-modal graphics-text model to be trained for each first training sample is the same, and here, the process of training the cross-modal graphics-text model to be trained for one first training sample is described as an example.
In specific implementation, an actual characteristic similarity matrix can be determined by the character characteristic vector and the image characteristic vector, the extracted cross-modal characteristics are mapped into a common space through the actual characteristic similarity matrix for analysis, and the semantic correlation degree between the character to be trained and the corresponding image to be trained is reflected based on an analysis result.
Further, the theoretical characteristic matrix is used for reflecting the semantic correlation degree between the image to be trained and the corresponding image to be trained under an ideal condition.
It should be noted that the closer the actual characteristic similarity matrix is to the theoretical characteristic matrix, the higher the training effect of the cross-modal image-text model to be trained is; the larger the difference between the actual characteristic similarity matrix and the theoretical characteristic matrix is, the poorer the training effect of the cross-modal image-text model to be trained is.
In order to improve the accuracy of the characteristic vector determined by the cross-modal image-text model to be trained to be closer to an ideal state, the loss value in the cross-modal image-text model to be trained can be determined, model parameters in the cross-modal image-text model to be trained are adjusted based on the loss value, the determined actual characteristic similarity matrix is close to the theoretical characteristic matrix, and the model training can be considered to be finished until the loss function is converged.
S220, training the to-be-trained coding and decoding model based on the plurality of second training samples to obtain the target coding and decoding model.
Wherein the second training sample comprises a second training image; the coding and decoding model to be trained comprises an image coding sub-model to be trained and an image reconstruction model to be trained; and the output of the image coding sub-model to be trained is the input of the image reconstruction model to be trained. The coding sub-model to be trained is used for carrying out coding operation on the input second training sample to generate discretization codes; and the image reconstruction model to be trained is used for reconstructing the image of the discretization code output by the coding sub-model to be trained and generating an image with the same semantic content as the second training sample.
It should be noted that the model parameters in the coding sub-model to be trained and the image reconstruction model to be trained are default values. And in the training process, continuously adjusting the model parameters of the coding sub-model to be trained and the model parameters of the image reconstruction model to be trained respectively to obtain a target coding model with high accuracy.
In this embodiment, the training mode of the codec model to be trained may be: training the to-be-trained encoding and decoding model based on the plurality of second training samples to obtain a target encoding and decoding model, wherein the training process comprises the following steps: for each second training sample, inputting a second training image in the current second training sample into the to-be-trained image coding sub-model to obtain a to-be-reconstructed discretization code; inputting the discretization code to be reconstructed into an image reconstruction model to be trained to obtain a reconstructed image; inputting the reconstructed image and the second training image into a discriminator to be trained to obtain a discrimination result; correcting the coding sub-model of the image to be trained, the reconstruction model of the image to be trained and the model parameters in the discriminator to be trained based on the discrimination result; taking the image coding sub-model to be trained, the image reconstruction model to be trained and the loss function convergence in the discriminator to be trained as training targets, and training to obtain a coding and decoding model to be used; and acquiring an image coding sub-model and an image reconstruction model in the coding and decoding model to be used to obtain a target coding and decoding model.
In this embodiment, the to-be-trained image coding sub-model is a neural network with a VQ-GAN structure, and is used for performing discretization coding on an input second training sample. And the to-be-trained discriminator is used for discriminating the similarity degree of the reconstructed image and the second training image. The initial model parameters of the image sub-model to be trained, the image reconstruction model to be trained and the discriminator to be trained are all default values, and the model parameters of the image sub-model to be trained, the image reconstruction model to be trained and the discriminator to be trained need to be adjusted continuously through training to obtain a target coding and decoding model.
The following describes the codec model training process in detail with reference to fig. 5. As shown in fig. 5, for each second training sample, the second training image in the second training sample may be input into the sub-model for encoding the image to be trained, so as to generate the discretization code to be reconstructed. For example, each second training image may be discretized into 16 × 16 codes after being input into the image-to-be-trained-code sub-model. And inputting the discretization code to be reconstructed into the image reconstruction model to be trained, and reconstructing a reconstructed image corresponding to the discretization code to be reconstructed.
Further, the reconstructed image and the second training image are respectively input into the discriminator to be trained, and the discrimination result of the discriminator to be trained is determined. For example, the result of the discrimination by the discriminator to be trained may be "true" or "false", where "true" indicates that the reconstructed image has a high degree of similarity with the second training image, and "false" indicates that the reconstructed image has a low degree of similarity with the second training image.
In this embodiment, based on the discrimination result of the discriminator to be trained, the model parameters of the image coding sub-model to be trained and the image reconstruction model to be trained are modified and optimized, so that the output reconstructed image can be closer to the input second training image. And determining a loss value according to the input second training image, the reconstructed image obtained after reconstruction and a discrimination result, and correcting the model parameters in the image coding sub-model to be trained, the image reconstruction model to be trained and the discriminator to be trained on the basis of the loss value until the loss function is converged, wherein the model training can be considered to be finished.
In this embodiment, the manner of determining whether to converge may be: whether the obtained loss values after the model parameters are adjusted are all smaller than a preset loss value or whether the variation trend of the loss function values tends to be stable or not; or the training times of the current second training image reach a preset time threshold. If the above conditions are met, the model can be considered to be trained and used.
And S230, determining an autoregressive sequence generation model based on the target coding and decoding model, the cross-mode image-text model, the autoregressive sequence generation model to be trained and a plurality of third training samples.
And the third training sample comprises a third training image and is used for being input into the cross-mode image-text model to obtain a feature vector required by training the autoregressive sequence to be trained to generate the model. The autoregressive sequence generation model to be trained can be a Transformer model; the autoregressive sequence generation model can be used for carrying out coding operation on the input feature vectors to generate the identifier to be decoded. The operation of generating each identifier to be decoded can be seen in the detailed description of the first embodiment. Based on each identifier to be decoded, a coding sequence can be formed, which corresponds to the input image, but at this time the model parameters in the model have not been completely adjusted, so that the resulting coding sequence differs from the expected desired coding sequence. Therefore, the autoregressive sequence generation model to be trained is trained through the third training samples, the target coding and decoding model and the cross-modal image-text model, and the accuracy of the coding sequence generated by the autoregressive sequence generation model is improved.
In this embodiment, the autoregressive sequence generation model is determined based on the target encoding and decoding model, the cross-modal image-text model, the autoregressive sequence generation model to be trained, and a plurality of third training samples, and may be implemented by: a) inputting a third training image in the current third training sample into the cross-mode image-text model to obtain a third feature vector aiming at each third training sample; coding the third training image based on the image coding sub-model in the target coding and decoding model to obtain an image coding sequence; b) taking the third feature vector as a current feature vector, inputting the current feature vector into an autoregressive sequence generation model to be trained to obtain a current identifier to be corrected, and updating the current feature vector based on the current identifier to be corrected; c) taking the current feature vector as the input of the autoregressive sequence generation model to be trained again, updating the current identifier to be corrected, and updating the current feature vector based on the updated current identifier to be corrected; d) repeating the step c until the number of the current marks to be corrected reaches a preset number threshold value, and obtaining a characteristic sequence to be reconstructed; and determining cross entropy loss based on the characteristic sequence to be reconstructed and the image coding sequence, and correcting the autoregressive sequence generation model to be trained based on the cross entropy loss to obtain the autoregressive sequence generation model.
The third feature vector is a feature vector with the same semantic content as the third training image; the image coding sequence is a coding sequence which is obtained after the third training image is coded and has the same semantic content with the third training image, and the coding sequence is composed of a plurality of coding identifiers. Specifically, for different input third training images, the target coding and decoding model may output corresponding image coding sequences, and each image coding sequence may be used as an image expression form of the input third training image.
It should be noted that the process of training the to-be-trained autoregressive sequence generation model for each third training sample is the same, and here, the process of training the to-be-trained cross-modal image-text model for one third training sample is described as an example.
In this embodiment, for a third training image in a third training sample, a corresponding third feature vector and an image coding sequence may be obtained. It should be noted that, in order to save the time of the training process, the two steps of inputting the third training image into the cross-mode graphics-text model and inputting the third training image into the target coding and decoding model may be performed simultaneously. The skilled person in the art may also determine the sequence between the two according to the actual application situation, which is not limited in this embodiment.
Furthermore, the third feature vector is used as a current feature vector and input into the autoregressive sequence generation model to be trained, a current identifier to be corrected corresponding to the current feature vector can be correspondingly generated, and the current feature vector is updated based on the current identifier to be corrected.
Exemplarily, if the third feature vector is e, e can be input into the autoregressive sequence generation model to be trained as the current feature vector, a current identifier t (1) to be corrected corresponding to the current feature vector is correspondingly generated, t (1) is spliced into the feature vector e to be spliced and then serves as the current input vector, and the current input vector is input into the autoregressive sequence generation model again to obtain a current identifier t (2) to be corrected; and splicing the current mark t (2) to be corrected to a current input vector consisting of the third characteristic vector e and the current mark t (1) to be corrected to form a new current input vector, and repeatedly inputting the new current input vector to the autoregressive sequence generation model to be trained for repeated execution. The number of times of execution of the autoregressive sequence generation model can be set to be n, then the current to-be-corrected identifier output at the n-1 th time is t (n-1), E, T (1) and t (2) … t (n-1) can be composed into a current input vector and input into the autoregressive sequence generation model to obtain a current to-be-corrected identifier t (n), and a to-be-reconstructed feature sequence is composed of t (1) to t (n).
The preset number threshold of the feature sequence to be reconstructed may be predetermined, for example, the dimension of an element in the image coding sequence may be determined as the preset number threshold of the feature sequence to be reconstructed. When the image coding sequence is data with a 16 × 16 structure, the to-be-trained sub regression sequence generation model needs to generate 256 to-be-corrected identifiers, and the to-be-corrected identifiers form a to-be-reconstructed feature sequence with the 16 × 16 structure. When the number of the current to-be-corrected identifications is detected to reach the preset number threshold, the process of inputting the third feature vector and the to-be-corrected identifications obtained each time into the to-be-trained sub regression sequence generation model can be stopped.
Further, for each current third training sample, the cross entropy loss between the feature sequence to be reconstructed and the obtained image coding sequence can be calculated. The similarity between the characteristic sequence to be reconstructed and the image coding sequence can be reflected through cross entropy loss, and the smaller the cross entropy loss is, the more similar the cross entropy loss is; conversely, the larger the cross entropy loss is, the lower the similarity between the two is. In order to ensure the training precision of the autoregressive sequence generation model to be trained, the cross entropy loss which is less than the preset entropy loss value can be used as a training target, and when the training target is met, the training of the autoregressive sequence generation model to be trained by the third training sample is stopped; when the training target is not met, the training can be continued until the training times reach the preset training times, and the training can be stopped. And for each third training sample, training the autoregressive sequence generation model to be trained by adopting the method.
On the basis of the embodiment, the characteristic sequence to be reconstructed can be input into an image reconstruction model to obtain an image to be corrected; inputting an image to be corrected into a cross-mode image-text model to obtain a characteristic vector to be corrected; determining a similarity value between the feature vector to be corrected and the third feature vector, and correcting model parameters in the autoregressive sequence generation model to be trained on the basis of the similarity value and cross entropy loss; and (4) converging a loss function in the autoregressive sequence generation model to be trained as a training target to obtain the autoregressive sequence generation model.
The image to be corrected is a reconstructed image corresponding to the feature sequence to be reconstructed, and the feature vector to be corrected is a feature vector corresponding to the feature sequence to be reconstructed. It should be noted that, because the accuracy of the autoregressive sequence generation model to be trained is poor during training, the feature vector to be corrected obtained based on the feature sequence to be reconstructed has a deviation compared with the third feature vector. Model parameters in the autoregressive sequence generation model to be trained can be corrected based on the similarity between the feature vector to be corrected and the third feature vector, so that the accuracy of the autoregressive sequence generation model is improved.
In this embodiment, the more similar the feature vector to be corrected and the third feature vector are, the more similar the input image to be corrected and the third training image are, the further description is that the accuracy of the feature sequence to be reconstructed for generating the image to be corrected is high, and the accuracy of the autoregressive sequence to be trained generating model is reflected to be high; on the contrary, when the difference between the feature vector to be corrected and the third feature vector is large, the error of the autoregressive sequence generation model to be trained is reflected to be large. Model parameters in the autoregressive sequence generation model to be trained can be corrected based on the similarity value between the feature vector to be corrected and the third feature vector and the cross entropy loss, so that the similarity value between the feature vector to be corrected and the third feature vector is higher and higher, and the cross entropy loss is smaller and smaller.
Further, a loss function in the autoregressive sequence to be trained determination model is determined based on the model parameters, and the model parameters in the autoregressive sequence to be trained generation model are corrected based on the loss value, the similarity value and the cross entropy loss, so that the similarity between the feature vector to be corrected and the third feature vector is higher and higher. And when the loss function convergence is taken as a training target, stopping the training process of the autoregressive sequence generation model to be trained when all the third training samples reach the training target, and obtaining the autoregressive sequence generation model.
S240, acquiring data to be processed, and determining a feature vector to be spliced corresponding to the data to be processed; the data to be processed comprises texts to be processed and/or images to be processed.
And S250, inputting the feature vector to be spliced into an autoregressive sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the data to be processed.
And S260, reconstructing the target coding sequence to obtain a target image matched with the semantic content of the data to be processed.
According to the technical method, the training samples can be adopted to train the cross-modal image-text model to be trained and the encoding and decoding model to be trained respectively, the trained cross-modal image-text model and the trained target encoding and decoding model are used for training the autoregressive sequence generation model to be trained, the loss function convergence is used as a training target, and the autoregressive sequence generation model with the optimal model parameters is obtained, so that the accuracy of a coding sequence generated by the autoregressive sequence generation model is improved, and the quality of a constructed target image is further improved.
EXAMPLE III
Fig. 6 is a block diagram of a data processing apparatus according to a third embodiment of the present disclosure, which is capable of executing a data processing method according to any embodiment of the present disclosure, and includes functional modules corresponding to the execution method and beneficial effects. As shown in fig. 6, the apparatus includes: a to-be-processed data acquisition module 310, a to-be-spliced feature vector input module 320, and a target coding sequence reconstruction module 330.
A to-be-processed data obtaining module 310, configured to obtain to-be-processed data and determine a to-be-spliced feature vector corresponding to the to-be-processed data; the data to be processed comprises texts to be processed and/or images to be processed;
the to-be-spliced feature vector input module 320 is configured to input the to-be-spliced feature vector into an autoregressive sequence generation model obtained through pre-training, so as to obtain a target coding sequence corresponding to the to-be-processed data;
and the target coding sequence reconstruction module 330 is configured to reconstruct the target coding sequence to obtain a target image matched with the semantic content of the data to be processed.
On the basis of the above technical solution, the to-be-processed data obtaining module 310 includes:
the data to be processed input unit is used for inputting the data to be processed into a cross-mode image-text model obtained by pre-training to obtain the feature vector to be spliced;
the cross-modal image-text model is used for processing the data to be processed into corresponding feature vectors.
On the basis of the above technical solution, the to-be-spliced feature vector input module 320 includes:
a current splicing characteristic vector input unit, configured to a) use the to-be-spliced characteristic vector as a current splicing characteristic vector, and input the current splicing characteristic vector into the autoregressive sequence generation model, to obtain a current to-be-decoded identifier, and write the current to-be-decoded identifier into a to-be-used coding sequence; b) updating the current splicing characteristic vector based on the current identifier to be decoded, taking the updated current splicing characteristic vector as the input of the autoregressive sequence generation model, updating the current identifier to be decoded, and writing the current identifier to be decoded into the coding sequence to be used; c) and c, when the repeated step b is detected until the number of the identifiers to be decoded in the coding sequence to be used reaches a preset number threshold value, taking the coding sequence to be used as the target coding sequence.
On the basis of the above technical solution, the target coding sequence reconstruction module 330 includes:
and the decoding unit is used for decoding the target coding sequence based on the image reconstruction model to obtain a target image matched with the semantic content corresponding to the data to be processed.
On the basis of the technical scheme, the device further comprises:
the cross-modal image-text model training module is used for training a cross-modal image-text model to be trained on the basis of a plurality of first training samples to obtain the cross-modal image-text model; the first training sample comprises characters to be trained and images to be trained corresponding to the characters to be trained;
the coding and decoding model training module is used for training the coding and decoding model to be trained on the basis of a plurality of second training samples to obtain a target coding and decoding model; wherein the second training sample comprises a second training image; the coding and decoding model to be trained comprises an image coding sub-model to be trained and an image reconstruction model to be trained; the output of the image coding sub-model to be trained is the input of the image reconstruction model to be trained;
the autoregressive sequence generation model determining module is used for determining the autoregressive sequence generation model based on the target coding and decoding model, the cross-mode image-text model, the autoregressive sequence generation model to be trained and a plurality of third training samples; wherein the third training sample comprises a third training image.
On the basis of the technical scheme, the cross-modal image-text model training module comprises:
the actual characteristic similarity matrix determining unit is used for inputting the characters to be trained and the images to be trained in the current first training sample into the cross-mode image-text model to be trained aiming at each first training sample, respectively obtaining character characteristic vectors and image characteristic vectors, and determining an actual characteristic similarity matrix based on the character characteristic vectors and the image characteristic vectors;
the first parameter correction unit is used for correcting the model parameters in the cross-modal image-text model to be trained based on the actual characteristic similar matrix and the theoretical characteristic matrix;
and the cross-modal image-text model training unit is used for training to obtain the cross-modal image-text model by taking the loss function convergence in the cross-modal image-text model to be trained as a training target.
On the basis of the technical scheme, the coding and decoding model training module comprises:
the second training image input unit is used for inputting a second training image in the current second training sample into the to-be-trained image coding sub-model to obtain to-be-reconstructed discretization codes aiming at each second training sample;
the device comprises a to-be-reconstructed discretization code input unit, a to-be-reconstructed discretization code generating unit and a to-be-reconstructed discretization code generating unit, wherein the to-be-reconstructed discretization code input unit is used for inputting the to-be-reconstructed discretization code into an image reconstruction model to be trained to obtain a reconstructed image;
the reconstructed image input unit is used for inputting the reconstructed image and the second training image into a discriminator to be trained to obtain a discrimination result;
the second parameter correcting unit is used for correcting the image coding sub-model to be trained, the image reconstruction model to be trained and the model parameters in the discriminator to be trained based on the discrimination result;
the to-be-used coding and decoding model training unit is used for training the to-be-used coding and decoding model by taking the to-be-trained image coding sub-model, the to-be-trained image reconstruction model and the loss function convergence in the to-be-trained discriminator as training targets;
and the image reconstruction model acquisition unit is used for acquiring the image coding sub-model and the image reconstruction model in the to-be-used coding and decoding model to obtain the target coding and decoding model.
On the basis of the technical scheme, the autoregressive sequence generation model determining module comprises:
the encoding processing unit is used for a) inputting a third training image in the current third training sample into the cross-modal image-text model aiming at each third training sample to obtain a third feature vector; coding the third training image based on the image coding sub-model in the target coding and decoding model to obtain an image coding sequence; b) taking the third feature vector as a current feature vector, inputting the current feature vector into the autoregressive sequence generation model to be trained to obtain a current identifier to be corrected, and updating the current feature vector based on the current identifier to be corrected; c) the current feature vector is used as the input of the autoregressive sequence generation model to be trained again, the current mark to be corrected is updated, and the current feature vector is updated based on the updated current mark to be corrected; d) repeating the step c until the number of the current marks to be corrected reaches a preset number threshold value, and obtaining a characteristic sequence to be reconstructed; and determining cross entropy loss based on the characteristic sequence to be reconstructed and the image coding sequence, and correcting the autoregressive sequence generation model to be trained based on the cross entropy loss to obtain the autoregressive sequence generation model.
On the basis of the technical scheme, the device further comprises:
the image to be corrected obtaining module is used for inputting the characteristic sequence to be reconstructed into the image reconstruction model to obtain an image to be corrected; inputting the image to be corrected into the cross-mode image-text model to obtain a characteristic vector to be corrected;
a similarity value determination module, configured to determine a similarity value between the feature vector to be modified and the third feature vector, and modify a model parameter in the autoregressive sequence generation model to be trained based on the similarity value and the cross entropy loss;
and the autoregressive sequence generation model obtaining unit is used for taking the loss function convergence in the autoregressive sequence generation model to be trained as a training target to obtain the autoregressive sequence generation model.
According to the technical scheme of the embodiment of the disclosure, the to-be-spliced characteristic vector corresponding to the to-be-processed data is determined through the acquired to-be-processed data; inputting the feature vector to be spliced into an autoregressive sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the data to be processed; and reconstructing the target coding sequence to obtain a target image matched with the semantic content of the data to be processed. According to the method and the device, the target image can be obtained by reconstructing the target coding sequence obtained by the autoregressive sequence generation model, the time for image search operation in space is shortened, the data processing efficiency is improved, the corresponding target image is determined based on the feature vector to be spliced, the target image can embody the features of the data to be processed, and the accuracy of the determined target image is improved.
The data processing device provided by the embodiment of the disclosure can execute the data processing method provided by any embodiment of the disclosure, and has the corresponding functional module and the beneficial effect of executing the data processing method.
It should be noted that, the units and modules included in the apparatus are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the embodiments of the present disclosure.
Example four
Fig. 7 is a schematic structural diagram of an electronic device according to a fourth embodiment of the disclosure. Referring now to fig. 7, a schematic diagram of an electronic device (e.g., the terminal device or the server of fig. 7) 400 suitable for implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, electronic device 400 may include a processing device (e.g., central processing unit, graphics processor, etc.) 401 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic apparatus 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404. An editing/output (I/O) interface 405 is also connected to bus 404.
Generally, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 400 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, the processes described above with reference to the flow diagrams may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 409, or from the storage device 408, or from the ROM 402. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 401.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
The electronic device provided by the embodiment of the present disclosure and the data processing method provided by the above embodiment belong to the same inventive concept, and technical details that are not described in detail in the embodiment can be referred to the above embodiment, and the embodiment has the same beneficial effects as the above embodiment.
EXAMPLE five
The disclosed embodiments provide a computer storage medium on which a computer program is stored, which when executed by a processor implements the data processing method provided by the above-described embodiments.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:
acquiring data to be processed, and determining a feature vector to be spliced corresponding to the data to be processed; the data to be processed comprises texts to be processed and/or images to be processed;
inputting the feature vector to be spliced into an autoregressive sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the data to be processed;
and reconstructing the target coding sequence to obtain a target image matched with the semantic content of the data to be processed.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure [ example one ] there is provided a data processing method comprising:
acquiring data to be processed, and determining a feature vector to be spliced corresponding to the data to be processed; the data to be processed comprises texts to be processed and/or images to be processed;
inputting the feature vector to be spliced into an autoregressive sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the data to be processed;
and reconstructing the target coding sequence to obtain a target image matched with the semantic content of the data to be processed.
According to one or more embodiments of the present disclosure [ example two ] there is provided a data processing method, comprising:
optionally, the determining the feature vector to be spliced corresponding to the data to be processed includes:
inputting the data to be processed into a cross-mode image-text model obtained by pre-training to obtain the feature vector to be spliced;
the cross-modal image-text model is used for processing the data to be processed into corresponding feature vectors.
According to one or more embodiments of the present disclosure [ example three ] there is provided a data processing method, comprising:
optionally, the inputting the feature vector to be spliced into an autoregressive sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the data to be processed includes:
a) taking the feature vector to be spliced as a current splicing feature vector, inputting the current splicing feature vector into the autoregressive sequence generation model, obtaining a current identifier to be decoded and writing the current identifier into a coding sequence to be used;
b) updating the current splicing characteristic vector based on the current identifier to be decoded, taking the updated current splicing characteristic vector as the input of the autoregressive sequence generation model, updating the current identifier to be decoded, and writing the updated current splicing characteristic vector into the coding sequence to be used;
c) and c, repeating the step b until the number of the identifiers to be decoded in the coding sequence to be used reaches a preset number threshold value, and taking the coding sequence to be used as the target coding sequence.
According to one or more embodiments of the present disclosure, [ example four ] there is provided a data processing method, comprising:
optionally, the reconstructing the target coding sequence to obtain a target image matched with semantic content corresponding to the data to be processed includes:
and decoding the target coding sequence based on an image reconstruction model to obtain a target image matched with semantic content corresponding to the data to be processed.
According to one or more embodiments of the present disclosure [ example five ] there is provided a data processing method, further comprising:
optionally, training a cross-modal image-text model to be trained based on a plurality of first training samples to obtain the cross-modal image-text model; the first training sample comprises characters to be trained and images to be trained corresponding to the characters to be trained;
training the to-be-trained encoding and decoding model based on the plurality of second training samples to obtain a target encoding and decoding model; wherein the second training sample comprises a second training image; the coding and decoding model to be trained comprises an image coding sub-model to be trained and an image reconstruction model to be trained; the output of the image coding sub-model to be trained is the input of the image reconstruction model to be trained;
determining the autoregressive sequence generation model based on the target coding and decoding model, the cross-modal image-text model, the autoregressive sequence generation model to be trained and a plurality of third training samples;
wherein the third training sample comprises a third training image.
According to one or more embodiments of the present disclosure [ example six ] there is provided a data processing method, comprising:
optionally, the training the cross-modal image-text model to be trained based on the plurality of first training samples to obtain the cross-modal image-text model includes:
inputting characters to be trained and images to be trained in current first training samples into the cross-modal image-text model to be trained aiming at each first training sample, respectively obtaining character characteristic vectors and image characteristic vectors, and determining an actual characteristic similarity matrix based on the character characteristic vectors and the image characteristic vectors;
based on the actual characteristic similarity matrix and the theoretical characteristic matrix, correcting model parameters in the cross-modal image-text model to be trained;
and (4) taking the loss function convergence in the cross-modal image-text model to be trained as a training target, and training to obtain the cross-modal image-text model.
According to one or more embodiments of the present disclosure, [ example seven ] there is provided a data processing method, comprising:
optionally, the training the codec model to be trained based on the plurality of second training samples to obtain a target codec model, including:
for each second training sample, inputting a second training image in the current second training sample into the to-be-trained image coding sub-model to obtain a to-be-reconstructed discretization code;
inputting the discretization code to be reconstructed into an image reconstruction model to be trained to obtain a reconstructed image;
inputting the reconstructed image and the second training image into a discriminator to be trained to obtain a discrimination result;
modifying the image coding sub-model to be trained, the image reconstruction model to be trained and the model parameters in the discriminator to be trained based on the discrimination result;
taking the image coding sub-model to be trained, the image reconstruction model to be trained and the loss function convergence in the discriminator to be trained as training targets, and training to obtain a coding and decoding model to be used;
and acquiring an image coding sub-model and an image reconstruction model in the coding and decoding model to be used to obtain the target coding and decoding model.
According to one or more embodiments of the present disclosure, [ example eight ] there is provided a data processing method, comprising:
optionally, the determining the autoregressive sequence generation model based on the target encoding and decoding model, the cross-modal image-text model, the to-be-trained autoregressive sequence generation model, and a plurality of third training samples includes:
a) inputting a third training image in the current third training sample into the cross-mode image-text model to obtain a third feature vector aiming at each third training sample; coding the third training image based on the image coding sub-model in the target coding and decoding model to obtain an image coding sequence;
b) taking the third feature vector as a current feature vector, inputting the current feature vector into the autoregressive sequence generation model to be trained to obtain a current identifier to be corrected, and updating the current feature vector based on the current identifier to be corrected;
c) the current feature vector is used as the input of the autoregressive sequence generation model to be trained again, the current mark to be corrected is updated, and the current feature vector is updated based on the updated current mark to be corrected;
d) repeating the step c until the number of the current marks to be corrected reaches a preset number threshold value, and obtaining a characteristic sequence to be reconstructed;
and determining cross entropy loss based on the characteristic sequence to be reconstructed and the image coding sequence, and correcting the autoregressive sequence generation model to be trained based on the cross entropy loss to obtain the autoregressive sequence generation model. And determining cross entropy loss based on the characteristic sequence to be reconstructed and the image coding sequence, and correcting the autoregressive sequence generation model to be trained based on the cross entropy loss to obtain the autoregressive sequence generation model.
According to one or more embodiments of the present disclosure, [ example nine ] there is provided a data processing method, further comprising:
optionally, the feature sequence to be reconstructed is input into the image reconstruction model to obtain an image to be corrected; inputting the image to be corrected into the cross-mode image-text model to obtain a characteristic vector to be corrected;
determining a similarity value between the feature vector to be corrected and the third feature vector, and correcting model parameters in the autoregressive sequence generation model to be trained on the basis of the similarity value and the cross entropy loss;
and converging a loss function in the autoregressive sequence generation model to be trained as a training target to obtain the autoregressive sequence generation model.
According to one or more embodiments of the present disclosure, [ example ten ] there is provided a data processing apparatus comprising:
the device comprises a to-be-processed data acquisition module, a splicing module and a splicing module, wherein the to-be-processed data acquisition module is used for acquiring to-be-processed data and determining to-be-spliced characteristic vectors corresponding to the to-be-processed data; the data to be processed comprises texts to be processed and/or images to be processed;
the to-be-spliced feature vector input module is used for inputting the to-be-spliced feature vector into an autoregressive sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the to-be-processed data;
and the target coding sequence reconstruction module is used for reconstructing the target coding sequence to obtain a target image matched with the semantic content of the data to be processed.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (12)

1. A data processing method, comprising:
acquiring data to be processed, and determining a feature vector to be spliced corresponding to the data to be processed; the data to be processed comprises texts to be processed and/or images to be processed;
inputting the feature vectors to be spliced into an autoregressive sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the data to be processed;
and reconstructing the target coding sequence to obtain a target image matched with the semantic content of the data to be processed.
2. The method of claim 1, wherein the determining the feature vector to be spliced corresponding to the data to be processed comprises:
inputting the data to be processed into a cross-mode image-text model obtained by pre-training to obtain the feature vector to be spliced;
the cross-modal image-text model is used for processing the data to be processed into corresponding feature vectors.
3. The method according to claim 1, wherein the inputting the feature vector to be spliced into an auto-regression sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the data to be processed comprises:
a) taking the feature vector to be spliced as a current splicing feature vector, inputting the current splicing feature vector into the autoregressive sequence generation model, obtaining a current identifier to be decoded and writing the current identifier into a coding sequence to be used;
b) updating the current splicing characteristic vector based on the current identifier to be decoded, taking the updated current splicing characteristic vector as the input of the autoregressive sequence generation model, updating the current identifier to be decoded, and writing the updated current splicing characteristic vector into the coding sequence to be used;
c) and c, repeating the step b until the number of the identifiers to be decoded in the coding sequence to be used reaches a preset number threshold value, and taking the coding sequence to be used as the target coding sequence.
4. The method according to claim 1, wherein the reconstructing the target coding sequence to obtain a target image matching semantic content corresponding to the data to be processed comprises:
and decoding the target coding sequence based on an image reconstruction model to obtain a target image matched with semantic content corresponding to the data to be processed.
5. The method of any of claims 1-4, further comprising:
training a cross-modal image-text model to be trained based on a plurality of first training samples to obtain the cross-modal image-text model; the first training sample comprises characters to be trained and images to be trained corresponding to the characters to be trained;
training the to-be-trained encoding and decoding model based on the plurality of second training samples to obtain a target encoding and decoding model; wherein the second training sample comprises a second training image; the coding and decoding model to be trained comprises an image coding sub-model to be trained and an image reconstruction model to be trained; the output of the image coding sub-model to be trained is the input of the image reconstruction model to be trained;
determining the autoregressive sequence generation model based on the target coding and decoding model, the cross-modal image-text model, the autoregressive sequence generation model to be trained and a plurality of third training samples;
wherein the third training sample comprises a third training image.
6. The method of claim 5, wherein the training the cross-modal graphics-text model to be trained based on the plurality of first training samples to obtain the cross-modal graphics-text model comprises:
inputting characters to be trained and images to be trained in current first training samples into the cross-modal image-text model to be trained aiming at each first training sample, respectively obtaining character characteristic vectors and image characteristic vectors, and determining an actual characteristic similarity matrix based on the character characteristic vectors and the image characteristic vectors;
based on the actual characteristic similarity matrix and the theoretical characteristic matrix, correcting model parameters in the cross-modal image-text model to be trained;
and (4) taking the loss function convergence in the cross-modal image-text model to be trained as a training target, and training to obtain the cross-modal image-text model.
7. The method according to claim 5, wherein the training the codec model to be trained based on the plurality of second training samples to obtain the target codec model comprises:
aiming at each second training sample, inputting a second training image in the current second training sample into the image coding sub-model to be trained to obtain a discretization code to be reconstructed;
inputting the discretization code to be reconstructed into an image reconstruction model to be trained to obtain a reconstructed image;
inputting the reconstructed image and the second training image into a discriminator to be trained to obtain a discrimination result;
modifying the image coding sub-model to be trained, the image reconstruction model to be trained and the model parameters in the discriminator to be trained based on the discrimination result;
taking the image coding sub-model to be trained, the image reconstruction model to be trained and the loss function convergence in the discriminator to be trained as training targets, and training to obtain a coding and decoding model to be used;
and acquiring an image coding sub-model and an image reconstruction model in the coding and decoding model to be used to obtain the target coding and decoding model.
8. The method of claim 5, wherein the determining the autoregressive sequence generation model based on the target codec model, the cross-modal teletext model, the autoregressive sequence generation model to be trained, and a plurality of third training samples comprises:
a) inputting a third training image in the current third training sample into the cross-mode image-text model to obtain a third feature vector aiming at each third training sample; coding the third training image based on the image coding sub-model in the target coding and decoding model to obtain an image coding sequence;
b) taking the third feature vector as a current feature vector, inputting the current feature vector into the autoregressive sequence generation model to be trained to obtain a current identifier to be corrected, and updating the current feature vector based on the current identifier to be corrected;
c) the current feature vector is used as the input of the autoregressive sequence generation model to be trained again, the current mark to be corrected is updated, and the current feature vector is updated based on the updated current mark to be corrected;
d) repeating the step c until the number of the current marks to be corrected reaches a preset number threshold value, and obtaining a characteristic sequence to be reconstructed;
and determining cross entropy loss based on the characteristic sequence to be reconstructed and the image coding sequence, and correcting the autoregressive sequence generation model to be trained based on the cross entropy loss to obtain the autoregressive sequence generation model.
9. The method of claim 8, further comprising:
inputting the characteristic sequence to be reconstructed into the image reconstruction model to obtain an image to be corrected; inputting the image to be corrected into the cross-mode image-text model to obtain a characteristic vector to be corrected;
determining a similarity value between the feature vector to be corrected and the third feature vector, and correcting model parameters in the autoregressive sequence generation model to be trained on the basis of the similarity value and the cross entropy loss;
and converging a loss function in the autoregressive sequence generation model to be trained as a training target to obtain the autoregressive sequence generation model.
10. A data processing apparatus, comprising:
the device comprises a to-be-processed data acquisition module, a splicing module and a splicing module, wherein the to-be-processed data acquisition module is used for acquiring to-be-processed data and determining to-be-spliced characteristic vectors corresponding to the to-be-processed data; the data to be processed comprises texts to be processed and/or images to be processed;
the to-be-spliced feature vector input module is used for inputting the to-be-spliced feature vector into an autoregressive sequence generation model obtained by pre-training to obtain a target coding sequence corresponding to the to-be-processed data;
and the target coding sequence reconstruction module is used for reconstructing the target coding sequence to obtain a target image matched with the semantic content of the data to be processed.
11. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a data processing method as claimed in any one of claims 1-9.
12. A storage medium containing computer-executable instructions for performing the data processing method of any one of claims 1-9 when executed by a computer processor.
CN202210173457.7A 2022-02-24 2022-02-24 Data processing method and device, electronic equipment and storage medium Pending CN114564606A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210173457.7A CN114564606A (en) 2022-02-24 2022-02-24 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210173457.7A CN114564606A (en) 2022-02-24 2022-02-24 Data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114564606A true CN114564606A (en) 2022-05-31

Family

ID=81716384

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210173457.7A Pending CN114564606A (en) 2022-02-24 2022-02-24 Data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114564606A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115292455A (en) * 2022-10-08 2022-11-04 有米科技股份有限公司 Training method and device of image-text matching model
CN118504623A (en) * 2024-07-17 2024-08-16 广东石油化工学院 Bearing vibration data generation method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115292455A (en) * 2022-10-08 2022-11-04 有米科技股份有限公司 Training method and device of image-text matching model
CN118504623A (en) * 2024-07-17 2024-08-16 广东石油化工学院 Bearing vibration data generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113470619B (en) Speech recognition method, device, medium and equipment
CN113313022B (en) Training method of character recognition model and method for recognizing characters in image
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN110969012B (en) Text error correction method and device, storage medium and electronic equipment
CN110413812B (en) Neural network model training method and device, electronic equipment and storage medium
CN114418834A (en) Character generation method and device, electronic equipment and storage medium
CN113327599B (en) Voice recognition method, device, medium and electronic equipment
CN114564606A (en) Data processing method and device, electronic equipment and storage medium
CN112270200B (en) Text information translation method and device, electronic equipment and storage medium
JP7520246B2 (en) Method and apparatus for generating text - Patents.com
CN114330236A (en) Character generation method and device, electronic equipment and storage medium
CN112883968A (en) Image character recognition method, device, medium and electronic equipment
US20240062253A1 (en) Advertisement title rewriting method, apparatus and device, and storage medium
CN115967833A (en) Video generation method, device and equipment meter storage medium
CN115908640A (en) Method and device for generating image, readable medium and electronic equipment
CN113191257B (en) Order of strokes detection method and device and electronic equipment
CN114625876B (en) Method for generating author characteristic model, method and device for processing author information
CN116644180A (en) Training method and training system for text matching model and text label determining method
CN114495112B (en) Method and device for processing text in image, readable medium and electronic equipment
CN112530416B (en) Speech recognition method, apparatus, device and computer readable medium
CN111737572B (en) Search statement generation method and device and electronic equipment
CN114429629A (en) Image processing method and device, readable storage medium and electronic equipment
CN110852043B (en) Text transcription method, device, equipment and storage medium
CN113986958A (en) Text information conversion method and device, readable medium and electronic equipment
CN112328751A (en) Method and device for processing text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination