WO2024011815A1 - 图文互检模型训练方法及装置、图文互检方法、设备 - Google Patents

图文互检模型训练方法及装置、图文互检方法、设备 Download PDF

Info

Publication number
WO2024011815A1
WO2024011815A1 PCT/CN2022/134092 CN2022134092W WO2024011815A1 WO 2024011815 A1 WO2024011815 A1 WO 2024011815A1 CN 2022134092 W CN2022134092 W CN 2022134092W WO 2024011815 A1 WO2024011815 A1 WO 2024011815A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
image
data
training
features
Prior art date
Application number
PCT/CN2022/134092
Other languages
English (en)
French (fr)
Inventor
李仁刚
王立
郭振华
范宝余
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024011815A1 publication Critical patent/WO2024011815A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of retrieval technology, in particular to image and text mutual inspection model training methods and devices, image and text mutual inspection methods and equipment.
  • the purpose of this application is to provide image and text mutual inspection model training methods and devices, image and text mutual inspection methods and equipment, which can improve the performance of the image and text mutual inspection model, and thereby improve the accuracy of image and text mutual inspection.
  • the specific plan is as follows:
  • this application discloses a method for training a picture-text mutual inspection model, which includes:
  • the training data pair includes text training data and image training data
  • the text training data includes long text data
  • the long text data is text data containing multiple target texts
  • the target text is a sentence or phrase
  • the text coding module includes a multi-layer LSTM network, and multiple The layer LSTM network includes a first LSTM network layer and a second LSTM network layer.
  • the first LSTM network layer is used to obtain the characteristics of each target text based on the characteristics of each word in each target text;
  • the second LSTM network layer is used to obtain the characteristics of each target text based on The features of each target text obtain the features of long text data;
  • the initial model after parameter adjustment is determined as the image-text mutual detection model.
  • the first LSTM network layer includes multiple BiLSTM networks, and each BiLSTM network includes multiple BiLSTM units; different BiLSTM units are used to extract features of different words, and different BiLSTM networks output features of different target texts;
  • the second LSTM network layer includes multiple BiLSTM units, and the input of the BiLSTM unit is the feature of the target text output by the corresponding BiLSTM network in the first LSTM network layer.
  • the text training data includes multiple long text data.
  • the text encoding module includes multiple multi-layer LSTM networks, and each multi-layer LSTM network is used to obtain the characteristics of a long text data.
  • the text training data also includes short text data, which is text data containing a target text; correspondingly, the text encoding module also includes a short text feature extraction module for extracting features of the short text data.
  • use the text encoding module in the initial model to extract text encoding features of the text training data including:
  • the features of multiple long text data and the features of short text data are spliced to obtain the text encoding features of the text training data.
  • training data pairs including:
  • Each type of text data is determined as the text training data in the training data pair, and the image data is determined as the image training data in the training data pair.
  • the image training data is an image sequence
  • the image coding module includes a backbone network and a BiLSTM network.
  • the image coding module is used to extract the image coding features of the image training data, including:
  • the image coding module also includes an attention structure.
  • the image features are input into the BiLSTM network to obtain image coding features, including:
  • the final features of each image feature are determined based on the attention weight, and the final features are input into the BiLSTM network to obtain the image encoding features.
  • calculate training loss based on text encoding features and image encoding features including:
  • the coding feature pairs are the coding pairs composed of the text coding features and image coding features of the training data pairs
  • the anchor sample is any text encoding feature or image encoding feature in the N encoding feature pairs
  • the positive sample is another encoding feature paired with the anchor sample
  • the negative sample is the other encoding feature in the N encoding feature pair. All coding features except;
  • the training loss is calculated based on the anchor sample and the positive and negative samples corresponding to the anchor sample.
  • the training data pairs into the initial model before inputting the training data pairs into the initial model, it also includes:
  • the target long text data is long text data with temporal relationships between sentences in the text training data
  • the target long text data will be scrambled, otherwise the target long text data will not be scrambled;
  • the label indicates whether the target long text data has been scrambled.
  • Optional also includes:
  • the temporal constraint loss is calculated based on the characteristics of the target long text data and the labels of the target long text data.
  • calculate the training loss based on the anchor sample and the positive and negative samples corresponding to the anchor sample including:
  • the training loss is calculated using the target triplet loss and the temporal constraint loss.
  • this application discloses a picture-text mutual inspection method, which includes:
  • target data is image data or text data
  • all data encoding features are text encoding features; if the target data is text data, all data encoding features are image encoding features.
  • this application discloses a picture-text mutual inspection model training device, which includes:
  • the training data acquisition module is used to obtain training data pairs; the training data pairs include text training data and image training data.
  • the text training data includes long text data.
  • the long text data is text data containing multiple target texts.
  • the target text is a sentence or phrase;
  • the feature extraction module is used to input the training data pairs into the initial model, and respectively uses the text coding module and the image coding module in the initial model to extract the text coding features of the text training data and the image coding features of the image training data;
  • the text coding module includes Multi-layer LSTM network
  • the multi-layer LSTM network includes a first LSTM network layer and a second LSTM network layer.
  • the first LSTM network layer is used to obtain the characteristics of each target text based on the characteristics of each word in each target text
  • the second The LSTM network layer is used to obtain the features of long text data based on the features of each target text;
  • Loss calculation module used to calculate training loss based on text encoding features and image encoding features
  • Parameter adjustment module used to adjust parameters of the initial model based on training loss
  • the image-text mutual detection model determination module is used to determine the initial model after parameter adjustment as the image-text mutual detection model if the training loss meets the convergence condition.
  • this application discloses an electronic device, including a memory and a processor, wherein:
  • Memory used to hold computer programs
  • the processor is configured to execute a computer program to implement the aforementioned image-text mutual inspection model training method, and/or the aforementioned image-text mutual inspection method.
  • the present application discloses a non-volatile readable storage medium for storing a computer program, wherein when the computer program is executed by a processor, the aforementioned image-text mutual inspection model training method is implemented, and/or, the aforementioned The picture and text mutual inspection method.
  • this application obtains training data pairs; the training data pairs include text training data and image training data, the text training data includes long text data, the long text data is text data containing multiple target texts, and the target text is a sentence or phrase; then Input the training data pairs into the initial model, and use the text coding module and image coding module in the initial model to extract the text coding features of the text training data and the image coding features of the image training data; among them, the text coding module includes a multi-layer LSTM network, and multiple The layer LSTM network includes a first LSTM network layer and a second LSTM network layer.
  • the first LSTM network layer is used to obtain the characteristics of each target text based on the characteristics of each word in each target text; the second LSTM network layer is used to obtain the characteristics of each target text based on The features of each target text are obtained from the features of the long text data. Then the training loss is calculated based on the text encoding features and image encoding features, and the parameters of the initial model are adjusted based on the training loss. If the training loss meets the convergence conditions, the adjusted parameters are The initial model is determined to be a picture-text mutual inspection model.
  • a multi-layer LSTM network is used to extract the features of text training data, and the first LSTM network layer is first used to obtain the features of each sentence or phrase based on the features of each word in each target text, and then the second LSTM network layer is used to obtain the features of each sentence or phrase.
  • the LSTM network layer obtains the characteristics of long text data based on the characteristics of each sentence or phrase. In this way, it solves the problem of information forgetting between words, sentences or phrases that are far apart in long text data, and obtains richer text information, which can Improve the performance of the image-text mutual inspection model, thereby improving the accuracy of image-text mutual inspection.
  • Figure 1 is a flow chart of a training method for a picture-text mutual inspection model disclosed in this application;
  • Figure 2 is a schematic diagram of a text encoding module disclosed in this application.
  • Figure 3 is a schematic diagram of a specific image coding module disclosed in this application.
  • FIG. 4 is a schematic diagram of an attention module disclosed in this application.
  • Figure 5 is a schematic diagram of a specific positive and negative sample disclosed in this application.
  • Figure 6 is a flow chart of a specific image-text mutual inspection model training method disclosed in this application.
  • Figure 7 is a schematic diagram of a specific image-text mutual inspection model training disclosed in this application.
  • Figure 8 is a flow chart of a graphic-text interactive method disclosed in this application.
  • Figure 9 is a schematic structural diagram of a picture-text mutual inspection model training device disclosed in this application.
  • Figure 10 is a structural diagram of an electronic device disclosed in this application.
  • Figure 11 is a schematic structural diagram of a non-volatile readable storage medium disclosed in this application.
  • this application provides a picture-text mutual check model training and picture-text mutual check scheme, which can improve the performance of the picture-text mutual check model and thereby improve the accuracy of picture-text mutual check.
  • a picture-text mutual inspection model training method includes:
  • Step S11 Obtain a training data pair; the training data pair includes text training data and image training data, the text training data includes long text data, the long text data is text data containing multiple target texts, and the target text is a sentence or a phrase.
  • Step S12 Input the training data pairs into the initial model, and use the text coding module and the image coding module in the initial model to extract the text coding features of the text training data and the image coding features of the image training data;
  • the text coding module includes a multi-layer LSTM Network
  • the multi-layer LSTM network includes a first LSTM network layer and a second LSTM network layer.
  • the first LSTM network layer is used to obtain the characteristics of each target text based on the characteristics of each word in each target text
  • the second LSTM network layer Used to obtain features of long text data based on the features of each target text.
  • the first LSTM network layer includes multiple BiLSTM (Bi-directional Long-Short Term Memory) networks, and each BiLSTM network includes multiple BiLSTM units; different BiLSTM units are used to extract different Characteristics of words, different BiLSTM networks output the characteristics of different target texts; the second LSTM network layer includes multiple BiLSTM units, and the input of the BiLSTM units is the characteristics of the target text output by the corresponding BiLSTM network in the first LSTM network layer.
  • BiLSTM Bi-directional Long-Short Term Memory
  • the text encoding module also includes a word encoding layer, which is used to encode each word in the text training data, and the encoding of each word of different target texts in the long text data is input into the BiLSTM in different BiLSTM networks in the first LSTM network layer. unit.
  • the word encoding layer can be a transformer layer or a word 2 vector (that is, words are converted into vectors) layer. That is, the encoding of words in different target texts is input into different BiLSTM networks.
  • the second LSTM network layer may include two sub-network layers, the first sub-network layer includes multiple BiLSTM networks, and each BiLSTM network includes multiple BiLSTM unit, the input of the BiLSTM unit is the characteristics of the target text output by the corresponding BiLSTM network in the first LSTM network layer, and different BiLSTM networks output the characteristics of different sections of text.
  • the second sub-network layer includes multiple BiLSTM units. The input of the BiLSTM unit is the feature of the text segment output by the corresponding BiLSTM network in the first sub-network layer. The output of the second sub-network layer is the feature of the long text data.
  • the text training data can include multiple long text data.
  • the text encoding module includes multiple multi-layer LSTM networks, and each multi-layer LSTM network is used to obtain the characteristics of a long text data.
  • the text training data may also include short text data, which is text data containing a target text; correspondingly, the text encoding module also includes a short text feature extraction module for extracting features of the short text data.
  • short text data is text data containing a target text
  • the text encoding module also includes a short text feature extraction module for extracting features of the short text data.
  • one-word text can also be used as short text.
  • the embodiment of the present application splices the features of multiple long text data and the features of short text data to obtain the text coding features of the text training data.
  • text data and image data in the same paper can be extracted.
  • extract text data and image data from medical papers classify the text data based on semantics, and obtain various types of text data, including: abstract, keywords, and titles.
  • the abstract includes multiple sentences, and the keywords include multiple phrases. , are determined to be long text data, and the title is a sentence, which is determined to be short text data.
  • medical reports include many types, such as medical papers and so on.
  • Medical papers include the paper title, paper abstract, paper keywords and paper body. You can select the title, abstract, and keywords of a medical paper as the main components of text data, and images of medical records or images in the paper as image data.
  • Figure 2 is a schematic diagram of a text encoding module provided by an embodiment of the present application.
  • the abstract, keywords and title of the medical paper are the first text information, the second text information and the third text information respectively. Since the first text information is a paragraph composed of multiple sentences, and the second text information is composed of multiple sentence phrases, in order to encode the first text information and the second text information, this application proposes a cascaded LSTM structure, That is, a multi-layer LSTM network.
  • the input text data of the model includes the first text information, the second text information and the third text information. All words are encoded through the transformer layer.
  • the transformer can encode each word into a feature vector and become the representation of the word. . Different text information can correspond to different transformer layers.
  • the first text information is encoded through the transformer layer, and then for each sentence of the first text information, it is input into a different BiLSTM network.
  • the encoding of different words is input into different BiLSTM units in the BiLSTM network.
  • the first BiLSTM network layer The BiLSTM network in is used to extract the feature representation of each sentence of the first text information.
  • the feature of the first word or the feature of the last word of each sentence can be selected as the feature representation of the entire sentence.
  • Representation method for example, take the mean value of the character features output by all BiLSTM units in the BiLSTM network as the feature of the entire sentence. In this way, the feature representation of each sentence is obtained and combined into a new sequence.
  • the features of each sentence are respectively Input the BiLSTM unit in the second LSTM network layer to finally obtain the overall feature expression of the first text information.
  • a row of BiLSTM units forms a BiLSTM network.
  • the same strategy as for the first text message is adopted.
  • the basic transformer model is used to directly obtain features. In this way, three different types of text features are obtained, and the features of all text information are spliced.
  • e ttl , e ins , and e ing represent the characteristics of the third text information, the first text information, and the second text information respectively.
  • [] represents feature splicing, that is, features are connected end to end.
  • e rec represents the spliced features.
  • the spliced features are mapped through a fully connected layer to obtain a vector with the same dimension as the word.
  • the dimension of the word is the length of the encoding (vector) of the word, and the text encoding features of the text training data are obtained. Subsequently used to match with image coding features.
  • the formula is as follows:
  • e′ rec represents the text encoding feature of the text training data
  • fc represents the fully connected layer processing
  • the image training data is an image sequence;
  • the image coding module includes a backbone network and a BiLSTM network.
  • the embodiment of the present application can use the backbone network to extract the features of each image in the image sequence to obtain image features; input each image feature into the BiLSTM network, Get image coding features.
  • the image coding module also includes an attention structure.
  • each image feature can be input into the attention structure to obtain the attention weight of each image feature; and the weight of each image feature is determined based on the attention weight.
  • final features and input the final features into the BiLSTM network to obtain image encoding features.
  • FIG. 3 is a schematic diagram of a specific image encoding module disclosed in an embodiment of the present application.
  • the ResNet i.e. Residual Network, residual network
  • the ResNet backbone network can be used to extract the image features of each image, and the features of the ResNet network in the previous layer of the classification layer are obtained as each image. image features.
  • the image features are input into the BiLSTM network, and each image is input into a BiLSTM unit in the BiLSTM network to obtain the overall characteristics of the image sequence, that is, the image coding characteristics.
  • the formula is as follows:
  • image sequence also includes reverse order and sequential order. All contain temporal semantic information.
  • the embodiment of this application uses the above formula to encode it.
  • BiLSTM represents each BiLSTM unit of the BiLSTM network.
  • is the output of the i-th BiLSTM unit. represents the image feature, i represents the i-th image, ⁇ represents the order, ⁇ represents the reverse order, I represents the BiLSTM network including I images, ⁇ att () represents the attention structure, and fc represents the fully connected layer.
  • the average feature encoding output of the BiLSTM unit can be taken as the output of the BiLSTM network. The formula is as follows:
  • e csi represents image feature encoding.
  • the attention module designed in this application includes a fully connected layer, an attention module, a softmax layer, a multiplication module, and an addition module.
  • the attention module is shown in Figure 4 and includes two fully connected layers FC and one ReLU (i.e. Linear rectification function, linear rectification function) layer.
  • FC i.e. Linear rectification function, linear rectification function
  • the image obtains embedded features after passing through the backbone network, and the embedded features pass through a fully connected layer to obtain the final embedded feature e of each image.
  • the final embedded feature e will pass through the attention module to calculate the weight of each feature.
  • the weight is a number and is normalized through the sigmoid layer.
  • the weights of the features of all images will be uniformly entered into the softmax layer to determine which image is important.
  • the feature weight of the image after the softmax layer will be multiplied by the corresponding final embedded feature e of each image. That is to say, in the embodiment of the present application, the embodiment of the present application can use the backbone network to extract the features of each image in the image sequence to obtain the image features, input the image features into a fully connected layer to obtain the embedded features, and input each embedded feature into the attention layer. force module, obtain the attention weight of each image feature, and then process it through the softmax layer.
  • the final features of each image feature are determined, and the final features are input into the BiLSTM network to obtain the image coding features.
  • the embodiment of this application introduces the idea of residual network. For each image, the output of its attention structure is as follows:
  • Step S13 Calculate training loss based on text encoding features and image encoding features, and adjust parameters of the initial model based on the training loss.
  • the positive samples and negative samples corresponding to the anchor point samples can be determined for the N coding feature pairs of the N training data pairs in a batch; wherein the coding feature pairs are the text coding features and images of the training data pairs.
  • the anchor sample is any text coding feature or image coding feature among the N coding feature pairs.
  • the positive sample is another coding feature paired with the anchor sample.
  • the negative sample is N coding features. For all coding features except another coding feature, the training loss is calculated based on the anchor sample and the positive and negative samples corresponding to the anchor sample. The formula used is as follows:
  • a text encoding feature corresponds to an image encoding feature.
  • this application will traverse each image encoding feature and text encoding feature to find the average loss.
  • a total of N times are traversed, and N represents a total of N paired samples in this batch.
  • First encode the image features Traverse (N in total), traversing the selected image feature encoding is called a represents anchor (anchor sample).
  • anchor sample an anchor sample
  • the text feature encoding paired with the anchor sample is denoted as p stands for positive. In the same way, in this batch, All remaining unpaired samples are recorded as s np .
  • is a hyperparameter, fixed during training, and can be set to 0.4 in this application.
  • the same traversal operation is also performed for text feature encoding, represents the sample selected in the traversal, and its corresponding positive image feature encoding sample is recorded as Those that do not correspond are recorded as s np .
  • Figure 5 which is a specific schematic diagram of positive and negative samples disclosed in the embodiment of the present application. min represents the minimum value operation, is the target triplet loss,
  • Step S14 If the training loss meets the convergence condition, the initial model after parameter adjustment is determined as the image-text mutual detection model.
  • the embodiment of the present application obtains a training data pair; the training data pair includes text training data and image training data, the text training data includes long text data, the long text data is text data containing multiple target texts, and the target text is a sentence or a phrase. ; Then input the training data pairs into the initial model, and use the text coding module and image coding module in the initial model to extract the text coding features of the text training data and the image coding features of the image training data; among them, the text coding module includes a multi-layer LSTM network , the multi-layer LSTM network includes the first LSTM network layer and the second LSTM network layer.
  • the first LSTM network layer is used to obtain the characteristics of each target text based on the characteristics of each word in each target text; the second LSTM network layer is used To obtain the features of long text data based on the characteristics of each target text, and then calculate the training loss based on the text encoding features and image encoding features, and adjust the parameters of the initial model based on the training loss. If the training loss meets the convergence conditions, the parameters will be adjusted.
  • the final initial model is determined as the picture-text mutual inspection model.
  • a multi-layer LSTM network is used to extract the features of text training data, and the first LSTM network layer is first used to obtain the features of each sentence or phrase based on the features of each word in each target text, and then the features of each sentence or phrase are obtained using The second LSTM network layer obtains the characteristics of long text data based on the characteristics of each sentence or phrase.
  • the embodiment of the present application discloses a specific image-text mutual inspection model training method, which includes:
  • Step S21 Obtain a training data pair; the training data pair includes text training data and image training data, the text training data includes long text data, the long text data is text data containing multiple target texts, and the target text is a sentence or a phrase.
  • Step S22 Determine whether to shuffle the target long text data based on the preset probability; wherein the target long text data is long text data with a temporal relationship between sentences in the text training data.
  • Step S23 If it is determined that the target long text data is to be scrambled, then the target long text data is scrambled. Otherwise, the target long text data is not scrambled.
  • sentences with a preset proportion can be selected, and the positions of the selected sentences can be swapped to implement shuffling.
  • the first text information that is, the summary, in the foregoing embodiments usually has a context or time sequence relationship. If you scramble the sentences, you may not know what the abstract is about.
  • the text information is randomly selected to be scrambled or not scrambled with a probability of 50%. If the first text information is selected to be scrambled, 30% of the sentences of the first text information are randomly selected. The selected sentences in the first text information exchange positions with each other, and the unselected sentences remain unchanged in their original positions. New first text information can be obtained through the above replacement steps. That is, the first text information after scrambling.
  • Step S24 Add a label to the target long text data, where the label indicates whether the target long text data has been scrambled.
  • Step S25 Input the training data pairs into the initial model, and respectively use the text encoding module and the image encoding module in the initial model to extract the text encoding features of the text training data and the image encoding features of the image training data;
  • the text encoding module includes a multi-layer LSTM Network
  • the multi-layer LSTM network includes a first LSTM network layer and a second LSTM network layer.
  • the first LSTM network layer is used to obtain the characteristics of each target text based on the characteristics of each word in each target text
  • the second LSTM network layer Used to obtain features of long text data based on the features of each target text.
  • Step S26 Calculate training loss based on text encoding features and image encoding features, and adjust parameters of the initial model based on the training loss.
  • training loss calculation specifically includes the following steps:
  • Step 260 Calculate the training loss based on the anchor sample and the positive and negative samples corresponding to the anchor sample to calculate the target triplet loss.
  • step 260 Regarding the specific calculation process of the above-mentioned step 260, reference may be made to the content disclosed in the foregoing embodiments, and details will not be described again here.
  • Step 261 Calculate the timing constraint loss based on the characteristics of the target long text data and the labels of the target long text data.
  • B represents the batch size (batch processing size)
  • yi ⁇ 0, 1 ⁇ represents the true value label of whether the target long text data is scrambled
  • pi represents the probability value of using the characteristics of the target long text data to determine whether the target long text data is scrambled. Represents timing constraint loss.
  • Step 262 Calculate the training loss using the target triplet loss and the timing constraint loss.
  • the training loss is formulated as follows:
  • L represents the total training loss
  • Step S27 If the training loss meets the convergence condition, the initial model after parameter adjustment is determined as the image-text mutual detection model.
  • FIG. 7 is a schematic diagram of a specific image-text mutual detection model training disclosed in an embodiment of the present application.
  • the network is trained according to the above loss function to make it converge.
  • the network training process is divided into two stages. The first stage is the stage where data is propagated from low level to high level, that is, the forward propagation stage.
  • Another stage is the stage where the error is propagated from the high level to the bottom level, that is, the back propagation stage, when the results obtained by the forward propagation are not in line with expectations.
  • the specific training process is as follows: all network layer weights are initialized, generally using random initialization; the input image and text data are forward propagated through the neural network, convolution layer, downsampling layer, fully connected layer and other layers to obtain the output value; find Find the output value of the network and find the loss of the output value of the network. The loss is transmitted back to the network, and the back propagation error of each layer of the network is obtained in turn.
  • Each layer of the network adjusts all weight coefficients in the network based on the backpropagation error of each layer, that is, updates the weights.
  • the embodiment of the present application discloses a method for mutual inspection of images and texts, which includes:
  • Step S31 Obtain target data; where the target data is image data or text data;
  • Step S32 Input the target data into the image-text mutual detection model, so that the image-text mutual detection model can extract the target encoding features of the target data; wherein the image-text mutual detection model is obtained based on the image-text mutual detection model training method of the aforementioned embodiment.
  • Step S33 Match all data coding features of the data set to be retrieved to obtain the retrieval results.
  • the vector distance such as the Euclidean distance
  • the distance between the target encoding feature and all data encoding features can be calculated, and the data encoding feature with the smallest distance is determined as the retrieval result.
  • all data encoding features are text encoding features; if the target data is text data, all data encoding features are image encoding features.
  • the image-text mutual detection model is used to extract features from medical texts or medical images and store them in the data set to be retrieved.
  • the user gives any medical text data or medical image data, which is called query data.
  • Medical data source channels have diversified characteristics.
  • a large-scale medical multi-modal database is constructed and the data query mode in the medical field is optimized.
  • doctors use the database to query information, they can screen the desired information with a simple description, which makes the query method more convenient and saves labor costs and time costs.
  • the embodiments of this application are except for the field of medical paper retrieval. It can also be adapted to any multi-text type retrieval tasks, such as manual retrieval.
  • a picture-text mutual inspection model training device which includes:
  • the training data acquisition module 11 is used to acquire training data pairs;
  • the training data pairs include text training data and image training data, the text training data includes long text data, the long text data is text data containing multiple target texts, and the target text is a sentence. or phrase;
  • the feature extraction module 12 is used to input the training data pairs into the initial model, respectively using the text coding module and the image coding module in the initial model to extract the text coding features of the text training data and the image coding features of the image training data; wherein, the text coding module Including a multi-layer LSTM network, the multi-layer LSTM network includes a first LSTM network layer and a second LSTM network layer, the first LSTM network layer is used to obtain the characteristics of each target text based on the characteristics of each word in each target text; the first The second LSTM network layer is used to obtain the features of long text data based on the features of each target text;
  • Loss calculation module 13 used to calculate training loss based on text encoding features and image encoding features
  • Parameter adjustment module 14 used to adjust parameters of the initial model based on training loss
  • the image-text mutual detection model determination module 15 is used to determine the initial model after parameter adjustment as the image-text mutual detection model if the training loss satisfies the convergence condition.
  • the embodiment of the present application obtains a training data pair; the training data pair includes text training data and image training data, the text training data includes long text data, the long text data is text data containing multiple target texts, and the target text is a sentence or a phrase. ; Then input the training data pairs into the initial model, and use the text coding module and image coding module in the initial model to extract the text coding features of the text training data and the image coding features of the image training data; among them, the text coding module includes a multi-layer LSTM network , the multi-layer LSTM network includes the first LSTM network layer and the second LSTM network layer.
  • the first LSTM network layer is used to obtain the characteristics of each target text based on the characteristics of each word in each target text; the second LSTM network layer is used To obtain the features of long text data based on the characteristics of each target text, and then calculate the training loss based on the text encoding features and image encoding features, and adjust the parameters of the initial model based on the training loss. If the training loss meets the convergence conditions, the parameters will be adjusted.
  • the final initial model is determined as the picture-text mutual inspection model.
  • a multi-layer LSTM network is used to extract the features of text training data, and the first LSTM network layer is first used to obtain the features of each sentence or phrase based on the features of each word in each target text, and then the features of each sentence or phrase are obtained using The second LSTM network layer obtains the characteristics of long text data based on the characteristics of each sentence or phrase.
  • the first LSTM network layer includes multiple BiLSTM networks, and each BiLSTM network includes multiple BiLSTM units; different BiLSTM units are used to extract features of different words, and different BiLSTM networks output features of different target texts;
  • the second LSTM network layer includes There are multiple BiLSTM units, and the input of the BiLSTM unit is the feature of the target text output by the corresponding BiLSTM network in the first LSTM network layer.
  • the text training data includes multiple long text data.
  • the text encoding module includes multiple multi-layer LSTM networks, and each multi-layer LSTM network is used to obtain the characteristics of a long text data.
  • the text training data also includes short text data, which is text data containing a target text; correspondingly, the text encoding module also includes a short text feature extraction module, which is used to extract features of the short text data.
  • the feature extraction module 12 is specifically used to splice features of multiple long text data and features of short text data to obtain text coding features of the text training data.
  • the training data acquisition module 11 is specifically used to extract text data and image data in the same paper. Classify text data based on semantics to obtain various types of text data; determine each type of text data as long text data or short text data based on the number of target texts; determine each type of text data as text training in training data pairs data, and identifying the image data as the image training data in the training data pair.
  • the image training data is an image sequence
  • the image coding module includes a backbone network and a BiLSTM network.
  • the feature extraction module 12 is specifically used to extract features of each image in the image sequence using the backbone network to obtain image features. ; Input each image feature into the BiLSTM network to obtain image coding features.
  • the image coding module also includes an attention structure.
  • the feature extraction module 12 is specifically used to input each image feature into the attention structure to obtain the attention weight of each image feature; determine based on the attention weight
  • the final features of each image feature are input into the BiLSTM network to obtain the image encoding features.
  • the loss calculation module 13 is specifically used to determine the positive samples and negative samples corresponding to the anchor point samples for the N coding feature pairs of a batch of N training data pairs; wherein the coding feature pairs are training data A coding pair composed of a pair of text coding features and image coding features.
  • the anchor sample is any text coding feature or image coding feature among the N coding feature pairs.
  • the positive sample is another coding feature paired with the anchor sample.
  • the negative samples are all coding features in the N coding feature pairs except the other coding feature; the training loss is calculated based on the anchor sample and the positive and negative samples corresponding to the anchor sample.
  • the apparatus further includes:
  • the shuffling processing module is used to determine whether to shuffle the target long text data based on a preset probability; wherein the target long text data is long text data with a temporal relationship between sentences in the text training data; if it is determined that the target long text data If the data is scrambled, the target long text data will be scrambled, otherwise the target long text data will not be scrambled; a label will be added to the target long text data, and the label indicates whether the target long text data has been scrambled.
  • the loss calculation module 13 is specifically configured to calculate the timing constraint loss based on the characteristics of the target long text data and the tags of the target long text data. Calculate the training loss based on the anchor sample and the positive and negative samples corresponding to the anchor sample to calculate the target triplet loss; use the target triplet loss and timing constraint loss to calculate the training loss.
  • the embodiment of the present application discloses an electronic device 20, which includes a processor 21 and a memory 22; the memory 22 is used to save a computer program; the processor 21 is used to execute the computer program.
  • the aforementioned embodiment The disclosed image-text mutual inspection model training method, and/or the aforementioned image-text mutual inspection method.
  • the memory 22, as a carrier for resource storage may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the storage method may be short-term storage or permanent storage.
  • the electronic device 20 also includes a power supply 23, a communication interface 24, an input and output interface 25 and a communication bus 26; the power supply 23 is used to provide operating voltage for each hardware device on the electronic device 20; the communication interface 24 can provide the electronic device 20 with working voltage.
  • a data transmission channel with external devices and the communication protocol it follows is any communication protocol that can be applied to the technical solution of this application, which is not specifically limited here; the input and output interface 25 is used to obtain external input data or To output data to the outside world, the specific interface type can be selected according to specific application needs, and is not specifically limited here.
  • the embodiment of the present application also discloses a non-volatile readable storage medium 30 for storing the computer program 31, wherein the computer program 31 implements the disclosure of the foregoing embodiment when executed by the processor.
  • the image-text mutual inspection model training method, and/or the aforementioned image-text mutual inspection method are examples of the image-text mutual inspection model training method, and/or the aforementioned image-text mutual inspection method.
  • RAM random access memory
  • ROM read-only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了图文互检模型训练方法及装置、图文互检方法、设备,应用于检索技术领域,包括:获取训练数据对(S11);将训练数据对输入初始模型,分别利用初始模型中的文本编码模块和图像编码模块提取文本训练数据的文本编码特征以及图像训练数据的图像编码特征(S12);基于文本编码特征和图像编码特征计算训练损失,并基于训练损失对初始模型进行参数调节(S13);若训练损失满足收敛条件,则将参数调节后的初始模型确定为图文互检模型(S14)。

Description

图文互检模型训练方法及装置、图文互检方法、设备
相关申请的交叉引用
本申请要求于2022年07月15日提交中国专利局,申请号为202210829134.9,申请名称为“图文互检模型训练方法及装置、图文互检方法、设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及检索技术领域,特别涉及图文互检模型训练方法及装置、图文互检方法、设备。
背景技术
随着信息时代的到来,检索面对的数据是海量的,并且,在海量的数据中,多模态的数据间也往往存在关联,比如文本数据和图像数据,而在一些场景中也存在基于文本数据检索图像数据,或基于图像数据检索文本数据的需求。
发明内容
有鉴于此,本申请的目的在于提供图文互检模型训练方法及装置、图文互检方法、设备,能够提升图文互检模型性能,进而提升图文互检的准确度。其具体方案如下:
第一方面,本申请公开了一种图文互检模型训练方法,包括:
获取训练数据对;训练数据对包括文本训练数据和图像训练数据,文本训练数据包括长文本数据,长文本数据为包含多个目标文本的文本数据,目标文本为句子或短语;
将训练数据对输入初始模型,分别利用初始模型中的文本编码模块和图像编码模块提取文本训练数据的文本编码特征以及图像训练数据的图像编码特征;其中,文本编码模块包括多层LSTM网络,多层LSTM网络包括第一LSTM网络层和第二LSTM网络层,第一LSTM网络层用于基于每个目标文本中每个字的特征获取每个目标文本的特征;第二LSTM网络层用于基于每个目标文本的特征获取长文本数据的特征;
基于文本编码特征和图像编码特征计算训练损失,并基于训练损失对初始模型进行参数调节;
若训练损失满足收敛条件,则将参数调节后的初始模型确定为图文互检模型。
可选的,第一LSTM网络层包括多个BiLSTM网络,每个BiLSTM网络包括多个BiLSTM单元;不同BiLSTM单元用于提取不同字的特征,不同BiLSTM网络输出不同目标文本的特征;
第二LSTM网络层包括多个BiLSTM单元,BiLSTM单元的输入为第一LSTM网络层中相应BiLSTM网络输出的目标文本的特征。
可选的,文本训练数据包括多个长文本数据,相应的,文本编码模块包括多个多层LSTM网络,每个多层LSTM网络用于获取一个长文本数据的特征。
可选的,文本训练数据还包括短文本数据,短文本数据为包含一个目标文本的文本数据;相应的,文本编码模块还包括短文本特征提取模块,用于提取短文本数据的特征。
可选的,利用初始模型中的文本编码模块提取文本训练数据的文本编码特征,包括:
对多个长文本数据的特征以及短文本数据的特征进行拼接,得到文本训练数据的文本编码特征。
可选的,获取训练数据对,包括:
提取同一论文中的文本数据和图像数据。
基于语义对文本数据分类,得到各类型的文本数据;
基于目标文本的数量将各类型的文本数据确定为长文本数据或短文本数据;
将各类型的文本数据确定为训练数据对中的文本训练数据,以及将图像数据确定为训练数据对中的图像训练数据。
可选的,图像训练数据为图像序列;图像编码模块包括骨干网络和BiLSTM网络,相应的,利用图像编码模块提取图像训练数据的图像编码特征,包括:
利用骨干网络提取图像序列中每张图像的特征,得到图像特征;
将各图像特征输入BiLSTM网络,得到图像编码特征。
可选的,图像编码模块还包括注意力结构,相应的,将图像特征输入BiLSTM网络,得到图像编码特征,包括:
将各图像特征输入注意力结构,得到每个图像特征的注意力权重;
基于注意力权重确定各图像特征的最终特征,并将最终特征输入BiLSTM网络,得到图像编码特征。
可选的,基于文本编码特征和图像编码特征计算训练损失,包括:
针对一个批次N个训练数据对的N个编码特征对,确定锚点样本对应的正样本和负样本;其中,编码特征对为训练数据对的文本编码特征和图像编码特征组成的编码对,锚点样本为N个编码特征对中的任一文本编码特征或图像编码特征,正样本为与锚点样本成对的另一编码特征,负样本为N个编码特征对中除另一编码特征外的所有编码特征;
基于锚点样本,以及锚点样本对应的正样本和负样本计算训练损失。
可选的,将训练数据对输入初始模型之前,还包括:
基于预设概率确定是否对目标长文本数据进行打乱处理;其中,目标长文本数据为文本训练数据中句子间具有时序关系的长文本数据;
若确定对目标长文本数据进行打乱处理,则对目标长文本数据进行打乱处理,否则不对目标长文本数据进行打乱处理;
为目标长文本数据添加标签,标签表征目标长文本数据是否经过打乱处理。
可选的,还包括:
基于目标长文本数据的特征以及目标长文本数据的标签计算时序约束损失。
可选的,基于锚点样本,以及锚点样本对应的正样本和负样本计算训练损失,包括:
基于锚点样本,以及锚点样本对应的正样本和负样本计算训练损失计算目标三元组损失;
利用目标三元组损失和时序约束损失计算训练损失。
第二方面,本申请公开了一种图文互检方法,包括:
获取目标数据;其中,目标数据为图像数据或文本数据;
将目标数据输入图文互检模型,以便图文互检模型提取出目标数据的目标编码特征;其中,图文互检模型基于前述的图文互检模型训练方法得到;
在待检索数据集的所有数据编码特征进行匹配,得到检索结果;
其中,若目标数据为图像数据,则所有数据编码特征均为文本编码特征,若目标数据为文本数据,则所有数据编码特征均为图像编码特征。
第三方面,本申请公开了一种图文互检模型训练装置,包括:
训练数据获取模块,用于获取训练数据对;训练数据对包括文本训练数据和图像训练数据,文本训练数据包括长文本数据,长文本数据为包含多个目标文本的文本数据,目标文本为句子或短语;
特征提取模块,用于将训练数据对输入初始模型,分别利用初始模型中的文本编码模块和图像编码模块提取文本训练数据的文本编码特征以及图像训练数据的图像编码特征;其中,文本编码模块包括多层LSTM网络,多层LSTM网络包括第一LSTM网络层和第二LSTM网络层,第一LSTM网络层用于基于每个目标文本中每个字的特征获取每个目标文本的特征;第二LSTM网络层用于基于每个目标文本的特征获取长文本数据的特征;
损失计算模块,用于基于文本编码特征和图像编码特征计算训练损失;
参数调节模块,用于基于训练损失对初始模型进行参数调节;
图文互检模型确定模块,用于若训练损失满足收敛条件,则将参数调节后的初始模型确定为图文互检模型。
第四方面,本申请公开了一种电子设备,包括存储器和处理器,其中:
存储器,用于保存计算机程序;
处理器,用于执行计算机程序,以实现前述的图文互检模型训练方法,和/或,前述的图文互检方法。
第五方面,本申请公开了一种非易失性可读存储介质,用于保存计算机程序,其中,计算机程序被处理器执行时实现前述的图文互检模型训练方法,和/或,前述的图文互检方法。
可见,本申请获取训练数据对;训练数据对包括文本训练数据和图像训练数据,文本训练数据包括长文本数据,长文本数据为包含多个目标文本的文本数据,目标文本为句子或短语;然后将训练数据对输入初始模型,分别利用初始模型中的文本编码模块和图像编码模块提取文本训练数据的文本编码特征以及图像训练数据的图像编码特征;其中,文本编码模块包括多层LSTM网络,多层LSTM网络包括第一LSTM网络层和第二LSTM网络层,第一LSTM网络层用于基于每个目标文本中每个字的特征获取每个目标文本的特征;第二LSTM网络层用于基于每个目标文本的特征获取长文本数据的特征,之后基于文本编码特征和图像编码特征计算训练损失,并基于训练损失对初始模型进行参数调节,若训练损失满足收敛条件,则将参数调节后的初始模型确定为图文互检模型。也即,本申请中利用多层LSTM网络提取文本训练数据的特征,并且先利用第一LSTM网络层基于每个目标文本中每个字的特征获取每个句子或短语的特征,然后利用第二LSTM网络层基于每个句子或短语的特征获取长文本数据的特征,这样,解决了长文本数据中距离较远的字、句子或短语之间的信息遗忘问题,得到更丰富的文本信息,能够提升图文互检模型性能,进而提升图文互检的准确度。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请公开的一种图文互检模型训练方法流程图;
图2为本申请公开的一种文本编码模块示意图;
图3为本申请公开的一种具体的图像编码模块示意图;
图4为本申请公开的一种注意力模块示意图;
图5为本申请公开的一种具体的正负样本示意图;
图6为本申请公开的一种具体的图文互检模型训练方法流程图;
图7为本申请公开的一种具体的图文互检模型训练示意图;
图8为本申请公开的一种图文互方法流程图;
图9为本申请公开的一种图文互检模型训练装置结构示意图;
图10为本申请公开的一种电子设备结构图;
图11为本申请公开的一种非易失性可读存储介质的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
随着信息时代的到来,检索面对的数据是海量的,并且,在海量的数据中,多模态的数据间也往往存在关联,比如文本数据和图像数据,而在一些场景中也存在基于文本数据检索图像数据,或基于图像数据检索文本数据的需求,因此,如何准确的进行图文互检是目前需要解决的问题。为此,本申请提供了一种图文互检模型训练、图文互检方案,能够提升图文互检模型性能,进而提升图文互检的准确度。
参见图1所示,一种图文互检模型训练方法,包括:
步骤S11:获取训练数据对;训练数据对包括文本训练数据和图像训练数据,文本训练数据包括长文本数据,长文本数据为包含多个目标文本的文本数据,目标文本为句子或短语。
步骤S12:将训练数据对输入初始模型,分别利用初始模型中的文本编码模块和图像编码模块提取文本训练数据的文本编码特征以及图像训练数据的图像编码特征;其中,文本编码模块包括多层LSTM网络,多层LSTM网络包括第一LSTM网络层和第二LSTM网络层,第一LSTM网络层用于基于每个目标文本中每个字的特征获取每个目标文本的特征;第二LSTM网络层用于基于每个目标文本的特征获取长文本数据的特征。
在一些实施方式中,第一LSTM网络层包括多个BiLSTM(即Bi-directional Long-Short Term Memory,双向长短时记忆)网络,每个BiLSTM网络包括多个BiLSTM单元;不同BiLSTM单元用于提取不同字的特征,不同BiLSTM网络输出不同目标文本的特征;第二 LSTM网络层包括多个BiLSTM单元,BiLSTM单元的输入为第一LSTM网络层中相应BiLSTM网络输出的目标文本的特征。
并且,文本编码模块还包括字编码层,用于将文本训练数据中的每个字编码,长文本数据中不同目标文本的每个字的编码输入第一LSTM网络层中不同BiLSTM网络中的BiLSTM单元。其中,字编码层可以为transformer层或word 2 vector(即词转化为向量)层。也即,不同目标文本的字的编码输入不同BiLSTM网络。
在另一些实施方式中,若长文本数据为包括多段文本的文本数据,则第二LSTM网络层可以包括两个子网络层,第一子网络层包括多个BiLSTM网络,每个BiLSTM网络包括多个BiLSTM单元,BiLSTM单元的输入为第一LSTM网络层中相应BiLSTM网络输出的目标文本的特征,不同BiLSTM网络输出不同段文本的特征。第二子网络层包括多个BiLSTM单元,BiLSTM单元的输入为第一子网络层中相应BiLSTM网络输出的成段文本的特征,第二子网络层的输出为长文本数据的特征。
并且,文本训练数据可以包括多个长文本数据,相应的,文本编码模块包括多个多层LSTM网络,每个多层LSTM网络用于获取一个长文本数据的特征。
进一步的,文本训练数据还可以包括短文本数据,短文本数据为包含一个目标文本的文本数据;相应的,文本编码模块还包括短文本特征提取模块,用于提取短文本数据的特征。当然,在一些实施例中,一个字的文本也可以作为短文本。
相应的,本申请实施例对多个长文本数据的特征以及短文本数据的特征进行拼接,得到文本训练数据的文本编码特征。
并且,在一些实施方式中,可以提取同一论文中的文本数据和图像数据。基于语义对文本数据分类,得到各类型的文本数据;基于目标文本的数量将各类型的文本数据确定为长文本数据或短文本数据;将各类型的文本数据确定为训练数据对中的文本训练数据,以及将图像数据确定为训练数据对中的图像训练数据。
比如,对提取医学论文中的文本数据和图像数据,基于语义对文本数据分类,得到各类型的文本数据,包括:摘要、关键词和题目,将摘要包括多个句子,关键词包括多个短语,均确定为长文本数据,题目为一个句子,确定为短文本数据。可以理解的是医学报告包含众多种类,例如医学论文等等。医学论文包括论文题目、论文摘要、论文关键字和论文主体。可以选取医学论文的论文题目、论文摘要、论文关键字作为文字数据主要组成部分,病历图像或者论文中的图像作为图像数据。
例如,参见图2所示,图2为本申请实施例提供的一种文本编码模块示意图。医学论文的摘要、关键词和题目分别为第一文本信息、第二文本信息、第三文本信息。由于第一文本信息是由多句话组成的一段话,第二文本信息由多句短语组成,为实现对第一文本信息、第二文本信息的编码,本申请提出一种级联LSTM结构,也即多层LSTM网络。模型的输入文本数据包括第一文本信息、第二文本信息和第三文本信息,对于所有的字,均通过transformer层进行编码,transformer可以将每个字编码成为一个特征向量,成为该字的表示。不同的文本信息可以对应不同的transformer层。第一文本信息通过transformer层进行编码,之后对于第一文本信息的每一句话,将其输入到不同的BiLSTM网络中,不同字的编码输入BiLSTM网络中的不同BiLSTM单元,第一个BiLSTM网络层中的BiLSTM网络用于抽取第一文本信息的每一句话的特征表示,可以选取每句话的第一个字的特征或最后一个字 的特征作为整句话的特征表示,当然也有其它的特征表示方法,例如,取BiLSTM网络中所有BiLSTM单元输出的字的特征的均值作为整句话的特征,这样获取每句话的特征表示,将其组合成一个新的序列,每句话的特征分别输入第二个LSTM网络层中的BiLSTM单元,最终获取到第一文本信息总的特征表达。图2中一行BiLSTM单元组成一个BiLSTM网络。对于第二文本信息,采用与第一文本信息相同的策略。将第二文本信息送入其transformer layer,获取每个第二文本信息的embedding特征。依次送到第二文本信息对应的多层LSTM网络来获取第二文本信息的特征。对于第三文本信息,使用基本的transformer模型直接获取特征。这样获取了3种不同类型的文本特征,对所有文本信息的特征进行特征拼接,如图2所示,对不同的特征向量进行首尾连接,拼接成一个更长的向量。最后,拼接后的向量通过一个全连接层,进行特征映射,映射到合适的维度,也即字的编码的维度,得到文本编码特征,用于与图像数据的图像编码特征进行损失计算,来对模型进行训练。公式如下:
e rec=[e ttl,e ins,e ing]
其中,e ttl,e ins,e ing分别表示第三文本信息、第一文本信息、第二文本信息的特征。[]代表特征拼接,即特征首尾相连。e rec表示拼接后特征,拼接后特征经过一个全连接层进行特征映射,得到与字的维度相同的向量,字的维度即字的编码(向量)的长度,得到文本训练数据的文本编码特征,后续用于和图像编码特征进行匹配。公式如下:
e′ rec=fc(e rec)
其中,e′ rec表示文本训练数据的文本编码特征,fc表示全连接层处理。
进一步的,图像训练数据为图像序列;图像编码模块包括骨干网络和BiLSTM网络,本申请实施例可以利用骨干网络提取图像序列中每张图像的特征,得到图像特征;将各图像特征输入BiLSTM网络,得到图像编码特征。
并且,在一些实施方式中,图像编码模块还包括注意力结构,本申请实施例可以将各图像特征输入注意力结构,得到每个图像特征的注意力权重;基于注意力权重确定各图像特征的最终特征,并将最终特征输入BiLSTM网络,得到图像编码特征。
例如,参见图3所示,图3为本申请实施例公开的一种具体的图像编码模块示意图。在具体的实施方式中,可以采用ResNet(即Residual Network,残差网络)骨干网络(backbone)提取每一张图像的图像特征,获取ResNet网络在分类层前一层的特征做为每一张图像的图像特征。将图像特征输入到BiLSTM网络,每张图像输入BiLSTM网络中的一个BiLSTM单元,获取图像序列的总体特征即图像编码特征。公式如下:
Figure PCTCN2022134092-appb-000001
Figure PCTCN2022134092-appb-000002
同上,图像序列也包含逆序和顺序两种。都隐含着时序语义信息。本申请实施例用如上公式对其进行编码。
其中,BiLSTM代表BiLSTM网络的每一个BiLSTM单元。
Figure PCTCN2022134092-appb-000003
为第i个BiLSTM单元的输出。
Figure PCTCN2022134092-appb-000004
表示图像特征,i表示第i张图像,→表示顺序,←表示逆序,I表示中BiLSTM网络包括I张图像,φ att()表示注意力结构,fc表示全连接层。在一些实施方式中,可以取BiLSTM单元的特征编码输出平均值作为BiLSTM网络的输出。公式如下:
Figure PCTCN2022134092-appb-000005
其中,e csi表示图像特征编码。
进一步的,在实际应用中,图像数据很多都是序列图像,比如医学图像,序列图像中图像的重要性不同,本发明设计注意力结构,对图像序列进行筛选,从而使BiLSTM能更集中于有用的信息。本申请设计的attention(注意力)模块包括全连接层、注意力模块、softmax层、乘法模块、加法模块,注意力模块如图4所示,包含两个全连接层FC和一个ReLU(即Linear rectification function,线性整流函数)层。在申请中,图像经过骨干网络backbone后获得嵌入式特征,嵌入式特征经过一个全连接层以后获得每张图像的最终的嵌入特征e。最终的嵌入特征e会通过经过attention(注意力)模块,计算每个特征的权重,该权重是一个数,经过sigmoid层进行归一化。所有图像的特征的权重会统一进入softmax层,来判别哪一个图像是重要的。最终,经过softmax层后的图像的特征权重会与对应的每张图像的最终的嵌入特征e相乘。也即,本申请实施例中,本申请实施例可以利用骨干网络提取图像序列中每张图像的特征,得到图像特征,将图像特征输入一个全连接层,得到嵌入特征,将各嵌入特征输入注意力模块,得到每个图像特征的注意力权重,然后经过softmax层处理,基于softmax层处理后的注意力权重确定各图像特征的最终特征,并将最终特征输入BiLSTM网络,得到图像编码特征。本申请实施例引入了残差网络的思想,对于每个图像而言,其注意力结构的输出如下公式所示:
Figure PCTCN2022134092-appb-000006
然后通过fc(即全连接层),即有:
Figure PCTCN2022134092-appb-000007
然后输入BiLSTM网络,得到图像编码特征。
步骤S13:基于文本编码特征和图像编码特征计算训练损失,并基于训练损失对初始模型进行参数调节。
在一些实施方式中,可以针对一个批次N个训练数据对的N个编码特征对,确定锚点样本对应的正样本和负样本;其中,编码特征对为训练数据对的文本编码特征和图像编码特征组成的编码对,锚点样本为N个编码特征对中的任一文本编码特征或图像编码特征,正样本为与锚点样本成对的另一编码特征,负样本为N个编码特征对中除另一编码特征外的所有编码特征;基于锚点样本,以及锚点样本对应的正样本和负样本计算训练损失。采用公式如下:
Figure PCTCN2022134092-appb-000008
本申请中的训练数据是成对出现的。一个文本编码特征的对应一个图像编码特征。在loss函数设计中,对于这种成对的数据,本申请会遍历每一个图像编码特征和文本编码特征求取损失的平均值。如上公式所示。共遍历N次,N代表在本batch(批次)中,共有N个成对的样本。首先对图像特征编码
Figure PCTCN2022134092-appb-000009
进行遍历(共N个),遍历选中的图像特征编码称为
Figure PCTCN2022134092-appb-000010
a代表anchor(锚点样本)。与锚点样本成对的文本特征编码记为
Figure PCTCN2022134092-appb-000011
p代表positive。同理,在本batch中与
Figure PCTCN2022134092-appb-000012
不配对的其余所有样本记为s np。Δ是超参数,在训练时固定,本申请可以设置为0.4。同理,对于文本特征编码也做相同的遍历操作,
Figure PCTCN2022134092-appb-000013
代表遍历中被选中的那个样本,与其对应的正图像特征编码样本记为
Figure PCTCN2022134092-appb-000014
不对应的记为s np。参见图5所示,图5为本申请实施例公开的一种具体的正负样本示意图。min表示求最小值运算,
Figure PCTCN2022134092-appb-000015
为目标三元组损失,|| ||表示求距离,在一些实施方式,目标三元组损失即为训练损失,可以用以上loss函数在训练中,进行梯度反传,对级联Transformer、BiLSTM、ResNet网络参数进行更新。
步骤S14:若训练损失满足收敛条件,则将参数调节后的初始模型确定为图文互检模型。
在一些实施方式中,可以判断训练损失是否小于预设阈值,若训练损失小于预设阈值,则判定训练损失满足收敛条件;若训练损失大于预设阈值,则判定训练损失没有满足收敛条件。
可见,本申请实施例获取训练数据对;训练数据对包括文本训练数据和图像训练数据,文本训练数据包括长文本数据,长文本数据为包含多个目标文本的文本数据,目标文本为句子或短语;然后将训练数据对输入初始模型,分别利用初始模型中的文本编码模块和图像编码模块提取文本训练数据的文本编码特征以及图像训练数据的图像编码特征;其中,文本编码模块包括多层LSTM网络,多层LSTM网络包括第一LSTM网络层和第二LSTM网络层,第一LSTM网络层用于基于每个目标文本中每个字的特征获取每个目标文本的特征;第二LSTM网络层用于基于每个目标文本的特征获取长文本数据的特征,之后基于文本编码特征和图像编码特征计算训练损失,并基于训练损失对初始模型进行参数调节,若训练损失满足收敛条件,则将参数调节后的初始模型确定为图文互检模型。也即,本申请实施例中利用多层LSTM网络提取文本训练数据的特征,并且先利用第一LSTM网络层基于每个目标文本中每个字的特征获取每个句子或短语的特征,然后利用第二LSTM网络层基于每个句子或短语的特征获取长文本数据的特征,这样,解决了长文本数据中距离较远的字、句子或短语之间的信息遗忘问题,得到更丰富的文本信息,能够提升图文互检模型性能,进而提升图文互检的准确度。
参见图6所示,本申请实施例公开了一种具体的图文互检模型训练方法,包括:
步骤S21:获取训练数据对;训练数据对包括文本训练数据和图像训练数据,文本训练数据包括长文本数据,长文本数据为包含多个目标文本的文本数据,目标文本为句子或短语。
步骤S22:基于预设概率确定是否对目标长文本数据进行打乱处理;其中,目标长文本数据为文本训练数据中句子间具有时序关系的长文本数据。
步骤S23:若确定对目标长文本数据进行打乱处理,则对目标长文本数据进行打乱处理,否则不对目标长文本数据进行打乱处理。
在一些实施方式中,可以选择预设比例的句子,对选择的句子进行位置调换,实现打乱处理。
需要指出的是,对于前述实施例中的第一文本信息也即摘要,通常是有上下文或者时间先后关系的。如果打乱句子,可能无法知道摘要的具体内容是什么。在一些实施方式中,对于第一文本信息,以50%的概率随机选择该文本信息被打乱或者不打乱。若第一文本信息被选择为打乱,在对第一文本信息的句子,随机抽取30%的句子。第一文本信息中被抽中的句子相互调换位置,未被抽中的句子在原位置不动。通过上面的调换步骤可以获得新的第一文本信息。也即打乱处理后的第一文本信息。
步骤S24:为目标长文本数据添加标签,标签表征目标长文本数据是否经过打乱处理。
步骤S25:将训练数据对输入初始模型,分别利用初始模型中的文本编码模块和图像编码模块提取文本训练数据的文本编码特征以及图像训练数据的图像编码特征;其中,文本编码模块包括多层LSTM网络,多层LSTM网络包括第一LSTM网络层和第二LSTM网络层,第一LSTM网络层用于基于每个目标文本中每个字的特征获取每个目标文本的特征;第二LSTM网络层用于基于每个目标文本的特征获取长文本数据的特征。
步骤S26:基于文本编码特征和图像编码特征计算训练损失,并基于训练损失对初始模型进行参数调节。
在一些实施方式中,训练损失计算具体包括以下步骤:
步骤260:基于锚点样本,以及锚点样本对应的正样本和负样本计算训练损失计算目标三元组损失。
关于上述步骤260的具体计算过程可以参考前述实施例公开的内容,在此不再进行赘述。
步骤261:基于目标长文本数据的特征以及目标长文本数据的标签计算时序约束损失。
在一些实施方式中,采用的公式如下:
Figure PCTCN2022134092-appb-000016
其中,其中B代表batchsize(批处理尺寸),yi∈{0,1}代表目标长文本数据是否被打乱的真值标签。pi代表用目标长文本数据的特征来判断目标长文本数据是否被打乱的概率值。
Figure PCTCN2022134092-appb-000017
表示时序约束损失。
步骤262:利用目标三元组损失和时序约束损失计算训练损失。
在一些实施方式中,训练损失的公式如下:
Figure PCTCN2022134092-appb-000018
其中,L代表总的训练损失。
步骤S27:若训练损失满足收敛条件,则将参数调节后的初始模型确定为图文互检模型。
例如,参见图7所示,图7为本申请实施例公开的一种具体的图文互检模型训练示意图。构建基于级联LSTM的图像文本检索网络,包括文本编码模块和图像特征编码模块。建立泛三元组损失也即目标三元组损失。建立时序约束损失函数。根据如上损失函数对网络进行训练,使其收敛。网络训练过程分为两个阶段。第一个阶段是数据由低层次向高层次传播的阶段,即前向传播阶段。另外一个阶段是,当前向传播得出的结果与预期不相符时,将误差从高层次向底层次进行传播训练的阶段,即反向传播阶段。训练过程具体为:所有网络层权值进行初始化,一般采用随机初始化;输入图像和文本数据经过神经网络、卷积层、下采样层、全连接层等各层的前向传播得到输出值;求出网络的输出值,求取网络的输出值的损失。将损失反向传回网络中,依次求得网络各层的反向传播误差。网络各层根据各层的反向传播误差对网络中的所有权重系数进行调整,即进行权重的更新。重新随机选取新的batch的图像文本数据,获得网络前向传播得到输出值。无限往复迭代,当求出网络的输出值对应的损失小于某个阈值,或者迭代次数超过某个阈值时,结束训练。保存训练好的所有层的网络参数。
可以理解的是,通过本申请提供的时序约束损失函数,能够进一步捕捉句子之间的上下文关系和时序依赖,能够更抽象的抽取句子之间的逻辑关系,得到更丰富的文本信息,从而进一步提升图文互检模型性能,提升图文互检的准确度。
参见图8所示,本申请实施例公开了一种图文互检方法,包括:
步骤S31:获取目标数据;其中,目标数据为图像数据或文本数据;
步骤S32:将目标数据输入图文互检模型,以便图文互检模型提取出目标数据的目标编码特征;其中,图文互检模型基于前述实施例的图文互检模型训练方法得到。
步骤S33:在待检索数据集的所有数据编码特征进行匹配,得到检索结果。
在一些实施例中,可以计算目标编码特征与所有数据编码特征之间的向量距离,比如欧氏距离,将距离最小的数据编码特征确定为检索结果。
其中,若目标数据为图像数据,则所有数据编码特征均为文本编码特征,若目标数据为文本数据,则所有数据编码特征均为图像编码特征。
例如,对利用图文互检模型对医学文本或医学图像进行特征提取,存入待检索数据集中。用户给定任意医学文本数据或医学图像数据,称为query数据。利用图文互检模型提取query数据的特征。将query数据的特征与待检索数据集中所有样本特征进行距离匹配,即求向量距离。比如求欧式距离。若query数据是医学文本数据就去取待检索数据集中所有的医学图像特征进行求距离。同理query数据是医学图像数据,与待检索数据集中所有的医学文本特征求欧式距离,距离最小的样本即为推荐样本,进行输出。
需要指出的是,医学图像图文数据库及图文报告系统对于信息检索、人才培养、数据挖掘和保护具有重要的价值。随着信息时代的到来、数字化、标准化、网络化作业已经进入医学影像界,全新的数字化影像技术陆续应用于临床,如CT(即Computed Tomography,电子计算机断层扫描)、MR(即Magnetic Resonance,MR)、DSA(Digital subtraction angiography,数字减影血管造影)、PET(即Positron Emission Computed Tomography,正电子发射型计算机断层显像)、CR(即computed radiography,计算机X射线)、DR(即digital radiography,数字X线摄影术)等,医学影像诊断设备的网络化,影像诊断报告计算机化、标准化、规范化,已逐步成为医学影像检查科室的必然发展趋势。实现基于海量医学报告的一些简洁易用的医学影像报告和数据管理系统,让更多的医学影像医师体验高新技术和现代化设备带来的方便快捷。使其方便的查阅和找寻病历,学习众多的疑难影像学知识具有重要的价值。医学数据来源渠道存在多样化的特点,通过本申请提供的方案,构建大型的医疗多模态数据库,优化了医学领域的资料查询模式。医生利用数据库查询资料时,只需简单描述就能筛查到想要的资料,这就使得查询方法更为便捷,节省人力成本和时间成本。并且,本申请实施例除了医学论文检索领域外。还可以适应于任何多文本类型的检索任务中,如说明书检索。
参见图9所示,本申请实施例公开了一种图文互检模型训练装置,包括:
训练数据获取模块11,用于获取训练数据对;训练数据对包括文本训练数据和图像训练数据,文本训练数据包括长文本数据,长文本数据为包含多个目标文本的文本数据,目标文本为句子或短语;
特征提取模块12,用于将训练数据对输入初始模型,分别利用初始模型中的文本编码模块和图像编码模块提取文本训练数据的文本编码特征以及图像训练数据的图像编码特征;其中,文本编码模块包括多层LSTM网络,多层LSTM网络包括第一LSTM网络层和第二LSTM网络层,第一LSTM网络层用于基于每个目标文本中每个字的特征获取每个目标文本的特征;第二LSTM网络层用于基于每个目标文本的特征获取长文本数据的特征;
损失计算模块13,用于基于文本编码特征和图像编码特征计算训练损失;
参数调节模块14,用于基于训练损失对初始模型进行参数调节;
图文互检模型确定模块15,用于若训练损失满足收敛条件,则将参数调节后的初始模型确定为图文互检模型。
可见,本申请实施例获取训练数据对;训练数据对包括文本训练数据和图像训练数据,文本训练数据包括长文本数据,长文本数据为包含多个目标文本的文本数据,目标文本为句子或短语;然后将训练数据对输入初始模型,分别利用初始模型中的文本编码模块和图像编码模块提取文本训练数据的文本编码特征以及图像训练数据的图像编码特征;其中,文本编码模块包括多层LSTM网络,多层LSTM网络包括第一LSTM网络层和第二LSTM网络层,第一LSTM网络层用于基于每个目标文本中每个字的特征获取每个目标文本的特征;第二LSTM网络层用于基于每个目标文本的特征获取长文本数据的特征,之后基于文本编码特征和图像编码特征计算训练损失,并基于训练损失对初始模型进行参数调节,若训练损失满足收敛条件,则将参数调节后的初始模型确定为图文互检模型。也即,本申请实施例中利用多层LSTM网络提取文本训练数据的特征,并且先利用第一LSTM网络层基于每个目标文本中每个 字的特征获取每个句子或短语的特征,然后利用第二LSTM网络层基于每个句子或短语的特征获取长文本数据的特征,这样,解决了长文本数据中距离较远的字、句子或短语之间的信息遗忘问题,得到更丰富的文本信息,能够提升图文互检模型性能,进而提升图文互检的准确度。
其中,第一LSTM网络层包括多个BiLSTM网络,每个BiLSTM网络包括多个BiLSTM单元;不同BiLSTM单元用于提取不同字的特征,不同BiLSTM网络输出不同目标文本的特征;第二LSTM网络层包括多个BiLSTM单元,BiLSTM单元的输入为第一LSTM网络层中相应BiLSTM网络输出的目标文本的特征。
其中,文本训练数据包括多个长文本数据,相应的,文本编码模块包括多个多层LSTM网络,每个多层LSTM网络用于获取一个长文本数据的特征。
并且,文本训练数据还包括短文本数据,短文本数据为包含一个目标文本的文本数据;相应的,文本编码模块还包括短文本特征提取模块,用于提取短文本数据的特征。
进一步的,特征提取模块12,具体用于对多个长文本数据的特征以及短文本数据的特征进行拼接,得到文本训练数据的文本编码特征。
在一些实施方式中,训练数据获取模块11,具体用于提取同一论文中的文本数据和图像数据。基于语义对文本数据分类,得到各类型的文本数据;基于目标文本的数量将各类型的文本数据确定为长文本数据或短文本数据;将各类型的文本数据确定为训练数据对中的文本训练数据,以及将图像数据确定为训练数据对中的图像训练数据。
在一些实施方式中,图像训练数据为图像序列;图像编码模块包括骨干网络和BiLSTM网络,相应的,特征提取模块12,具体用于利用骨干网络提取图像序列中每张图像的特征,得到图像特征;将各图像特征输入BiLSTM网络,得到图像编码特征。
在一些实施方式中,图像编码模块还包括注意力结构,相应的,特征提取模块12,具体用于将各图像特征输入注意力结构,得到每个图像特征的注意力权重;基于注意力权重确定各图像特征的最终特征,并将最终特征输入BiLSTM网络,得到图像编码特征。
在一些实施方式中,损失计算模块13,具体用于针对一个批次N个训练数据对的N个编码特征对,确定锚点样本对应的正样本和负样本;其中,编码特征对为训练数据对的文本编码特征和图像编码特征组成的编码对,锚点样本为N个编码特征对中的任一文本编码特征或图像编码特征,正样本为与锚点样本成对的另一编码特征,负样本为N个编码特征对中除另一编码特征外的所有编码特征;基于锚点样本,以及锚点样本对应的正样本和负样本计算训练损失。
在另外一些实施方式中,所示装置还包括:
打乱处理模块,用于基于预设概率确定是否对目标长文本数据进行打乱处理;其中,目标长文本数据为文本训练数据中句子间具有时序关系的长文本数据;若确定对目标长文本数据进行打乱处理,则对目标长文本数据进行打乱处理,否则不对目标长文本数据进行打乱处理;为目标长文本数据添加标签,标签表征目标长文本数据是否经过打乱处理。
相应的,损失计算模块13,具体用于基于目标长文本数据的特征以及目标长文本数据的标签计算时序约束损失。基于锚点样本,以及锚点样本对应的正样本和负样本计算训练损失计算目标三元组损失;利用目标三元组损失和时序约束损失计算训练损失。
参见图10所示,本申请实施例公开了一种电子设备20,包括处理器21和存储器22;其中,存储器22,用于保存计算机程序;处理器21,用于执行计算机程序,前述实施例公开的图文互检模型训练方法,和/或,前述的图文互检方法。
关于上述图文互检模型训练方法,和/或,前述的图文互检方法的具体过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。
并且,存储器22作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,存储方式可以是短暂存储或者永久存储。
另外,电子设备20还包括电源23、通信接口24、输入输出接口25和通信总线26;其中,电源23用于为电子设备20上的各硬件设备提供工作电压;通信接口24能够为电子设备20创建与外界设备之间的数据传输通道,其所遵循的通信协议是能够适用于本申请技术方案的任意通信协议,在此不对其进行具体限定;输入输出接口25,用于获取外界输入数据或向外界输出数据,其具体的接口类型可以根据具体应用需要进行选取,在此不进行具体限定。
进一步的,参见图11所示,本申请实施例还公开了一种非易失性可读存储介质30,用于保存计算机程序31,其中,计算机程序31被处理器执行时实现前述实施例公开的图文互检模型训练方法,和/或,前述的图文互检方法。
关于上述图文互检模型训练方法,和/或,前述的图文互检方法的具体过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上对本申请所提供的图文互检模型训练方法及装置、图文互检方法、设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种图文互检模型训练方法,其特征在于,包括:
    获取训练数据对;所述训练数据对包括文本训练数据和图像训练数据,所述文本训练数据包括长文本数据,所述长文本数据为包含多个目标文本的文本数据,所述目标文本为句子或短语;
    将所述训练数据对输入初始模型,分别利用所述初始模型中的文本编码模块和图像编码模块提取所述文本训练数据的文本编码特征以及所述图像训练数据的图像编码特征;其中,所述文本编码模块包括多层LSTM网络,所述多层LSTM网络包括第一LSTM网络层和第二LSTM网络层,所述第一LSTM网络层用于基于每个所述目标文本中每个字的特征获取每个所述目标文本的特征;所述第二LSTM网络层用于基于每个所述目标文本的特征获取所述长文本数据的特征;
    基于所述文本编码特征和所述图像编码特征计算训练损失,并基于所述训练损失对所述初始模型进行参数调节;
    若所述训练损失满足收敛条件,则将参数调节后的初始模型确定为图文互检模型。
  2. 根据权利要求1所述的图文互检模型训练方法,其特征在于,所述第一LSTM网络层包括多个BiLSTM网络,每个BiLSTM网络包括多个BiLSTM单元;不同所述BiLSTM单元用于提取不同字的特征,不同BiLSTM网络输出不同目标文本的特征;
    所述第二LSTM网络层包括多个BiLSTM单元,BiLSTM单元的输入为所述第一LSTM网络层中相应BiLSTM网络输出的目标文本的特征。
  3. 根据权利要求2所述的图文互检模型训练方法,其特征在于,所述文本编码模块还包括字编码层,用于将所述长文本数据中不同目标文本的每个字的编码输入所述第一LSTM网络层中不同BiLSTM网络中的BiLSTM单元。
  4. 根据权利要求2所述的图文互检模型训练方法,其特征在于,所述长文本数据为包括多段文本的文本数据,所述第二LSTM网络层包括两个子网络层,第一子网络层包括多个BiLSTM网络,每个所述第一子网络层中相应BiLSTM网络包括多个BiLSTM单元,所述第一子网络层中相应BiLSTM单元的输入为所述第一LSTM网络层中相应BiLSTM网络输出的目标文本的特征,不同所述第一子网络层中相应BiLSTM网络输出不同段文本的特征;第二子网络层包括多个BiLSTM单元,所述第二子网络层中相应BiLSTM单元的输入为所述第一子网络层中相应BiLSTM网络输出的成段文本的特征,所述第二子网络层的输出为所述长文本数据的特征。
  5. 根据权利要求1所述的图文互检模型训练方法,其特征在于,所述文本训练数据包括多个所述长文本数据,相应的,所述文本编码模块包括多个多层LSTM网络,每个所述多层LSTM网络用于获取一个所述长文本数据的特征。
  6. 根据权利要求5所述的图文互检模型训练方法,其特征在于,所述文本训练数据还包括短文本数据,所述短文本数据为包含一个所述目标文本的文本数据;相应的,所述文本编码模块还包括短文本特征提取模块,用于提取所述短文本数据的特征。
  7. 根据权利要求6所述的图文互检模型训练方法,其特征在于,利用所述初始模型中的文本编码模块提取所述文本训练数据的文本编码特征,包括:
    对多个所述长文本数据的特征以及所述短文本数据的特征进行拼接,得到所述文本 训练数据的文本编码特征。
  8. 根据权利要求6所述的图文互检模型训练方法,其特征在于,所述获取训练数据对,包括:
    提取同一论文中的文本数据和图像数据;
    基于语义对所述文本数据分类,得到各类型的文本数据;
    基于所述目标文本的数量将所述各类型的文本数据确定为长文本数据或短文本数据;
    将所述各类型的文本数据确定为所述训练数据对中的文本训练数据,以及将所述图像数据确定为所述训练数据对中的图像训练数据。
  9. 根据权利要求1所述的图文互检模型训练方法,其特征在于,所述图像训练数据为图像序列;所述图像编码模块包括骨干网络和BiLSTM网络,相应的,利用图像编码模块提取所述图像训练数据的图像编码特征,包括:
    利用骨干网络提取所述图像序列中每张图像的特征,得到图像特征;
    将各所述图像特征输入所述BiLSTM网络,得到图像编码特征。
  10. 根据权利要求9所述的图文互检模型训练方法,其特征在于,所述图像编码模块还包括注意力结构,相应的,所述将所述图像特征输入所述BiLSTM网络,得到图像编码特征,包括:
    将各所述图像特征输入所述注意力结构,得到每个所述图像特征的注意力权重;
    基于所述注意力权重确定各所述图像特征的最终特征,并将所述最终特征输入所述BiLSTM网络,得到图像编码特征。
  11. 根据权利要求1所述的图文互检模型训练方法,其特征在于,所述基于所述文本编码特征和所述图像编码特征计算训练损失,包括:
    针对一个批次N个训练数据对的N个编码特征对,确定锚点样本对应的正样本和负样本;其中,所述编码特征对为所述训练数据对的文本编码特征和图像编码特征组成的编码对,所述锚点样本为所述N个编码特征对中的任一文本编码特征或图像编码特征,所述正样本为与所述锚点样本成对的另一编码特征,所述负样本为所述N个编码特征对中除所述另一编码特征外的所有编码特征;
    基于所述锚点样本,以及所述锚点样本对应的正样本和负样本计算训练损失。
  12. 根据权利要求11所述的图文互检模型训练方法,其特征在于,所述将所述训练数据对输入初始模型之前,还包括:
    基于预设概率确定是否对目标长文本数据进行打乱处理;其中,所述目标长文本数据为所述文本训练数据中句子间具有时序关系的长文本数据;
    若确定对所述目标长文本数据进行打乱处理,则对所述目标长文本数据进行打乱处理,否则不对所述目标长文本数据进行打乱处理;
    为所述目标长文本数据添加标签,所述标签表征所述目标长文本数据是否经过打乱处理。
  13. 根据权利要求12所述的图文互检模型训练方法,其特征在于,还包括:
    基于所述目标长文本数据的特征以及所述目标长文本数据的所述标签计算时序约束损失。
  14. 根据权利要求13所述的图文互检模型训练方法,其特征在于,所述基于所述锚 点样本,以及所述锚点样本对应的正样本和负样本计算训练损失,包括:
    基于所述锚点样本,以及所述锚点样本对应的正样本和负样本计算训练损失计算目标三元组损失;
    利用所述目标三元组损失和所述时序约束损失计算训练损失。
  15. 根据权利要求12所述的图文互检模型训练方法,其特征在于,所述对所述目标长文本数据进行打乱处理包括:
    选择预设比例的句子,对选择的句子进行位置调换,实现打乱处理。
  16. 根据权利要求1所述的图文互检模型训练方法,其特征在于,所述若所述训练损失满足收敛条件,则将参数调节后的初始模型确定为图文互检模型之前,还包括:
    判断所述训练损失是否小于预设阈值,若所述训练损失小于所述预设阈值,则判定所述训练损失满足收敛条件;若所述训练损失大于所述预设阈值,则判定所述训练损失没有满足收敛条件。
  17. 一种图文互检方法,其特征在于,包括:
    获取目标数据;其中,所述目标数据为图像数据或文本数据;
    将所述目标数据输入图文互检模型,以便所述图文互检模型提取出所述目标数据的目标编码特征;其中,所述图文互检模型基于如权利要求1至16任一项所述的图文互检模型训练方法得到;
    在待检索数据集的所有数据编码特征进行匹配,得到检索结果;
    其中,若所述目标数据为图像数据,则所述所有数据编码特征均为文本编码特征,若所述目标数据为文本数据,则所述所有数据编码特征均为图像编码特征。
  18. 一种图文互检模型训练装置,其特征在于,包括:
    训练数据获取模块,用于获取训练数据对;所述训练数据对包括文本训练数据和图像训练数据,所述文本训练数据包括长文本数据,所述长文本数据为包含多个目标文本的文本数据,所述目标文本为句子或短语;
    特征提取模块,用于将所述训练数据对输入初始模型,分别利用所述初始模型中的文本编码模块和图像编码模块提取所述文本训练数据的文本编码特征以及所述图像训练数据的图像编码特征;其中,所述文本编码模块包括多层LSTM网络,所述多层LSTM网络包括第一LSTM网络层和第二LSTM网络层,所述第一LSTM网络层用于基于每个所述目标文本中每个字的特征获取每个所述目标文本的特征;所述第二LSTM网络层用于基于每个所述目标文本的特征获取所述长文本数据的特征;
    损失计算模块,用于基于所述文本编码特征和所述图像编码特征计算训练损失;
    参数调节模块,用于基于所述训练损失对所述初始模型进行参数调节;
    图文互检模型确定模块,用于若所述训练损失满足收敛条件,则将参数调节后的初始模型确定为图文互检模型。
  19. 一种电子设备,其特征在于,包括存储器和处理器,其中:
    所述存储器,用于保存计算机程序;
    所述处理器,用于执行所述计算机程序,以实现如权利要求1至16任一项所述的图文互检模型训练方法,和/或,如权利要求17所述的图文互检方法。
  20. 一种非易失性可读存储介质,其特征在于,用于保存计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至16任一项所述的图文互检模型训练方 法,和/或,如权利要求17所述的图文互检方法。
PCT/CN2022/134092 2022-07-15 2022-11-24 图文互检模型训练方法及装置、图文互检方法、设备 WO2024011815A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210829134.9A CN114896373B (zh) 2022-07-15 2022-07-15 图文互检模型训练方法及装置、图文互检方法、设备
CN202210829134.9 2022-07-15

Publications (1)

Publication Number Publication Date
WO2024011815A1 true WO2024011815A1 (zh) 2024-01-18

Family

ID=82730090

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/134092 WO2024011815A1 (zh) 2022-07-15 2022-11-24 图文互检模型训练方法及装置、图文互检方法、设备

Country Status (2)

Country Link
CN (1) CN114896373B (zh)
WO (1) WO2024011815A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117975472A (zh) * 2024-04-01 2024-05-03 鹏城实验室 物体定位方法、装置、设备及介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114896373B (zh) * 2022-07-15 2022-12-09 苏州浪潮智能科技有限公司 图文互检模型训练方法及装置、图文互检方法、设备
CN115618043B (zh) * 2022-11-08 2023-04-07 苏州浪潮智能科技有限公司 文本操作图互检方法及模型训练方法、装置、设备、介质
CN115438169A (zh) * 2022-11-08 2022-12-06 苏州浪潮智能科技有限公司 一种文本与视频的互检方法、装置、设备及存储介质
CN115455228A (zh) * 2022-11-08 2022-12-09 苏州浪潮智能科技有限公司 一种多模态数据互检方法、装置、设备及可读存储介质
CN116049459B (zh) * 2023-03-30 2023-07-14 浪潮电子信息产业股份有限公司 跨模态互检索的方法、装置、服务器及存储介质
CN116226319B (zh) * 2023-05-10 2023-08-04 浪潮电子信息产业股份有限公司 一种混合异构模型训练方法、装置、设备及可读存储介质
CN116246288B (zh) * 2023-05-10 2023-08-04 浪潮电子信息产业股份有限公司 一种文本编码方法、模型训练方法、模型匹配方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10373022B1 (en) * 2018-02-28 2019-08-06 Konica Minolta Laboratory U.S.A., Inc. Text image processing using stroke-aware max-min pooling for OCR system employing artificial neural network
CN112529986A (zh) * 2019-09-19 2021-03-19 百度在线网络技术(北京)有限公司 图文相关性的计算模型建立方法、计算方法及装置
US20210271707A1 (en) * 2020-02-27 2021-09-02 Adobe Inc. Joint Visual-Semantic Embedding and Grounding via Multi-Task Training for Image Searching
CN113435529A (zh) * 2021-07-06 2021-09-24 北京百度网讯科技有限公司 模型预训练方法、模型训练方法及图像处理方法
CN113836333A (zh) * 2021-09-18 2021-12-24 北京百度网讯科技有限公司 图文匹配模型的训练方法、实现图文检索的方法、装置
CN114722224A (zh) * 2022-04-13 2022-07-08 西安电子科技大学 基于联合特征的图文跨模态检索方法
CN114896373A (zh) * 2022-07-15 2022-08-12 苏州浪潮智能科技有限公司 图文互检模型训练方法及装置、图文互检方法、设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647350A (zh) * 2018-05-16 2018-10-12 中国人民解放军陆军工程大学 一种基于双通道网络的图文关联检索方法
CN111581510B (zh) * 2020-05-07 2024-02-09 腾讯科技(深圳)有限公司 分享内容处理方法、装置、计算机设备和存储介质
AU2021104218A4 (en) * 2021-07-16 2021-09-09 Jaspal Bagga A system for identification of personality traits and a method thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10373022B1 (en) * 2018-02-28 2019-08-06 Konica Minolta Laboratory U.S.A., Inc. Text image processing using stroke-aware max-min pooling for OCR system employing artificial neural network
CN112529986A (zh) * 2019-09-19 2021-03-19 百度在线网络技术(北京)有限公司 图文相关性的计算模型建立方法、计算方法及装置
US20210271707A1 (en) * 2020-02-27 2021-09-02 Adobe Inc. Joint Visual-Semantic Embedding and Grounding via Multi-Task Training for Image Searching
CN113435529A (zh) * 2021-07-06 2021-09-24 北京百度网讯科技有限公司 模型预训练方法、模型训练方法及图像处理方法
CN113836333A (zh) * 2021-09-18 2021-12-24 北京百度网讯科技有限公司 图文匹配模型的训练方法、实现图文检索的方法、装置
CN114722224A (zh) * 2022-04-13 2022-07-08 西安电子科技大学 基于联合特征的图文跨模态检索方法
CN114896373A (zh) * 2022-07-15 2022-08-12 苏州浪潮智能科技有限公司 图文互检模型训练方法及装置、图文互检方法、设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117975472A (zh) * 2024-04-01 2024-05-03 鹏城实验室 物体定位方法、装置、设备及介质

Also Published As

Publication number Publication date
CN114896373A (zh) 2022-08-12
CN114896373B (zh) 2022-12-09

Similar Documents

Publication Publication Date Title
WO2024011815A1 (zh) 图文互检模型训练方法及装置、图文互检方法、设备
US10679345B2 (en) Automatic contour annotation of medical images based on correlations with medical reports
WO2022155994A1 (zh) 基于注意力的深度跨模态哈希检索方法、装置及相关设备
US10565508B2 (en) Inferred facts discovered through knowledge graph derived contextual overlays
US7831111B2 (en) Method and mechanism for retrieving images
US20180165328A1 (en) Apply Corrections to an Ingested Corpus
US11042712B2 (en) Simplifying and/or paraphrasing complex textual content by jointly learning semantic alignment and simplicity
US11687570B2 (en) System and method for efficient multi-relational entity understanding and retrieval
CN107145485B (zh) 用于压缩主题模型的方法和装置
WO2024011814A1 (zh) 一种图文互检方法、系统、设备及非易失性可读存储介质
WO2022188584A1 (zh) 基于预训练语言模型的相似语句生成方法和装置
US11934781B2 (en) Systems and methods for controllable text summarization
WO2022089267A1 (zh) 样本数据获取方法、图像分割方法、装置、设备和介质
WO2021208444A1 (zh) 电子病例自动生成方法、装置、设备及存储介质
EP4174714A1 (en) Text sequence generation method, apparatus and device, and medium
CN110807086A (zh) 文本数据标注方法及装置、存储介质、电子设备
CN108563645B (zh) His系统的元数据翻译方法和装置
WO2024098525A1 (zh) 视频文本互检方法及其模型训练方法、装置、设备、介质
CN115994212B (zh) 视觉问答处理方法、视觉问答模型的训练方法及装置
CN111507405A (zh) 图片标注方法、装置、电子设备及计算机可读存储介质
US20230094828A1 (en) Audio file annotation
CN112507721B (zh) 生成文本主题的方法、装置、设备和计算机可读存储介质
US20210073480A1 (en) Automatic preprocessing for black box translation
US9940320B2 (en) Plugin tool for collecting user generated document segmentation feedback
CN115357710B (zh) 表格描述文本生成模型的训练方法、装置及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22950923

Country of ref document: EP

Kind code of ref document: A1