WO2023168811A1 - Picture-text model generation method and apparatus based on multiple experts, and device and medium - Google Patents

Picture-text model generation method and apparatus based on multiple experts, and device and medium Download PDF

Info

Publication number
WO2023168811A1
WO2023168811A1 PCT/CN2022/089730 CN2022089730W WO2023168811A1 WO 2023168811 A1 WO2023168811 A1 WO 2023168811A1 CN 2022089730 W CN2022089730 W CN 2022089730W WO 2023168811 A1 WO2023168811 A1 WO 2023168811A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
picture
sample
vector
model
Prior art date
Application number
PCT/CN2022/089730
Other languages
French (fr)
Chinese (zh)
Inventor
谯轶轩
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023168811A1 publication Critical patent/WO2023168811A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a multi-expert-based graphic and text model generation method and device, storage media, and computer equipment.
  • the image retrieval task includes two types: retrieval of pictures based on pictures and retrieval of text based on pictures.
  • the text retrieval task includes two types of retrieval of text based on text and retrieval of pictures based on text.
  • pre-trained graphic and text models are usually single-expert models, and different personnel are responsible for training, deployment, and maintenance, which increases the training cost and maintenance cost of the model, and takes up a large amount of computers. resource.
  • this application provides a multi-expert-based graphic and text model generation method and device, storage media, and computer equipment, which can enable the initial picture expert module, the initial text expert module, and the initial picture and text expert module to achieve joint training, and can Save model training and maintenance costs and effectively reduce computer resource usage.
  • a multi-expert based graphic and text model generation method including:
  • the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture;
  • the picture text target vector is determined, the picture text target vector is input to the initial picture text expert module of the preset picture text model, and based on the output result and the full Connect the layer to obtain the first prediction score between the sample picture and the sample text;
  • a multi-expert-based graphic and text model generation device including:
  • a sample acquisition module is used to obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries an indication between the sample picture and the sample picture. the true label of the relationship;
  • the first input module is used to determine an initial picture vector based on the sample picture in any of the training samples, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector;
  • a second input module configured to determine an initial text vector based on the sample text in any of the training samples, and input the initial text vector to the initial text expert module of the preset picture text model, Get the second target vector;
  • a prediction module configured to determine a picture text target vector based on the first target vector and the second target vector, input the picture text target vector into the initial picture text expert module of the preset picture text model, and Based on the output result and the fully connected layer, obtain the first prediction score between the sample picture and the sample text;
  • a model training module configured to determine a model loss value of the preset image text model based on the first prediction score and the real label, and train the preset image text model based on the model loss value , to obtain the multi-expert based graphic and text model.
  • a computer-readable storage medium on which computer-readable instructions are stored.
  • a multi-expert-based graphic model generation method is implemented, including:
  • the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture;
  • Based on the sample picture in any of the training samples determine an initial picture vector, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector; based on any of the Determine an initial text vector for the sample text in the training sample, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector; according to the first target vector and the second target vector, determine the picture text target vector, input the picture text target vector to the initial picture text expert module of the preset picture text model, and based on the output result and the fully connected layer, obtain the The first prediction score between the sample picture and the sample text; based on the first prediction score and the real label, determine the model loss value of the preset picture text model, and based on the model loss value
  • a computer device including a storage medium, a processor, and computer-readable instructions stored on the storage medium and executable on the processor.
  • the processor executes the computer-readable instructions.
  • a multi-expert-based graphic model generation method is implemented, including:
  • the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture;
  • Based on the sample picture in any of the training samples determine an initial picture vector, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector; based on any of the Determine an initial text vector for the sample text in the training sample, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector; according to the first target vector and the second target vector, determine the picture text target vector, input the picture text target vector to the initial picture text expert module of the preset picture text model, and based on the output result and the fully connected layer, obtain the The first prediction score between the sample picture and the sample text; based on the first prediction score and the real label, determine the model loss value of the preset picture text model, and based on the model loss value
  • the multi-expert based graphic and text model generation method and device, storage medium and computer equipment provided by this application can enable the initial picture expert module, the initial text expert module and the initial picture text expert module to achieve joint training , which can save model training and maintenance costs and effectively reduce the occupation of computer resources.
  • Figure 1 shows a schematic flow chart of a multi-expert-based graphic and text model generation method provided by an embodiment of the present application
  • Figure 2 shows a schematic flowchart of another multi-expert-based graphic and text model generation method provided by an embodiment of the present application
  • Figure 3 shows a schematic structural diagram of a multi-expert-based graphic and text model generation device provided by an embodiment of the present application.
  • a multi-expert-based graphic model generation method is provided, as shown in Figure 1.
  • the method includes:
  • Step 101 Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a link indicating a relationship with the sample picture. real label;
  • the multi-expert-based graphic and text model generation method can enable the initial picture expert module, the initial text expert module and the initial picture and text expert module to achieve joint training, which can save model training and maintenance costs and effectively reduce computer resources. of occupation.
  • the preset picture text model of this application mainly consists of three parts, namely the initial picture expert module, the initial text expert module and the initial picture text expert module.
  • the target picture expert module, the target text expert module and the Target image text expert module can be obtained.
  • the training sample set can include multiple training samples, and each training sample can include a sample picture and a sample text.
  • the sample text can also include a real label indicating the relationship with the sample picture.
  • the real label can be 1; the sample If the text is a negative sample of the sample image, that is, the sample text is not the interpretation of the sample image, then the true label can be 0.
  • Step 102 Determine an initial picture vector based on the sample picture in any of the training samples, and input the initial picture vector into the initial picture expert module of the preset picture text model to obtain the first target vector;
  • the sample picture in the training sample can be converted to obtain an initial picture vector corresponding to the sample picture. Then, the initial picture vector can be input into the initial picture expert module in the preset picture text model, and then the first target vector can be output.
  • Step 103 Determine an initial text vector based on the sample text in any of the training samples, and input the initial text vector into the initial text expert module of the preset image text model to obtain the second target vector;
  • the initial text vector of the sample text corresponding to the sample picture in the training sample may also be determined. Then, the initial text vector can be input into the initial text expert module in the preset image text model, and then the second target vector can be output.
  • Step 104 Determine a picture text target vector according to the first target vector and the second target vector, input the picture text target vector to the initial picture text expert module of the preset picture text model, and based on the output As a result, and the fully connected layer, the first prediction score between the sample picture and the sample text is obtained;
  • the picture text target vector can be further determined based on the first target vector and the second target vector.
  • the image text target vector can be used as input to the initial image text expert module of the preset image text model. Then the output of the initial image text expert module can be passed through the fully connected layer to output the first image between the sample image and the sample text. A prediction score. From the first prediction score, we can see the correlation score between the sample text and the sample image.
  • Step 105 Determine the model loss value of the preset picture text model based on the first prediction score and the real label, and train the preset picture text model based on the model loss value to obtain the Describes a graphic model based on multi-experts.
  • the preset picture text model can be determined based on the first prediction score and the true label of each training sample. Model loss value. Then, the preset picture and text model can be trained based on the model loss value. After training, a multi-expert picture and text model based on picture experts, text experts and picture and text experts can be obtained.
  • a training sample set can be obtained.
  • the training sample set can include multiple training samples, where each training sample can include a sample picture and a sample text.
  • the sample text can also include a real label indicating the relationship to the sample image.
  • the sample image in the training sample can be converted to obtain the initial image vector corresponding to the sample image.
  • the initial picture vector can be input into the initial picture expert module in the preset picture text model, and then the first target vector can be output.
  • the initial text vector of the sample text corresponding to the sample picture in the training sample can also be determined.
  • the initial text vector can be input into the initial text expert module in the preset image text model, and then the second target vector can be output.
  • the picture text target vector can be further determined based on the first target vector and the second target vector.
  • the image text target vector can be used as input to the initial image text expert module of the preset image text model, and the output of the initial image text expert module is passed through the fully connected layer to output the first prediction between the sample image and the sample text. Score.
  • the model loss value of the preset image text model can be determined based on the first prediction score and the real label of each training sample, and based on the model loss value, the preset image text model After training, a multi-expert graphic and text model based on picture experts, text experts and picture and text experts can be obtained.
  • the embodiments of the present application can enable the initial picture expert module, the initial text expert module and the initial picture text expert module to realize joint training, which can save the training and maintenance costs of the model and effectively reduce the occupation of computer resources.
  • this method include:
  • Step 201 Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a link indicating a relationship with the sample picture. real label;
  • a training sample set may be obtained.
  • the training sample set may include multiple training samples, where each training sample may include a sample picture and a sample text.
  • the sample text can also include a real label indicating the relationship with the sample picture. For example, if the sample text is a positive sample of the sample picture, that is, the sample text is an explanation of the sample picture, then the real label can be 1; the sample If the text is a negative sample of the sample image, that is, the sample text is not the interpretation of the sample image, then the true label can be 0.
  • Step 202 Determine the picture dimensions of the sample picture, where the picture dimensions include picture height and/or picture width;
  • the picture dimensions can be determined for the sample pictures in each training sample.
  • the picture dimensions can include picture height and picture width, and can also include the number of picture channels.
  • the image dimensions corresponding to the sample image can be H x W x C, where H represents the image height of the sample image, W represents the image width of the sample image, and C represents the number of image channels of the sample image.
  • Step 203 Divide the picture height and/or picture width of the sample picture based on the preset division size to obtain sub-sample pictures corresponding to the sample picture;
  • the sample picture after determining the picture dimensions of the sample picture, can be divided according to the preset division size.
  • the sample picture can only be divided from the picture height direction, and the picture width remains unchanged, or the sample picture can be divided.
  • the sample image is divided according to the image width, and the image height remains unchanged.
  • the sample image can also be divided from the image height and image width of the sample image at the same time.
  • multiple sub-sample images corresponding to the sample image can be obtained.
  • the image dimension of the sample image is H x W x C.
  • the sample image can be divided into multiple sub-sample images of P x P x C according to the preset division size. That is, the image dimension corresponding to each sub-sample image is P x P xC.
  • Step 204 Convert the sub-sample pictures into the initial picture vector corresponding to each of the sub-sample pictures through a preset conversion tool;
  • each sub-sample picture can be converted into an initial picture vector corresponding to the sub-sample picture through a preset conversion tool, that is, each sub-sample picture can be directly Represented by the initial image vector corresponding to the subsample image.
  • the preset transformation tool can be reshape. For example, if the image dimension corresponding to each sub-sample image is P x P x C, then each sub-sample image can be converted into a vector with dimension P 2 C through the preset conversion tool. This vector of P 2 C can be the initial image. vector.
  • the P 2 C vector corresponding to each sub-sample image can also be converted into a one-dimensional vector of a specified dimension through dimensionality reduction, and the converted one-dimensional vector is used as the initial image vector.
  • Obtaining the initial image vector through dimensionality reduction can make the initial image vector more conveniently participate in subsequent operations, reduce the difficulty of subsequent operations, and increase the efficiency of operations.
  • Step 205 Input the initial image vector to the initial image expert module of the preset image text model to obtain the first target vector;
  • Step 206 Based on the preset word vector database, determine the word vector corresponding to each word in the sample text from the preset word vector database, and splice the word vectors corresponding to each word in the sample text. , get the initial text vector;
  • the initial picture vector corresponding to each sub-sample picture is input into the initial picture expert module of the preset picture text model, and the first target vector can be output accordingly.
  • the word vector corresponding to each word can be found from the preset word vector database, and then the word vectors corresponding to each word are spliced according to the order of each word in the sample text. , get the initial text vector corresponding to each sample text.
  • Step 207 Input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector
  • Step 208 Splice the first target vector corresponding to each sub-sample picture to obtain a picture splicing vector; splice the picture splicing vector with the second target vector corresponding to the sample text to obtain the Image text target vector;
  • the initial text vector can be input into the initial text expert module in the preset picture text model, and then the second target vector can be output. After obtaining multiple first target vectors corresponding to the sample pictures and the second target vector corresponding to the sample text, the first target vector and the second target vector can be spliced based on the first target vector and the second target vector, and further Determine the image text target vector.
  • Step 209 Input the picture text target vector to the initial picture text expert module of the preset picture text model, and obtain the first value between the sample picture and the sample text based on the output result and the fully connected layer. predicted score;
  • the picture text target vector is used as input to the initial picture text expert module of the preset picture text model. Then the output of the initial picture text expert module can be passed through the fully connected layer to output sample pictures and sample texts. From the first prediction score, we can see the correlation score between the sample text and the sample picture.
  • Step 210 Based on the first prediction score corresponding to each training sample in the training sample set and the true label, determine the model loss of the preset image text model through a preset cross-entropy loss function. value;
  • the model loss value of the preset image text model can be calculated through the preset cross-entropy loss function based on the first prediction score and the corresponding real label.
  • the preset cross-entropy loss function can be in, is the real label between the sample image and the sample text, which can be 0 or 1, is the first prediction score between the sample image and the sample text, and N is the number of training samples in the training sample set.
  • Step 211 When the model loss value is greater than the preset loss threshold, adjust the initial picture expert module, the initial text expert module and the initial picture text in the preset picture text model according to the model loss value. Module parameters corresponding to at least one module in the expert module are used to obtain an updated preset picture text model. Through the updated preset picture text model and the fully connected layer, each sample picture and the sample are obtained. second prediction score between texts and calculate the model loss value again;
  • the preset image and text model can be directly used as the final multi-expert-based image and text model.
  • the model loss value is greater than the preset loss threshold, it means that the accuracy of the preset image text model has not reached expectations.
  • the parameters of the preset image text model can be further adjusted. Specifically, the initial image expert module and the initial text expert module can be adjusted. , parameters of one or several modules in the initial picture text expert module. After parameter adjustment, an updated preset picture text model can be obtained.
  • the second prediction score corresponding to each training sample can be further obtained based on the training sample set, and then the second prediction score and the corresponding real label, and then calculate the model loss value of the updated preset image text model through the preset cross entropy loss function.
  • the relationship between the model loss value and the preset loss threshold can be judged again, and when the model loss value is still greater than the preset loss threshold, the parameters of the updated preset image text model can be updated again, and the calculation of the third Three prediction scores, calculate the model loss value through the third prediction score and the real label... Repeat the process of adjusting the model parameters of the preset image text model and calculating the model loss value until the model loss value is less than or equal to the preset loss threshold .
  • Step 212 When the model loss value is less than or equal to the preset loss threshold, the multi-expert based graphic and text model is obtained.
  • model loss value when the model loss value is less than or equal to the preset loss threshold, it means that the model accuracy has reached expectations.
  • a multi-expert-based graphic and text model is obtained.
  • the multi-expert-based graphic and text model is obtained.
  • this application simultaneously trains the initial image expert module, the initial text expert module and the initial image text expert module.
  • Each module is equivalent to the Transformer layer of the original BERT model, among which the initial image expert module
  • the initial text expert module can correspond to the F layer, and the initial picture text expert module corresponds to the (L-F) layer.
  • the embodiment of this application can flexibly and freely configure the sizes of L and F during the training process according to the resource and time requirements of the actual business situation, so that the training of the model is closer to the actual business needs, and the initial picture expert module and the initial The text expert module shares the parameters of the Multi-head attention layer during the training process, which greatly reduces the number of parameters of the model and reduces the need for GPU memory when the model is deployed.
  • the method further includes: receiving an object to be analyzed, and determining the corresponding object from the multi-expert-based graphic model according to the format of the object to be analyzed.
  • Target analysis module wherein the target analysis module includes at least one of a target picture expert module, a target text expert module, and a target picture text expert module; convert the object to be analyzed into a corresponding target input vector, and The target input vector is input into the target analysis module to obtain a target output vector corresponding to the object to be analyzed, so as to obtain a target result through the target output vector.
  • one or more modules can be directly determined and used from the multi-expert-based graphic model according to the object to be analyzed.
  • an object to be analyzed can be received.
  • the object to be analyzed can be a picture or text.
  • the format of the object to be analyzed can be analyzed, and the selected module can be determined based on the format of the object to be analyzed.
  • the object to be analyzed can be converted into the corresponding target input vector, and then the target input vector can be input into the target analysis module, and the target output vector corresponding to the object to be analyzed can be output.
  • the target result can be obtained later by using the target output vector.
  • the target output vector For example, when the object to be analyzed is in text format, after obtaining the target output vector corresponding to the object to be analyzed, the most similar vector can then be obtained through the corresponding similarity index to find similar text or similar vectors to the object to be analyzed. picture.
  • the method of dividing the picture into sub-pictures and then converting it into the target input vector corresponding to the sub-picture can also be used, or the method of finding the corresponding word vector for each word in the text can also be used.
  • the word vectors are spliced together and converted into a target input vector.
  • determining the corresponding target analysis module from the multi-expert-based graphic model according to the format of the object to be analyzed specifically includes: when the object to be analyzed is When the format is a picture format, the target picture expert module is used as the target analysis module; when the format of the object to be analyzed is a text format, the target text expert module is used as the target analysis module; when the object to be analyzed is in text format, the target text expert module is used as the target analysis module; When the format of the object to be analyzed includes picture format and text format, the target picture expert module, the target text expert module and the target picture text expert module are used as the target analysis modules.
  • the target analysis module can be determined according to the format of the object to be analyzed.
  • the target picture expert module in the multi-expert-based graphic model can be used as the target analysis module;
  • the target text expert module in the multi-expert-based graphic model can be used as the target analysis module;
  • the format of the object to be analyzed includes not only picture format but also text format, the target text expert module can be used as the target analysis module.
  • the target picture expert module, target text expert module and target picture text expert module in the multi-expert based graphic and text model are all used as target analysis modules.
  • the target text expert module After the object to be analyzed in text format is converted into a target input vector, the target text expert module The corresponding output vector is obtained. After converting the object to be analyzed in the image format into the target input vector, the corresponding output vector is obtained through the target image text expert module. Finally, the output vector corresponding to the target text expert module is compared with the target image text expert module. The output vectors are spliced as the input corresponding to the target image text expert module to obtain the target output vector.
  • the vector corresponding to the object to be analyzed in picture format is first output through the target picture expert module, and then the vector corresponding to the object to be analyzed in picture format is output through the target text expert module.
  • the vector corresponding to the object to be analyzed in text format is then spliced and input into the target image text expert module, which can improve the accuracy of the target output vector of the target image text expert and help improve subsequent use effects.
  • the embodiment of the present application provides a multi-expert-based graphic and text model generation device, as shown in Figure 3, the device includes:
  • a sample acquisition module is used to obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries an indication between the sample picture and the sample picture. the true label of the relationship;
  • the first input module is used to determine an initial picture vector based on the sample picture in any of the training samples, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector;
  • a second input module configured to determine an initial text vector based on the sample text in any of the training samples, and input the initial text vector to the initial text expert module of the preset picture text model, Get the second target vector;
  • a prediction module configured to determine a picture text target vector based on the first target vector and the second target vector, input the picture text target vector into the initial picture text expert module of the preset picture text model, and Based on the output result and the fully connected layer, obtain the first prediction score between the sample picture and the sample text;
  • a model training module configured to determine a model loss value of the preset image text model based on the first prediction score and the real label, and train the preset image text model based on the model loss value , to obtain the multi-expert based graphic and text model.
  • the first input module is specifically used for:
  • the picture dimensions include picture height and/or picture width; based on the preset division size, divide the picture height and/or picture width of the sample picture to obtain the Sub-sample pictures corresponding to the sample pictures; convert the sub-sample pictures into the initial picture vector corresponding to each of the sub-sample pictures through a preset conversion tool.
  • the second input module is specifically used for:
  • the word vector corresponding to each word in the sample text is determined from the preset word vector database, and the word vectors corresponding to each word in the sample text are spliced to obtain the Describe the initial text vector.
  • the prediction module is specifically used for:
  • the first target vector corresponding to each of the sub-sample pictures is spliced to obtain a picture splicing vector; the picture splicing vector is spliced with the second target vector corresponding to the sample text to obtain the picture text target vector.
  • model training module is specifically used for:
  • the model loss value of the preset image text model is determined through a preset cross-entropy loss function; when When the model loss value is greater than the preset loss threshold, at least one of the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model is adjusted according to the model loss value.
  • the module parameters corresponding to a module are used to obtain an updated preset picture text model. Through the updated preset picture text model and the fully connected layer, the relationship between each sample picture and the sample text is obtained.
  • second prediction score and calculate the model loss value again; when the model loss value is less than or equal to the preset loss threshold, the multi-expert based graphic and text model is obtained.
  • the device also includes:
  • a receiving module configured to receive the object to be analyzed after obtaining the multi-expert-based graphic and text model, and determine the corresponding target analysis from the multi-expert-based graphic and text model according to the format of the object to be analyzed.
  • the target analysis module includes at least one of a target picture expert module, a target text expert module, and a target picture text expert module;
  • the third input module is used to convert the object to be analyzed into the corresponding target input vector, and input the target input vector into the target analysis module to obtain the target output corresponding to the object to be analyzed. vector to obtain the target result through the target output vector.
  • the receiving module is specifically used for:
  • the target picture expert module is used as the target analysis module; when the format of the object to be analyzed is a text format, the target text expert module is used as the target analysis module.
  • Target analysis module when the format of the object to be analyzed includes a picture format and a text format, use the target picture expert module, the target text expert module and the target picture text expert module as the target analysis module.
  • inventions of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the above-mentioned multi-expert-based graphic and text model generation method shown in Figures 1 to 2 is implemented.
  • the technical solution of this application can be embodied in the form of a software product.
  • the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), or a volatile storage medium.
  • the storage medium includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in each implementation scenario of this application.
  • embodiments of the present application also provide a computer device, which includes: a processor, a memory and computer-readable instructions stored in the memory and executable on the processor, wherein the memory and the processor are both arranged on the bus, and when the processor executes the computer-readable instructions, the above-mentioned instructions shown in Figures 1 to 2 are implemented.
  • the graphic model generation method based on multi-experts is shown.
  • the computer device may also include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, etc.
  • the user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc.
  • the optional user interface may also include a USB interface, a card reader interface, etc.
  • Optional network interfaces may include standard wired interfaces, wireless interfaces (such as Bluetooth interfaces, WI-FI interfaces), etc.
  • a computer device does not constitute a limitation on the computer device, and may include more or less components, or combine certain components, or arrange different components.
  • the storage medium may also include an operating system and a network communication module.
  • An operating system is a program that manages and saves the hardware and software resources of a computer device and supports the operation of information processing programs and other software and/or programs.
  • the network communication module is used to implement communication between components within the storage medium, as well as communication with other hardware and software in the physical device.
  • a training sample set can be obtained.
  • the training sample set can include multiple training samples, and each training sample can include a sample picture and a sample text.
  • the sample text can also include a real label indicating the relationship to the sample image.
  • the sample image in the training sample can be converted to obtain the initial image vector corresponding to the sample image.
  • the initial picture vector can be input into the initial picture expert module in the preset picture text model, and then the first target vector can be output.
  • the initial text vector of the sample text corresponding to the sample picture in the training sample can also be determined. Then, the initial text vector can be input into the initial text expert module in the preset image text model, and then the second target vector can be output. After obtaining the first target vector corresponding to the sample picture and the second target vector corresponding to the sample text, the picture text target vector can be further determined based on the first target vector and the second target vector. Then the image text target vector can be used as input to the initial image text expert module of the preset image text model, and the output of the initial image text expert module is passed through the fully connected layer to output the first prediction between the sample image and the sample text. Score.
  • the model loss value of the preset image text model can be determined based on the first prediction score and the real label of each training sample, and based on the model loss value, the preset image text model After training, a multi-expert graphic and text model based on picture experts, text experts and picture and text experts can be obtained.
  • the embodiments of the present application can enable the initial picture expert module, the initial text expert module and the initial picture text expert module to realize joint training, which can save the training and maintenance costs of the model and effectively reduce the occupation of computer resources.
  • the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawing are not necessarily necessary for implementing the present application.
  • the modules in the devices in the implementation scenario can be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or can be correspondingly changed and located in one or more devices different from the implementation scenario.
  • the modules of the above implementation scenarios can be combined into one module or further split into multiple sub-modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The present application relates to the field of artificial intelligence. Disclosed are a picture-text model generation method and apparatus based on multiple experts, and a storage medium and a computer device. The method comprises: acquiring a training sample set; determining an initial picture vector on the basis of a sample picture in a training sample, and inputting the initial picture vector into an initial picture expert module, so as to obtain a first target vector; determining an initial text vector on the basis of sample text in the training sample, and inputting the initial text vector into an initial text expert module, so as to obtain a second target vector; determining a picture-text target vector according to the first target vector and the second target vector, inputting the picture-text target vector into an initial picture-text expert module, and obtaining a first predicted score on the basis of an output result and a fully-connected layer; and determining a model loss value of a preset picture-text model on the basis of the first predicted score and a real label, and training the preset picture-text model on the basis of the model loss value, so as to obtain a picture-text model based on multiple experts.

Description

一种基于多专家的图文模型生成方法、装置、设备及介质A multi-expert-based graphic model generation method, device, equipment and medium
本申请要求与2022年3月9日提交中国专利局、申请号为202210232059.8、申请名称为“一种基于多专家的图文模型生成方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims priority with the Chinese patent application submitted to the China Patent Office on March 9, 2022, with the application number 202210232059.8 and the application title "A multi-expert-based graphic and text model generation method, device, equipment and medium", The entire contents of which are incorporated into this application by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其是涉及到一种基于多专家的图文模型生成方法及装置、存储介质、计算机设备。This application relates to the field of artificial intelligence technology, and in particular to a multi-expert-based graphic and text model generation method and device, storage media, and computer equipment.
背景技术Background technique
当前大规模图文预训练通常用于解决如下几类问题,分别是图片检索任务、文字检索任务以及图片文字复杂推理任务。其中,图片检索任务包括根据图片检索图片以及根据图片检索文字两种,文字检索任务包括根据文字检索文字以及根据文字检索图片两种。Currently, large-scale image and text pre-training is usually used to solve the following types of problems, namely image retrieval tasks, text retrieval tasks, and complex reasoning tasks of image and text. Among them, the image retrieval task includes two types: retrieval of pictures based on pictures and retrieval of text based on pictures. The text retrieval task includes two types of retrieval of text based on text and retrieval of pictures based on text.
然而,发明人发现,现有技术中,预训练的图文模型通常为单专家模型,由不同的人员负责训练、部署、维护,增加了模型的训练成本和维护成本,同时占用了大量的计算机资源。However, the inventor found that in the existing technology, pre-trained graphic and text models are usually single-expert models, and different personnel are responsible for training, deployment, and maintenance, which increases the training cost and maintenance cost of the model, and takes up a large amount of computers. resource.
发明内容Contents of the invention
有鉴于此,本申请提供了一种基于多专家的图文模型生成方法及装置、存储介质、计算机设备,可以使初始图片专家模块、初始文本专家模块以及初始图片文本专家模块实现共同训练,能够节省模型的训练和维护成本,有效减少计算机资源的占用。In view of this, this application provides a multi-expert-based graphic and text model generation method and device, storage media, and computer equipment, which can enable the initial picture expert module, the initial text expert module, and the initial picture and text expert module to achieve joint training, and can Save model training and maintenance costs and effectively reduce computer resource usage.
根据本申请的一个方面,提供了一种基于多专家的图文模型生成方法,包括:According to one aspect of this application, a multi-expert based graphic and text model generation method is provided, including:
获取训练样本集合,其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括样本图片和样本文本,所述样本文本带有指示与所述样本图片之间关系的真实标签;Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture;
基于任一所述训练样本中的所述样本图片,确定初始图片向量,并将所述初始图片向量输入至预设图片文本模型的初始图片专家模块,得到第一目标向量;Based on the sample picture in any of the training samples, determine an initial picture vector, and input the initial picture vector into the initial picture expert module of the preset picture text model to obtain a first target vector;
基于所述任一所述训练样本中的所述样本文本,确定初始文本向量,并将所述初始文本向量输入至所述预设图片文本模型的初始文本专家模块,得到第二目标向量;Based on the sample text in any of the training samples, determine an initial text vector, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector;
依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,将所述图片文本目标向量输入至所述预设图片文本模型的初始图片文本专家模块,并基于输出结果以及全连接层,得到所述样本图片与所述样本文本之间的第一预测分值;According to the first target vector and the second target vector, the picture text target vector is determined, the picture text target vector is input to the initial picture text expert module of the preset picture text model, and based on the output result and the full Connect the layer to obtain the first prediction score between the sample picture and the sample text;
基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型。Based on the first prediction score and the real label, determine the model loss value of the preset image text model, and train the preset image text model based on the model loss value to obtain the multi-based Graphic models for experts.
根据本申请的另一方面,提供了一种基于多专家的图文模型生成装置,包括:According to another aspect of the present application, a multi-expert-based graphic and text model generation device is provided, including:
样本获取模块,用于获取训练样本集合,其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括样本图片和样本文本,所述样本文本带有指示与所述样本图片之间关系的真实标签;A sample acquisition module is used to obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries an indication between the sample picture and the sample picture. the true label of the relationship;
第一输入模块,用于基于任一所述训练样本中的所述样本图片,确定初始图片向量,并将所述初始图片向量输入至预设图片文本模型的初始图片专家模块,得到第一目标向量;The first input module is used to determine an initial picture vector based on the sample picture in any of the training samples, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector;
第二输入模块,用于基于所述任一所述训练样本中的所述样本文本,确定初始文本向量,并将所述初始文本向量输入至所述预设图片文本模型的初始文本专家模块,得到第二目标向量;a second input module, configured to determine an initial text vector based on the sample text in any of the training samples, and input the initial text vector to the initial text expert module of the preset picture text model, Get the second target vector;
预测模块,用于依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,将所述图片文本目标向量输入至所述预设图片文本模型的初始图片文本专家模块,并基于输出结果以及全连接层,得到所述样本图片与所述样本文本之间的第一预测分值;A prediction module, configured to determine a picture text target vector based on the first target vector and the second target vector, input the picture text target vector into the initial picture text expert module of the preset picture text model, and Based on the output result and the fully connected layer, obtain the first prediction score between the sample picture and the sample text;
模型训练模块,用于基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型。A model training module, configured to determine a model loss value of the preset image text model based on the first prediction score and the real label, and train the preset image text model based on the model loss value , to obtain the multi-expert based graphic and text model.
依据本申请又一个方面,提供了一种计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现基于多专家的图文模型生成方法,包括:According to another aspect of the present application, a computer-readable storage medium is provided, on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, a multi-expert-based graphic model generation method is implemented, including:
获取训练样本集合,其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括样本图片和样本文本,所述样本文本带有指示与所述样本图片之间关系的真实标签;基于任一所述训练样本中的所述样本图片,确定初始图片向量,并将所述初始图片向量输入至预设图片文本模型的初始图片专家模块,得到第一目标向量;基于所述任一所述训练样本中的所述样本文本,确定初始文本向量,并将所述初始文本向量输入至所述预设图片文本模型的初始文本专家模块,得到第二目标向量;依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,将所述图片文本目标向量输入至所述预设图片文本模型的初始图片文本专家模块,并基于输出结果以及全连接层,得到所述样本图片与所述样本文本之间的第一预测分值;基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型。Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture; Based on the sample picture in any of the training samples, determine an initial picture vector, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector; based on any of the Determine an initial text vector for the sample text in the training sample, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector; according to the first target vector and the second target vector, determine the picture text target vector, input the picture text target vector to the initial picture text expert module of the preset picture text model, and based on the output result and the fully connected layer, obtain the The first prediction score between the sample picture and the sample text; based on the first prediction score and the real label, determine the model loss value of the preset picture text model, and based on the model loss value The preset picture and text model is trained to obtain the multi-expert based picture and text model.
依据本申请再一个方面,提供了一种计算机设备,包括存储介质、处理器及存储在存储介质上并可在处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现基于多专家的图文模型生成方法,包括:According to yet another aspect of the present application, a computer device is provided, including a storage medium, a processor, and computer-readable instructions stored on the storage medium and executable on the processor. The processor executes the computer-readable instructions. At the same time, a multi-expert-based graphic model generation method is implemented, including:
获取训练样本集合,其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括样本图片和样本文本,所述样本文本带有指示与所述样本图片之间关系的真实标签;基于任一所述训练样本中的所述样本图片,确定初始图片向量,并将所述初始图片向量输入至预设图片文本模型的初始图片专家模块,得到第一目标向量;基于所述任一所述训练样本中的所述样本文本,确定初始文本向量,并将所述初始文本向量输入至所述预设图片文本模型的初始文本专家模块,得到第二目标向量;依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,将所述图片文本目标向量输入至所述预设图片文本模型的初始图片文本专家模块,并基于输出结果以及全连接层,得到所述样本图片与所述样本文本之间的第一预测分值;基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型。Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture; Based on the sample picture in any of the training samples, determine an initial picture vector, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector; based on any of the Determine an initial text vector for the sample text in the training sample, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector; according to the first target vector and the second target vector, determine the picture text target vector, input the picture text target vector to the initial picture text expert module of the preset picture text model, and based on the output result and the fully connected layer, obtain the The first prediction score between the sample picture and the sample text; based on the first prediction score and the real label, determine the model loss value of the preset picture text model, and based on the model loss value The preset picture and text model is trained to obtain the multi-expert based picture and text model.
借由上述技术方案,本申请提供的一种基于多专家的图文模型生成方法及装置、存储介质、计算机设备,可以使初始图片专家模块、初始文本专家模块以及初始图片文本专家模块实现共同训练,能够节省模型的训练和维护成本,有效减少计算机资源的占用。Through the above technical solution, the multi-expert based graphic and text model generation method and device, storage medium and computer equipment provided by this application can enable the initial picture expert module, the initial text expert module and the initial picture text expert module to achieve joint training , which can save model training and maintenance costs and effectively reduce the occupation of computer resources.
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。The above description is only an overview of the technical solutions of the present application. In order to have a clearer understanding of the technical means of the present application, they can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present application more obvious and understandable. , the specific implementation methods of the present application are specifically listed below.
附图说明Description of the drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described here are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation of the present application. In the attached picture:
图1示出了本申请实施例提供的一种基于多专家的图文模型生成方法的流程示意图;Figure 1 shows a schematic flow chart of a multi-expert-based graphic and text model generation method provided by an embodiment of the present application;
图2示出了本申请实施例提供的另一种基于多专家的图文模型生成方法的流程示意图;Figure 2 shows a schematic flowchart of another multi-expert-based graphic and text model generation method provided by an embodiment of the present application;
图3示出了本申请实施例提供的一种基于多专家的图文模型生成装置的结构示意图。Figure 3 shows a schematic structural diagram of a multi-expert-based graphic and text model generation device provided by an embodiment of the present application.
具体实施方式Detailed ways
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。The present application will be described in detail below with reference to the accompanying drawings and embodiments. It should be noted that, as long as there is no conflict, the embodiments and features in the embodiments of this application can be combined with each other.
在本实施例中提供了一种基于多专家的图文模型生成方法,如图1所示,该方法包括:In this embodiment, a multi-expert-based graphic model generation method is provided, as shown in Figure 1. The method includes:
步骤101,获取训练样本集合,其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括样本图片和样本文本,所述样本文本带有指示与所述样本图片之间关系的真实标签;Step 101: Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a link indicating a relationship with the sample picture. real label;
本申请实施例提供的基于多专家的图文模型生成方法,可以使初始图片专家模块、初始文本专家模块以及初始图片文本专家模块实现共同训练,能够节省模型的训练和维护成本,有效减少计算机资源的占用。本申请的预设图片文本模型主要由三个部分组成,分别是初始图片专家模块、初始文本专家模块以及初始图片文本专家模块,当训练结束后可以对应生成目标图片专家模块、目标文本专家模块以及目标图片文本专家模块。首先,可以获取训练样本集合,训练样本集合中可以包括多个训练样本,其中每个训练样本可以包括一个样本图片和一个样本文本。此外,样本文本还可以包括一个指示与样本图片之间关系的真实标签,例如,该样本文本如果是样本图片的正样本,即样本文本是样本图片的解释,那么真实标签可以为1;该样本文本如果是样本图片的负样本,即样本文本不是样本图片的解释,那么真实标签可以为0。The multi-expert-based graphic and text model generation method provided by the embodiments of this application can enable the initial picture expert module, the initial text expert module and the initial picture and text expert module to achieve joint training, which can save model training and maintenance costs and effectively reduce computer resources. of occupation. The preset picture text model of this application mainly consists of three parts, namely the initial picture expert module, the initial text expert module and the initial picture text expert module. After the training is completed, the target picture expert module, the target text expert module and the Target image text expert module. First, a training sample set can be obtained. The training sample set can include multiple training samples, and each training sample can include a sample picture and a sample text. In addition, the sample text can also include a real label indicating the relationship with the sample picture. For example, if the sample text is a positive sample of the sample picture, that is, the sample text is an explanation of the sample picture, then the real label can be 1; the sample If the text is a negative sample of the sample image, that is, the sample text is not the interpretation of the sample image, then the true label can be 0.
步骤102,基于任一所述训练样本中的所述样本图片,确定初始图片向量,并将所述初始图片向量输入至预设图片文本模型的初始图片专家模块,得到第一目标向量;Step 102: Determine an initial picture vector based on the sample picture in any of the training samples, and input the initial picture vector into the initial picture expert module of the preset picture text model to obtain the first target vector;
在该实施例中,对于训练样本集合中的每个训练样本,可以将训练样本中的样本图片进行转换,得到该样本图片对应的初始图片向量。接着,可以将初始图片向量输入到预设图片文本模型中的初始图片专家模块中,进而可以输出第一目标向量。In this embodiment, for each training sample in the training sample set, the sample picture in the training sample can be converted to obtain an initial picture vector corresponding to the sample picture. Then, the initial picture vector can be input into the initial picture expert module in the preset picture text model, and then the first target vector can be output.
步骤103,基于所述任一所述训练样本中的所述样本文本,确定初始文本向量,并将所述初始文本向量输入至所述预设图片文本模型的初始文本专家模块,得到第二目标向量;Step 103: Determine an initial text vector based on the sample text in any of the training samples, and input the initial text vector into the initial text expert module of the preset image text model to obtain the second target vector;
在该实施例中,还可以确定该训练样本中与样本图片对应的样本文本的初始文本向量。接着,可以将初始文本向量输入到预设图片文本模型中的初始文本专家模块中,进而可以输出第二目标向量。In this embodiment, the initial text vector of the sample text corresponding to the sample picture in the training sample may also be determined. Then, the initial text vector can be input into the initial text expert module in the preset image text model, and then the second target vector can be output.
步骤104,依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,将所述图片文本目标向量输入至所述预设图片文本模型的初始图片文本专家模块,并基于输出结果以及全连接层,得到所述样本图片与所述样本文本之间的第一预测分值;Step 104: Determine a picture text target vector according to the first target vector and the second target vector, input the picture text target vector to the initial picture text expert module of the preset picture text model, and based on the output As a result, and the fully connected layer, the first prediction score between the sample picture and the sample text is obtained;
在该实施例中,得到样本图片对应的第一目标向量以及样本文本对应的第二目标向量后,可以以第一目标向量和第二目标向量为基础,进一步确定图片文本目标向量。之后可以将图片文本目标向量作为输入,输入到预设图片文本模型的初始图片文本专家模块中,接着可以将初始图片文本专家模块的输出通过全连接层,输出样本图片和样本文本之间的第一预测分值,从第一预测分值中可以看出样本文本与样本图片之间的关联程度得分。In this embodiment, after obtaining the first target vector corresponding to the sample picture and the second target vector corresponding to the sample text, the picture text target vector can be further determined based on the first target vector and the second target vector. Afterwards, the image text target vector can be used as input to the initial image text expert module of the preset image text model. Then the output of the initial image text expert module can be passed through the fully connected layer to output the first image between the sample image and the sample text. A prediction score. From the first prediction score, we can see the correlation score between the sample text and the sample image.
步骤105,基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型。Step 105: Determine the model loss value of the preset picture text model based on the first prediction score and the real label, and train the preset picture text model based on the model loss value to obtain the Describes a graphic model based on multi-experts.
在该实施例中,确定每个训练样本的样本图片和样本文本之间的第一预测分值后,可以根据每个训练样本的第一预测分值和真实标签,确定预设图片文本模型的模型损失值。接着,可以以该模型损失值为基础,对预设图片文本模型进行训练,经过训练后可以得到基于图片专家、文本专家以及图片文本专家的多专家图文模型。In this embodiment, after determining the first prediction score between the sample picture and the sample text of each training sample, the preset picture text model can be determined based on the first prediction score and the true label of each training sample. Model loss value. Then, the preset picture and text model can be trained based on the model loss value. After training, a multi-expert picture and text model based on picture experts, text experts and picture and text experts can be obtained.
通过应用本实施例的技术方案,首先,可以获取训练样本集合,训练样本集合中可以包括多个训练样本,其中每个训练样本可以包括一个样本图片和一个样本文本。此外,样本文本还可以包括一个指示与样本图片之间关系的真实标签。对于训练样本集合中的每个训练样本,可以将训练样本中的样本图片进行转换,得到该样本图片对应的初始图片向量。接着,可以将初始图片向量输入到预设图片文本模型中的初始图片专家模块中,进而可以输出第一目标向量。此外还可以确定该训练样本中与样本图片对应的样本文本的初始文本向量。接着,可以将初始文本向量输入到预设图片文本模型中的初始文本专家模块中,进而可以输出第二目标向量。得到样本图片对应的第一目标向量以及样本文本对应的第二目标向量后,可以以第一目标向量和第二目标向量为基础,进一步确定图片文本目标向量。之后可以将图片文本目标向量作为输入,输入到预设图片文本模型的初始图片文本专家模块中,将初始图片文本专家模块的输出通过全连接层,输出样本图片和样本文本之间的第一预测分值。得到第一预测分值后,可以根据每个训练样本的第一预测分值和真实标签,确定预设图片文本模型的模型损失值,并以该模型损失值为基础,对预设图片文本模型进行训练,经过训练后可以得到基于图片专家、文本专家以及图片文本专家的多专家图文模型。本申请实施例可以使初始图片专家模块、初始文本专家模块以及初始图片文本专家模块实现共同训练,能够节省模型的训练和维护成本,有效减少计算机资源的占用。By applying the technical solution of this embodiment, first, a training sample set can be obtained. The training sample set can include multiple training samples, where each training sample can include a sample picture and a sample text. In addition, the sample text can also include a real label indicating the relationship to the sample image. For each training sample in the training sample set, the sample image in the training sample can be converted to obtain the initial image vector corresponding to the sample image. Then, the initial picture vector can be input into the initial picture expert module in the preset picture text model, and then the first target vector can be output. In addition, the initial text vector of the sample text corresponding to the sample picture in the training sample can also be determined. Then, the initial text vector can be input into the initial text expert module in the preset image text model, and then the second target vector can be output. After obtaining the first target vector corresponding to the sample picture and the second target vector corresponding to the sample text, the picture text target vector can be further determined based on the first target vector and the second target vector. Then the image text target vector can be used as input to the initial image text expert module of the preset image text model, and the output of the initial image text expert module is passed through the fully connected layer to output the first prediction between the sample image and the sample text. Score. After obtaining the first prediction score, the model loss value of the preset image text model can be determined based on the first prediction score and the real label of each training sample, and based on the model loss value, the preset image text model After training, a multi-expert graphic and text model based on picture experts, text experts and picture and text experts can be obtained. The embodiments of the present application can enable the initial picture expert module, the initial text expert module and the initial picture text expert module to realize joint training, which can save the training and maintenance costs of the model and effectively reduce the occupation of computer resources.
进一步的,作为上述实施例具体实施方式的细化和扩展,为了完整说明本实施例的具体实施过程,提供了另一种基于多专家的图文模型生成方法,如图2所示,该方法包括:Further, as a refinement and expansion of the specific implementation of the above embodiment, in order to completely explain the specific implementation process of this embodiment, another multi-expert-based graphic model generation method is provided. As shown in Figure 2, this method include:
步骤201,获取训练样本集合,其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括样本图片和样本文本,所述样本文本带有指示与所述样本图片之间关系的真实标签;Step 201: Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a link indicating a relationship with the sample picture. real label;
在该实施例中,首先,可以获取训练样本集合,训练样本集合中可以包括多个训练样本,其中每个训练样本可以包括一个样本图片和一个样本文本。此外,样本文本还可以包括一个指示与样本图片之间关系的真实标签,例如,该样本文本如果是样本图片的正样本,即样本文本是样本图片的解释,那么真实标签可以为1;该样本文本如果是样本图片的负样本,即样本文本不是样本图片的解释,那么真实标签可以为0。In this embodiment, first, a training sample set may be obtained. The training sample set may include multiple training samples, where each training sample may include a sample picture and a sample text. In addition, the sample text can also include a real label indicating the relationship with the sample picture. For example, if the sample text is a positive sample of the sample picture, that is, the sample text is an explanation of the sample picture, then the real label can be 1; the sample If the text is a negative sample of the sample image, that is, the sample text is not the interpretation of the sample image, then the true label can be 0.
步骤202,确定所述样本图片的图片维度,其中,所述图片维度包括图片高度和/或图片宽度;Step 202: Determine the picture dimensions of the sample picture, where the picture dimensions include picture height and/or picture width;
在该实施例中,可以对每个训练样本中的样本图片进行图片维度的确定,在这里,图片维度可以包括图片高度和图片宽度,此外还可以包括图片通道数。例如,样本图片对应的图片维度可以是H x W x C,其中H表示样本图片的图片高度,W表示样本图片的图片宽度,C表示样本图片的图片通道数。In this embodiment, the picture dimensions can be determined for the sample pictures in each training sample. Here, the picture dimensions can include picture height and picture width, and can also include the number of picture channels. For example, the image dimensions corresponding to the sample image can be H x W x C, where H represents the image height of the sample image, W represents the image width of the sample image, and C represents the number of image channels of the sample image.
步骤203,基于预设划分尺寸,对所述样本图片的图片高度和/或图片宽度进行划分,得到与所述样本图片对应的子样本图片;Step 203: Divide the picture height and/or picture width of the sample picture based on the preset division size to obtain sub-sample pictures corresponding to the sample picture;
在该实施例中,确定样本图片的图片维度后,可以根据预设划分尺寸对样本图片进行划分,在这里,可以仅对样本图片从图片高度方向进行划分,图片宽度保持不变,也可以对样本图片从图片宽度方面进行划分,图片高度保持不变,还可以同时从样本图片的图片高度和图片宽度两个方向对样本图片进行划分。划分后,可以得到与样本图片对应的多个子样本图片。例如,样本图片的图片维度为H x W x C,可以按照预设划分尺寸将样本图片划分为多个P x P x C的子样本图片,即每个子样本图片对应的图片维度为P x P x C。In this embodiment, after determining the picture dimensions of the sample picture, the sample picture can be divided according to the preset division size. Here, the sample picture can only be divided from the picture height direction, and the picture width remains unchanged, or the sample picture can be divided. The sample image is divided according to the image width, and the image height remains unchanged. The sample image can also be divided from the image height and image width of the sample image at the same time. After division, multiple sub-sample images corresponding to the sample image can be obtained. For example, the image dimension of the sample image is H x W x C. The sample image can be divided into multiple sub-sample images of P x P x C according to the preset division size. That is, the image dimension corresponding to each sub-sample image is P x P xC.
步骤204,将所述子样本图片通过预设转换工具转换成与每个所述子样本图片对应的所述初始图片向量;Step 204: Convert the sub-sample pictures into the initial picture vector corresponding to each of the sub-sample pictures through a preset conversion tool;
在该实施例中,得到每个样本图片对应的多个子样本图片之后,可以通过预设转换工具将每个子样本图片转换成与该子样本图片相对应的初始图片向量,即将每个子样本图片直接用该子样本图片对应的初始图片向量表示。在这里,预设转换工具可以是reshape。例 如,每个子样本图片对应的图片维度为P x P x C,那么可以通过预设转换工具将每个子样本图片转换为维度为P 2C的向量,这个P 2C的向量即可以是初始图片向量。此外,还可以将每个子样本图片对应的P 2C的向量通过降维的方式转换为指定维度的一维向量,将转化后的一维向量作为初始图片向量。通过降维得到初始图片向量,可以使得初始图片向量更加便利地参与到后面的运算中,可以减少后续运算的难度,增加运算的效率。 In this embodiment, after obtaining multiple sub-sample pictures corresponding to each sample picture, each sub-sample picture can be converted into an initial picture vector corresponding to the sub-sample picture through a preset conversion tool, that is, each sub-sample picture can be directly Represented by the initial image vector corresponding to the subsample image. Here, the preset transformation tool can be reshape. For example, if the image dimension corresponding to each sub-sample image is P x P x C, then each sub-sample image can be converted into a vector with dimension P 2 C through the preset conversion tool. This vector of P 2 C can be the initial image. vector. In addition, the P 2 C vector corresponding to each sub-sample image can also be converted into a one-dimensional vector of a specified dimension through dimensionality reduction, and the converted one-dimensional vector is used as the initial image vector. Obtaining the initial image vector through dimensionality reduction can make the initial image vector more conveniently participate in subsequent operations, reduce the difficulty of subsequent operations, and increase the efficiency of operations.
步骤205,将所述初始图片向量输入至预设图片文本模型的初始图片专家模块,得到第一目标向量;Step 205: Input the initial image vector to the initial image expert module of the preset image text model to obtain the first target vector;
步骤206,基于预设字向量数据库,从所述预设字向量数据库中分别确定所述样本文本中每个字对应的字向量,并将所述样本文本中每个字对应的字向量进行拼接,得到所述初始文本向量;Step 206: Based on the preset word vector database, determine the word vector corresponding to each word in the sample text from the preset word vector database, and splice the word vectors corresponding to each word in the sample text. , get the initial text vector;
在该实施例中,将每个子样本图片对应的初始图片向量输入到预设图片文本模型的初始图片专家模块中,可以对应输出第一目标向量。此外,还可以对于样本文本中的每个字,从预设字向量数据库中找到每个字对应的字向量,接着,按照样本文本中每个字的顺序对每个字对应的字向量进行拼接,得到每个样本文本对应的初始文本向量。In this embodiment, the initial picture vector corresponding to each sub-sample picture is input into the initial picture expert module of the preset picture text model, and the first target vector can be output accordingly. In addition, for each word in the sample text, the word vector corresponding to each word can be found from the preset word vector database, and then the word vectors corresponding to each word are spliced according to the order of each word in the sample text. , get the initial text vector corresponding to each sample text.
步骤207,将所述初始文本向量输入至所述预设图片文本模型的初始文本专家模块,得到第二目标向量;Step 207: Input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector;
步骤208,将每个所述子样本图片对应的第一目标向量进行拼接,得到图片拼接向量;将所述图片拼接向量与所述样本文本对应的所述第二目标向量进行拼接,得到所述图片文本目标向量;Step 208: Splice the first target vector corresponding to each sub-sample picture to obtain a picture splicing vector; splice the picture splicing vector with the second target vector corresponding to the sample text to obtain the Image text target vector;
在该实施例中,可以将初始文本向量输入到预设图片文本模型中的初始文本专家模块中,进而可以输出第二目标向量。得到样本图片对应的多个第一目标向量以及样本文本对应的第二目标向量后,可以以第一目标向量和第二目标向量为基础,对第一目标向量和地二目标向量进行拼接,进一步确定图片文本目标向量。In this embodiment, the initial text vector can be input into the initial text expert module in the preset picture text model, and then the second target vector can be output. After obtaining multiple first target vectors corresponding to the sample pictures and the second target vector corresponding to the sample text, the first target vector and the second target vector can be spliced based on the first target vector and the second target vector, and further Determine the image text target vector.
步骤209,将所述图片文本目标向量输入至所述预设图片文本模型的初始图片文本专家模块,并基于输出结果以及全连接层,得到所述样本图片与所述样本文本之间的第一预测分值;Step 209: Input the picture text target vector to the initial picture text expert module of the preset picture text model, and obtain the first value between the sample picture and the sample text based on the output result and the fully connected layer. predicted score;
在该实施例中,将图片文本目标向量作为输入,输入到预设图片文本模型的初始图片文本专家模块中,接着可以将初始图片文本专家模块的输出通过全连接层,输出样本图片和样本文本之间的第一预测分值,从第一预测分值中可以看出样本文本与样本图片之间的关联程度得分。In this embodiment, the picture text target vector is used as input to the initial picture text expert module of the preset picture text model. Then the output of the initial picture text expert module can be passed through the fully connected layer to output sample pictures and sample texts. From the first prediction score, we can see the correlation score between the sample text and the sample picture.
步骤210,基于所述训练样本集合中的每个所述训练样本对应的所述第一预测分值以及所述真实标签,通过预设交叉熵损失函数确定所述预设图片文本模型的模型损失值;Step 210: Based on the first prediction score corresponding to each training sample in the training sample set and the true label, determine the model loss of the preset image text model through a preset cross-entropy loss function. value;
在该实施例中,得到每个训练样本对应的第一预测分值后,可以根据第一预测分值和对应的真实标签,通过预设交叉熵损失函数计算预设图片文本模型的模型损失值。在这里,预设交叉熵损失函数可以是
Figure PCTCN2022089730-appb-000001
其中,
Figure PCTCN2022089730-appb-000002
是样本图片和样本文本之间的真实标签,可以是0或者1,
Figure PCTCN2022089730-appb-000003
是样本图片和样本文本之间的第一预测分值,N是训练样本集合中训练样本的数量。
In this embodiment, after obtaining the first prediction score corresponding to each training sample, the model loss value of the preset image text model can be calculated through the preset cross-entropy loss function based on the first prediction score and the corresponding real label. . Here, the preset cross-entropy loss function can be
Figure PCTCN2022089730-appb-000001
in,
Figure PCTCN2022089730-appb-000002
is the real label between the sample image and the sample text, which can be 0 or 1,
Figure PCTCN2022089730-appb-000003
is the first prediction score between the sample image and the sample text, and N is the number of training samples in the training sample set.
步骤211,当所述模型损失值大于预设损失阈值时,依据所述模型损失值调整所述预设图片文本模型中所述初始图片专家模块、所述初始文本专家模块以及所述初始图片文本专家模块中至少一个模块对应的模块参数,得到更新后的预设图片文本模型,通过所述更新后的预设图片文本模型以及所述全连接层,得到每个所述样本图片与所述样本文本之间的第二预测分值,并再次计算所述模型损失值;Step 211: When the model loss value is greater than the preset loss threshold, adjust the initial picture expert module, the initial text expert module and the initial picture text in the preset picture text model according to the model loss value. Module parameters corresponding to at least one module in the expert module are used to obtain an updated preset picture text model. Through the updated preset picture text model and the fully connected layer, each sample picture and the sample are obtained. second prediction score between texts and calculate the model loss value again;
在该实施例中,计算得到模型损失值后,当模型损失值小于或等于预设损失阈值时,可以直接将预设图片文本模型作为最终的基于多专家的图文模型。当模型损失值大于预设 损失阈值时,说明预设图片文本模型的准确度还没有达到预期,可以进一步对预设图片文本模型的参数进行调整,具体可以调整初始图片专家模块、初始文本专家模块、初始图片文本专家模块中的一个或几个模块的参数,参数调整后可以得到更新的预设图片文本模型。对预设图片文本模型进行参数调整得到更新后的预设图片文本模型后,可以进一步根据训练样本集合,得到每个训练样本对应的第二预测分值,接着可以根据第二预测分值和对应的真实标签,再次通过预设交叉熵损失函数计算更新后的预设图片文本模型的模型损失值。之后可以再次判断模型损失值和预设损失阈值之间的大小关系,并当模型损失值仍旧大于预设损失阈值时,再次对更新后的预设图片文本模型的参数进行更新,并继续计算第三预测分值,通过第三预测分值和真实标签计算模型损失值……重复进行调整预设图片文本模型的模型参数和计算模型损失值的过程,直至模型损失值小于或等于预设损失阈值。In this embodiment, after the model loss value is calculated, when the model loss value is less than or equal to the preset loss threshold, the preset image and text model can be directly used as the final multi-expert-based image and text model. When the model loss value is greater than the preset loss threshold, it means that the accuracy of the preset image text model has not reached expectations. The parameters of the preset image text model can be further adjusted. Specifically, the initial image expert module and the initial text expert module can be adjusted. , parameters of one or several modules in the initial picture text expert module. After parameter adjustment, an updated preset picture text model can be obtained. After adjusting the parameters of the preset image text model to obtain the updated preset image text model, the second prediction score corresponding to each training sample can be further obtained based on the training sample set, and then the second prediction score and the corresponding real label, and then calculate the model loss value of the updated preset image text model through the preset cross entropy loss function. After that, the relationship between the model loss value and the preset loss threshold can be judged again, and when the model loss value is still greater than the preset loss threshold, the parameters of the updated preset image text model can be updated again, and the calculation of the third Three prediction scores, calculate the model loss value through the third prediction score and the real label... Repeat the process of adjusting the model parameters of the preset image text model and calculating the model loss value until the model loss value is less than or equal to the preset loss threshold .
步骤212,当所述模型损失值小于或等于所述预设损失阈值时,得到所述基于多专家的图文模型。Step 212: When the model loss value is less than or equal to the preset loss threshold, the multi-expert based graphic and text model is obtained.
在该实施例中,当模型损失值小于或者等于预设损失阈值时,说明模型精度已经达到了预期,此时即得到了基于多专家的图文模型,此时基于多专家的图文模型中包括训练完成的目标图片专家模块、目标文本专家模块以及目标图片文本专家模块。本申请在对预设图片文本模型进行训练时,同时对初始图片专家模块、初始文本专家模块和初始图片文本专家模块进行训练,每个模块相当于原始BERT模型的Transformer层,其中初始图片专家模块和初始文本专家模块可以对应F层,初始图片文本专家模块对应(L-F)层。因此,本申请实施例可以根据实际业务情况的资源和时间需求,在训练过程灵活自由的配置L和F的大小,以使模型的训练更贴近于实际的业务需求,且初始图片专家模块和初始文本专家模块在训练过程中共享了Multi-head attention层的参数,极大程度上减少了模型的参数量,降低了模型在部署时对GPU显存的需求。In this embodiment, when the model loss value is less than or equal to the preset loss threshold, it means that the model accuracy has reached expectations. At this time, a multi-expert-based graphic and text model is obtained. At this time, the multi-expert-based graphic and text model is obtained. Including the trained target picture expert module, target text expert module and target picture text expert module. When training the preset image text model, this application simultaneously trains the initial image expert module, the initial text expert module and the initial image text expert module. Each module is equivalent to the Transformer layer of the original BERT model, among which the initial image expert module The initial text expert module can correspond to the F layer, and the initial picture text expert module corresponds to the (L-F) layer. Therefore, the embodiment of this application can flexibly and freely configure the sizes of L and F during the training process according to the resource and time requirements of the actual business situation, so that the training of the model is closer to the actual business needs, and the initial picture expert module and the initial The text expert module shares the parameters of the Multi-head attention layer during the training process, which greatly reduces the number of parameters of the model and reduces the need for GPU memory when the model is deployed.
在本申请实施例中,可选地,步骤212之后,所述方法还包括:接收待分析对象,并依据所述待分析对象的格式,从所述基于多专家的图文模型中确定对应的目标分析模块,其中,所述目标分析模块包括目标图片专家模块、目标文本专家模块以及目标图片文本专家模块中的至少一种;将所述待分析对象转换成对应的目标输入向量,并将所述目标输入向量输入至所述所述目标分析模块中,得到与所述待分析对象对应的目标输出向量,以通过所述目标输出向量得到目标结果。In this embodiment of the present application, optionally, after step 212, the method further includes: receiving an object to be analyzed, and determining the corresponding object from the multi-expert-based graphic model according to the format of the object to be analyzed. Target analysis module, wherein the target analysis module includes at least one of a target picture expert module, a target text expert module, and a target picture text expert module; convert the object to be analyzed into a corresponding target input vector, and The target input vector is input into the target analysis module to obtain a target output vector corresponding to the object to be analyzed, so as to obtain a target result through the target output vector.
在该实施例中,得到基于多专家的图文模型后,后续可以直接根据待分析对象从基于多专家的图文模型中确定一个或多个模块加以使用。具体地,首先,可以接收待分析对象,在这里,待分析对象可以是图片,也可以是文本。接收待分析对象后,可以对待分析对象的格式进行分析,根据待分析对象的格式确定选用的模块。确定选用的模块后,可以先将待分析对象转换成对应的目标输入向量,之后将目标输入向量输入到目标分析模块中,可以输出和待分析对象对应的目标输出向量。这样,后续可以通过使用目标输出向量得到目标结果。例如,当待分析对象为文本格式时,得到与待分析对象对应的目标输出向量后,后续可以再通过相应的相似性指标求得最相似的向量,以实现查找待分析对象的相似文本或相似图片。在这里,待分析对象转换为对应目标输入向量时,可以同样采用将图片划分成子图片,进而转换为子图片对应的目标输入向量的方法,或者同样采用将文本中每个字找到对应字向量,最后将字向量拼接在一起转换为目标输入向量的方法。In this embodiment, after the multi-expert-based graphic model is obtained, one or more modules can be directly determined and used from the multi-expert-based graphic model according to the object to be analyzed. Specifically, first, an object to be analyzed can be received. Here, the object to be analyzed can be a picture or text. After receiving the object to be analyzed, the format of the object to be analyzed can be analyzed, and the selected module can be determined based on the format of the object to be analyzed. After determining the module to be selected, the object to be analyzed can be converted into the corresponding target input vector, and then the target input vector can be input into the target analysis module, and the target output vector corresponding to the object to be analyzed can be output. In this way, the target result can be obtained later by using the target output vector. For example, when the object to be analyzed is in text format, after obtaining the target output vector corresponding to the object to be analyzed, the most similar vector can then be obtained through the corresponding similarity index to find similar text or similar vectors to the object to be analyzed. picture. Here, when the object to be analyzed is converted into the corresponding target input vector, the method of dividing the picture into sub-pictures and then converting it into the target input vector corresponding to the sub-picture can also be used, or the method of finding the corresponding word vector for each word in the text can also be used. Finally, the word vectors are spliced together and converted into a target input vector.
在本申请实施例中,可选地,所述依据所述待分析对象的格式,从所述基于多专家的图文模型中确定对应的目标分析模块,具体包括:当所述待分析对象的格式为图片格式时,将所述目标图片专家模块作为所述目标分析模块;当所述待分析对象的格式为文本格式时,将所述目标文本专家模块作为所述目标分析模块;当所述待分析对象的格式包括图片格式以及文本格式时,将所述目标图片专家模块、所述目标文本专家模块以及所述目标图片文本专家模块作为所述目标分析模块。In the embodiment of the present application, optionally, determining the corresponding target analysis module from the multi-expert-based graphic model according to the format of the object to be analyzed specifically includes: when the object to be analyzed is When the format is a picture format, the target picture expert module is used as the target analysis module; when the format of the object to be analyzed is a text format, the target text expert module is used as the target analysis module; when the object to be analyzed is in text format, the target text expert module is used as the target analysis module; When the format of the object to be analyzed includes picture format and text format, the target picture expert module, the target text expert module and the target picture text expert module are used as the target analysis modules.
在该实施例中,可以根据待分析对象的格式确定目标分析模块,当待分析对象的格式是图片格式时,可以将基于多专家的图文模型中的目标图片专家模块作为目标分析模块;当待分析对象的格式是文本格式时,可以将基于多专家的图文模型中的目标文本专家模块作为目标分析模块;当待分析对象的格式不但包括图片格式,同时还包括文本格式时,可以将基于多专家的图文模型中的目标图片专家模块、目标文本专家模块以及目标图片文本专家模块均作为目标分析模块,这样将文本格式的待分析对象转换为目标输入向量后,通过目标文本专家模块得到对应的输出向量,将图片格式的待分析对象转换为目标输入向量后,通过目标图片文本专家模块得到对应的输出向量,最后将目标文本专家模块对应的输出向量与目标图片文本专家模块对应的输出向量进行拼接作为目标图片文本专家模块对应的输入,得到目标输出向量。当待分析对象中既包括图片格式的待分析对象,又包括文本格式的待分析对象时,先通过目标图片专家模块输出与图片格式的待分析对象对应的向量,再通过目标文本专家模块输出与文本格式的待分析对象对应的向量,之后进行拼接输入到目标图片文本专家模块中,可以提升目标图片文本专家的目标输出向量的准确度,有利于提升后续的使用效果。In this embodiment, the target analysis module can be determined according to the format of the object to be analyzed. When the format of the object to be analyzed is a picture format, the target picture expert module in the multi-expert-based graphic model can be used as the target analysis module; when When the format of the object to be analyzed is text format, the target text expert module in the multi-expert-based graphic model can be used as the target analysis module; when the format of the object to be analyzed includes not only picture format but also text format, the target text expert module can be used as the target analysis module. The target picture expert module, target text expert module and target picture text expert module in the multi-expert based graphic and text model are all used as target analysis modules. In this way, after the object to be analyzed in text format is converted into a target input vector, the target text expert module The corresponding output vector is obtained. After converting the object to be analyzed in the image format into the target input vector, the corresponding output vector is obtained through the target image text expert module. Finally, the output vector corresponding to the target text expert module is compared with the target image text expert module. The output vectors are spliced as the input corresponding to the target image text expert module to obtain the target output vector. When the object to be analyzed includes both the object to be analyzed in picture format and the object to be analyzed in text format, the vector corresponding to the object to be analyzed in picture format is first output through the target picture expert module, and then the vector corresponding to the object to be analyzed in picture format is output through the target text expert module. The vector corresponding to the object to be analyzed in text format is then spliced and input into the target image text expert module, which can improve the accuracy of the target output vector of the target image text expert and help improve subsequent use effects.
进一步的,作为图1方法的具体实现,本申请实施例提供了一种基于多专家的图文模型生成装置,如图3所示,该装置包括:Further, as a specific implementation of the method in Figure 1, the embodiment of the present application provides a multi-expert-based graphic and text model generation device, as shown in Figure 3, the device includes:
样本获取模块,用于获取训练样本集合,其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括样本图片和样本文本,所述样本文本带有指示与所述样本图片之间关系的真实标签;A sample acquisition module is used to obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries an indication between the sample picture and the sample picture. the true label of the relationship;
第一输入模块,用于基于任一所述训练样本中的所述样本图片,确定初始图片向量,并将所述初始图片向量输入至预设图片文本模型的初始图片专家模块,得到第一目标向量;The first input module is used to determine an initial picture vector based on the sample picture in any of the training samples, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector;
第二输入模块,用于基于所述任一所述训练样本中的所述样本文本,确定初始文本向量,并将所述初始文本向量输入至所述预设图片文本模型的初始文本专家模块,得到第二目标向量;a second input module, configured to determine an initial text vector based on the sample text in any of the training samples, and input the initial text vector to the initial text expert module of the preset picture text model, Get the second target vector;
预测模块,用于依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,将所述图片文本目标向量输入至所述预设图片文本模型的初始图片文本专家模块,并基于输出结果以及全连接层,得到所述样本图片与所述样本文本之间的第一预测分值;A prediction module, configured to determine a picture text target vector based on the first target vector and the second target vector, input the picture text target vector into the initial picture text expert module of the preset picture text model, and Based on the output result and the fully connected layer, obtain the first prediction score between the sample picture and the sample text;
模型训练模块,用于基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型。A model training module, configured to determine a model loss value of the preset image text model based on the first prediction score and the real label, and train the preset image text model based on the model loss value , to obtain the multi-expert based graphic and text model.
可选地,所述第一输入模块,具体用于:Optionally, the first input module is specifically used for:
确定所述样本图片的图片维度,其中,所述图片维度包括图片高度和/或图片宽度;基于预设划分尺寸,对所述样本图片的图片高度和/或图片宽度进行划分,得到与所述样本图片对应的子样本图片;将所述子样本图片通过预设转换工具转换成与每个所述子样本图片对应的所述初始图片向量。Determine the picture dimensions of the sample picture, where the picture dimensions include picture height and/or picture width; based on the preset division size, divide the picture height and/or picture width of the sample picture to obtain the Sub-sample pictures corresponding to the sample pictures; convert the sub-sample pictures into the initial picture vector corresponding to each of the sub-sample pictures through a preset conversion tool.
可选地,所述第二输入模块,具体用于:Optionally, the second input module is specifically used for:
基于预设字向量数据库,从所述预设字向量数据库中分别确定所述样本文本中每个字对应的字向量,并将所述样本文本中每个字对应的字向量进行拼接,得到所述初始文本向量。Based on the preset word vector database, the word vector corresponding to each word in the sample text is determined from the preset word vector database, and the word vectors corresponding to each word in the sample text are spliced to obtain the Describe the initial text vector.
可选地,所述预测模块,具体用于:Optionally, the prediction module is specifically used for:
将每个所述子样本图片对应的第一目标向量进行拼接,得到图片拼接向量;将所述图片拼接向量与所述样本文本对应的所述第二目标向量进行拼接,得到所述图片文本目标向量。The first target vector corresponding to each of the sub-sample pictures is spliced to obtain a picture splicing vector; the picture splicing vector is spliced with the second target vector corresponding to the sample text to obtain the picture text target vector.
可选地,所述模型训练模块,具体用于:Optionally, the model training module is specifically used for:
基于所述训练样本集合中的每个所述训练样本对应的所述第一预测分值以及所述真实标签,通过预设交叉熵损失函数确定所述预设图片文本模型的模型损失值;当所述模型损失值大于预设损失阈值时,依据所述模型损失值调整所述预设图片文本模型中所述初始图片专家模块、所述初始文本专家模块以及所述初始图片文本专家模块中至少一个模块对应的模块参数,得到更新后的预设图片文本模型,通过所述更新后的预设图片文本模型以及所述全连接层,得到每个所述样本图片与所述样本文本之间的第二预测分值,并再次计算所述模型损失值;当所述模型损失值小于或等于所述预设损失阈值时,得到所述基于多专家的图文模型。Based on the first prediction score and the true label corresponding to each training sample in the training sample set, the model loss value of the preset image text model is determined through a preset cross-entropy loss function; when When the model loss value is greater than the preset loss threshold, at least one of the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model is adjusted according to the model loss value. The module parameters corresponding to a module are used to obtain an updated preset picture text model. Through the updated preset picture text model and the fully connected layer, the relationship between each sample picture and the sample text is obtained. second prediction score, and calculate the model loss value again; when the model loss value is less than or equal to the preset loss threshold, the multi-expert based graphic and text model is obtained.
可选地,所述装置还包括:Optionally, the device also includes:
接收模块,用于所述得到所述基于多专家的图文模型之后,接收待分析对象,并依据所述待分析对象的格式,从所述基于多专家的图文模型中确定对应的目标分析模块,其中,所述目标分析模块包括目标图片专家模块、目标文本专家模块以及目标图片文本专家模块中的至少一种;A receiving module, configured to receive the object to be analyzed after obtaining the multi-expert-based graphic and text model, and determine the corresponding target analysis from the multi-expert-based graphic and text model according to the format of the object to be analyzed. Module, wherein the target analysis module includes at least one of a target picture expert module, a target text expert module, and a target picture text expert module;
第三输入模块,用于将所述待分析对象转换成对应的目标输入向量,并将所述目标输入向量输入至所述所述目标分析模块中,得到与所述待分析对象对应的目标输出向量,以通过所述目标输出向量得到目标结果。The third input module is used to convert the object to be analyzed into the corresponding target input vector, and input the target input vector into the target analysis module to obtain the target output corresponding to the object to be analyzed. vector to obtain the target result through the target output vector.
可选地,所述接收模块,具体用于:Optionally, the receiving module is specifically used for:
当所述待分析对象的格式为图片格式时,将所述目标图片专家模块作为所述目标分析模块;当所述待分析对象的格式为文本格式时,将所述目标文本专家模块作为所述目标分析模块;当所述待分析对象的格式包括图片格式以及文本格式时,将所述目标图片专家模块、所述目标文本专家模块以及所述目标图片文本专家模块作为所述目标分析模块。When the format of the object to be analyzed is a picture format, the target picture expert module is used as the target analysis module; when the format of the object to be analyzed is a text format, the target text expert module is used as the target analysis module. Target analysis module; when the format of the object to be analyzed includes a picture format and a text format, use the target picture expert module, the target text expert module and the target picture text expert module as the target analysis module.
需要说明的是,本申请实施例提供的一种基于多专家的图文模型生成装置所涉及各功能单元的其他相应描述,可以参考图1至图2方法中的对应描述,在此不再赘述。It should be noted that for other corresponding descriptions of the functional units involved in the multi-expert-based graphic and text model generation device provided by the embodiments of the present application, please refer to the corresponding descriptions in the method of Figures 1 to 2, and will not be described again here. .
基于上述如图1至图2所示方法,相应的,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性。所述计算机可读存储介质上存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述如图1至图2所示的基于多专家的图文模型生成方法。Based on the above methods shown in Figures 1 to 2, correspondingly, embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the above-mentioned multi-expert-based graphic and text model generation method shown in Figures 1 to 2 is implemented.
基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)、或易失性存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施场景所述的方法。Based on this understanding, the technical solution of this application can be embodied in the form of a software product. The software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), or a volatile storage medium. The storage medium includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in each implementation scenario of this application.
基于上述如图1至图2所示的方法,以及图3所示的虚拟装置实施例,为了实现上述目的,本申请实施例还提供了一种计算机设备,该计算机设备包括:处理器、存储器及存储在存储器上并可在处理器上运行的计算机可读指令,其中存储器和处理器均设置在总线上,所述处理器执行所述计算机可读指令时实现上述如图1至图2所示的基于多专家的图文模型生成方法。Based on the above methods shown in Figures 1 to 2 and the virtual device embodiment shown in Figure 3, in order to achieve the above purpose, embodiments of the present application also provide a computer device, which includes: a processor, a memory and computer-readable instructions stored in the memory and executable on the processor, wherein the memory and the processor are both arranged on the bus, and when the processor executes the computer-readable instructions, the above-mentioned instructions shown in Figures 1 to 2 are implemented. The graphic model generation method based on multi-experts is shown.
可选地,该计算机设备还可以包括用户接口、网络接口、摄像头、射频(Radio Frequency,RF)电路,传感器、音频电路、WI-FI模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard)等,可选用户接口还可以包括USB接口、读卡器接口等。网络接口可选的可以包括标准的有线接口、无线接口(如蓝牙接口、WI-FI接口)等。Optionally, the computer device may also include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, etc. The user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc. The optional user interface may also include a USB interface, a card reader interface, etc. Optional network interfaces may include standard wired interfaces, wireless interfaces (such as Bluetooth interfaces, WI-FI interfaces), etc.
本领域技术人员可以理解,本实施例提供的一种计算机设备结构并不构成对该计算机设备的限定,可以包括更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure of a computer device provided in this embodiment does not constitute a limitation on the computer device, and may include more or less components, or combine certain components, or arrange different components.
存储介质中还可以包括操作系统、网络通信模块。操作系统是管理和保存计算机设备硬件和软件资源的程序,支持信息处理程序以及其它软件和/或程序的运行。网络通信模块用于实现存储介质内部各组件之间的通信,以及与该实体设备中其它硬件和软件之间通信。The storage medium may also include an operating system and a network communication module. An operating system is a program that manages and saves the hardware and software resources of a computer device and supports the operation of information processing programs and other software and/or programs. The network communication module is used to implement communication between components within the storage medium, as well as communication with other hardware and software in the physical device.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可以借助软件加必要的通用硬件平台的方式来实现,也可以通过硬件实现。首先,可以获取训练样本集合,训练样本集合中可以包括多个训练样本,其中每个训练样本可以包括一个样本图片和一个样本文本。此外,样本文本还可以包括一个指示与样本图片之间关系的真实标签。对于训练样本集合中的每个训练样本,可以将训练样本中的样本图片进行转换,得到该样本图片对应的初始图片向量。接着,可以将初始图片向量输入到预设图片文本模型中的初始图片专家模块中,进而可以输出第一目标向量。此外还可以确定该训练样本中与样本图片对应的样本文本的初始文本向量。接着,可以将初始文本向量输入到预设图片文本模型中的初始文本专家模块中,进而可以输出第二目标向量。得到样本图片对应的第一目标向量以及样本文本对应的第二目标向量后,可以以第一目标向量和第二目标向量为基础,进一步确定图片文本目标向量。之后可以将图片文本目标向量作为输入,输入到预设图片文本模型的初始图片文本专家模块中,将初始图片文本专家模块的输出通过全连接层,输出样本图片和样本文本之间的第一预测分值。得到第一预测分值后,可以根据每个训练样本的第一预测分值和真实标签,确定预设图片文本模型的模型损失值,并以该模型损失值为基础,对预设图片文本模型进行训练,经过训练后可以得到基于图片专家、文本专家以及图片文本专家的多专家图文模型。本申请实施例可以使初始图片专家模块、初始文本专家模块以及初始图片文本专家模块实现共同训练,能够节省模型的训练和维护成本,有效减少计算机资源的占用。Through the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus a necessary general hardware platform, or can also be implemented by hardware. First, a training sample set can be obtained. The training sample set can include multiple training samples, and each training sample can include a sample picture and a sample text. In addition, the sample text can also include a real label indicating the relationship to the sample image. For each training sample in the training sample set, the sample image in the training sample can be converted to obtain the initial image vector corresponding to the sample image. Then, the initial picture vector can be input into the initial picture expert module in the preset picture text model, and then the first target vector can be output. In addition, the initial text vector of the sample text corresponding to the sample picture in the training sample can also be determined. Then, the initial text vector can be input into the initial text expert module in the preset image text model, and then the second target vector can be output. After obtaining the first target vector corresponding to the sample picture and the second target vector corresponding to the sample text, the picture text target vector can be further determined based on the first target vector and the second target vector. Then the image text target vector can be used as input to the initial image text expert module of the preset image text model, and the output of the initial image text expert module is passed through the fully connected layer to output the first prediction between the sample image and the sample text. Score. After obtaining the first prediction score, the model loss value of the preset image text model can be determined based on the first prediction score and the real label of each training sample, and based on the model loss value, the preset image text model After training, a multi-expert graphic and text model based on picture experts, text experts and picture and text experts can be obtained. The embodiments of the present application can enable the initial picture expert module, the initial text expert module and the initial picture text expert module to realize joint training, which can save the training and maintenance costs of the model and effectively reduce the occupation of computer resources.
本领域技术人员可以理解附图只是一个优选实施场景的示意图,附图中的模块或流程并不一定是实施本申请所必须的。本领域技术人员可以理解实施场景中的装置中的模块可以按照实施场景描述进行分布于实施场景的装置中,也可以进行相应变化位于不同于本实施场景的一个或多个装置中。上述实施场景的模块可以合并为一个模块,也可以进一步拆分成多个子模块。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawing are not necessarily necessary for implementing the present application. Those skilled in the art can understand that the modules in the devices in the implementation scenario can be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or can be correspondingly changed and located in one or more devices different from the implementation scenario. The modules of the above implementation scenarios can be combined into one module or further split into multiple sub-modules.
上述本申请序号仅仅为了描述,不代表实施场景的优劣。以上公开的仅为本申请的几个具体实施场景,但是,本申请并非局限于此,任何本领域的技术人员能思之的变化都应落入本申请的保护范围。The above serial numbers of this application are only for description and do not represent the advantages and disadvantages of the implementation scenarios. What is disclosed above are only a few specific implementation scenarios of the present application. However, the present application is not limited thereto. Any changes that can be thought of by those skilled in the art should fall within the protection scope of the present application.

Claims (20)

  1. 一种基于多专家的图文模型生成方法,其中,包括:A multi-expert-based graphic model generation method, which includes:
    获取训练样本集合,其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括样本图片和样本文本,所述样本文本带有指示与所述样本图片之间关系的真实标签;Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture;
    基于任一所述训练样本中的所述样本图片,确定初始图片向量,并将所述初始图片向量输入至预设图片文本模型的初始图片专家模块,得到第一目标向量;Based on the sample picture in any of the training samples, determine an initial picture vector, and input the initial picture vector into the initial picture expert module of the preset picture text model to obtain a first target vector;
    基于所述任一所述训练样本中的所述样本文本,确定初始文本向量,并将所述初始文本向量输入至所述预设图片文本模型的初始文本专家模块,得到第二目标向量;Based on the sample text in any of the training samples, determine an initial text vector, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector;
    依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,将所述图片文本目标向量输入至所述预设图片文本模型的初始图片文本专家模块,并基于输出结果以及全连接层,得到所述样本图片与所述样本文本之间的第一预测分值;According to the first target vector and the second target vector, the picture text target vector is determined, the picture text target vector is input to the initial picture text expert module of the preset picture text model, and based on the output result and the full Connect the layer to obtain the first prediction score between the sample picture and the sample text;
    基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型。Based on the first prediction score and the real label, determine the model loss value of the preset image text model, and train the preset image text model based on the model loss value to obtain the multi-based Graphic models for experts.
  2. 根据权利要求1所述的方法,其中,所述基于任一所述训练样本中的所述样本图片,确定初始图片向量,具体包括:The method according to claim 1, wherein determining an initial picture vector based on the sample picture in any of the training samples specifically includes:
    确定所述样本图片的图片维度,其中,所述图片维度包括图片高度和/或图片宽度;Determine the picture dimensions of the sample picture, where the picture dimensions include picture height and/or picture width;
    基于预设划分尺寸,对所述样本图片的图片高度和/或图片宽度进行划分,得到与所述样本图片对应的子样本图片;Based on the preset division size, divide the picture height and/or picture width of the sample picture to obtain sub-sample pictures corresponding to the sample picture;
    将所述子样本图片通过预设转换工具转换成与每个所述子样本图片对应的所述初始图片向量。The sub-sample pictures are converted into the initial picture vector corresponding to each sub-sample picture through a preset conversion tool.
  3. 根据权利要求1所述的方法,其中,所述基于所述任一所述训练样本中的所述样本文本,确定初始文本向量,具体包括:The method according to claim 1, wherein determining an initial text vector based on the sample text in any of the training samples specifically includes:
    基于预设字向量数据库,从所述预设字向量数据库中分别确定所述样本文本中每个字对应的字向量,并将所述样本文本中每个字对应的字向量进行拼接,得到所述初始文本向量。Based on the preset word vector database, the word vector corresponding to each word in the sample text is determined from the preset word vector database, and the word vectors corresponding to each word in the sample text are spliced to obtain the Describe the initial text vector.
  4. 根据权利要求2或3所述的方法,其中,所述依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,具体包括:The method according to claim 2 or 3, wherein determining the picture text target vector based on the first target vector and the second target vector specifically includes:
    将每个所述子样本图片对应的第一目标向量进行拼接,得到图片拼接向量;Splice the first target vector corresponding to each of the sub-sample pictures to obtain a picture splicing vector;
    将所述图片拼接向量与所述样本文本对应的所述第二目标向量进行拼接,得到所述图片文本目标向量。The picture splicing vector is spliced with the second target vector corresponding to the sample text to obtain the picture text target vector.
  5. 根据权利要求1所述的方法,其中,所述基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型,具体包括:The method according to claim 1, wherein the model loss value of the preset image text model is determined based on the first prediction score and the real label, and the model loss value is calculated based on the model loss value. The preset picture and text model is trained to obtain the multi-expert based picture and text model, which specifically includes:
    基于所述训练样本集合中的每个所述训练样本对应的所述第一预测分值以及所述真实标签,通过预设交叉熵损失函数确定所述预设图片文本模型的模型损失值;Based on the first prediction score corresponding to each training sample in the training sample set and the true label, determine the model loss value of the preset image text model through a preset cross-entropy loss function;
    当所述模型损失值大于预设损失阈值时,依据所述模型损失值调整所述预设图片文本模型中所述初始图片专家模块、所述初始文本专家模块以及所述初始图片文本专家模块中至少一个模块对应的模块参数,得到更新后的预设图片文本模型,通过所述更新后的预设图片文本模型以及所述全连接层,得到每个所述样本图片与所述样本文本之间的第二预测分值,并再次计算所述模型损失值;When the model loss value is greater than the preset loss threshold, the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model are adjusted according to the model loss value. Module parameters corresponding to at least one module are used to obtain an updated preset image text model. Through the updated preset image text model and the fully connected layer, the relationship between each sample image and the sample text is obtained. the second prediction score, and calculate the model loss value again;
    当所述模型损失值小于或等于所述预设损失阈值时,得到所述基于多专家的图文模型。When the model loss value is less than or equal to the preset loss threshold, the multi-expert based graphic and text model is obtained.
  6. 根据权利要求1所述的方法,其中,所述得到所述基于多专家的图文模型之后,所述方法还包括:The method according to claim 1, wherein after obtaining the multi-expert based graphic and text model, the method further includes:
    接收待分析对象,并依据所述待分析对象的格式,从所述基于多专家的图文模型中确定对应的目标分析模块,其中,所述目标分析模块包括目标图片专家模块、目标文本专家模块以及目标图片文本专家模块中的至少一种;Receive the object to be analyzed, and determine the corresponding target analysis module from the multi-expert-based graphic and text model according to the format of the object to be analyzed, wherein the target analysis module includes a target picture expert module and a target text expert module and at least one of the target image text expert modules;
    将所述待分析对象转换成对应的目标输入向量,并将所述目标输入向量输入至所述所述目标分析模块中,得到与所述待分析对象对应的目标输出向量,以通过所述目标输出向量得到目标结果。Convert the object to be analyzed into a corresponding target input vector, and input the target input vector into the target analysis module to obtain a target output vector corresponding to the object to be analyzed, so as to pass the target The output vector gets the target result.
  7. 根据权利要求6所述的方法,其中,所述依据所述待分析对象的格式,从所述基于多专家的图文模型中确定对应的目标分析模块,具体包括:The method according to claim 6, wherein the determining the corresponding target analysis module from the multi-expert-based graphic model according to the format of the object to be analyzed specifically includes:
    当所述待分析对象的格式为图片格式时,将所述目标图片专家模块作为所述目标分析模块;When the format of the object to be analyzed is a picture format, use the target picture expert module as the target analysis module;
    当所述待分析对象的格式为文本格式时,将所述目标文本专家模块作为所述目标分析模块;When the format of the object to be analyzed is a text format, the target text expert module is used as the target analysis module;
    当所述待分析对象的格式包括图片格式以及文本格式时,将所述目标图片专家模块、所述目标文本专家模块以及所述目标图片文本专家模块作为所述目标分析模块。When the format of the object to be analyzed includes a picture format and a text format, the target picture expert module, the target text expert module and the target picture text expert module are used as the target analysis modules.
  8. 一种基于多专家的图文模型生成装置,其中,包括:A multi-expert based graphic and text model generation device, which includes:
    样本获取模块,用于获取训练样本集合,其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括样本图片和样本文本,所述样本文本带有指示与所述样本图片之间关系的真实标签;A sample acquisition module is used to obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries an indication between the sample picture and the sample picture. the true label of the relationship;
    第一输入模块,用于基于任一所述训练样本中的所述样本图片,确定初始图片向量,并将所述初始图片向量输入至预设图片文本模型的初始图片专家模块,得到第一目标向量;The first input module is used to determine an initial picture vector based on the sample picture in any of the training samples, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector;
    第二输入模块,用于基于所述任一所述训练样本中的所述样本文本,确定初始文本向量,并将所述初始文本向量输入至所述预设图片文本模型的初始文本专家模块,得到第二目标向量;a second input module, configured to determine an initial text vector based on the sample text in any of the training samples, and input the initial text vector to the initial text expert module of the preset picture text model, Get the second target vector;
    预测模块,用于依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,将所述图片文本目标向量输入至所述预设图片文本模型的初始图片文本专家模块,并基于输出结果以及全连接层,得到所述样本图片与所述样本文本之间的第一预测分值;A prediction module, configured to determine a picture text target vector based on the first target vector and the second target vector, input the picture text target vector into the initial picture text expert module of the preset picture text model, and Based on the output result and the fully connected layer, obtain the first prediction score between the sample picture and the sample text;
    模型训练模块,用于基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型。A model training module, configured to determine a model loss value of the preset image text model based on the first prediction score and the real label, and train the preset image text model based on the model loss value , to obtain the multi-expert based graphic and text model.
  9. 一种计算机可读存储介质,其上存储有计算机可读指令,其中,所述计算机可读指令被处理器执行时实现基于多专家的图文模型生成方法,包括:A computer-readable storage medium on which computer-readable instructions are stored, wherein when the computer-readable instructions are executed by a processor, a multi-expert-based graphic model generation method is implemented, including:
    获取训练样本集合,其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括样本图片和样本文本,所述样本文本带有指示与所述样本图片之间关系的真实标签;基于任一所述训练样本中的所述样本图片,确定初始图片向量,并将所述初始图片向量输入至预设图片文本模型的初始图片专家模块,得到第一目标向量;基于所述任一所述训练样本中的所述样本文本,确定初始文本向量,并将所述初始文本向量输入至所述预设图片文本模型的初始文本专家模块,得到第二目标向量;依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,将所述图片文本目标向量输入至所述预设图片文本模型的初始图片文本专家模块,并基于输出结果以及全连接层,得到所述样本图片与所述样本文本之间的第一预测分值;基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型。Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture; Based on the sample picture in any of the training samples, determine an initial picture vector, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector; based on any of the Determine an initial text vector for the sample text in the training sample, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector; according to the first target vector and the second target vector, determine the picture text target vector, input the picture text target vector to the initial picture text expert module of the preset picture text model, and based on the output result and the fully connected layer, obtain the The first prediction score between the sample picture and the sample text; based on the first prediction score and the real label, determine the model loss value of the preset picture text model, and based on the model loss value The preset picture and text model is trained to obtain the multi-expert based picture and text model.
  10. 根据权利要求9所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述基于任一所述训练样本中的所述样本图片,确定初始图片向量,具体包括:The computer-readable storage medium according to claim 9, wherein the computer-readable instructions, when executed by a processor, implement the step of determining an initial picture vector based on the sample picture in any of the training samples, specifically including: :
    确定所述样本图片的图片维度,其中,所述图片维度包括图片高度和/或图片宽度;基于预设划分尺寸,对所述样本图片的图片高度和/或图片宽度进行划分,得到与所述样本图片对应的子样本图片;将所述子样本图片通过预设转换工具转换成与每个所述子样本图片对应的所述初始图片向量。Determine the picture dimensions of the sample picture, where the picture dimensions include picture height and/or picture width; based on the preset division size, divide the picture height and/or picture width of the sample picture to obtain the Sub-sample pictures corresponding to the sample pictures; convert the sub-sample pictures into the initial picture vector corresponding to each of the sub-sample pictures through a preset conversion tool.
  11. 根据权利要求9所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述基于所述任一所述训练样本中的所述样本文本,确定初始文本向量,具体包括:The computer-readable storage medium according to claim 9, wherein the computer-readable instructions, when executed by a processor, implement the determining an initial text vector based on the sample text in any of the training samples, Specifically include:
    基于预设字向量数据库,从所述预设字向量数据库中分别确定所述样本文本中每个字对应的字向量,并将所述样本文本中每个字对应的字向量进行拼接,得到所述初始文本向量。Based on the preset word vector database, the word vector corresponding to each word in the sample text is determined from the preset word vector database, and the word vectors corresponding to each word in the sample text are spliced to obtain the Describe the initial text vector.
  12. 根据权利要求10或11所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,具体包括:The computer-readable storage medium according to claim 10 or 11, wherein when the computer-readable instructions are executed by a processor, the computer-readable instructions determine the picture text target according to the first target vector and the second target vector. Vectors, specifically including:
    将每个所述子样本图片对应的第一目标向量进行拼接,得到图片拼接向量;将所述图片拼接向量与所述样本文本对应的所述第二目标向量进行拼接,得到所述图片文本目标向量。The first target vector corresponding to each of the sub-sample pictures is spliced to obtain a picture splicing vector; the picture splicing vector is spliced with the second target vector corresponding to the sample text to obtain the picture text target vector.
  13. 根据权利要求9所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型,具体包括:The computer-readable storage medium according to claim 9, wherein when the computer-readable instructions are executed by a processor, the computer-readable instructions determine the preset image text based on the first prediction score and the real label. The model loss value of the model, and the preset picture and text model is trained based on the model loss value to obtain the multi-expert-based picture and text model, which specifically includes:
    基于所述训练样本集合中的每个所述训练样本对应的所述第一预测分值以及所述真实标签,通过预设交叉熵损失函数确定所述预设图片文本模型的模型损失值;当所述模型损失值大于预设损失阈值时,依据所述模型损失值调整所述预设图片文本模型中所述初始图片专家模块、所述初始文本专家模块以及所述初始图片文本专家模块中至少一个模块对应的模块参数,得到更新后的预设图片文本模型,通过所述更新后的预设图片文本模型以及所述全连接层,得到每个所述样本图片与所述样本文本之间的第二预测分值,并再次计算所述模型损失值;当所述模型损失值小于或等于所述预设损失阈值时,得到所述基于多专家的图文模型。Based on the first prediction score and the true label corresponding to each training sample in the training sample set, the model loss value of the preset image text model is determined through a preset cross-entropy loss function; when When the model loss value is greater than the preset loss threshold, at least one of the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model is adjusted according to the model loss value. The module parameters corresponding to a module are used to obtain an updated preset picture text model. Through the updated preset picture text model and the fully connected layer, the relationship between each sample picture and the sample text is obtained. second prediction score, and calculate the model loss value again; when the model loss value is less than or equal to the preset loss threshold, the multi-expert based graphic and text model is obtained.
  14. 根据权利要求9所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述得到所述基于多专家的图文模型之后,所述方法还包括:The computer-readable storage medium according to claim 9, wherein after the computer-readable instructions are executed by a processor to achieve the obtaining of the multi-expert-based graphic and text model, the method further includes:
    接收待分析对象,并依据所述待分析对象的格式,从所述基于多专家的图文模型中确定对应的目标分析模块,其中,所述目标分析模块包括目标图片专家模块、目标文本专家模块以及目标图片文本专家模块中的至少一种;将所述待分析对象转换成对应的目标输入向量,并将所述目标输入向量输入至所述所述目标分析模块中,得到与所述待分析对象对应的目标输出向量,以通过所述目标输出向量得到目标结果。Receive the object to be analyzed, and determine the corresponding target analysis module from the multi-expert-based graphic and text model according to the format of the object to be analyzed, wherein the target analysis module includes a target picture expert module and a target text expert module and at least one of the target picture and text expert modules; convert the object to be analyzed into a corresponding target input vector, and input the target input vector into the target analysis module to obtain the target image to be analyzed. The target output vector corresponding to the object is used to obtain the target result through the target output vector.
  15. 一种计算机设备,包括存储介质、处理器及存储在存储介质上并可在处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现基于多专家的图文模型生成方法,包括:A computer device, including a storage medium, a processor, and computer-readable instructions stored on the storage medium and executable on the processor, wherein when the processor executes the computer-readable instructions, a multi-expert-based graph is implemented Text model generation methods include:
    获取训练样本集合,其中,所述训练样本集合包括多个训练样本,每个所述训练样本包括样本图片和样本文本,所述样本文本带有指示与所述样本图片之间关系的真实标签;基于任一所述训练样本中的所述样本图片,确定初始图片向量,并将所述初始图片向量输入至预设图片文本模型的初始图片专家模块,得到第一目标向量;基于所述任一所述训练 样本中的所述样本文本,确定初始文本向量,并将所述初始文本向量输入至所述预设图片文本模型的初始文本专家模块,得到第二目标向量;依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,将所述图片文本目标向量输入至所述预设图片文本模型的初始图片文本专家模块,并基于输出结果以及全连接层,得到所述样本图片与所述样本文本之间的第一预测分值;基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型。Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture; Based on the sample picture in any of the training samples, determine an initial picture vector, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector; based on any of the Determine an initial text vector for the sample text in the training sample, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector; according to the first target vector and the second target vector, determine the picture text target vector, input the picture text target vector to the initial picture text expert module of the preset picture text model, and based on the output result and the fully connected layer, obtain the The first prediction score between the sample picture and the sample text; based on the first prediction score and the real label, determine the model loss value of the preset picture text model, and based on the model loss value The preset picture and text model is trained to obtain the multi-expert based picture and text model.
  16. 根据权利要求15所述的计算机设备,其中,所述处理器执行所述计算机可读指令时实现所述基于任一所述训练样本中的所述样本图片,确定初始图片向量,具体包括:The computer device according to claim 15, wherein when the processor executes the computer-readable instructions, the processor implements determining the initial picture vector based on the sample picture in any of the training samples, specifically including:
    确定所述样本图片的图片维度,其中,所述图片维度包括图片高度和/或图片宽度;基于预设划分尺寸,对所述样本图片的图片高度和/或图片宽度进行划分,得到与所述样本图片对应的子样本图片;将所述子样本图片通过预设转换工具转换成与每个所述子样本图片对应的所述初始图片向量。Determine the picture dimensions of the sample picture, where the picture dimensions include picture height and/or picture width; based on the preset division size, divide the picture height and/or picture width of the sample picture to obtain the Sub-sample pictures corresponding to the sample pictures; convert the sub-sample pictures into the initial picture vector corresponding to each of the sub-sample pictures through a preset conversion tool.
  17. 根据权利要求15所述的计算机设备,其中,所述处理器执行所述计算机可读指令时实现所述基于所述任一所述训练样本中的所述样本文本,确定初始文本向量,具体包括:The computer device according to claim 15, wherein when the processor executes the computer readable instructions, the processor implements the step of determining an initial text vector based on the sample text in any of the training samples, specifically including: :
    基于预设字向量数据库,从所述预设字向量数据库中分别确定所述样本文本中每个字对应的字向量,并将所述样本文本中每个字对应的字向量进行拼接,得到所述初始文本向量。Based on the preset word vector database, the word vector corresponding to each word in the sample text is determined from the preset word vector database, and the word vectors corresponding to each word in the sample text are spliced to obtain the Describe the initial text vector.
  18. 根据权利要求16或17所述的计算机设备,其中,所述处理器执行所述计算机可读指令时实现所述依据所述第一目标向量以及所述第二目标向量,确定图片文本目标向量,具体包括:The computer device according to claim 16 or 17, wherein when the processor executes the computer-readable instructions, the processor implements determining the picture text target vector according to the first target vector and the second target vector, Specifically include:
    将每个所述子样本图片对应的第一目标向量进行拼接,得到图片拼接向量;将所述图片拼接向量与所述样本文本对应的所述第二目标向量进行拼接,得到所述图片文本目标向量。The first target vector corresponding to each of the sub-sample pictures is spliced to obtain a picture splicing vector; the picture splicing vector is spliced with the second target vector corresponding to the sample text to obtain the picture text target vector.
  19. 根据权利要求15所述的计算机设备,其中,所述处理器执行所述计算机可读指令时实现所述基于所述第一预测分值以及所述真实标签,确定所述预设图片文本模型的模型损失值,并基于所述模型损失值对所述预设图片文本模型进行训练,得到所述基于多专家的图文模型,具体包括:The computer device according to claim 15, wherein when the processor executes the computer readable instructions, the processor implements the step of determining the preset image text model based on the first prediction score and the real label. model loss value, and train the preset picture and text model based on the model loss value to obtain the multi-expert-based picture and text model, which specifically includes:
    基于所述训练样本集合中的每个所述训练样本对应的所述第一预测分值以及所述真实标签,通过预设交叉熵损失函数确定所述预设图片文本模型的模型损失值;当所述模型损失值大于预设损失阈值时,依据所述模型损失值调整所述预设图片文本模型中所述初始图片专家模块、所述初始文本专家模块以及所述初始图片文本专家模块中至少一个模块对应的模块参数,得到更新后的预设图片文本模型,通过所述更新后的预设图片文本模型以及所述全连接层,得到每个所述样本图片与所述样本文本之间的第二预测分值,并再次计算所述模型损失值;当所述模型损失值小于或等于所述预设损失阈值时,得到所述基于多专家的图文模型。Based on the first prediction score and the true label corresponding to each training sample in the training sample set, the model loss value of the preset image text model is determined through a preset cross-entropy loss function; when When the model loss value is greater than the preset loss threshold, at least one of the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model is adjusted according to the model loss value. The module parameters corresponding to a module are used to obtain an updated preset picture text model. Through the updated preset picture text model and the fully connected layer, the relationship between each sample picture and the sample text is obtained. second prediction score, and calculate the model loss value again; when the model loss value is less than or equal to the preset loss threshold, the multi-expert based graphic and text model is obtained.
  20. 根据权利要求15所述的计算机设备,其中,所述处理器执行所述计算机可读指令时实现所述得到所述基于多专家的图文模型之后,所述方法还包括:The computer device according to claim 15, wherein after the processor implements the obtaining of the multi-expert-based graphic and text model when the processor executes the computer-readable instructions, the method further includes:
    接收待分析对象,并依据所述待分析对象的格式,从所述基于多专家的图文模型中确定对应的目标分析模块,其中,所述目标分析模块包括目标图片专家模块、目标文本专家模块以及目标图片文本专家模块中的至少一种;将所述待分析对象转换成对应的目标输入向量,并将所述目标输入向量输入至所述所述目标分析模块中,得到与所述待分析对象对应的目标输出向量,以通过所述目标输出向量得到目标结果。Receive the object to be analyzed, and determine the corresponding target analysis module from the multi-expert-based graphic and text model according to the format of the object to be analyzed, wherein the target analysis module includes a target picture expert module and a target text expert module and at least one of the target picture and text expert modules; convert the object to be analyzed into a corresponding target input vector, and input the target input vector into the target analysis module to obtain the target image to be analyzed. The target output vector corresponding to the object is used to obtain the target result through the target output vector.
PCT/CN2022/089730 2022-03-09 2022-04-28 Picture-text model generation method and apparatus based on multiple experts, and device and medium WO2023168811A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210232059.8 2022-03-09
CN202210232059.8A CN114610919A (en) 2022-03-09 2022-03-09 Multi-expert-based image-text model generation method, device, equipment and medium

Publications (1)

Publication Number Publication Date
WO2023168811A1 true WO2023168811A1 (en) 2023-09-14

Family

ID=81861502

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089730 WO2023168811A1 (en) 2022-03-09 2022-04-28 Picture-text model generation method and apparatus based on multiple experts, and device and medium

Country Status (2)

Country Link
CN (1) CN114610919A (en)
WO (1) WO2023168811A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170098138A1 (en) * 2015-10-06 2017-04-06 Adobe Systems Incorporated Font Attributes for Font Recognition and Similarity
CN110781633A (en) * 2019-10-30 2020-02-11 广东博智林机器人有限公司 Image-text design quality detection method, device and system based on deep learning model
CN111310041A (en) * 2020-02-12 2020-06-19 腾讯科技(深圳)有限公司 Image-text publishing method, model training method and device and storage medium
CN113283551A (en) * 2021-07-22 2021-08-20 智者四海(北京)技术有限公司 Training method and training device of multi-mode pre-training model and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170098138A1 (en) * 2015-10-06 2017-04-06 Adobe Systems Incorporated Font Attributes for Font Recognition and Similarity
CN110781633A (en) * 2019-10-30 2020-02-11 广东博智林机器人有限公司 Image-text design quality detection method, device and system based on deep learning model
CN111310041A (en) * 2020-02-12 2020-06-19 腾讯科技(深圳)有限公司 Image-text publishing method, model training method and device and storage medium
CN113283551A (en) * 2021-07-22 2021-08-20 智者四海(北京)技术有限公司 Training method and training device of multi-mode pre-training model and electronic equipment

Also Published As

Publication number Publication date
CN114610919A (en) 2022-06-10

Similar Documents

Publication Publication Date Title
KR102401942B1 (en) Method and apparatus for evaluating translation quality
JP7331171B2 (en) Methods and apparatus for training image recognition models, methods and apparatus for recognizing images, electronic devices, storage media, and computer programs
US20220230420A1 (en) Artificial intelligence-based object detection method and apparatus, device, and storage medium
WO2020134571A1 (en) Page display method and apparatus, terminal device and storage medium
CN110555795A (en) High resolution style migration
CN111027563A (en) Text detection method, device and recognition system
CN111666416B (en) Method and device for generating semantic matching model
CN109902763B (en) Method and device for generating feature map
WO2024036847A1 (en) Image processing method and apparatus, and electronic device and storage medium
CN109948699B (en) Method and device for generating feature map
CN116127020A (en) Method for training generated large language model and searching method based on model
CN116127046A (en) Training method for generating large language model and man-machine voice interaction method based on model
CN116127045A (en) Training method for generating large language model and man-machine voice interaction method based on model
US11195048B2 (en) Generating descriptions of image relationships
CN112084920B (en) Method, device, electronic equipment and medium for extracting hotwords
US20220215177A1 (en) Method and system for processing sentence, and electronic device
WO2023168812A1 (en) Optimization method and apparatus for search system, and storage medium and computer device
US20230289402A1 (en) Joint perception model training method, joint perception method, device, and storage medium
CN116244416A (en) Training method for generating large language model and man-machine voice interaction method based on model
CN112084959A (en) Crowd image processing method and device
CN111738010B (en) Method and device for generating semantic matching model
US20210279589A1 (en) Electronic device and control method thereof
WO2023168811A1 (en) Picture-text model generation method and apparatus based on multiple experts, and device and medium
CN113361384A (en) Face recognition model compression method, device, medium, and computer program product
CN113140221A (en) Language model fusion method, device, medium and computer program product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22930440

Country of ref document: EP

Kind code of ref document: A1