WO2023168811A1

WO2023168811A1 - Picture-text model generation method and apparatus based on multiple experts, and device and medium

Info

Publication number: WO2023168811A1
Application number: PCT/CN2022/089730
Authority: WO
Inventors: 谯轶轩
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-03-09
Filing date: 2022-04-28
Publication date: 2023-09-14
Also published as: CN114610919A

Abstract

The present application relates to the field of artificial intelligence. Disclosed are a picture-text model generation method and apparatus based on multiple experts, and a storage medium and a computer device. The method comprises: acquiring a training sample set; determining an initial picture vector on the basis of a sample picture in a training sample, and inputting the initial picture vector into an initial picture expert module, so as to obtain a first target vector; determining an initial text vector on the basis of sample text in the training sample, and inputting the initial text vector into an initial text expert module, so as to obtain a second target vector; determining a picture-text target vector according to the first target vector and the second target vector, inputting the picture-text target vector into an initial picture-text expert module, and obtaining a first predicted score on the basis of an output result and a fully-connected layer; and determining a model loss value of a preset picture-text model on the basis of the first predicted score and a real label, and training the preset picture-text model on the basis of the model loss value, so as to obtain a picture-text model based on multiple experts.

Description

A multi-expert-based graphic model generation method, device, equipment and medium

This application claims priority with the Chinese patent application submitted to the China Patent Office on March 9, 2022, with the application number 202210232059.8 and the application title "A multi-expert-based graphic and text model generation method, device, equipment and medium", The entire contents of which are incorporated into this application by reference.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a multi-expert-based graphic and text model generation method and device, storage media, and computer equipment.

Background technique

Currently, large-scale image and text pre-training is usually used to solve the following types of problems, namely image retrieval tasks, text retrieval tasks, and complex reasoning tasks of image and text. Among them, the image retrieval task includes two types: retrieval of pictures based on pictures and retrieval of text based on pictures. The text retrieval task includes two types of retrieval of text based on text and retrieval of pictures based on text.

However, the inventor found that in the existing technology, pre-trained graphic and text models are usually single-expert models, and different personnel are responsible for training, deployment, and maintenance, which increases the training cost and maintenance cost of the model, and takes up a large amount of computers. resource.

Contents of the invention

In view of this, this application provides a multi-expert-based graphic and text model generation method and device, storage media, and computer equipment, which can enable the initial picture expert module, the initial text expert module, and the initial picture and text expert module to achieve joint training, and can Save model training and maintenance costs and effectively reduce computer resource usage.

According to one aspect of this application, a multi-expert based graphic and text model generation method is provided, including:

Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture;

Based on the sample picture in any of the training samples, determine an initial picture vector, and input the initial picture vector into the initial picture expert module of the preset picture text model to obtain a first target vector;

Based on the sample text in any of the training samples, determine an initial text vector, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector;

According to the first target vector and the second target vector, the picture text target vector is determined, the picture text target vector is input to the initial picture text expert module of the preset picture text model, and based on the output result and the full Connect the layer to obtain the first prediction score between the sample picture and the sample text;

Based on the first prediction score and the real label, determine the model loss value of the preset image text model, and train the preset image text model based on the model loss value to obtain the multi-based Graphic models for experts.

According to another aspect of the present application, a multi-expert-based graphic and text model generation device is provided, including:

A sample acquisition module is used to obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries an indication between the sample picture and the sample picture. the true label of the relationship;

The first input module is used to determine an initial picture vector based on the sample picture in any of the training samples, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector;

a second input module, configured to determine an initial text vector based on the sample text in any of the training samples, and input the initial text vector to the initial text expert module of the preset picture text model, Get the second target vector;

A prediction module, configured to determine a picture text target vector based on the first target vector and the second target vector, input the picture text target vector into the initial picture text expert module of the preset picture text model, and Based on the output result and the fully connected layer, obtain the first prediction score between the sample picture and the sample text;

A model training module, configured to determine a model loss value of the preset image text model based on the first prediction score and the real label, and train the preset image text model based on the model loss value , to obtain the multi-expert based graphic and text model.

According to another aspect of the present application, a computer-readable storage medium is provided, on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, a multi-expert-based graphic model generation method is implemented, including:

Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture; Based on the sample picture in any of the training samples, determine an initial picture vector, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector; based on any of the Determine an initial text vector for the sample text in the training sample, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector; according to the first target vector and the second target vector, determine the picture text target vector, input the picture text target vector to the initial picture text expert module of the preset picture text model, and based on the output result and the fully connected layer, obtain the The first prediction score between the sample picture and the sample text; based on the first prediction score and the real label, determine the model loss value of the preset picture text model, and based on the model loss value The preset picture and text model is trained to obtain the multi-expert based picture and text model.

According to yet another aspect of the present application, a computer device is provided, including a storage medium, a processor, and computer-readable instructions stored on the storage medium and executable on the processor. The processor executes the computer-readable instructions. At the same time, a multi-expert-based graphic model generation method is implemented, including:

Through the above technical solution, the multi-expert based graphic and text model generation method and device, storage medium and computer equipment provided by this application can enable the initial picture expert module, the initial text expert module and the initial picture text expert module to achieve joint training , which can save model training and maintenance costs and effectively reduce the occupation of computer resources.

The above description is only an overview of the technical solutions of the present application. In order to have a clearer understanding of the technical means of the present application, they can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present application more obvious and understandable. , the specific implementation methods of the present application are specifically listed below.

Description of the drawings

The drawings described here are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation of the present application. In the attached picture:

Figure 1 shows a schematic flow chart of a multi-expert-based graphic and text model generation method provided by an embodiment of the present application;

Figure 2 shows a schematic flowchart of another multi-expert-based graphic and text model generation method provided by an embodiment of the present application;

Figure 3 shows a schematic structural diagram of a multi-expert-based graphic and text model generation device provided by an embodiment of the present application.

Detailed ways

The present application will be described in detail below with reference to the accompanying drawings and embodiments. It should be noted that, as long as there is no conflict, the embodiments and features in the embodiments of this application can be combined with each other.

In this embodiment, a multi-expert-based graphic model generation method is provided, as shown in Figure 1. The method includes:

Step 101: Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a link indicating a relationship with the sample picture. real label;

The multi-expert-based graphic and text model generation method provided by the embodiments of this application can enable the initial picture expert module, the initial text expert module and the initial picture and text expert module to achieve joint training, which can save model training and maintenance costs and effectively reduce computer resources. of occupation. The preset picture text model of this application mainly consists of three parts, namely the initial picture expert module, the initial text expert module and the initial picture text expert module. After the training is completed, the target picture expert module, the target text expert module and the Target image text expert module. First, a training sample set can be obtained. The training sample set can include multiple training samples, and each training sample can include a sample picture and a sample text. In addition, the sample text can also include a real label indicating the relationship with the sample picture. For example, if the sample text is a positive sample of the sample picture, that is, the sample text is an explanation of the sample picture, then the real label can be 1; the sample If the text is a negative sample of the sample image, that is, the sample text is not the interpretation of the sample image, then the true label can be 0.

Step 102: Determine an initial picture vector based on the sample picture in any of the training samples, and input the initial picture vector into the initial picture expert module of the preset picture text model to obtain the first target vector;

In this embodiment, for each training sample in the training sample set, the sample picture in the training sample can be converted to obtain an initial picture vector corresponding to the sample picture. Then, the initial picture vector can be input into the initial picture expert module in the preset picture text model, and then the first target vector can be output.

Step 103: Determine an initial text vector based on the sample text in any of the training samples, and input the initial text vector into the initial text expert module of the preset image text model to obtain the second target vector;

In this embodiment, the initial text vector of the sample text corresponding to the sample picture in the training sample may also be determined. Then, the initial text vector can be input into the initial text expert module in the preset image text model, and then the second target vector can be output.

Step 104: Determine a picture text target vector according to the first target vector and the second target vector, input the picture text target vector to the initial picture text expert module of the preset picture text model, and based on the output As a result, and the fully connected layer, the first prediction score between the sample picture and the sample text is obtained;

In this embodiment, after obtaining the first target vector corresponding to the sample picture and the second target vector corresponding to the sample text, the picture text target vector can be further determined based on the first target vector and the second target vector. Afterwards, the image text target vector can be used as input to the initial image text expert module of the preset image text model. Then the output of the initial image text expert module can be passed through the fully connected layer to output the first image between the sample image and the sample text. A prediction score. From the first prediction score, we can see the correlation score between the sample text and the sample image.

Step 105: Determine the model loss value of the preset picture text model based on the first prediction score and the real label, and train the preset picture text model based on the model loss value to obtain the Describes a graphic model based on multi-experts.

In this embodiment, after determining the first prediction score between the sample picture and the sample text of each training sample, the preset picture text model can be determined based on the first prediction score and the true label of each training sample. Model loss value. Then, the preset picture and text model can be trained based on the model loss value. After training, a multi-expert picture and text model based on picture experts, text experts and picture and text experts can be obtained.

By applying the technical solution of this embodiment, first, a training sample set can be obtained. The training sample set can include multiple training samples, where each training sample can include a sample picture and a sample text. In addition, the sample text can also include a real label indicating the relationship to the sample image. For each training sample in the training sample set, the sample image in the training sample can be converted to obtain the initial image vector corresponding to the sample image. Then, the initial picture vector can be input into the initial picture expert module in the preset picture text model, and then the first target vector can be output. In addition, the initial text vector of the sample text corresponding to the sample picture in the training sample can also be determined. Then, the initial text vector can be input into the initial text expert module in the preset image text model, and then the second target vector can be output. After obtaining the first target vector corresponding to the sample picture and the second target vector corresponding to the sample text, the picture text target vector can be further determined based on the first target vector and the second target vector. Then the image text target vector can be used as input to the initial image text expert module of the preset image text model, and the output of the initial image text expert module is passed through the fully connected layer to output the first prediction between the sample image and the sample text. Score. After obtaining the first prediction score, the model loss value of the preset image text model can be determined based on the first prediction score and the real label of each training sample, and based on the model loss value, the preset image text model After training, a multi-expert graphic and text model based on picture experts, text experts and picture and text experts can be obtained. The embodiments of the present application can enable the initial picture expert module, the initial text expert module and the initial picture text expert module to realize joint training, which can save the training and maintenance costs of the model and effectively reduce the occupation of computer resources.

Further, as a refinement and expansion of the specific implementation of the above embodiment, in order to completely explain the specific implementation process of this embodiment, another multi-expert-based graphic model generation method is provided. As shown in Figure 2, this method include:

Step 201: Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a link indicating a relationship with the sample picture. real label;

In this embodiment, first, a training sample set may be obtained. The training sample set may include multiple training samples, where each training sample may include a sample picture and a sample text. In addition, the sample text can also include a real label indicating the relationship with the sample picture. For example, if the sample text is a positive sample of the sample picture, that is, the sample text is an explanation of the sample picture, then the real label can be 1; the sample If the text is a negative sample of the sample image, that is, the sample text is not the interpretation of the sample image, then the true label can be 0.

Step 202: Determine the picture dimensions of the sample picture, where the picture dimensions include picture height and/or picture width;

In this embodiment, the picture dimensions can be determined for the sample pictures in each training sample. Here, the picture dimensions can include picture height and picture width, and can also include the number of picture channels. For example, the image dimensions corresponding to the sample image can be H x W x C, where H represents the image height of the sample image, W represents the image width of the sample image, and C represents the number of image channels of the sample image.

Step 203: Divide the picture height and/or picture width of the sample picture based on the preset division size to obtain sub-sample pictures corresponding to the sample picture;

In this embodiment, after determining the picture dimensions of the sample picture, the sample picture can be divided according to the preset division size. Here, the sample picture can only be divided from the picture height direction, and the picture width remains unchanged, or the sample picture can be divided. The sample image is divided according to the image width, and the image height remains unchanged. The sample image can also be divided from the image height and image width of the sample image at the same time. After division, multiple sub-sample images corresponding to the sample image can be obtained. For example, the image dimension of the sample image is H x W x C. The sample image can be divided into multiple sub-sample images of P x P x C according to the preset division size. That is, the image dimension corresponding to each sub-sample image is P x P xC.

Step 204: Convert the sub-sample pictures into the initial picture vector corresponding to each of the sub-sample pictures through a preset conversion tool;

In this embodiment, after obtaining multiple sub-sample pictures corresponding to each sample picture, each sub-sample picture can be converted into an initial picture vector corresponding to the sub-sample picture through a preset conversion tool, that is, each sub-sample picture can be directly Represented by the initial image vector corresponding to the subsample image. Here, the preset transformation tool can be reshape. For example, if the image dimension corresponding to each sub-sample image is P x P x C, then each sub-sample image can be converted into a vector with dimension P ² C through the preset conversion tool. This vector of P ² C can be the initial image. vector. In addition, the P ² C vector corresponding to each sub-sample image can also be converted into a one-dimensional vector of a specified dimension through dimensionality reduction, and the converted one-dimensional vector is used as the initial image vector. Obtaining the initial image vector through dimensionality reduction can make the initial image vector more conveniently participate in subsequent operations, reduce the difficulty of subsequent operations, and increase the efficiency of operations.

Step 205: Input the initial image vector to the initial image expert module of the preset image text model to obtain the first target vector;

Step 206: Based on the preset word vector database, determine the word vector corresponding to each word in the sample text from the preset word vector database, and splice the word vectors corresponding to each word in the sample text. , get the initial text vector;

In this embodiment, the initial picture vector corresponding to each sub-sample picture is input into the initial picture expert module of the preset picture text model, and the first target vector can be output accordingly. In addition, for each word in the sample text, the word vector corresponding to each word can be found from the preset word vector database, and then the word vectors corresponding to each word are spliced according to the order of each word in the sample text. , get the initial text vector corresponding to each sample text.

Step 207: Input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector;

Step 208: Splice the first target vector corresponding to each sub-sample picture to obtain a picture splicing vector; splice the picture splicing vector with the second target vector corresponding to the sample text to obtain the Image text target vector;

In this embodiment, the initial text vector can be input into the initial text expert module in the preset picture text model, and then the second target vector can be output. After obtaining multiple first target vectors corresponding to the sample pictures and the second target vector corresponding to the sample text, the first target vector and the second target vector can be spliced based on the first target vector and the second target vector, and further Determine the image text target vector.

Step 209: Input the picture text target vector to the initial picture text expert module of the preset picture text model, and obtain the first value between the sample picture and the sample text based on the output result and the fully connected layer. predicted score;

In this embodiment, the picture text target vector is used as input to the initial picture text expert module of the preset picture text model. Then the output of the initial picture text expert module can be passed through the fully connected layer to output sample pictures and sample texts. From the first prediction score, we can see the correlation score between the sample text and the sample picture.

Step 210: Based on the first prediction score corresponding to each training sample in the training sample set and the true label, determine the model loss of the preset image text model through a preset cross-entropy loss function. value;

In this embodiment, after obtaining the first prediction score corresponding to each training sample, the model loss value of the preset image text model can be calculated through the preset cross-entropy loss function based on the first prediction score and the corresponding real label. . Here, the preset cross-entropy loss function can be

in,

is the real label between the sample image and the sample text, which can be 0 or 1,

is the first prediction score between the sample image and the sample text, and N is the number of training samples in the training sample set.

Step 211: When the model loss value is greater than the preset loss threshold, adjust the initial picture expert module, the initial text expert module and the initial picture text in the preset picture text model according to the model loss value. Module parameters corresponding to at least one module in the expert module are used to obtain an updated preset picture text model. Through the updated preset picture text model and the fully connected layer, each sample picture and the sample are obtained. second prediction score between texts and calculate the model loss value again;

In this embodiment, after the model loss value is calculated, when the model loss value is less than or equal to the preset loss threshold, the preset image and text model can be directly used as the final multi-expert-based image and text model. When the model loss value is greater than the preset loss threshold, it means that the accuracy of the preset image text model has not reached expectations. The parameters of the preset image text model can be further adjusted. Specifically, the initial image expert module and the initial text expert module can be adjusted. , parameters of one or several modules in the initial picture text expert module. After parameter adjustment, an updated preset picture text model can be obtained. After adjusting the parameters of the preset image text model to obtain the updated preset image text model, the second prediction score corresponding to each training sample can be further obtained based on the training sample set, and then the second prediction score and the corresponding real label, and then calculate the model loss value of the updated preset image text model through the preset cross entropy loss function. After that, the relationship between the model loss value and the preset loss threshold can be judged again, and when the model loss value is still greater than the preset loss threshold, the parameters of the updated preset image text model can be updated again, and the calculation of the third Three prediction scores, calculate the model loss value through the third prediction score and the real label... Repeat the process of adjusting the model parameters of the preset image text model and calculating the model loss value until the model loss value is less than or equal to the preset loss threshold .

Step 212: When the model loss value is less than or equal to the preset loss threshold, the multi-expert based graphic and text model is obtained.

In this embodiment, when the model loss value is less than or equal to the preset loss threshold, it means that the model accuracy has reached expectations. At this time, a multi-expert-based graphic and text model is obtained. At this time, the multi-expert-based graphic and text model is obtained. Including the trained target picture expert module, target text expert module and target picture text expert module. When training the preset image text model, this application simultaneously trains the initial image expert module, the initial text expert module and the initial image text expert module. Each module is equivalent to the Transformer layer of the original BERT model, among which the initial image expert module The initial text expert module can correspond to the F layer, and the initial picture text expert module corresponds to the (L-F) layer. Therefore, the embodiment of this application can flexibly and freely configure the sizes of L and F during the training process according to the resource and time requirements of the actual business situation, so that the training of the model is closer to the actual business needs, and the initial picture expert module and the initial The text expert module shares the parameters of the Multi-head attention layer during the training process, which greatly reduces the number of parameters of the model and reduces the need for GPU memory when the model is deployed.

In this embodiment of the present application, optionally, after step 212, the method further includes: receiving an object to be analyzed, and determining the corresponding object from the multi-expert-based graphic model according to the format of the object to be analyzed. Target analysis module, wherein the target analysis module includes at least one of a target picture expert module, a target text expert module, and a target picture text expert module; convert the object to be analyzed into a corresponding target input vector, and The target input vector is input into the target analysis module to obtain a target output vector corresponding to the object to be analyzed, so as to obtain a target result through the target output vector.

In this embodiment, after the multi-expert-based graphic model is obtained, one or more modules can be directly determined and used from the multi-expert-based graphic model according to the object to be analyzed. Specifically, first, an object to be analyzed can be received. Here, the object to be analyzed can be a picture or text. After receiving the object to be analyzed, the format of the object to be analyzed can be analyzed, and the selected module can be determined based on the format of the object to be analyzed. After determining the module to be selected, the object to be analyzed can be converted into the corresponding target input vector, and then the target input vector can be input into the target analysis module, and the target output vector corresponding to the object to be analyzed can be output. In this way, the target result can be obtained later by using the target output vector. For example, when the object to be analyzed is in text format, after obtaining the target output vector corresponding to the object to be analyzed, the most similar vector can then be obtained through the corresponding similarity index to find similar text or similar vectors to the object to be analyzed. picture. Here, when the object to be analyzed is converted into the corresponding target input vector, the method of dividing the picture into sub-pictures and then converting it into the target input vector corresponding to the sub-picture can also be used, or the method of finding the corresponding word vector for each word in the text can also be used. Finally, the word vectors are spliced together and converted into a target input vector.

In the embodiment of the present application, optionally, determining the corresponding target analysis module from the multi-expert-based graphic model according to the format of the object to be analyzed specifically includes: when the object to be analyzed is When the format is a picture format, the target picture expert module is used as the target analysis module; when the format of the object to be analyzed is a text format, the target text expert module is used as the target analysis module; when the object to be analyzed is in text format, the target text expert module is used as the target analysis module; When the format of the object to be analyzed includes picture format and text format, the target picture expert module, the target text expert module and the target picture text expert module are used as the target analysis modules.

In this embodiment, the target analysis module can be determined according to the format of the object to be analyzed. When the format of the object to be analyzed is a picture format, the target picture expert module in the multi-expert-based graphic model can be used as the target analysis module; when When the format of the object to be analyzed is text format, the target text expert module in the multi-expert-based graphic model can be used as the target analysis module; when the format of the object to be analyzed includes not only picture format but also text format, the target text expert module can be used as the target analysis module. The target picture expert module, target text expert module and target picture text expert module in the multi-expert based graphic and text model are all used as target analysis modules. In this way, after the object to be analyzed in text format is converted into a target input vector, the target text expert module The corresponding output vector is obtained. After converting the object to be analyzed in the image format into the target input vector, the corresponding output vector is obtained through the target image text expert module. Finally, the output vector corresponding to the target text expert module is compared with the target image text expert module. The output vectors are spliced as the input corresponding to the target image text expert module to obtain the target output vector. When the object to be analyzed includes both the object to be analyzed in picture format and the object to be analyzed in text format, the vector corresponding to the object to be analyzed in picture format is first output through the target picture expert module, and then the vector corresponding to the object to be analyzed in picture format is output through the target text expert module. The vector corresponding to the object to be analyzed in text format is then spliced and input into the target image text expert module, which can improve the accuracy of the target output vector of the target image text expert and help improve subsequent use effects.

Further, as a specific implementation of the method in Figure 1, the embodiment of the present application provides a multi-expert-based graphic and text model generation device, as shown in Figure 3, the device includes:

Optionally, the first input module is specifically used for:

Determine the picture dimensions of the sample picture, where the picture dimensions include picture height and/or picture width; based on the preset division size, divide the picture height and/or picture width of the sample picture to obtain the Sub-sample pictures corresponding to the sample pictures; convert the sub-sample pictures into the initial picture vector corresponding to each of the sub-sample pictures through a preset conversion tool.

Optionally, the second input module is specifically used for:

Based on the preset word vector database, the word vector corresponding to each word in the sample text is determined from the preset word vector database, and the word vectors corresponding to each word in the sample text are spliced to obtain the Describe the initial text vector.

Optionally, the prediction module is specifically used for:

The first target vector corresponding to each of the sub-sample pictures is spliced to obtain a picture splicing vector; the picture splicing vector is spliced with the second target vector corresponding to the sample text to obtain the picture text target vector.

Optionally, the model training module is specifically used for:

Based on the first prediction score and the true label corresponding to each training sample in the training sample set, the model loss value of the preset image text model is determined through a preset cross-entropy loss function; when When the model loss value is greater than the preset loss threshold, at least one of the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model is adjusted according to the model loss value. The module parameters corresponding to a module are used to obtain an updated preset picture text model. Through the updated preset picture text model and the fully connected layer, the relationship between each sample picture and the sample text is obtained. second prediction score, and calculate the model loss value again; when the model loss value is less than or equal to the preset loss threshold, the multi-expert based graphic and text model is obtained.

Optionally, the device also includes:

A receiving module, configured to receive the object to be analyzed after obtaining the multi-expert-based graphic and text model, and determine the corresponding target analysis from the multi-expert-based graphic and text model according to the format of the object to be analyzed. Module, wherein the target analysis module includes at least one of a target picture expert module, a target text expert module, and a target picture text expert module;

The third input module is used to convert the object to be analyzed into the corresponding target input vector, and input the target input vector into the target analysis module to obtain the target output corresponding to the object to be analyzed. vector to obtain the target result through the target output vector.

Optionally, the receiving module is specifically used for:

When the format of the object to be analyzed is a picture format, the target picture expert module is used as the target analysis module; when the format of the object to be analyzed is a text format, the target text expert module is used as the target analysis module. Target analysis module; when the format of the object to be analyzed includes a picture format and a text format, use the target picture expert module, the target text expert module and the target picture text expert module as the target analysis module.

It should be noted that for other corresponding descriptions of the functional units involved in the multi-expert-based graphic and text model generation device provided by the embodiments of the present application, please refer to the corresponding descriptions in the method of Figures 1 to 2, and will not be described again here. .

Based on the above methods shown in Figures 1 to 2, correspondingly, embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the above-mentioned multi-expert-based graphic and text model generation method shown in Figures 1 to 2 is implemented.

Based on this understanding, the technical solution of this application can be embodied in the form of a software product. The software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), or a volatile storage medium. The storage medium includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in each implementation scenario of this application.

Based on the above methods shown in Figures 1 to 2 and the virtual device embodiment shown in Figure 3, in order to achieve the above purpose, embodiments of the present application also provide a computer device, which includes: a processor, a memory and computer-readable instructions stored in the memory and executable on the processor, wherein the memory and the processor are both arranged on the bus, and when the processor executes the computer-readable instructions, the above-mentioned instructions shown in Figures 1 to 2 are implemented. The graphic model generation method based on multi-experts is shown.

Optionally, the computer device may also include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, etc. The user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc. The optional user interface may also include a USB interface, a card reader interface, etc. Optional network interfaces may include standard wired interfaces, wireless interfaces (such as Bluetooth interfaces, WI-FI interfaces), etc.

Those skilled in the art can understand that the structure of a computer device provided in this embodiment does not constitute a limitation on the computer device, and may include more or less components, or combine certain components, or arrange different components.

The storage medium may also include an operating system and a network communication module. An operating system is a program that manages and saves the hardware and software resources of a computer device and supports the operation of information processing programs and other software and/or programs. The network communication module is used to implement communication between components within the storage medium, as well as communication with other hardware and software in the physical device.

Through the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus a necessary general hardware platform, or can also be implemented by hardware. First, a training sample set can be obtained. The training sample set can include multiple training samples, and each training sample can include a sample picture and a sample text. In addition, the sample text can also include a real label indicating the relationship to the sample image. For each training sample in the training sample set, the sample image in the training sample can be converted to obtain the initial image vector corresponding to the sample image. Then, the initial picture vector can be input into the initial picture expert module in the preset picture text model, and then the first target vector can be output. In addition, the initial text vector of the sample text corresponding to the sample picture in the training sample can also be determined. Then, the initial text vector can be input into the initial text expert module in the preset image text model, and then the second target vector can be output. After obtaining the first target vector corresponding to the sample picture and the second target vector corresponding to the sample text, the picture text target vector can be further determined based on the first target vector and the second target vector. Then the image text target vector can be used as input to the initial image text expert module of the preset image text model, and the output of the initial image text expert module is passed through the fully connected layer to output the first prediction between the sample image and the sample text. Score. After obtaining the first prediction score, the model loss value of the preset image text model can be determined based on the first prediction score and the real label of each training sample, and based on the model loss value, the preset image text model After training, a multi-expert graphic and text model based on picture experts, text experts and picture and text experts can be obtained. The embodiments of the present application can enable the initial picture expert module, the initial text expert module and the initial picture text expert module to realize joint training, which can save the training and maintenance costs of the model and effectively reduce the occupation of computer resources.

Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawing are not necessarily necessary for implementing the present application. Those skilled in the art can understand that the modules in the devices in the implementation scenario can be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or can be correspondingly changed and located in one or more devices different from the implementation scenario. The modules of the above implementation scenarios can be combined into one module or further split into multiple sub-modules.

The above serial numbers of this application are only for description and do not represent the advantages and disadvantages of the implementation scenarios. What is disclosed above are only a few specific implementation scenarios of the present application. However, the present application is not limited thereto. Any changes that can be thought of by those skilled in the art should fall within the protection scope of the present application.

Claims

A multi-expert-based graphic model generation method, which includes:

Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture;

Based on the sample picture in any of the training samples, determine an initial picture vector, and input the initial picture vector into the initial picture expert module of the preset picture text model to obtain a first target vector;

Based on the sample text in any of the training samples, determine an initial text vector, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector;

According to the first target vector and the second target vector, the picture text target vector is determined, the picture text target vector is input to the initial picture text expert module of the preset picture text model, and based on the output result and the full Connect the layer to obtain the first prediction score between the sample picture and the sample text;

Based on the first prediction score and the real label, determine the model loss value of the preset image text model, and train the preset image text model based on the model loss value to obtain the multi-based Graphic models for experts.
The method according to claim 1, wherein determining an initial picture vector based on the sample picture in any of the training samples specifically includes:

Determine the picture dimensions of the sample picture, where the picture dimensions include picture height and/or picture width;

Based on the preset division size, divide the picture height and/or picture width of the sample picture to obtain sub-sample pictures corresponding to the sample picture;

The sub-sample pictures are converted into the initial picture vector corresponding to each sub-sample picture through a preset conversion tool.
The method according to claim 1, wherein determining an initial text vector based on the sample text in any of the training samples specifically includes:

Based on the preset word vector database, the word vector corresponding to each word in the sample text is determined from the preset word vector database, and the word vectors corresponding to each word in the sample text are spliced to obtain the Describe the initial text vector.
The method according to claim 2 or 3, wherein determining the picture text target vector based on the first target vector and the second target vector specifically includes:

Splice the first target vector corresponding to each of the sub-sample pictures to obtain a picture splicing vector;

The picture splicing vector is spliced with the second target vector corresponding to the sample text to obtain the picture text target vector.
The method according to claim 1, wherein the model loss value of the preset image text model is determined based on the first prediction score and the real label, and the model loss value is calculated based on the model loss value. The preset picture and text model is trained to obtain the multi-expert based picture and text model, which specifically includes:

Based on the first prediction score corresponding to each training sample in the training sample set and the true label, determine the model loss value of the preset image text model through a preset cross-entropy loss function;

When the model loss value is greater than the preset loss threshold, the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model are adjusted according to the model loss value. Module parameters corresponding to at least one module are used to obtain an updated preset image text model. Through the updated preset image text model and the fully connected layer, the relationship between each sample image and the sample text is obtained. the second prediction score, and calculate the model loss value again;

When the model loss value is less than or equal to the preset loss threshold, the multi-expert based graphic and text model is obtained.
The method according to claim 1, wherein after obtaining the multi-expert based graphic and text model, the method further includes:

Receive the object to be analyzed, and determine the corresponding target analysis module from the multi-expert-based graphic and text model according to the format of the object to be analyzed, wherein the target analysis module includes a target picture expert module and a target text expert module and at least one of the target image text expert modules;

Convert the object to be analyzed into a corresponding target input vector, and input the target input vector into the target analysis module to obtain a target output vector corresponding to the object to be analyzed, so as to pass the target The output vector gets the target result.
The method according to claim 6, wherein the determining the corresponding target analysis module from the multi-expert-based graphic model according to the format of the object to be analyzed specifically includes:

When the format of the object to be analyzed is a picture format, use the target picture expert module as the target analysis module;

When the format of the object to be analyzed is a text format, the target text expert module is used as the target analysis module;

When the format of the object to be analyzed includes a picture format and a text format, the target picture expert module, the target text expert module and the target picture text expert module are used as the target analysis modules.
A multi-expert based graphic and text model generation device, which includes:

A sample acquisition module is used to obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries an indication between the sample picture and the sample picture. the true label of the relationship;

The first input module is used to determine an initial picture vector based on the sample picture in any of the training samples, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector;

a second input module, configured to determine an initial text vector based on the sample text in any of the training samples, and input the initial text vector to the initial text expert module of the preset picture text model, Get the second target vector;

A prediction module, configured to determine a picture text target vector based on the first target vector and the second target vector, input the picture text target vector into the initial picture text expert module of the preset picture text model, and Based on the output result and the fully connected layer, obtain the first prediction score between the sample picture and the sample text;

A model training module, configured to determine a model loss value of the preset image text model based on the first prediction score and the real label, and train the preset image text model based on the model loss value , to obtain the multi-expert based graphic and text model.
A computer-readable storage medium on which computer-readable instructions are stored, wherein when the computer-readable instructions are executed by a processor, a multi-expert-based graphic model generation method is implemented, including:

Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture; Based on the sample picture in any of the training samples, determine an initial picture vector, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector; based on any of the Determine an initial text vector for the sample text in the training sample, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector; according to the first target vector and the second target vector, determine the picture text target vector, input the picture text target vector to the initial picture text expert module of the preset picture text model, and based on the output result and the fully connected layer, obtain the The first prediction score between the sample picture and the sample text; based on the first prediction score and the real label, determine the model loss value of the preset picture text model, and based on the model loss value The preset picture and text model is trained to obtain the multi-expert based picture and text model.
The computer-readable storage medium according to claim 9, wherein the computer-readable instructions, when executed by a processor, implement the step of determining an initial picture vector based on the sample picture in any of the training samples, specifically including: :

Determine the picture dimensions of the sample picture, where the picture dimensions include picture height and/or picture width; based on the preset division size, divide the picture height and/or picture width of the sample picture to obtain the Sub-sample pictures corresponding to the sample pictures; convert the sub-sample pictures into the initial picture vector corresponding to each of the sub-sample pictures through a preset conversion tool.
The computer-readable storage medium according to claim 9, wherein the computer-readable instructions, when executed by a processor, implement the determining an initial text vector based on the sample text in any of the training samples, Specifically include:

Based on the preset word vector database, the word vector corresponding to each word in the sample text is determined from the preset word vector database, and the word vectors corresponding to each word in the sample text are spliced to obtain the Describe the initial text vector.
The computer-readable storage medium according to claim 10 or 11, wherein when the computer-readable instructions are executed by a processor, the computer-readable instructions determine the picture text target according to the first target vector and the second target vector. Vectors, specifically including:

The first target vector corresponding to each of the sub-sample pictures is spliced to obtain a picture splicing vector; the picture splicing vector is spliced with the second target vector corresponding to the sample text to obtain the picture text target vector.
The computer-readable storage medium according to claim 9, wherein when the computer-readable instructions are executed by a processor, the computer-readable instructions determine the preset image text based on the first prediction score and the real label. The model loss value of the model, and the preset picture and text model is trained based on the model loss value to obtain the multi-expert-based picture and text model, which specifically includes:

Based on the first prediction score and the true label corresponding to each training sample in the training sample set, the model loss value of the preset image text model is determined through a preset cross-entropy loss function; when When the model loss value is greater than the preset loss threshold, at least one of the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model is adjusted according to the model loss value. The module parameters corresponding to a module are used to obtain an updated preset picture text model. Through the updated preset picture text model and the fully connected layer, the relationship between each sample picture and the sample text is obtained. second prediction score, and calculate the model loss value again; when the model loss value is less than or equal to the preset loss threshold, the multi-expert based graphic and text model is obtained.
The computer-readable storage medium according to claim 9, wherein after the computer-readable instructions are executed by a processor to achieve the obtaining of the multi-expert-based graphic and text model, the method further includes:

Receive the object to be analyzed, and determine the corresponding target analysis module from the multi-expert-based graphic and text model according to the format of the object to be analyzed, wherein the target analysis module includes a target picture expert module and a target text expert module and at least one of the target picture and text expert modules; convert the object to be analyzed into a corresponding target input vector, and input the target input vector into the target analysis module to obtain the target image to be analyzed. The target output vector corresponding to the object is used to obtain the target result through the target output vector.
A computer device, including a storage medium, a processor, and computer-readable instructions stored on the storage medium and executable on the processor, wherein when the processor executes the computer-readable instructions, a multi-expert-based graph is implemented Text model generation methods include:

Obtain a training sample set, wherein the training sample set includes a plurality of training samples, each of the training samples includes a sample picture and a sample text, and the sample text carries a real label indicating a relationship with the sample picture; Based on the sample picture in any of the training samples, determine an initial picture vector, and input the initial picture vector to the initial picture expert module of the preset picture text model to obtain the first target vector; based on any of the Determine an initial text vector for the sample text in the training sample, and input the initial text vector into the initial text expert module of the preset picture text model to obtain a second target vector; according to the first target vector and the second target vector, determine the picture text target vector, input the picture text target vector to the initial picture text expert module of the preset picture text model, and based on the output result and the fully connected layer, obtain the The first prediction score between the sample picture and the sample text; based on the first prediction score and the real label, determine the model loss value of the preset picture text model, and based on the model loss value The preset picture and text model is trained to obtain the multi-expert based picture and text model.
The computer device according to claim 15, wherein when the processor executes the computer-readable instructions, the processor implements determining the initial picture vector based on the sample picture in any of the training samples, specifically including:

Determine the picture dimensions of the sample picture, where the picture dimensions include picture height and/or picture width; based on the preset division size, divide the picture height and/or picture width of the sample picture to obtain the Sub-sample pictures corresponding to the sample pictures; convert the sub-sample pictures into the initial picture vector corresponding to each of the sub-sample pictures through a preset conversion tool.
The computer device according to claim 15, wherein when the processor executes the computer readable instructions, the processor implements the step of determining an initial text vector based on the sample text in any of the training samples, specifically including: :

Based on the preset word vector database, the word vector corresponding to each word in the sample text is determined from the preset word vector database, and the word vectors corresponding to each word in the sample text are spliced to obtain the Describe the initial text vector.
The computer device according to claim 16 or 17, wherein when the processor executes the computer-readable instructions, the processor implements determining the picture text target vector according to the first target vector and the second target vector, Specifically include:

The first target vector corresponding to each of the sub-sample pictures is spliced to obtain a picture splicing vector; the picture splicing vector is spliced with the second target vector corresponding to the sample text to obtain the picture text target vector.
The computer device according to claim 15, wherein when the processor executes the computer readable instructions, the processor implements the step of determining the preset image text model based on the first prediction score and the real label. model loss value, and train the preset picture and text model based on the model loss value to obtain the multi-expert-based picture and text model, which specifically includes:

Based on the first prediction score and the true label corresponding to each training sample in the training sample set, the model loss value of the preset image text model is determined through a preset cross-entropy loss function; when When the model loss value is greater than the preset loss threshold, at least one of the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model is adjusted according to the model loss value. The module parameters corresponding to a module are used to obtain an updated preset picture text model. Through the updated preset picture text model and the fully connected layer, the relationship between each sample picture and the sample text is obtained. second prediction score, and calculate the model loss value again; when the model loss value is less than or equal to the preset loss threshold, the multi-expert based graphic and text model is obtained.
The computer device according to claim 15, wherein after the processor implements the obtaining of the multi-expert-based graphic and text model when the processor executes the computer-readable instructions, the method further includes:

Receive the object to be analyzed, and determine the corresponding target analysis module from the multi-expert-based graphic and text model according to the format of the object to be analyzed, wherein the target analysis module includes a target picture expert module and a target text expert module and at least one of the target picture and text expert modules; convert the object to be analyzed into a corresponding target input vector, and input the target input vector into the target analysis module to obtain the target image to be analyzed. The target output vector corresponding to the object is used to obtain the target result through the target output vector.