CN116484217A

CN116484217A - Intelligent decision method and system based on multi-mode pre-training large model

Info

Publication number: CN116484217A
Application number: CN202310407938.4A
Authority: CN
Inventors: 刘应波; 杜宇; 刘应玲
Original assignee: Yunnan Yuanmatrix Technology Co ltd
Current assignee: Yunnan Yuanmatrix Technology Co ltd
Priority date: 2023-04-17
Filing date: 2023-04-17
Publication date: 2023-07-25

Abstract

The invention discloses an intelligent decision method and system based on a multi-mode pre-training large model, wherein the method comprises the following steps: acquiring a decision problem, performing intelligent decision through a preset multi-mode pre-training model, generating a decision result, and storing a decision case; obtaining a decision case of the multi-mode pre-training model, and constructing decision tag data; the method comprises the steps of performing supervised training through decision tag data, and adjusting model parameters of the multi-mode pre-training model; the invention uses the case generated by decision as training label data to finely adjust the model parameters, thereby being beneficial to improving the decision capability of the model in specific case types.

Description

Intelligent decision method and system based on multi-mode pre-training large model

Technical Field

The invention relates to the technical field of multi-mode data processing, in particular to an intelligent decision method and system based on a multi-mode pre-training large model.

Background

In recent years, researchers have made great progress in both computer vision and natural language processing, so that multi-modal deep learning, which combines both, is also receiving more and more attention. In the existing multi-mode pre-training model, deep learning is performed through combination of multi-mode data, so that the understanding capability of the model to the original data is improved, and the decision accuracy is further improved.

However, with social progress, the decision problem to be performed is also slowly changed, and in the using process of the pre-training large model, the generation of the new decision problem cannot be adaptively changed, so that the model cannot be improved in the aspects of the degree of the breadth of the decision problem and the accuracy of the decision result, and meanwhile, the pre-training large model cannot meet the personalized requirements under specific scenes.

Therefore, how to enable the multi-modal pre-training model to enhance the decision making capability while using the multi-modal pre-training model is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of this, the invention provides an intelligent decision method and system based on a multi-mode pre-training large model, which can use the case generated by decision as training label data to finely adjust the model parameters, and is helpful to improve the decision capability of the model for deciding the problem in a specific case type, so that the problem can be solved more widely in the specific scene.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

acquiring a decision problem, performing intelligent decision through a preset multi-mode pre-training model, generating a decision result, and storing a decision case;

obtaining a decision case of the multi-mode pre-training model, and constructing decision tag data; and the model parameters of the multi-mode pre-training model are adjusted by performing supervised training through the decision tag data.

Further, the pre-training step of the multi-modal pre-training model includes:

acquiring training data of multiple modes;

extracting training features of training data corresponding to each mode, uniformly encoding the training features, generating tuple sequences corresponding to each mode, and constructing a multi-mode data set;

and performing joint training on the pre-constructed multi-mode data processing model through tuple sequences corresponding to the multiple modes to generate a multi-mode pre-training model.

Further, the multi-modal training data includes one or more of image data, video data, and text data.

Further, training features of training data corresponding to each mode are extracted, the training features are uniformly coded, and a tuple sequence corresponding to each mode is generated, specifically:

for image data, the feature information is recorded as tuple f1= (C, O, P, R, …);

wherein C is a data modality type, wherein O represents an object in an image, P is a position of the object in the image, and R represents other features, which are geometry, shape, amplitude, histogram, color or local binary pattern;

for video data, extracting images of the video data frame by frame to form an image set, wherein the tuples F2= (C, O, P, R, T, …) of the images in the image set, and the element T is used for recording time information of a current frame;

for text data, extracting features by natural language processing, the text data tuple F3 can be encoded as (C, S, E …), where S is the feature level and E is the environmental information;

the multi-modality dataset dstd= { D1, D2, D3, …, DN }.

Further, the joint training specifically includes:

acquiring training data of different data modality types in a multi-modality data set, and merging;

training a multi-mode data processing model constructed in advance through the combined data.

Further, the merging mode comprises mode embedding, attention mechanism, multi-view learning or multi-task learning. Such as using a similar modality embedding (Modality Embedding): the different input modalities Dstd are converted into vector representations of the shared space, which are then connected together to form a multi-modal vector dem= [ F1, F2, F3, …, FN ], e.g. the images are encoded using Convolutional Neural Networks (CNN), the text is encoded using Recurrent Neural Networks (RNN), and the two vectors are then connected together to form Dem, further training models with Dem.

Further, the step of constructing decision tag data includes:

creating a decision problem text;

calculating decision case e according to text similarity _m Distance l between text and decision problem text _m The method comprises the steps of carrying out a first treatment on the surface of the Wherein m is the sequence number of the decision case.

Constructing a data Tag vector tag= { (e) ₁ ,R ₁ ,l ₁ ),(e ₂ ,R ₂ ,l ₂ ),(e ₃ ,R ₃ ,l ₃ ) ,., }, wherein R is a decision case outcome;

and obtaining a training data format mapping rule map of the multi-mode pre-training model, and mapping and converting the Tag vector Tag form into a pre-training data format according to the mapping rule to form decision Tag data.

An intelligent decision making system based on a multi-modal pre-trained large model, comprising:

a user data acquisition device for a user to input a decision problem,

the data processor is used for setting a multi-mode pre-training model, carrying out intelligent decision through the multi-mode pre-training model according to the decision problem, generating a decision result and storing a decision case;

the intelligent optimization module is used for acquiring decision cases of the multi-mode pre-training model and constructing decision tag data; and the multi-mode pre-training model is used for performing supervised training on the multi-mode pre-training model through the decision tag data, and model parameters are adjusted.

Further, the user data acquisition device is an electronic input device or a voice acquisition device.

Further, the visual operation device is further included and used for visual model evaluation by a user through man-machine interaction.

The invention has the beneficial effects that:

compared with the prior art, the intelligent decision method and the system based on the multi-mode pre-training large model provided by the invention have the advantages that the case generated in the decision process of the pre-training large model is used as training label data to finely adjust the model parameters, so that the decision capability of the model in specific case types is improved; meanwhile, the pre-training large model is trained through training data in a plurality of scenes, has decision making capability in a plurality of scenes, combines with model fine adjustment, and can achieve capability improvement in a specific scene corresponding to the decision making problem according to the decision making problem of specific requirements.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an intelligent decision method based on a multi-mode pre-training large model according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an intelligent decision system based on a multi-mode pre-training large model according to another embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, the embodiment of the invention discloses an intelligent decision method based on a multi-mode pre-training large model, which comprises the following steps:

s1: acquiring a decision problem, performing intelligent decision through a preset multi-mode pre-training model, generating a decision result, and storing a decision case.

In one embodiment, the pre-training step of the multimodal pre-training model is:

s11: acquiring training data of multiple modes; the training data may include one or more of image data, video data, and text data. And, cleaning, labeling and format conversion are carried out on the data.

S12: extracting training features of training data corresponding to each mode, uniformly encoding the training features, generating tuple sequences corresponding to each mode, and constructing a multi-mode data set; this step is a pretreatment. Preprocessing is defined as a function f (x), wherein x is multi-modal data, and a standardized text data set, namely the multi-modal data set, is formed after f processing.

Specifically, for image data, feature information of an image can be extracted by using a feature extraction technique, for example, using CNN, viT model, or the like, by which feature information is recorded as tuple f1= (C, O, P, R, …); c is a data mode type, the image is 1, the video is 2, and the text is 3; o represents an object in the image, P is the position of the object in the image, R represents other features, which are geometric, shape, amplitude, histogram, color or local binary pattern; for video data, extracting images of the video data frame by frame to form an image set, wherein the tuples F2= (C, O, P, R, T, …) of the images in the image set, and the element T is used for recording time information of a current frame; for text data, extracting features by natural language processing, the text data tuple F3 can be encoded as (C, S, E …), where S is the feature level and E is the environmental information; the multi-modality dataset dstd= { D1, D2, D3, …, DN }. In addition, data of other modality types, such as audio data, are included, and feature extraction is performed through a speech recognition model.

S13: and performing joint training on the pre-constructed multi-mode data processing model through tuple sequences corresponding to the multiple modes to obtain model parameters of the multi-mode pre-training model. For the construction of the multi-mode data processing model, a suitable multi-mode pre-training model can be selected, and is not limited to a specific model, such as Transformer, BERT, more specifically, such as DALL-E of OpenAI or CLIP of Google. These models are typically composed of multiple neural networks for processing image, video, and text input data.

In one embodiment, in S13, the specific steps of the joint training include:

acquiring training data of different data modality types in a multi-modality data set, and merging; the merging method comprises mode embedding, attention mechanism, multi-view learning, multi-task learning and the like. The invention uses a similar modality embedding (Modality Embedding): converting different input modalities Dstd into vector representations of the shared space, and then concatenating the vectors together to form a multi-modal vector dem= [ F1, F2, F3, …, fn]For example, images are encoded using Convolutional Neural Networks (CNNs), text is encoded using convolutional neural networks (RNNs), and then two vectors are concatenated together to form D _em . Training a multi-mode data processing model constructed in advance through the combined data. The training method is characterized in that a combined training mode is used for pre-training, and a model is trained through a plurality of unlabeled tasks, so that richer semantic information is learned. Pre-training is typically performed using self-supervised learning, for example, by predicting the rotation angle of an image or by dividing an image into blocks and rearranging the image to predict the original image. The goal of these tasks is to train different parts of the model by utilizing information of multiple modalities.

S2: obtaining a decision case of the multi-mode pre-training model, and constructing decision tag data; and the model parameters of the multi-mode pre-training model are adjusted by performing supervised training through the decision tag data.

In one embodiment, the step of constructing decision tag data includes:

s21: creating a decision problem text;

S22：calculating decision case e according to text similarity _m Distance l between text and decision problem text _m The method comprises the steps of carrying out a first treatment on the surface of the The text distance may be calculated by a natural language text processing method for similarity calculation, such as cosine similarity calculation.

S23: constructing a data Tag vector tag= { (e) ₁ ,R ₁ ,l ₁ ),(e ₂ ,R ₂ ,l ₂ ),(e ₃ ,R ₃ ,l ₃ ) ,., }, wherein R is a decision case outcome;

s24: and obtaining a training data format mapping rule map of the multi-mode pre-training model, and mapping and converting the Tag vector Tag form into a pre-training data format according to the mapping rule to form decision Tag data.

Wherein a complete case should contain: the decision context requires a decision problem, the end result. The decision cases are illustrated as follows: decision making of intelligent vehicle control system: the case context is: the model of the large vehicle is XXX on a certain month of a certain year, and avoidance is performed at the XX intersection. Other parameters for this case include: weather, traffic flow, field pictures, etc.; on a certain month of a certain year, the model of the small vehicle is XXX, and the vehicle is avoided in a certain place. Other parameters for this case include: weather, traffic volume, etc. The decision problem is: if a road is now being traversed, it is necessary to avoid pedestrians? The decision result is "yes" or "not".

In this embodiment, decision tag data is input into a training model for supervised training, and a back propagation algorithm is used to update large model parameters, and the training uses accuracy, precision, recall, and F1 score in evaluating training results.

Evaluating the performance of the trimmed model: the trimmed model is subjected to performance evaluation by using a tool such as a confusion matrix or an ROC curve to visualize the evaluation result data set.

Example 2

As shown in fig. 2, the invention also discloses an intelligent decision system based on the multi-mode pre-training large model, which comprises a user data acquisition device for users to input decision-making problems,

the intelligent optimization module is used for acquiring decision cases of the multi-mode pre-training model and constructing decision tag data; the method is used for performing supervised training on the multi-mode pre-training model through the decision tag data and adjusting model parameters.

In one embodiment, the user data collection device is an electronic entry device or a voice collection device.

In one embodiment, the system further comprises a visual operation device for visual model evaluation by a user through man-machine interaction.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An intelligent decision method based on a multi-mode pre-training large model is characterized by comprising the following steps:

2. The intelligent decision method based on a multi-modal pre-training large model according to claim 1, wherein the pre-training step of the multi-modal pre-training model comprises:

acquiring training data of multiple modes;

and performing joint training on the pre-constructed multi-mode data processing model through tuple sequences corresponding to the modes to obtain model parameters of the multi-mode pre-training model.

3. The intelligent decision making method based on a multi-modal pre-trained large model according to claim 2, wherein the multi-modal training data comprises one or more of image data, video data and text data.

4. The intelligent decision-making method based on the multi-mode pre-training large model according to claim 3, wherein training features of training data corresponding to each mode are extracted, the training features are uniformly coded, and a tuple sequence corresponding to each mode is generated, specifically:

the multi-modality dataset dstd= { D1, D2, D3, …, DN }.

5. The intelligent decision method based on the multi-mode pre-training large model according to claim 2, wherein the joint training is specifically as follows:

6. The intelligent decision making method based on a multi-modal pre-trained large model according to claim 5, wherein the merging means comprises modal embedding, attention mechanism, multi-view learning or multi-task learning.

7. The intelligent decision making method based on a multi-modal pre-trained large model according to claim 1, wherein the step of constructing decision tag data comprises:

creating a decision problem text;

calculating decision case e according to text similarity _m Distance l between text and decision problem text _m ；

and acquiring a training data format mapping rule of the multi-mode pre-training model, and mapping and converting the Tag vector Tag form into a pre-training data format according to the mapping rule to form decision Tag data.

8. An intelligent decision making system based on a multi-modal pre-trained large model, comprising:

a user data acquisition device for a user to input a decision problem,

9. The intelligent decision making system based on a multi-modal pre-trained large model according to claim 8, wherein the user data acquisition device is an electronic entry device or a voice acquisition device.

10. The intelligent decision making system based on a multi-modal pre-trained large model according to claim 8, further comprising a visual manipulation means for visual model assessment by a user through human-machine interaction.