CN117786058A

CN117786058A - Method for constructing multi-mode large model knowledge migration framework

Info

Publication number: CN117786058A
Application number: CN202311627552.0A
Authority: CN
Inventors: 陈伯瑜; 王亚立; 乔宇
Original assignee: Shenzhen Institute of Advanced Technology of CAS; Shanghai AI Innovation Center
Current assignee: Shenzhen Institute of Advanced Technology of CAS; Shanghai AI Innovation Center
Priority date: 2023-11-30
Filing date: 2023-11-30
Publication date: 2024-03-29

Abstract

The application provides a method and a device for constructing a multi-mode large model knowledge migration framework, electronic equipment and a storage medium, and relates to the field of video identification. Wherein the method comprises the following steps: obtaining external visual characteristics based on visual enhancement of the original video; based on inputting the original video into the visual language big model to carry out dialogue, extracting text features, wherein the text features are text descriptions of the original video; and inserting an adaptive module into the classification network, fusing the external visual features and the text features into training of the classification network, and determining a knowledge migration framework after convergence of training of the classification network. The method and the device solve the problem that the effect of using the multi-mode large model on open domain data is not good in the related art.

Description

Method for constructing multi-mode large model knowledge migration framework

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for constructing a multi-mode large model knowledge migration framework, an electronic device, and a storage medium.

Background

With the rapid development of deep learning, many deep learning video understanding networks have been proposed in recent years. However, these recognition models work mainly on classical video benchmarks collected under ideal conditions. In many real-world applications, the camera shooting environment is much more complex, the target resolution is low, the illumination condition is poor, the video scene is abnormal, etc. In such open world scenarios, most existing video models do not generalize well and do not work well due to lack of knowledge of the external world.

The development of large models of natural language has prompted the development of many multi-modal models. Through the pre-training of a large amount of network data, the basic models can have rich semantic knowledge and can adapt to generalization with low probability. The open world video recognition is a difficult problem, how to use knowledge of these large models to perform open world video recognition has not been fully explored yet, and the effect of directly using the multi-mode large model on the open domain data is not good, which needs to be further improved.

Disclosure of Invention

The application provides a method and a device for constructing a multi-mode large model knowledge migration framework, electronic equipment and a storage medium, and can solve the problem that the effect of using the multi-mode large model on open domain data is not good in the related technology. The technical scheme is as follows:

according to one aspect of the application, a method for constructing a multi-modal large model knowledge migration framework includes: obtaining external visual characteristics based on visual enhancement of the original video; based on inputting the original video into the visual language big model to carry out dialogue, extracting text features, wherein the text features are text descriptions of the original video; and inserting an adaptive module into the classification network, fusing the external visual features and the text features into training of the classification network, and determining a knowledge migration framework after convergence of training of the classification network.

According to one aspect of the application, a device for constructing a multi-modal large model knowledge migration framework includes: the external visual characteristic acquisition module is used for acquiring external visual characteristics based on the visual enhancement of the original video; the text feature extraction module is used for carrying out dialogue based on inputting the original video into the visual language big model and extracting text features, wherein the text features are text descriptions of the original video; and the framework determining module is used for inserting the self-adaptive module into the classification network, fusing the external visual features and the text features into training of the classification network, and determining a knowledge migration framework after the training of the classification network is converged.

In an exemplary embodiment, the apparatus further includes, but is not limited to: the visual data acquisition module is used for acquiring a visual enhancement model to preprocess the original video and acquiring corresponding visual data; and the external visual characteristic determining module is used for inputting the visual data into the video basic model for determining external visual characteristics.

In an exemplary embodiment, the apparatus further includes, but is not limited to: the category confidence score acquisition module is used for acquiring category confidence scores corresponding to the visual data based on the video basic model; the confidence threshold value calling module is used for calling the confidence threshold value; the confidence coefficient judging module is used for judging whether the category confidence coefficient score is larger than a confidence coefficient threshold value or not; if yes, an external description information acquisition module determines the category corresponding to the category confidence score as an external tag and is used for acquiring external description information of the external tag; otherwise, the external video title acquisition module inputs the original video into a dialogue model for acquiring an external video title; and the text feature extraction module inputs the external description information or the external video title into a BERT model for extracting text features.

In an exemplary embodiment, the apparatus further includes, but is not limited to: the word length data acquisition module is used for acquiring word length data corresponding to the external description information; the word length threshold value calling module is used for calling a word length threshold value of a word; and the word length judging module is used for judging whether the word length data is larger than a word length threshold value, and if so, the word length data of the external description information is contracted to the word length threshold value.

In an exemplary embodiment, the apparatus further includes, but is not limited to: the self-adaptive feature acquisition module is used for acquiring visual self-adaptive features corresponding to the external visual features and text self-adaptive features corresponding to the text features; the middle layer characteristic acquisition module is used for acquiring middle layer characteristics corresponding to the basic model; and the combining module is used for combining the visual self-adaptive feature, the text self-adaptive feature and the middle layer feature through weighted residual addition.

In an exemplary embodiment, the apparatus further includes, but is not limited to: the external visual characteristic input module is used for inputting the external visual characteristic into a cross attention mechanism to acquire a visual self-adaptive characteristic; and the text feature input module is used for inputting the text features into a cross attention mechanism and acquiring text self-adaptive features.

According to one aspect of the application, an electronic device comprises at least one processor and at least one memory, wherein the memory has computer readable instructions stored thereon; the computer readable instructions are executed by one or more of the processors to cause an electronic device to implement a method of building a multimodal large model knowledge migration framework as described above.

According to one aspect of the application, a storage medium has stored thereon computer readable instructions that are executed by one or more processors to implement the method of constructing a multimodal large model knowledge migration framework as described above.

According to one aspect of the application, a computer program product includes computer-readable instructions stored in a storage medium, one or more processors of an electronic device reading the computer-readable instructions from the storage medium, loading and executing the computer-readable instructions, causing the electronic device to implement a method of constructing a multimodal large model knowledge migration framework as described above.

The beneficial effects that this application provided technical scheme brought are: firstly, an original video of an open world is enhanced in a perception stage, domain difference between open world data and laboratory data is reduced, then visual features of the enhanced video are extracted to serve as external visual features, secondly, text features related to a predictive tag or a video subtitle are further generated through visual language understanding in a chat stage, the video is described in detail through diversified text features on external text knowledge, and finally, in an adaptive stage, fusion of the external visual and text knowledge is achieved through flexible insertion of a multi-mode knowledge adaptive module in a video backbone, and open world recognition is achieved.

In the technical scheme, external visual characteristics are obtained based on visual enhancement of an original video, dialogue is performed based on inputting the original video into a visual language big model, text characteristics are extracted, an adaptive module is inserted into a classification network, the external visual characteristics and the text characteristics are fused into training of the classification network, and a knowledge migration framework is determined after convergence of training of the classification network; then, through reducing the external visual characteristics after the domain gap and the text characteristics for carrying out diversified text description on the original video, in the self-adaptive stage, the fusion of the external visual and text knowledge is realized by flexibly inserting a multi-modal knowledge self-adaptive module into a video backbone, and the open world identification is realized, so that the problem that the effect of using a multi-modal large model on open domain data in the related technology is not good can be effectively solved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of an implementation environment in accordance with the teachings of the present application;

FIG. 2 is a flowchart illustrating a method of building a multimodal, large model knowledge migration framework, in accordance with an illustrative embodiment;

FIG. 3 is a flowchart of S111 through S115 in a method of constructing a multi-modal large model knowledge migration framework, according to an example embodiment;

FIG. 4 is a flowchart of S1141 through S1143 in a method of constructing a multi-modal large model knowledge migration framework, according to an example embodiment;

FIG. 5 is a block diagram of an apparatus for building a multimodal large model knowledge migration framework in accordance with an illustrative embodiment;

fig. 6 is a block diagram illustrating a configuration of an electronic device according to an exemplary embodiment.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of illustrating the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

The following is an introduction and explanation of several terms involved in this application:

multimode: multimodal techniques refer to techniques for information processing and interaction using a variety of different perceptual modalities (e.g., visual, text, audio, etc.). By combining a plurality of modal data, the multi-modal technology can obtain more comprehensive and accurate information, so that performances in the aspects of man-machine interaction, data analysis, intelligent recognition and the like are improved. One of the key challenges of multi-modal technology is how to fuse and coordinate information of different sensing modalities. This involves techniques such as data fusion, modality alignment, feature extraction, etc. In addition, multi-modal techniques also need to address the issues of heterogeneity and uncertainty between different perceptual modalities, and how to efficiently utilize multi-modal information for analysis and decision making.

Large model: a large model refers to a model with a large number of parameters and a complex structure. The main goal of using large models is to improve the performance and capabilities of the model to better address complex tasks and problems. The application of the large model technology is very wide, for example, in the field of natural language processing, the large model can be used for tasks such as machine translation, text generation, question-answering systems and the like; in the field of computer vision, large models can be used for tasks such as classification, object detection, and image generation. Large models can bring more accurate and powerful results over complex tasks.

And (3) knowledge migration: knowledge migration refers to the process of applying knowledge or models learned on one task to another related task. The method utilizes the learned model or characteristic representation to help solve the new task, thereby reducing the requirement for a large amount of annotation data, accelerating the training speed of the model and improving the performance of the model. The basic assumption of knowledge migration is that knowledge learned by one task can be reused on other tasks. This is because there may be some shared features or potential associations between many tasks so that knowledge learned on one task can help solve other tasks.

As described above, how to use knowledge of these large models for open world video recognition has not been fully explored, and it is not good to directly use the multi-modal large model on the open domain data.

Therefore, the method for constructing the multi-mode large model knowledge migration framework can effectively improve the accuracy of constructing the multi-mode large model knowledge migration framework, and accordingly, the method for constructing the multi-mode large model knowledge migration framework is suitable for a device for constructing the multi-mode large model knowledge migration framework, and the device for constructing the multi-mode large model knowledge migration framework can be deployed in electronic equipment.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of an implementation environment involved in a method of constructing a multi-modal large model knowledge migration framework. The implementation environment comprises a terminal, a server and a service system configured with a member association database.

Specifically, the terminal may be used for a client that provides a video description function to run, and may be an electronic device such as a desktop computer, a notebook computer, a tablet computer, a smart phone, and the like, which is not limited herein.

The client may provide video description functions, for example, a media player, a browser, etc., and may be in the form of an application program or a web page, and accordingly, a user interface for playing video by the client may be in the form of a program window or a web page, which is not limited herein.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms. The server is an electronic device for providing a background service, for example, in the present implementation environment, the server provides a cloud storage service of audio and video data for the terminal.

The server establishes communication connection with the service system in advance in a wired or wireless mode and the like, and realizes linkage with the service system through the communication connection. The service system may be one server or may be a server cluster composed of a plurality of servers.

Through the interaction between the terminal and the server, the client running on the terminal initiates a resource use invitation to the server, requests the server to determine the position of the resource and the resource use time through resource allocation, and sends the invitation accordingly.

For the server, the resource allocation process is executed for the client of the inviter through the resource usage invitation linkage service system, and an invitation result indicating the position of the resource and the resource usage time is returned to the client of the inviter, so that the client of the inviter further confirms whether to send out the invitation of the resource usage to the client of the invitee.

Of course, the servers and the service system may be integrated in the same server cluster according to actual operation requirements, so that resource allocation is completed by the same server cluster.

In the following method embodiments, for convenience of description, the execution subject of each step of the method is described as an electronic device, but this configuration is not particularly limited.

As shown in fig. 2, a method for constructing a multi-modal large model knowledge migration framework includes:

s100, obtaining external visual characteristics based on visual enhancement of an original video;

after the original video of the open world is acquired, the perception operation is performed first, that is, the original video V is enhanced by the visual enhancement model F, after the original video is enhanced, the domain difference between the original video of the open world and the laboratory data may be reduced, and the visual information of the original video of the open world may be enriched, where the enhancement may be expressed as:the visual characteristics of the original video after enhancement can be extracted through the basic model to serve as external visual characteristics.

For different original video data sets, different visual enhancement models can be adopted for enhancement, for example, for TinyVIRAT data sets, a RealBasicVSR super-division model is adopted for enhancing the original video, so that clearer visual information is obtained, and part of important details in the video are enriched; for the ARID black night dataset, the gamma transformation is adopted to lighten the model, so that some details which are difficult to recognize in a dark scene are restored; for a video pipe data set, a segment rendering model is used for adding a semantic segmentation Mask on the basis of an original video; the data with enhanced vision can be obtained by the method

S110, inputting an original video into a visual language big model to perform dialogue, and extracting text features;

the real scene video with complex environment variables is difficult to perceive through simple visual characteristics, so that the recognition is assisted by combining text information, and a dialogue process is introduced to acquire rich text description so as to supplement the visual characteristics of the original video perception of the open world, as shown in fig. 3, and the specific method comprises the following steps:

s111, obtaining category confidence scores corresponding to visual data based on a video basic model;

wherein, for category confidence scores, the video base model B is used to extract external visual features F of the enhanced open world raw video _v And predicts a score S, where the prediction score is the class confidence score S referred to in the embodiments.

S112, calling a confidence threshold;

wherein the confidence threshold σ may be manually set in advance.

S113, judging whether the category confidence score S is larger than a confidence threshold sigma.

S114, if yes, determining that the category corresponding to the category confidence score S is the external tag of the open world original video, and after the corresponding external tag is acquired, inputting the external tag into the ChatGPT and the Prompt method P to perform semantic expansion on the external tag, so as to acquire external description information T of the external tag _p The method comprises the steps of carrying out a first treatment on the surface of the Otherwise, if none of the category confidence scores S exceeds the confidence threshold sigma, the enhancement video is not suitable for the basic model perception to input the original video into the video chat, namely C, so as to obtain an external video title T _c 。

For external description information T _p External video title T _c Can be expressed as:

in the process of obtaining the external description information of the external tag, as shown in fig. 4, the method further includes the following steps:

s1141, acquiring word length data corresponding to external description information;

wherein the word length data is a word length described for the external tag.

S1142, calling a word length threshold value;

in this embodiment, the word length threshold is 160 words, and in other embodiments, the word length threshold may be set interchangeably.

S1143, judging whether the word length data is larger than a word length threshold, if so, inputting the external description information into the ChatGPT to perform word length abbreviation if the word length of the generated external description information exceeds 160, wherein one or more models of LLAMA, miniGPT4, vicuna and the like can be used, namely, the word length data of the external description information is abbreviated to 160.

S115, external description information T _p Or external video title T _c Extracting text features F from an input BERT model _T 。

Through the execution steps, diversified text description can be performed on the open world original video.

S120, inserting an adaptive module into the classification network, fusing external visual features and text features into training of the classification network, and determining a knowledge migration framework based on convergence of training of the classification network;

wherein, in order to make the external visual feature F _V And text feature F _T The method is integrated into the original video training process of the open world, and an adaptation process is introduced; on the basis of the pre-trained classification model, we embed an adaptation module into the middle layer of the classification model, in particular we adapt the text adaptation feature G by weighted residual addition _T And visual adaptive feature G _V Intermediate layer features F with classification model _B In combination, can be expressed as

G _T And G _V By and withAnd->A cross attention mechanism (cross attention) is obtained, in the operation of the cross attention mechanism, F _B As a query variable +.>And->Is a key value pair, and the whole process is shown in the following formula:

by->And F _T And compressing and encoding. CA vs F through mutual attention mechanism _T The compression is performed before the knowledge is encoded by the self-attention mechanism SA and the forward network FFN. The formula is shown below.

Also obtained by the same method, the formula is as follows:

P _L obtained by normal distribution random initialization and layer normalizationThe formula is as follows:

P _L ＝LN(P _L )。

the following is an embodiment of the device of the present application, which may be used to execute the method for constructing the multi-modal large model knowledge migration framework related to the present application. For details not disclosed in the device embodiments of the present application, please refer to a method embodiment of a method for constructing a multi-modal large model knowledge migration framework related to the present application.

Referring to fig. 5, in an embodiment of the present application, a device for constructing a multi-modal large model knowledge migration framework is provided, including but not limited to:

in an exemplary embodiment, the apparatus further includes, but is not limited to: an external visual feature acquisition module 200, a text feature extraction module 210, and a frame determination module 220;

the external visual feature acquisition module 200 is used for acquiring external visual features based on the visual enhancement of the original video;

the text feature extraction module 210 performs dialogue based on inputting the original video into the visual language big model, and is used for extracting text features, wherein the text features are text descriptions of the original video;

the framework determining module 220 inserts an adaptive module in the classification network, merges the external visual feature and the text feature into training of the classification network, and is used for determining a knowledge migration framework after the training of the classification network is converged.

In an exemplary embodiment, the apparatus further includes, but is not limited to: a visual data acquisition module 300 and an external visual feature determination module 310;

the visual data acquisition module 300 acquires a visual enhancement model to preprocess an original video, and is used for acquiring corresponding visual data;

the external visual feature determination module 310 inputs the visual data into the video base model for determining external visual features.

In an exemplary embodiment, the apparatus further includes, but is not limited to: a category confidence score acquisition module 400, a confidence threshold retrieval module 410, an external descriptive information acquisition module 420, an external video title acquisition module 430, and a text feature extraction module 440;

the category confidence score obtaining module 400 is configured to obtain a category confidence score corresponding to the visual data based on the video base model;

a confidence threshold retrieving module 410, configured to retrieve a confidence threshold; the confidence coefficient judging module is used for judging whether the category confidence coefficient score is larger than a confidence coefficient threshold value or not;

if yes, the external description information obtaining module 420 determines that the category corresponding to the category confidence score is an external tag, and is used for obtaining external description information of the external tag;

otherwise, the external video title obtaining module 430 inputs the original video into the dialogue model for obtaining an external video title;

the text feature extraction module 440 inputs the external description information or the external video title into the BERT model for extracting text features.

In an exemplary embodiment, the apparatus further includes, but is not limited to: a word length data acquisition module 500, a word length threshold value retrieval module 510 and a word length judgment module 520;

the word length data obtaining module 500 is configured to obtain word length data corresponding to the external description information;

a word length threshold value retrieving module 510, configured to retrieve a word length threshold value of a word;

the word length judging module 520 is configured to judge whether the word length data is greater than a word length threshold, and if yes, write the word length data of the external description information to the word length threshold.

In an exemplary embodiment, the apparatus further includes, but is not limited to: an adaptive feature acquisition module 600, an inter-layer feature acquisition module 610, and a combining module 620;

the adaptive feature obtaining module 600 is configured to obtain a visual adaptive feature corresponding to the external visual feature and a text adaptive feature corresponding to the text feature; in (a)

The interlayer feature acquisition module 610 is configured to acquire an interlayer feature corresponding to the base model;

a combining module 620, configured to combine the visual adaptive feature, the text adaptive feature and the middle layer feature by weighted residual addition.

In an exemplary embodiment, the apparatus further includes, but is not limited to: an external visual feature input module 700 and a text feature input module 710;

the external visual feature input module 700 is configured to input the external visual feature into a cross-attention mechanism, and obtain a visual adaptive feature;

the text feature input module 710 is configured to input the text feature into a cross-attention mechanism, and obtain a text adaptive feature.

It should be noted that, when the multi-mode large model knowledge migration framework building apparatus provided in the foregoing embodiments performs multi-mode large model knowledge migration framework building, only the division of the foregoing functional modules is used for illustration, and in practical application, the foregoing functional allocation may be completed by different functional modules according to needs, that is, the internal structure of the multi-mode large model knowledge migration framework building apparatus may be divided into different functional modules, so as to complete all or part of the functions described above.

In addition, the device for constructing the multi-mode large model knowledge migration framework provided in the above embodiment belongs to the same concept as the embodiment of the method for constructing the multi-mode large model knowledge migration framework, and the specific manner in which each module performs the operation has been described in detail in the method embodiment, which is not described herein again.

Referring to fig. 6, in an embodiment of the present application, an electronic device 4000 is provided, and the electronic device 400 may include: desktop computers, notebook computers, servers, etc.

In fig. 6, the electronic device 4000 includes at least one processor 4001 and at least one memory 4003.

Among other things, data interaction between the processor 4001 and the memory 4003 may be achieved by at least one communication bus 4002. The communication bus 4002 may include a path for transferring data between the processor 4001 and the memory 4003. The communication bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus.

Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program instructions or code in the form of instructions or data structures and that can be accessed by electronic device 400.

The memory 4003 has computer readable instructions stored thereon, and the processor 4001 can read the computer readable instructions stored in the memory 4003 through the communication bus 4002.

The computer-readable instructions are executed by the one or more processors 4001 to implement the method of constructing a multi-modal large model knowledge migration framework in the above embodiments.

Furthermore, in an embodiment of the present application, a storage medium is provided, on which computer readable instructions are stored, where the computer readable instructions are executed by one or more processors to implement a method for constructing a multimodal big model knowledge migration framework as described above.

In an embodiment of the present application, a computer program product is provided, where the computer program product includes computer readable instructions, where the computer readable instructions are stored in a storage medium, and where one or more processors of an electronic device read the computer readable instructions from the storage medium, load and execute the computer readable instructions, so that the electronic device implements a method for constructing a multi-modal large model knowledge migration framework as described above.

Through the construction of a multi-mode large model knowledge migration framework, the ability of the model to understand videos in the open field is improved by using perception, dialogue and adaptation of three processes; in the sensing process, the model in the embodiment of the application can customize different modules according to actual conditions, for example, the super-resolution model can be used for preprocessing low-resolution data, so that domain difference is reduced; for other open field data sets, the model can be set manually according to the scene to preprocess, for example, the data with fog, noise or ambiguity at night can be replaced by a module for reducing data defects; during the dialog, the model used to extract the text features may also select the optimal model by ablation experiments. The adaptation module can be inserted into the current mainstream transformers and CNN models, plug and play is realized, and training and testing of the models are facilitated; the method comprises the steps that at the last self-adaptive introduction plug-and-play knowledge adaptation module, external multi-mode large model knowledge is flexibly integrated into various video trunks, and training and recognition of a trunk network are assisted; the fusion of external vision and text knowledge is realized, and the open world identification is realized, so that the problem that the effect of using the multi-mode large model on open domain data is not good in the related technology can be effectively solved.

The multi-mode large-model knowledge migration framework constructed by the scheme in the embodiment of the application has wide application range, for example, in the field of video monitoring and security, the video behavior recognition can be applied to scenes such as a monitoring system, a dim light scene, a low resolution and the like, and the behavior of a real scene can be detected and recognized. In the field of intelligent traffic systems, video behavior recognition can be used for traffic monitoring and management. By identifying traffic flow, vehicle violations, traffic congestion, etc., real-time traffic condition information may be provided to help optimize traffic flow and improve traffic safety. In the field of media content analysis, video behavior recognition can be used for automatic analysis and tagging of media content. For example, automatically tagging video content on a video website, detecting offensive or harmful content, provides better user experience and content management. In the field of urban construction, it is possible to assist in detecting sewer pipes, or other infrastructure operating conditions, detecting damage, etc.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for a person skilled in the art, several improvements and modifications can be made without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. The method for constructing the multi-mode large model knowledge migration framework is characterized by comprising the following steps of:

obtaining external visual characteristics based on visual enhancement of the original video;

based on inputting the original video into the visual language big model to carry out dialogue, extracting text features, wherein the text features are text descriptions of the original video;

and inserting an adaptive module into the classification network, fusing the external visual features and the text features into training of the classification network, and determining a knowledge migration framework after convergence of training of the classification network.

2. The method of claim 1, wherein in the process of obtaining external visual features based on visual enhancement of the original video, the method further comprises:

the method comprises the steps of obtaining a visual enhancement model to preprocess an original video, and obtaining corresponding visual data:

the visual data is input into a video base model to determine external visual features.

3. The method of claim 1, wherein in the process of extracting text features based on inputting an original video into a visual language big model for dialogue, the method further comprises:

acquiring category confidence scores corresponding to the visual data based on the video basic model;

invoking a confidence threshold;

judging whether the category confidence score is larger than a confidence threshold value or not;

if yes, determining the category corresponding to the category confidence score as an external tag, and acquiring external description information of the external tag; otherwise, inputting the original video into a dialogue model to acquire an external video title;

inputting the external description information or the external video title into a BERT model to extract text features.

4. The method of claim 3, wherein in the process of obtaining the external description information of the external tag, the method further comprises:

acquiring word length data corresponding to the external description information;

invoking a word length threshold; judging whether the word length data is larger than a word length threshold value, if so, the word length data of the external description information is contracted and written to the word length threshold value.

5. The method of claim 1, wherein during the fusing of the external visual feature and the text feature into training of a classification network, the method further comprises:

acquiring visual self-adaptive features corresponding to the external visual features and text self-adaptive features corresponding to the text features;

obtaining middle layer characteristics corresponding to the basic model;

combining the visual adaptive feature, the text adaptive feature and the intermediate layer feature by weighted residual addition.

6. The method of claim 5, wherein in the process of obtaining visual and text adaptation features, the method further comprises:

inputting the external visual characteristics into a cross attention mechanism to acquire visual self-adaptive characteristics;

and inputting the text characteristics into a cross-attention mechanism to acquire text self-adaptive characteristics.

7. The device for constructing the multi-mode large model knowledge migration framework is characterized by comprising the following components:

the external visual characteristic acquisition module is used for acquiring external visual characteristics based on the visual enhancement of the original video;

the text feature extraction module is used for carrying out dialogue based on inputting the original video into the visual language big model and extracting text features, wherein the text features are text descriptions of the original video;

and the framework determining module is used for inserting the self-adaptive module into the classification network, fusing the external visual features and the text features into training of the classification network, and determining a knowledge migration framework after the training of the classification network is converged.

8. The apparatus of claim 7, wherein the apparatus further comprises:

the visual data acquisition module is used for acquiring a visual enhancement model to preprocess the original video and acquiring corresponding visual data:

and the external visual characteristic determining module is used for inputting the visual data into the video basic model for determining external visual characteristics.

9. An electronic device, comprising: at least one processor, and at least one memory, wherein,

the memory has computer readable instructions stored thereon;

the computer readable instructions, executed by one or more of the processors, cause an electronic device to implement the method of constructing a multi-modal large model knowledge migration framework of any one of claims 1 to 6.

10. A storage medium having stored thereon computer readable instructions, the computer readable instructions being executable by one or more processors to implement the method of constructing a multi-modal large model knowledge migration framework of any one of claims 1 to 6.