CN117633707A - Fine-grained multi-mode Chinese large language model construction method and computer storage medium - Google Patents

Fine-grained multi-mode Chinese large language model construction method and computer storage medium Download PDF

Info

Publication number
CN117633707A
CN117633707A CN202311630540.3A CN202311630540A CN117633707A CN 117633707 A CN117633707 A CN 117633707A CN 202311630540 A CN202311630540 A CN 202311630540A CN 117633707 A CN117633707 A CN 117633707A
Authority
CN
China
Prior art keywords
language model
mode
fine
large language
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311630540.3A
Other languages
Chinese (zh)
Other versions
CN117633707B (en
Inventor
孙腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Royole Technologies Co Ltd
Original Assignee
Shenzhen Royole Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Royole Technologies Co Ltd filed Critical Shenzhen Royole Technologies Co Ltd
Priority to CN202311630540.3A priority Critical patent/CN117633707B/en
Publication of CN117633707A publication Critical patent/CN117633707A/en
Application granted granted Critical
Publication of CN117633707B publication Critical patent/CN117633707B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a fine-grained multi-mode Chinese large language model construction method and a computer storage medium, belonging to the field of computers, wherein the method comprises the following steps: planning the architecture of a fine-grained multi-mode Chinese large language model, wherein the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module; the method and the device have the advantages that the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module, so that the Chinese large language model is used as a central connection understanding and generation module of a model system, a series of multi-mode content understanding and content generation tasks can be executed according to user instructions, and compared with the current multi-mode large model technology, the method and the device have the advantages of few pseudoscopy problems, multiple extensible functions, low training cost, deep understanding of complex multi-mode scenes and the like.

Description

Fine-grained multi-mode Chinese large language model construction method and computer storage medium
Technical Field
The application relates to the field of computers, in particular to a fine-granularity multi-mode Chinese large language model construction method and a computer storage medium thereof.
Background
With the explosive growth of digital media, we have generated a large amount of multimodal data in our daily lives. Multimodal content is emerging in the fields of social media, online video, virtual reality, and augmented reality applications, which contain a variety of information such as text, images, sound, and video. Therefore, it becomes critical to develop techniques that can understand and analyze these multimodal data. At present, a large language model for Chinese has become a research hotspot, but due to the problems of data resources and the like, the prior art has certain defects in multi-modal integration and model generalization, and the unavoidable pseudoscopic problem of the large model restricts the application of the model, and the pseudoscopic phenomenon of the large model occurs due to the fact that the granularity of the extracted multi-modal information is not fine enough and the multi-modal information is not well aligned.
Disclosure of Invention
The method for constructing the fine-grained multi-mode Chinese large language model and the computer storage medium thereof solve the problem that the large language model still has certain defects in multi-mode integration and model generalization in the prior art, and the unavoidable pseudoscopic problem restricts the application of the model; the capability of providing a large language model on multi-modal integration and model generalization is realized; reducing pseudoscopic problems and improving applicability. The application provides a fine-grained multi-mode Chinese large language model construction method, which comprises the following steps: planning the architecture of a fine-grained multi-mode Chinese large language model, wherein the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module; constructing a multi-mode information extraction and fusion module; constructing a core large language model; constructing a multi-mode content generation module; training and tuning.
In another aspect of the present application, there is also provided a computer storage medium storing a program for executing the above fine-grained multi-modal chinese large language model construction method.
The technical scheme provided by the application has at least the following technical effects or advantages:
the method and the device have the advantages that the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module, so that the Chinese large language model is used as a central connection understanding and generation module of a model system, a series of multi-mode content understanding and content generation tasks can be executed according to user instructions, and compared with the current multi-mode large model technology, the method and the device have the advantages of few pseudoscopy problems, multiple extensible functions, low training cost, deep understanding of complex multi-mode scenes and the like.
Drawings
FIG. 1 is a flowchart of a method for building a fine-grained multi-modal Chinese large language model in accordance with an embodiment of the present application;
FIG. 2 is a flowchart of a multi-modal information extraction and fusion module constructed in the fine-grained multi-modal Chinese large language model construction method according to an embodiment of the application;
FIG. 3 is a flowchart of a method for constructing a core large language model in a fine-grained multi-modal Chinese large language model construction method according to an embodiment of the application;
FIG. 4 is a flowchart of a multi-modal content generation module constructed in the fine-grained multi-modal Chinese large language model construction method according to an embodiment of the application.
Detailed Description
In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.
In order to solve the technical problems involved in the background technology, three innovations are respectively made in the traditional multi-mode large model architecture in the embodiment of the application: 1. in the aspect of feature extraction, we propose a multi-modal information extraction and fusion method, which extracts fine-grained multi-modal high-level semantic features by means of SIA-CNN models, and simultaneously adopts a multi-modal high-level semantic extractor MMSSE model to extract and align multi-modal high-level semantics; 2. in the aspect of a text large model, a voice-meaning integrated large language model training method is adopted to train the model, so that the text large model has better compatibility when reasoning aligned multi-modal information; 3. in the aspect of multi-modal output, we propose a decoupled multi-modal content generation method, so that a model can accurately convert high-level multi-modal semantics after reasoning into appointed multi-modal output;
referring to fig. 1-4, an embodiment of the present application provides a method for constructing a fine-grained multi-mode chinese large language model, including S1, planning a framework of a fine-grained multi-mode chinese large language model, where the fine-grained multi-mode chinese large language model includes a multi-mode information extraction and fusion module, a core large language model, and a multi-mode content generation module; s2, constructing a multi-mode information extraction and fusion module; s3, constructing a core large language model; s4, constructing a multi-mode content generation module; s5, training and optimizing;
the method in an exemplary embodiment of the present application is composed of three major parts, namely a multi-modal information extraction and fusion method, a phonetic sense integrated large language model training method and a decoupled multi-modal content generation method:
aiming at the defect that the large voice model has certain defects in multi-modal integration and model generalization, unavoidable pseudoscopic problems restrict the applicable problem of the model, in an exemplary embodiment, S1, the architecture of a fine-grained multi-modal Chinese large language model is planned, and the fine-grained multi-modal Chinese large language model comprises a multi-modal information extraction and fusion module, a core large language model and a multi-modal content generation module;
s2: the multi-mode information extraction and fusion module is composed of a fine-grain visual feature extractor and a multi-mode high-level semantic extractor. The fine-grained visual feature extractor consists of Vision Transformer (ViT) and an autonomously developed spatial information adapter SIA-CNN based on a convolutional neural network; and ViT, fully interacting and fusing the global visual features extracted by ViT with the local visual features extracted by SIA-CNN to obtain fine-grained visual representation of the visual input signals. The MMSSE model of the multi-mode high-level semantic extractor extracts and aligns semantic information on the fine-grain visual representation according to a user instruction to obtain fine-grain multi-mode high-level semantic features, the multi-mode information can be represented on a plurality of fine-grain layers, and the understanding capability of the multi-mode large model on input is improved.
In an exemplary embodiment, the step S2 multi-modal information extraction and fusion method includes: s11: a fine-grained visual feature extractor was first constructed, using Vision Transformer (ViT) as a visual modality extraction base.
S22: the spatial information adapter SIA-CNN is added in ViT, each SIA-CNN layer consists of a 3-segment residual convolutional neural network, and each SIA-CNN layer is inserted in each layer of ViT transformers. The reasoning result of SIA-CNN is projected downwards to a smaller dimension through a multi-layer convolution neural network, then through a layer of nonlinear activation function, and then projected upwards to the original dimension through a multi-layer transposition convolution neural network. In addition, a residual connection is arranged between the input and the output of the whole SIA-CNN, so as to strengthen the robustness of the model structure. The newly added parameters of the module account for about 17% of the original model, so that ViT can fully extract depth and space information of the picture, and the representation capability of the output vector of the ViT model on fine granularity information is enhanced.
S23: the MMSSE model of the multi-mode high-level semantic extractor is a double-tower model formed by two 7-layer transformers, wherein one 7-layer transformer single-tower input is instruction information input by a user, and the other 7-layer transformer single-tower input is a learned random instruction vector and multi-mode information input; through the image text comparison learning task, the image text generation task, the image text matching task and the difficult negative sample mining strategy joint training, the multi-mode high-level semantic extractor can extract semantic information of the fine-grain visual representation according to user instructions, and fine-grain multi-mode high-level semantic features guided by the user instructions are obtained.
S24: the multi-mode high-level semantic features of the multi-mode high-level semantic extractor are subjected to linear transformation through a linear layer and mapped into the semantic space of the text, so that a large model can understand multi-mode information and perform reasoning.
S3: the large language model provides reasoning and understanding capability, is the core of a multi-mode large model architecture, has the integral parameter reaching 11B, and enhances the understanding capability on central and multi-mode information by adopting unified representation of sound, shape and meaning as input. The ultra-large scale data set used by the large language model is subjected to seven rounds of multi-dimensional and multi-level deep cleaning, and has the total character number of 1500B.
In an exemplary embodiment, the step S3 of the method for training a large-scale phonetic meaning model includes: s31: constructing a pre-training basic data set, collecting and arranging 40GB small red book data, 120G headline news data, 50G WeChat public number data, 110G known data, 50G encyclopedia data, 70G webpage data, 420G MNBVC Chinese data, 200G code data and 900G English data, constructing an automatic cleaning script, and cleaning the data according to specified rules.
S32: constructing a large language model architecture, adopting a decoder-only structure, using 28 layers of transformers as a model body, using rotary position coding as a coding scheme for maintaining the result of the model in a long context, assuming that the inner product operation between the query vector and the key vector can be represented by a function g', having
<f q (x m ,m),f k (x n ,n)≥g′(x m ,x n M-n), where f q (x m M) is a query vector; f (f) k (x n N) is a key vector; function g () Is word embedding vector x m ,x n And the relative position m-n between them, an equivalent position coding method satisfying the condition can be constructed so that the above formula can still be established, with
f q (x m ,m)=(W q x m )e imθ
f k (x n ,n)=(W k x n )e inθ
g′(x m ,x n ,m-n)=Re[(W q x m )(W k x n )e i(m-n)θ ]
Based on the geometric meaning of complex multiplication, the transformation actually corresponds to the rotation of the vector, and is therefore called rotary position coding. Meanwhile, in the stage of embading, the input of the large language model uses the vector representation of three in one of sound, shape and meaning, the three features of sound, shape and meaning are deeply fused, and then unified training is carried out, so that the large language model can understand multi-mode information more deeply.
S33: the multi-card cluster integrated training is realized by using a deep training framework to accelerate model training, and in order to solve the problem of display card faults in a long-time training scene, a communication node hierarchical management mode is adopted to divide a computational power scheduling scheme: and setting part of server nodes as communication nodes, and collecting logs and values of subordinate computing forces of the server nodes in the calculation process, wherein the logs and values are responsible for unifying and collecting the subordinate computing forces of the server nodes, when part of the subordinate computing forces of a certain communication node have faults, the communication node can try to wake up a fault display card, and if the communication node fails to try for many times, the result is directly saved and the training is stopped, so that the subordinate computing forces of other communication nodes are not influenced.
S4: the multi-mode content generating module consists of a plurality of downstream sub-models, comprises an image generating model based on the StableDiffuse model framework and an audio generating model based on the EnCodec model framework, and the downstream models can directly use hidden vectors of the large language model to generate appointed content in a joint training mode, so that the expandability of the multi-mode large model framework to downstream tasks is enhanced.
In an exemplary embodiment, the step S4 may be a process of the method for generating multi-modal content, which includes:
s41: based on the large language model combined training StableDiffuse model, freezing the parameters of the large predictive model in the training process, using an alignment model of a 5-layer transformation structure to align the hidden vector in the large language model to the condition vector space in the StableDiffuse model, and mutually decoupling the StableDiffuse model and the large language model, so that the large language model can regulate and control the generation result of the StableDiffuse model at any time.
S42: based on the large language model joint training EnCodec audio generation model, the parameters of the large predictive model are frozen in the training process, a 3-layer perceptron structure is used for carrying out nonlinear transformation on hidden vectors of the large language model, and the output result is superimposed on the hidden space characteristics of the EnCodec audio generation model according to weight, so that the conversion from text output to audio output is realized.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art, within the scope of the present application, should apply to the present application, and all equivalents and modifications as fall within the scope of the present application.

Claims (10)

1. A fine-grained multi-mode Chinese large language model construction method is characterized by comprising the following steps:
s1, planning a framework of a fine-grained multi-mode Chinese large language model, wherein the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module;
s2, constructing a multi-mode information extraction and fusion module;
s3, constructing a core large language model;
s4, constructing a multi-mode content generation module;
s5, training and optimizing.
2. The method for constructing a fine-grained multi-modal chinese large language model according to claim 1, wherein S2 comprises the steps of:
s21: constructing a fine-grained visual feature extractor, and using Vision Transformer as a visual mode extraction base;
s22: adding a spatial information adapter based on a convolutional neural network into Vision Transformer, and inserting a layer of spatial information adapter into a transducer layer of Vision Transformer;
s23: the multimode high-level semantic extractor is a double-tower model formed by two 7-layer transformers, wherein one 7-layer transformer single tower input is instruction information input by a user, and the other 7-layer transformer single tower input is a learned random instruction vector and multimode information input; through the image text comparison learning task, the image text generation task, the image text matching task and the difficult negative sample mining strategy joint training, the multi-mode high-level semantic extractor can extract semantic information of the fine-grain visual representation according to user instructions, and fine-grain multi-mode high-level semantic features guided by the user instructions are obtained;
s24: the multi-mode high-level semantic features of the multi-mode high-level semantic extractor are subjected to linear transformation through a linear layer and mapped into the semantic space of the text, so that a large model can understand multi-mode information and perform reasoning.
3. The method for constructing a fine-grained multi-modal chinese large language model according to claim 2, wherein the spatial information adapter result in S22 projects features downward to a smaller dimension through a multi-layer convolutional neural network, then through a layer of nonlinear activation function, and then upward to an original dimension through a multi-layer transposed convolutional neural network.
4. The method for constructing a fine-grained multi-modal chinese large language model according to claim 2, wherein a residual connection is further provided between the input and the output of the entire spatial information adapter in S22, so as to enhance the robustness of the model structure.
5. The method for constructing a fine-grained multi-modal chinese large language model according to claim 1, wherein S3 comprises the steps of:
s31: constructing a pre-training basic data set, collecting and arranging a plurality of internet data, constructing an automatic cleaning script, and cleaning the data according to a specified rule;
s32: constructing a core large language model architecture, adopting a decoder-only structure, and using 28 transformer layers as a model main body;
s33: multi-card cluster integration training, model training is accelerated using a deep training framework.
6. The method for constructing a fine-grained multi-modal chinese large language model according to claim 5, wherein in S33, the power-calculation scheduling scheme is divided by adopting a hierarchical management manner of communication nodes: and setting part of server nodes as communication nodes, and collecting logs and values of subordinate computing forces of the communication nodes in the calculation process, wherein the logs and the values are responsible for unifying and collecting the subordinate computing forces of the communication nodes, when part of the subordinate computing forces of the communication nodes have faults, the communication nodes try to wake up the fault display card, and if the communication nodes fail to try for many times, the result is directly saved and the training is stopped.
7. The fine-grained multi-modal chinese large language model construction method of claim 5, wherein the S32 uses rotary position coding as the coding scheme.
8. The method for constructing a fine-grained multi-modal chinese large language model according to claim 5, wherein in step S32, the input of the core large language model uses three-in-one vector representation of sound, shape and meaning, the three features of sound, shape and meaning are deeply fused, and then unified training is performed, so that the large language model can understand multi-modal information more deeply.
9. The method for constructing a fine-grained multi-modal chinese large language model according to claim 1, wherein S4 comprises the steps of: s41: based on the large language model combined training Stable Diffusion model, freezing parameters of the large prediction model in the training process, and using an alignment model of a 5-layer transformation structure to align hidden vectors in the large language model into a condition vector space in the Stable Diffusion model, so that the large language model can regulate and control the generation result of the Stable Diffusion model at any time; s42: based on the large language model joint training EnCodec audio generation model, the parameters of the large predictive model are frozen in the training process, a 3-layer perceptron structure is used for carrying out nonlinear transformation on hidden vectors of the large language model, and the output result is superimposed on the hidden space characteristics of the EnCodec audio generation model according to weight, so that the conversion from text output to audio output is realized.
10. A computer storage medium storing a program capable of executing the fine-grained multi-modal chinese large language model construction method according to any one of claims 1 to 9.
CN202311630540.3A 2023-12-01 2023-12-01 Fine-grained multi-mode Chinese large language model construction method and computer storage medium Active CN117633707B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311630540.3A CN117633707B (en) 2023-12-01 2023-12-01 Fine-grained multi-mode Chinese large language model construction method and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311630540.3A CN117633707B (en) 2023-12-01 2023-12-01 Fine-grained multi-mode Chinese large language model construction method and computer storage medium

Publications (2)

Publication Number Publication Date
CN117633707A true CN117633707A (en) 2024-03-01
CN117633707B CN117633707B (en) 2024-08-23

Family

ID=90033405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311630540.3A Active CN117633707B (en) 2023-12-01 2023-12-01 Fine-grained multi-mode Chinese large language model construction method and computer storage medium

Country Status (1)

Country Link
CN (1) CN117633707B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118039057A (en) * 2024-04-11 2024-05-14 湖南超能机器人技术有限公司 Household health service robot based on multi-mode large model and intelligent interaction method
CN118364433A (en) * 2024-06-20 2024-07-19 清华大学 Multi-mode image-text interleaving generation model based on dynamic characteristic synchronizer

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116259075A (en) * 2023-01-16 2023-06-13 安徽大学 Pedestrian attribute identification method based on prompt fine tuning pre-training large model
CN116628490A (en) * 2023-04-07 2023-08-22 中国科学院自动化研究所 Graphic-audio multi-mode pre-training model method, device, electronic equipment and medium
CN116821457A (en) * 2023-08-30 2023-09-29 环球数科集团有限公司 Intelligent consultation and public opinion processing system based on multi-mode large model
CN116994171A (en) * 2023-06-01 2023-11-03 无锡动视宫原科技有限公司 Video understanding method and device
CN117079299A (en) * 2023-10-12 2023-11-17 腾讯科技(深圳)有限公司 Data processing method, device, electronic equipment and storage medium
CN117094419A (en) * 2023-10-16 2023-11-21 华南理工大学 Multi-modal content output-oriented large language model training method, device and medium
WO2023226309A1 (en) * 2022-05-24 2023-11-30 华为云计算技术有限公司 Model training method and related device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023226309A1 (en) * 2022-05-24 2023-11-30 华为云计算技术有限公司 Model training method and related device
CN116259075A (en) * 2023-01-16 2023-06-13 安徽大学 Pedestrian attribute identification method based on prompt fine tuning pre-training large model
CN116628490A (en) * 2023-04-07 2023-08-22 中国科学院自动化研究所 Graphic-audio multi-mode pre-training model method, device, electronic equipment and medium
CN116994171A (en) * 2023-06-01 2023-11-03 无锡动视宫原科技有限公司 Video understanding method and device
CN116821457A (en) * 2023-08-30 2023-09-29 环球数科集团有限公司 Intelligent consultation and public opinion processing system based on multi-mode large model
CN117079299A (en) * 2023-10-12 2023-11-17 腾讯科技(深圳)有限公司 Data processing method, device, electronic equipment and storage medium
CN117094419A (en) * 2023-10-16 2023-11-21 华南理工大学 Multi-modal content output-oriented large language model training method, device and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GONGWEI CHEN 等: "LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge", ARXIV, 26 November 2023 (2023-11-26), pages 3 *
王翀 等: "基于Vision Transformer和语义学习的视频描述模型", 印刷与数字媒体技术研究, 31 October 2023 (2023-10-31) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118039057A (en) * 2024-04-11 2024-05-14 湖南超能机器人技术有限公司 Household health service robot based on multi-mode large model and intelligent interaction method
CN118364433A (en) * 2024-06-20 2024-07-19 清华大学 Multi-mode image-text interleaving generation model based on dynamic characteristic synchronizer

Also Published As

Publication number Publication date
CN117633707B (en) 2024-08-23

Similar Documents

Publication Publication Date Title
CN113762322B (en) Video classification method, device and equipment based on multi-modal representation and storage medium
CN117633707B (en) Fine-grained multi-mode Chinese large language model construction method and computer storage medium
CN110234018B (en) Multimedia content description generation method, training method, device, equipment and medium
CN111581966A (en) Context feature fusion aspect level emotion classification method and device
CN110196928B (en) Fully parallelized end-to-end multi-turn dialogue system with domain expansibility and method
CN113408284A (en) Training method and device of text processing model, electronic equipment and storage medium
CN116681810B (en) Virtual object action generation method, device, computer equipment and storage medium
CN111026852B (en) Financial event-oriented hybrid causal relationship discovery method
CN118014086B (en) Data processing method, device, equipment, storage medium and product
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
Li et al. [Retracted] Multimedia Data Processing Technology and Application Based on Deep Learning
CN112800339B (en) Information stream searching method, device and equipment
Madan et al. Foundation Models for Video Understanding: A Survey
CN117934803A (en) Visual positioning method based on multi-modal feature alignment
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN116975347A (en) Image generation model training method and related device
Liu [Retracted] Research on Virtual Interactive Animation Design System Based on Deep Learning
CN113409769B (en) Data identification method, device, equipment and medium based on neural network model
Zhang et al. [Retracted] Cloud Application in the Construction of English Virtual Teaching Resources Based on Digital Three‐Dimensional Technology
CN114333069A (en) Object posture processing method, device, equipment and storage medium
CN118469025B (en) Modality expansion method of pre-training multi-modality language reasoning model based on continuous learning and transfer learning
CN117076090B (en) Task model construction method, device, equipment and computer readable storage medium
CN115587160B (en) Phrase-level text image generation method and system based on self-attention mechanism
CN114580443B (en) Text translation method, text translation device, kernel function combination method, server and medium
CN116127051B (en) Dialogue generation method based on deep learning, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant