CN117633707A - Fine-grained multi-mode Chinese large language model construction method and computer storage medium - Google Patents
Fine-grained multi-mode Chinese large language model construction method and computer storage medium Download PDFInfo
- Publication number
- CN117633707A CN117633707A CN202311630540.3A CN202311630540A CN117633707A CN 117633707 A CN117633707 A CN 117633707A CN 202311630540 A CN202311630540 A CN 202311630540A CN 117633707 A CN117633707 A CN 117633707A
- Authority
- CN
- China
- Prior art keywords
- language model
- mode
- fine
- large language
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000010276 construction Methods 0.000 title claims abstract description 12
- 238000000034 method Methods 0.000 claims abstract description 34
- 238000012549 training Methods 0.000 claims abstract description 32
- 238000000605 extraction Methods 0.000 claims abstract description 18
- 230000004927 fusion Effects 0.000 claims abstract description 12
- 230000006870 function Effects 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 18
- 238000004891 communication Methods 0.000 claims description 13
- 238000013527 convolutional neural network Methods 0.000 claims description 13
- 230000000007 visual effect Effects 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 8
- 230000009466 transformation Effects 0.000 claims description 7
- 238000004140 cleaning Methods 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 5
- 230000010354 integration Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000008014 freezing Effects 0.000 claims description 2
- 238000007710 freezing Methods 0.000 claims description 2
- 238000007726 management method Methods 0.000 claims description 2
- 238000005065 mining Methods 0.000 claims description 2
- 238000009792 diffusion process Methods 0.000 claims 3
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000007547 defect Effects 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 238000007500 overflow downdraw method Methods 0.000 description 3
- 230000004075 alteration Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
- G06F18/256—Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Machine Translation (AREA)
Abstract
The application discloses a fine-grained multi-mode Chinese large language model construction method and a computer storage medium, belonging to the field of computers, wherein the method comprises the following steps: planning the architecture of a fine-grained multi-mode Chinese large language model, wherein the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module; the method and the device have the advantages that the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module, so that the Chinese large language model is used as a central connection understanding and generation module of a model system, a series of multi-mode content understanding and content generation tasks can be executed according to user instructions, and compared with the current multi-mode large model technology, the method and the device have the advantages of few pseudoscopy problems, multiple extensible functions, low training cost, deep understanding of complex multi-mode scenes and the like.
Description
Technical Field
The application relates to the field of computers, in particular to a fine-granularity multi-mode Chinese large language model construction method and a computer storage medium thereof.
Background
With the explosive growth of digital media, we have generated a large amount of multimodal data in our daily lives. Multimodal content is emerging in the fields of social media, online video, virtual reality, and augmented reality applications, which contain a variety of information such as text, images, sound, and video. Therefore, it becomes critical to develop techniques that can understand and analyze these multimodal data. At present, a large language model for Chinese has become a research hotspot, but due to the problems of data resources and the like, the prior art has certain defects in multi-modal integration and model generalization, and the unavoidable pseudoscopic problem of the large model restricts the application of the model, and the pseudoscopic phenomenon of the large model occurs due to the fact that the granularity of the extracted multi-modal information is not fine enough and the multi-modal information is not well aligned.
Disclosure of Invention
The method for constructing the fine-grained multi-mode Chinese large language model and the computer storage medium thereof solve the problem that the large language model still has certain defects in multi-mode integration and model generalization in the prior art, and the unavoidable pseudoscopic problem restricts the application of the model; the capability of providing a large language model on multi-modal integration and model generalization is realized; reducing pseudoscopic problems and improving applicability. The application provides a fine-grained multi-mode Chinese large language model construction method, which comprises the following steps: planning the architecture of a fine-grained multi-mode Chinese large language model, wherein the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module; constructing a multi-mode information extraction and fusion module; constructing a core large language model; constructing a multi-mode content generation module; training and tuning.
In another aspect of the present application, there is also provided a computer storage medium storing a program for executing the above fine-grained multi-modal chinese large language model construction method.
The technical scheme provided by the application has at least the following technical effects or advantages:
the method and the device have the advantages that the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module, so that the Chinese large language model is used as a central connection understanding and generation module of a model system, a series of multi-mode content understanding and content generation tasks can be executed according to user instructions, and compared with the current multi-mode large model technology, the method and the device have the advantages of few pseudoscopy problems, multiple extensible functions, low training cost, deep understanding of complex multi-mode scenes and the like.
Drawings
FIG. 1 is a flowchart of a method for building a fine-grained multi-modal Chinese large language model in accordance with an embodiment of the present application;
FIG. 2 is a flowchart of a multi-modal information extraction and fusion module constructed in the fine-grained multi-modal Chinese large language model construction method according to an embodiment of the application;
FIG. 3 is a flowchart of a method for constructing a core large language model in a fine-grained multi-modal Chinese large language model construction method according to an embodiment of the application;
FIG. 4 is a flowchart of a multi-modal content generation module constructed in the fine-grained multi-modal Chinese large language model construction method according to an embodiment of the application.
Detailed Description
In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.
In order to solve the technical problems involved in the background technology, three innovations are respectively made in the traditional multi-mode large model architecture in the embodiment of the application: 1. in the aspect of feature extraction, we propose a multi-modal information extraction and fusion method, which extracts fine-grained multi-modal high-level semantic features by means of SIA-CNN models, and simultaneously adopts a multi-modal high-level semantic extractor MMSSE model to extract and align multi-modal high-level semantics; 2. in the aspect of a text large model, a voice-meaning integrated large language model training method is adopted to train the model, so that the text large model has better compatibility when reasoning aligned multi-modal information; 3. in the aspect of multi-modal output, we propose a decoupled multi-modal content generation method, so that a model can accurately convert high-level multi-modal semantics after reasoning into appointed multi-modal output;
referring to fig. 1-4, an embodiment of the present application provides a method for constructing a fine-grained multi-mode chinese large language model, including S1, planning a framework of a fine-grained multi-mode chinese large language model, where the fine-grained multi-mode chinese large language model includes a multi-mode information extraction and fusion module, a core large language model, and a multi-mode content generation module; s2, constructing a multi-mode information extraction and fusion module; s3, constructing a core large language model; s4, constructing a multi-mode content generation module; s5, training and optimizing;
the method in an exemplary embodiment of the present application is composed of three major parts, namely a multi-modal information extraction and fusion method, a phonetic sense integrated large language model training method and a decoupled multi-modal content generation method:
aiming at the defect that the large voice model has certain defects in multi-modal integration and model generalization, unavoidable pseudoscopic problems restrict the applicable problem of the model, in an exemplary embodiment, S1, the architecture of a fine-grained multi-modal Chinese large language model is planned, and the fine-grained multi-modal Chinese large language model comprises a multi-modal information extraction and fusion module, a core large language model and a multi-modal content generation module;
s2: the multi-mode information extraction and fusion module is composed of a fine-grain visual feature extractor and a multi-mode high-level semantic extractor. The fine-grained visual feature extractor consists of Vision Transformer (ViT) and an autonomously developed spatial information adapter SIA-CNN based on a convolutional neural network; and ViT, fully interacting and fusing the global visual features extracted by ViT with the local visual features extracted by SIA-CNN to obtain fine-grained visual representation of the visual input signals. The MMSSE model of the multi-mode high-level semantic extractor extracts and aligns semantic information on the fine-grain visual representation according to a user instruction to obtain fine-grain multi-mode high-level semantic features, the multi-mode information can be represented on a plurality of fine-grain layers, and the understanding capability of the multi-mode large model on input is improved.
In an exemplary embodiment, the step S2 multi-modal information extraction and fusion method includes: s11: a fine-grained visual feature extractor was first constructed, using Vision Transformer (ViT) as a visual modality extraction base.
S22: the spatial information adapter SIA-CNN is added in ViT, each SIA-CNN layer consists of a 3-segment residual convolutional neural network, and each SIA-CNN layer is inserted in each layer of ViT transformers. The reasoning result of SIA-CNN is projected downwards to a smaller dimension through a multi-layer convolution neural network, then through a layer of nonlinear activation function, and then projected upwards to the original dimension through a multi-layer transposition convolution neural network. In addition, a residual connection is arranged between the input and the output of the whole SIA-CNN, so as to strengthen the robustness of the model structure. The newly added parameters of the module account for about 17% of the original model, so that ViT can fully extract depth and space information of the picture, and the representation capability of the output vector of the ViT model on fine granularity information is enhanced.
S23: the MMSSE model of the multi-mode high-level semantic extractor is a double-tower model formed by two 7-layer transformers, wherein one 7-layer transformer single-tower input is instruction information input by a user, and the other 7-layer transformer single-tower input is a learned random instruction vector and multi-mode information input; through the image text comparison learning task, the image text generation task, the image text matching task and the difficult negative sample mining strategy joint training, the multi-mode high-level semantic extractor can extract semantic information of the fine-grain visual representation according to user instructions, and fine-grain multi-mode high-level semantic features guided by the user instructions are obtained.
S24: the multi-mode high-level semantic features of the multi-mode high-level semantic extractor are subjected to linear transformation through a linear layer and mapped into the semantic space of the text, so that a large model can understand multi-mode information and perform reasoning.
S3: the large language model provides reasoning and understanding capability, is the core of a multi-mode large model architecture, has the integral parameter reaching 11B, and enhances the understanding capability on central and multi-mode information by adopting unified representation of sound, shape and meaning as input. The ultra-large scale data set used by the large language model is subjected to seven rounds of multi-dimensional and multi-level deep cleaning, and has the total character number of 1500B.
In an exemplary embodiment, the step S3 of the method for training a large-scale phonetic meaning model includes: s31: constructing a pre-training basic data set, collecting and arranging 40GB small red book data, 120G headline news data, 50G WeChat public number data, 110G known data, 50G encyclopedia data, 70G webpage data, 420G MNBVC Chinese data, 200G code data and 900G English data, constructing an automatic cleaning script, and cleaning the data according to specified rules.
S32: constructing a large language model architecture, adopting a decoder-only structure, using 28 layers of transformers as a model body, using rotary position coding as a coding scheme for maintaining the result of the model in a long context, assuming that the inner product operation between the query vector and the key vector can be represented by a function g', having
<f q (x m ,m),f k (x n ,n)≥g′(x m ,x n M-n), where f q (x m M) is a query vector; f (f) k (x n N) is a key vector; function g ′ () Is word embedding vector x m ,x n And the relative position m-n between them, an equivalent position coding method satisfying the condition can be constructed so that the above formula can still be established, with
f q (x m ,m)=(W q x m )e imθ
f k (x n ,n)=(W k x n )e inθ
g′(x m ,x n ,m-n)=Re[(W q x m )(W k x n )e i(m-n)θ ]
Based on the geometric meaning of complex multiplication, the transformation actually corresponds to the rotation of the vector, and is therefore called rotary position coding. Meanwhile, in the stage of embading, the input of the large language model uses the vector representation of three in one of sound, shape and meaning, the three features of sound, shape and meaning are deeply fused, and then unified training is carried out, so that the large language model can understand multi-mode information more deeply.
S33: the multi-card cluster integrated training is realized by using a deep training framework to accelerate model training, and in order to solve the problem of display card faults in a long-time training scene, a communication node hierarchical management mode is adopted to divide a computational power scheduling scheme: and setting part of server nodes as communication nodes, and collecting logs and values of subordinate computing forces of the server nodes in the calculation process, wherein the logs and values are responsible for unifying and collecting the subordinate computing forces of the server nodes, when part of the subordinate computing forces of a certain communication node have faults, the communication node can try to wake up a fault display card, and if the communication node fails to try for many times, the result is directly saved and the training is stopped, so that the subordinate computing forces of other communication nodes are not influenced.
S4: the multi-mode content generating module consists of a plurality of downstream sub-models, comprises an image generating model based on the StableDiffuse model framework and an audio generating model based on the EnCodec model framework, and the downstream models can directly use hidden vectors of the large language model to generate appointed content in a joint training mode, so that the expandability of the multi-mode large model framework to downstream tasks is enhanced.
In an exemplary embodiment, the step S4 may be a process of the method for generating multi-modal content, which includes:
s41: based on the large language model combined training StableDiffuse model, freezing the parameters of the large predictive model in the training process, using an alignment model of a 5-layer transformation structure to align the hidden vector in the large language model to the condition vector space in the StableDiffuse model, and mutually decoupling the StableDiffuse model and the large language model, so that the large language model can regulate and control the generation result of the StableDiffuse model at any time.
S42: based on the large language model joint training EnCodec audio generation model, the parameters of the large predictive model are frozen in the training process, a 3-layer perceptron structure is used for carrying out nonlinear transformation on hidden vectors of the large language model, and the output result is superimposed on the hidden space characteristics of the EnCodec audio generation model according to weight, so that the conversion from text output to audio output is realized.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art, within the scope of the present application, should apply to the present application, and all equivalents and modifications as fall within the scope of the present application.
Claims (10)
1. A fine-grained multi-mode Chinese large language model construction method is characterized by comprising the following steps:
s1, planning a framework of a fine-grained multi-mode Chinese large language model, wherein the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module;
s2, constructing a multi-mode information extraction and fusion module;
s3, constructing a core large language model;
s4, constructing a multi-mode content generation module;
s5, training and optimizing.
2. The method for constructing a fine-grained multi-modal chinese large language model according to claim 1, wherein S2 comprises the steps of:
s21: constructing a fine-grained visual feature extractor, and using Vision Transformer as a visual mode extraction base;
s22: adding a spatial information adapter based on a convolutional neural network into Vision Transformer, and inserting a layer of spatial information adapter into a transducer layer of Vision Transformer;
s23: the multimode high-level semantic extractor is a double-tower model formed by two 7-layer transformers, wherein one 7-layer transformer single tower input is instruction information input by a user, and the other 7-layer transformer single tower input is a learned random instruction vector and multimode information input; through the image text comparison learning task, the image text generation task, the image text matching task and the difficult negative sample mining strategy joint training, the multi-mode high-level semantic extractor can extract semantic information of the fine-grain visual representation according to user instructions, and fine-grain multi-mode high-level semantic features guided by the user instructions are obtained;
s24: the multi-mode high-level semantic features of the multi-mode high-level semantic extractor are subjected to linear transformation through a linear layer and mapped into the semantic space of the text, so that a large model can understand multi-mode information and perform reasoning.
3. The method for constructing a fine-grained multi-modal chinese large language model according to claim 2, wherein the spatial information adapter result in S22 projects features downward to a smaller dimension through a multi-layer convolutional neural network, then through a layer of nonlinear activation function, and then upward to an original dimension through a multi-layer transposed convolutional neural network.
4. The method for constructing a fine-grained multi-modal chinese large language model according to claim 2, wherein a residual connection is further provided between the input and the output of the entire spatial information adapter in S22, so as to enhance the robustness of the model structure.
5. The method for constructing a fine-grained multi-modal chinese large language model according to claim 1, wherein S3 comprises the steps of:
s31: constructing a pre-training basic data set, collecting and arranging a plurality of internet data, constructing an automatic cleaning script, and cleaning the data according to a specified rule;
s32: constructing a core large language model architecture, adopting a decoder-only structure, and using 28 transformer layers as a model main body;
s33: multi-card cluster integration training, model training is accelerated using a deep training framework.
6. The method for constructing a fine-grained multi-modal chinese large language model according to claim 5, wherein in S33, the power-calculation scheduling scheme is divided by adopting a hierarchical management manner of communication nodes: and setting part of server nodes as communication nodes, and collecting logs and values of subordinate computing forces of the communication nodes in the calculation process, wherein the logs and the values are responsible for unifying and collecting the subordinate computing forces of the communication nodes, when part of the subordinate computing forces of the communication nodes have faults, the communication nodes try to wake up the fault display card, and if the communication nodes fail to try for many times, the result is directly saved and the training is stopped.
7. The fine-grained multi-modal chinese large language model construction method of claim 5, wherein the S32 uses rotary position coding as the coding scheme.
8. The method for constructing a fine-grained multi-modal chinese large language model according to claim 5, wherein in step S32, the input of the core large language model uses three-in-one vector representation of sound, shape and meaning, the three features of sound, shape and meaning are deeply fused, and then unified training is performed, so that the large language model can understand multi-modal information more deeply.
9. The method for constructing a fine-grained multi-modal chinese large language model according to claim 1, wherein S4 comprises the steps of: s41: based on the large language model combined training Stable Diffusion model, freezing parameters of the large prediction model in the training process, and using an alignment model of a 5-layer transformation structure to align hidden vectors in the large language model into a condition vector space in the Stable Diffusion model, so that the large language model can regulate and control the generation result of the Stable Diffusion model at any time; s42: based on the large language model joint training EnCodec audio generation model, the parameters of the large predictive model are frozen in the training process, a 3-layer perceptron structure is used for carrying out nonlinear transformation on hidden vectors of the large language model, and the output result is superimposed on the hidden space characteristics of the EnCodec audio generation model according to weight, so that the conversion from text output to audio output is realized.
10. A computer storage medium storing a program capable of executing the fine-grained multi-modal chinese large language model construction method according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311630540.3A CN117633707B (en) | 2023-12-01 | 2023-12-01 | Fine-grained multi-mode Chinese large language model construction method and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311630540.3A CN117633707B (en) | 2023-12-01 | 2023-12-01 | Fine-grained multi-mode Chinese large language model construction method and computer storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117633707A true CN117633707A (en) | 2024-03-01 |
CN117633707B CN117633707B (en) | 2024-08-23 |
Family
ID=90033405
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311630540.3A Active CN117633707B (en) | 2023-12-01 | 2023-12-01 | Fine-grained multi-mode Chinese large language model construction method and computer storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117633707B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118039057A (en) * | 2024-04-11 | 2024-05-14 | 湖南超能机器人技术有限公司 | Household health service robot based on multi-mode large model and intelligent interaction method |
CN118364433A (en) * | 2024-06-20 | 2024-07-19 | 清华大学 | Multi-mode image-text interleaving generation model based on dynamic characteristic synchronizer |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116259075A (en) * | 2023-01-16 | 2023-06-13 | 安徽大学 | Pedestrian attribute identification method based on prompt fine tuning pre-training large model |
CN116628490A (en) * | 2023-04-07 | 2023-08-22 | 中国科学院自动化研究所 | Graphic-audio multi-mode pre-training model method, device, electronic equipment and medium |
CN116821457A (en) * | 2023-08-30 | 2023-09-29 | 环球数科集团有限公司 | Intelligent consultation and public opinion processing system based on multi-mode large model |
CN116994171A (en) * | 2023-06-01 | 2023-11-03 | 无锡动视宫原科技有限公司 | Video understanding method and device |
CN117079299A (en) * | 2023-10-12 | 2023-11-17 | 腾讯科技(深圳)有限公司 | Data processing method, device, electronic equipment and storage medium |
CN117094419A (en) * | 2023-10-16 | 2023-11-21 | 华南理工大学 | Multi-modal content output-oriented large language model training method, device and medium |
WO2023226309A1 (en) * | 2022-05-24 | 2023-11-30 | 华为云计算技术有限公司 | Model training method and related device |
-
2023
- 2023-12-01 CN CN202311630540.3A patent/CN117633707B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023226309A1 (en) * | 2022-05-24 | 2023-11-30 | 华为云计算技术有限公司 | Model training method and related device |
CN116259075A (en) * | 2023-01-16 | 2023-06-13 | 安徽大学 | Pedestrian attribute identification method based on prompt fine tuning pre-training large model |
CN116628490A (en) * | 2023-04-07 | 2023-08-22 | 中国科学院自动化研究所 | Graphic-audio multi-mode pre-training model method, device, electronic equipment and medium |
CN116994171A (en) * | 2023-06-01 | 2023-11-03 | 无锡动视宫原科技有限公司 | Video understanding method and device |
CN116821457A (en) * | 2023-08-30 | 2023-09-29 | 环球数科集团有限公司 | Intelligent consultation and public opinion processing system based on multi-mode large model |
CN117079299A (en) * | 2023-10-12 | 2023-11-17 | 腾讯科技(深圳)有限公司 | Data processing method, device, electronic equipment and storage medium |
CN117094419A (en) * | 2023-10-16 | 2023-11-21 | 华南理工大学 | Multi-modal content output-oriented large language model training method, device and medium |
Non-Patent Citations (2)
Title |
---|
GONGWEI CHEN 等: "LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge", ARXIV, 26 November 2023 (2023-11-26), pages 3 * |
王翀 等: "基于Vision Transformer和语义学习的视频描述模型", 印刷与数字媒体技术研究, 31 October 2023 (2023-10-31) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118039057A (en) * | 2024-04-11 | 2024-05-14 | 湖南超能机器人技术有限公司 | Household health service robot based on multi-mode large model and intelligent interaction method |
CN118364433A (en) * | 2024-06-20 | 2024-07-19 | 清华大学 | Multi-mode image-text interleaving generation model based on dynamic characteristic synchronizer |
Also Published As
Publication number | Publication date |
---|---|
CN117633707B (en) | 2024-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113762322B (en) | Video classification method, device and equipment based on multi-modal representation and storage medium | |
CN117633707B (en) | Fine-grained multi-mode Chinese large language model construction method and computer storage medium | |
CN110234018B (en) | Multimedia content description generation method, training method, device, equipment and medium | |
CN111581966A (en) | Context feature fusion aspect level emotion classification method and device | |
CN110196928B (en) | Fully parallelized end-to-end multi-turn dialogue system with domain expansibility and method | |
CN113408284A (en) | Training method and device of text processing model, electronic equipment and storage medium | |
CN116681810B (en) | Virtual object action generation method, device, computer equipment and storage medium | |
CN111026852B (en) | Financial event-oriented hybrid causal relationship discovery method | |
CN118014086B (en) | Data processing method, device, equipment, storage medium and product | |
CN116258147A (en) | Multimode comment emotion analysis method and system based on heterogram convolution | |
Li et al. | [Retracted] Multimedia Data Processing Technology and Application Based on Deep Learning | |
CN112800339B (en) | Information stream searching method, device and equipment | |
Madan et al. | Foundation Models for Video Understanding: A Survey | |
CN117934803A (en) | Visual positioning method based on multi-modal feature alignment | |
CN116913278B (en) | Voice processing method, device, equipment and storage medium | |
CN116975347A (en) | Image generation model training method and related device | |
Liu | [Retracted] Research on Virtual Interactive Animation Design System Based on Deep Learning | |
CN113409769B (en) | Data identification method, device, equipment and medium based on neural network model | |
Zhang et al. | [Retracted] Cloud Application in the Construction of English Virtual Teaching Resources Based on Digital Three‐Dimensional Technology | |
CN114333069A (en) | Object posture processing method, device, equipment and storage medium | |
CN118469025B (en) | Modality expansion method of pre-training multi-modality language reasoning model based on continuous learning and transfer learning | |
CN117076090B (en) | Task model construction method, device, equipment and computer readable storage medium | |
CN115587160B (en) | Phrase-level text image generation method and system based on self-attention mechanism | |
CN114580443B (en) | Text translation method, text translation device, kernel function combination method, server and medium | |
CN116127051B (en) | Dialogue generation method based on deep learning, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |