CN117351331A - Method and device for adding adapter for large visual model - Google Patents

Method and device for adding adapter for large visual model Download PDF

Info

Publication number
CN117351331A
CN117351331A CN202311385817.0A CN202311385817A CN117351331A CN 117351331 A CN117351331 A CN 117351331A CN 202311385817 A CN202311385817 A CN 202311385817A CN 117351331 A CN117351331 A CN 117351331A
Authority
CN
China
Prior art keywords
visual
model
adapters
large model
adapter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311385817.0A
Other languages
Chinese (zh)
Inventor
吕伊凯
周吴夏朗
杜晓祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yunshang Technology Co ltd
Original Assignee
Beijing Yunshang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yunshang Technology Co ltd filed Critical Beijing Yunshang Technology Co ltd
Priority to CN202311385817.0A priority Critical patent/CN117351331A/en
Publication of CN117351331A publication Critical patent/CN117351331A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/96Management of image or video recognition tasks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a method and a device for adding an adapter for a large visual model, which relate to the technical field of deep learning models, wherein the large visual model in a graph-text multi-mode large model in a service scene is independently extracted, and one adapter is respectively trained for different service scenes on the basis of guaranteeing the identification capacity of the original large visual model, so that the large visual model has the universal identification capacity facing various scenes; the method comprises the steps of selecting centralized deployment or distributed deployment according to different online requirements, wherein a visual large model and each adapter in the centralized deployment share one inference graph, the method is suitable for a single-node server, the visual large model and each adapter in the distributed deployment are deployed at different nodes, data interaction is carried out through protocol communication, the method is suitable for a cluster multi-node server, and recognition speed is guaranteed while deployment cost is reduced.

Description

Method and device for adding adapter for large visual model
Technical Field
The application relates to the technical field of deep learning models, in particular to a method and a device for adding an adapter for a large visual model.
Background
The graph-text multi-mode large model is a deep learning model comprehensively utilizing image and text information. The method fuses and interacts information of two modes by processing the image and text data simultaneously so as to improve understanding and reasoning capacity of complex tasks. In a graph-text multimodal big model, two main components of a visual big model and a text big model are typically contained:
visual large model (visual model): the visual large model is used to process the image data and extract and learn therefrom the characteristic representation of the image. The visual large model may be a Convolutional Neural Network (CNN) based model, such as ResNet, inception, etc., for feature extraction and representation learning of images.
Text big Model (Text Model): the text large model is used to process text data and learn a characteristic representation of the text. The text large model may be a model based on a Recurrent Neural Network (RNN) or a Transformer model, such as LSTM, BERT, etc., for encoding and feature extraction of text.
With the increase of the parameters of the deep learning model and the continuous increase of the training data scale, the graph-text multi-mode large model has very strong image, text recognition and generation capacity. Because the text large model provides rich semantic information related to the pictures, compared with a common visual model, the visual large model in the picture-text multi-mode large model has further improved picture identification and generation capacity.
However, the increase of the number of parameters of the large visual model also means the increase of deployment cost and difficulty, and different models are often required to be deployed according to different service requirements in a real service scene. If a plurality of large visual models are deployed, the cost is greatly increased, and the on-line model recognition speed is reduced.
Disclosure of Invention
Therefore, the application provides a method and a device for adding an adapter for a large visual model, so as to solve the problems of poor recognition capability and low recognition speed of the large visual model in the prior art when a service scene is recognized.
In order to achieve the above object, the present application provides the following technical solutions:
in a first aspect, a method of adding an adapter to a visual large model, comprises:
step 1: extracting a visual large model from an original graph-text multi-mode large model;
step 2: building a plurality of adapters and respectively training the adapters according to different service scenes;
step 3: converting the visual large model and the plurality of adapters into the same file format;
step 4: merging and deploying the visual large model and a plurality of the adapters to a server; or the visual large model and the adapters are deployed on a plurality of servers respectively according to requirements, wherein the visual large model is a server, the adapters are clients, and data interaction is carried out between the visual large model and the adapters through protocol communication.
Preferably, the adapter comprises a multi-scale feature extraction module, a feature interaction module and a classifier module, wherein the multi-scale feature extraction module is composed of a plurality of convolution layers, the feature interaction module is composed of a cross attention layer and a convolution nerve layer, and the classifier module is composed of a linear layer.
Preferably, in the step 2, parameters of the visual large model are fixed when a plurality of adapters are trained, and parameters of the adapters are updated according to service related data.
Preferably, in the step 3, the file format is an ONNX exchange format.
Preferably, in the step 4, an NVIDIATRITON deep learning inference engine is used when deploying the visual large model and the plurality of adapters.
Preferably, in the step 4, the protocol communication is HTTP/gPRC protocol.
In a second aspect, an apparatus for adding an adapter to a visual large model, comprises:
the visual large model extraction module is used for extracting a visual large model from the original graph-text multi-mode large model;
the adapter training module is used for building a plurality of adapters and respectively training the adapters according to different service scenes;
a format conversion module for converting the visual macro model and the plurality of adapters into the same file format;
the deployment module is used for fusing and deploying the visual large model and the plurality of adapters to a server; or the visual large model and the adapters are deployed on a plurality of servers respectively according to requirements, wherein the visual large model is a server, the adapters are clients, and data interaction is carried out between the visual large model and the adapters through protocol communication.
In a third aspect, a computer device comprises a memory storing a computer program and a processor implementing the steps of a method of adding an adapter to a visual large model when the computer program is executed.
In a fourth aspect, a computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of a method of adding an adapter to a visual large model.
Compared with the prior art, the application has the following beneficial effects:
the application provides a method and a device for adding an adapter for a large visual model, which are used for independently extracting the large visual model in a graph-text multi-mode large model in a service scene, and respectively training one adapter for different service scenes on the basis of guaranteeing the identification capacity of the original large visual model so that the large visual model has the universal identification capacity facing various scenes; the method comprises the steps of selecting centralized deployment or distributed deployment according to different online requirements, wherein a visual large model and each adapter in the centralized deployment share one inference graph, the method is suitable for a single-node server, the visual large model and each adapter in the distributed deployment are deployed at different nodes, data interaction is carried out through protocol communication, the method is suitable for a cluster multi-node server, and recognition speed is guaranteed while deployment cost is reduced.
Drawings
For a more visual description of the prior art and the present application, exemplary drawings are presented below. It should be understood that the specific shape and configuration shown in the drawings should not be considered in general as limiting upon the practice of the present application; for example, based on the technical concepts and exemplary drawings disclosed herein, those skilled in the art have the ability to easily make conventional adjustments or further optimizations for the add/subtract/assign division, specific shapes, positional relationships, connection modes, dimensional scaling relationships, etc. of certain units (components).
FIG. 1 is a flow chart of a method for adding an adapter to a visual large model according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a method for adding an adapter to a visual large model according to an embodiment of the present application;
fig. 3 is a schematic diagram of the overall structure of a large visual model+adapter according to the first embodiment of the present application;
fig. 4 is a schematic structural diagram of different modules of an adapter according to an embodiment of the present application.
Detailed Description
The present application is further described in detail below with reference to the attached drawings.
In the description of the present application: unless otherwise indicated, the meaning of "a plurality" is two or more. The terms "first," "second," "third," and the like in this application are intended to distinguish between the referenced objects without a special meaning in terms of technical connotation (e.g., should not be construed as emphasis on degree or order of importance, etc.). The expressions "comprising", "including", "having", etc. also mean "not limited to" (certain units, components, materials, steps, etc.).
The terms such as "upper", "lower", "left", "right", "middle", and the like, as used in this application, are generally used for the purpose of facilitating an intuitive understanding with reference to the drawings and are not intended to be an absolute limitation of the positional relationship in actual products.
Example 1
Referring to fig. 1, a method for adding an adapter to a visual large model includes:
s1: extracting a visual large model from an original graph-text multi-mode large model;
referring to fig. 2, the graph-text multi-modal large model is generally composed of a visual large model and a language large model, and the visual large model exhibits stronger recognition and generation capabilities than a general visual model because the language large model provides rich text information to the visual large model.
The original graph-text multi-mode large model is in Pytorch format, and the visual large model is independently extracted by adopting a related tool to serve as an infrastructure model.
S2: building a plurality of adapters and respectively training the plurality of adapters according to different service scenes;
in order to keep the capability of the large visual model, the embodiment fixes the parameters of the large visual model unchanged, and builds and trains the independent adapters for different business scenes respectively.
And constructing an adapter by using a Pytorch deep learning library. The built adapter consists of various neural network architectures, such as convolutional layers (Convolutional Layer), transformers (Transformer Layer), etc.
Specifically, referring to fig. 3 and fig. 4, the built adapter includes a multi-scale feature extraction module, a feature interaction module, and a classifier module; the multi-scale feature extraction module consists of a plurality of convolution layers, and by inputting an original picture into the module, a feature pyramid with three resolutions (1/8, 1/16 and 1/32) can be obtained, and the feature pyramids are flattened and spliced to obtain a multi-scale feature sequence; the feature interaction module consists of a cross attention layer and a convolution nerve layer, and the multi-scale feature sequence and the visual large model output sequence are input into the module, so that the multi-scale feature is extracted from the single-scale output of the visual large model; the classifier module consists of a linear layer, and the final extracted sequence is input into the module to obtain the final classification result.
The training process of the training adapter specifically comprises the following steps: sample data of a specific task are collected, the sample data are subjected to regular transformation such as rotation, geometric transformation and color transformation, then are input into an adapter of a large visual model and corresponding service for reasoning calculation, the output of the model and the calculation loss of a real label are obtained, and the gradient of model parameters is calculated according to the back propagation of the loss. The parameters of the visual large model are fixed and do not participate in updating, and only the parameters of the adapter are updated according to the service related data. So a plurality of businesses respectively train an adapter correspondingly and share a large visual model, and the universality of the large visual model to various business scenes can be improved while the capacity of the large visual model is ensured. This embodiment trains the adapter on the NVIDIARTX 3090 graphics card.
S3: converting the visual large model and the plurality of adapters into the same file format;
specifically, the visual large model and the adapter are respectively converted into ONNX exchange formats.
S4: merging and deploying the large visual model and a plurality of adapters on a server; or the visual large model and the plurality of adapters are deployed on the plurality of servers respectively according to the requirements, wherein the visual large model is a server, the plurality of adapters are clients, and data interaction is carried out between the visual large model and the plurality of adapters through protocol communication.
In particular, the on-line deployment of the present embodiment may select a centralized deployment or a distributed deployment.
When the visual large model and the adapter are combined into one ONNX file by using a correlation tool in the centralized part, the ONNX file is deployed on a single-node server by using an NVIDIATRITON deep learning reasoning engine.
The distributed deployment adopts an NVIDIATRITON reasoning framework, the mode is suitable for an NVIDIA GPU platform, deployment is convenient, concurrency capacity is high, delay is low, and various performance reasoning optimization methods are integrated. Because the visual large model occupies large resources and has high throughput requirements, the visual large model is used as an infrastructure to be deployed independently in the distributed deployment, and the adapter is deployed on the corresponding service server, so that the visual large model occupies small resources, has high reasoning efficiency and does not occupy too much resources of other services on the service server. In this embodiment, the visual large model and each adapter respectively compile the wenrt format supported by the ONNX exchange format NVIDIATRITON, the visual large model is used as a Server, the adapter is used as a Client (Client), the adapter requests to obtain the reasoning result of the visual large model through the HTTP/gPRC protocol, and after the result is obtained, the adapter inputs the result to perform downstream reasoning calculation, each adapter can perform concurrent requests without mutual influence, so that the picture recognition efficiency of the whole online service is improved.
According to the method for adding the adapter to the large visual model, the large visual model in the graph-text multi-mode large model in the service scene is independently extracted, and on the basis of guaranteeing the identification capacity of the original large visual model, one adapter is trained for different service scenes respectively, so that the large visual model has the universal identification capacity facing various scenes; the method comprises the steps of selecting centralized deployment or distributed deployment according to different online requirements, wherein a visual large model and each adapter in the centralized deployment share one inference graph, the method is suitable for a single-node server, the visual large model and each adapter in the distributed deployment are deployed at different nodes, data interaction is carried out through protocol communication, the method is suitable for a cluster multi-node server, and recognition speed is guaranteed while deployment cost is reduced.
Example two
The embodiment provides a device for adding an adapter for a large visual model, which comprises the following components:
the visual large model extraction module is used for extracting a visual large model from the original graph-text multi-mode large model;
the adapter training module is used for building a plurality of adapters and respectively training the adapters according to different service scenes;
a format conversion module for converting the visual macro model and the plurality of adapters into the same file format;
the deployment module is used for fusing and deploying the visual large model and the plurality of adapters to a server; or the visual large model and the adapters are deployed on a plurality of servers respectively according to requirements, wherein the visual large model is a server, the adapters are clients, and data interaction is carried out between the visual large model and the adapters through protocol communication.
Specific limitations regarding an apparatus for adding an adapter to a visual large model may be found in the above description of a method for adding an adapter to a visual large model, and will not be described in detail herein.
Example III
The present embodiment provides a computer device comprising a memory storing a computer program and a processor implementing the steps of a method of adding an adapter to a visual large model when the computer program is executed.
Example IV
The present embodiment provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a method of adding an adapter to a visual large model.
Any combination of the technical features of the above embodiments may be performed (as long as there is no contradiction between the combination of the technical features), and for brevity of description, all of the possible combinations of the technical features of the above embodiments are not described; these examples, which are not explicitly written, should also be considered as being within the scope of the present description.

Claims (9)

1. A method of adding an adapter to a visual large model, comprising:
step 1: extracting a visual large model from an original graph-text multi-mode large model;
step 2: building a plurality of adapters and respectively training the adapters according to different service scenes;
step 3: converting the visual large model and the plurality of adapters into the same file format;
step 4: merging and deploying the visual large model and a plurality of the adapters to a server; or the visual large model and the adapters are deployed on a plurality of servers respectively according to requirements, wherein the visual large model is a server, the adapters are clients, and data interaction is carried out between the visual large model and the adapters through protocol communication.
2. The method of adding an adapter to a visual large model according to claim 1, wherein the adapter comprises a multi-scale feature extraction module, a feature interaction module and a classifier module, wherein the multi-scale feature extraction module is comprised of a plurality of convolution layers, the feature interaction module is comprised of a cross-attention layer and a convolution nerve layer, and the classifier module is comprised of a linear layer.
3. The method for adding adapters to a large visual model according to claim 1, wherein parameters of the large visual model are fixed when a plurality of adapters are trained in step 2, and parameters of the adapters are updated according to business-related data.
4. The method of adding an adapter to a visual large model according to claim 1, wherein in step 3, the file format is an ONNX exchange format.
5. The method of adding adapters to a visual large model according to claim 1, wherein in step 4, an NVIDIA TRITON deep learning inference engine is used when deploying the visual large model and a plurality of the adapters.
6. The method of adding an adapter to a visual large model according to claim 1, wherein in step 4, the protocol communication is HTTP/gPRC protocol.
7. An apparatus for adding an adapter to a visual large model, comprising:
the visual large model extraction module is used for extracting a visual large model from the original graph-text multi-mode large model;
the adapter training module is used for building a plurality of adapters and respectively training the adapters according to different service scenes;
a format conversion module for converting the visual macro model and the plurality of adapters into the same file format;
the deployment module is used for fusing and deploying the visual large model and the plurality of adapters to a server; or the visual large model and the adapters are deployed on a plurality of servers respectively according to requirements, wherein the visual large model is a server, the adapters are clients, and data interaction is carried out between the visual large model and the adapters through protocol communication.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN202311385817.0A 2023-10-24 2023-10-24 Method and device for adding adapter for large visual model Pending CN117351331A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311385817.0A CN117351331A (en) 2023-10-24 2023-10-24 Method and device for adding adapter for large visual model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311385817.0A CN117351331A (en) 2023-10-24 2023-10-24 Method and device for adding adapter for large visual model

Publications (1)

Publication Number Publication Date
CN117351331A true CN117351331A (en) 2024-01-05

Family

ID=89364661

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311385817.0A Pending CN117351331A (en) 2023-10-24 2023-10-24 Method and device for adding adapter for large visual model

Country Status (1)

Country Link
CN (1) CN117351331A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543290A (en) * 2018-09-04 2019-12-06 谷歌有限责任公司 Multimodal response
US20200184278A1 (en) * 2014-03-18 2020-06-11 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
US20200311798A1 (en) * 2019-03-25 2020-10-01 Board Of Trustees Of The University Of Illinois Search engine use of neural network regressor for multi-modal item recommendations based on visual semantic embeddings
US20210209513A1 (en) * 2020-01-02 2021-07-08 Intuit Inc. Method for serving parameter efficient nlp models through adaptive architectures
CN114270434A (en) * 2019-12-04 2022-04-01 谷歌有限责任公司 Two-pass end-to-end speech recognition
KR20220133141A (en) * 2022-03-10 2022-10-04 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Text extraction method, text extraction model training method, apparatus and device
US20220366318A1 (en) * 2021-05-17 2022-11-17 Google Llc Machine Learning Hyperparameter Tuning
WO2022258666A1 (en) * 2021-06-08 2022-12-15 Deepmind Technologies Limited Multimodal few-shot learning with frozen language models
CN115565038A (en) * 2022-09-19 2023-01-03 广州市网星信息技术有限公司 Content audit, content audit model training method and related device
US20230214605A1 (en) * 2021-12-30 2023-07-06 Naver Corporation Multilingual unsupervised neural machine translation with denoising adapters
US20230325725A1 (en) * 2022-04-12 2023-10-12 Google Llc Parameter Efficient Prompt Tuning for Efficient Models at Scale

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200184278A1 (en) * 2014-03-18 2020-06-11 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
CN110543290A (en) * 2018-09-04 2019-12-06 谷歌有限责任公司 Multimodal response
US20200311798A1 (en) * 2019-03-25 2020-10-01 Board Of Trustees Of The University Of Illinois Search engine use of neural network regressor for multi-modal item recommendations based on visual semantic embeddings
CN114270434A (en) * 2019-12-04 2022-04-01 谷歌有限责任公司 Two-pass end-to-end speech recognition
US20210209513A1 (en) * 2020-01-02 2021-07-08 Intuit Inc. Method for serving parameter efficient nlp models through adaptive architectures
US20220366318A1 (en) * 2021-05-17 2022-11-17 Google Llc Machine Learning Hyperparameter Tuning
WO2022258666A1 (en) * 2021-06-08 2022-12-15 Deepmind Technologies Limited Multimodal few-shot learning with frozen language models
KR20230152741A (en) * 2021-06-08 2023-11-03 딥마인드 테크놀로지스 리미티드 Multi-modal few-shot learning using fixed language models
US20230214605A1 (en) * 2021-12-30 2023-07-06 Naver Corporation Multilingual unsupervised neural machine translation with denoising adapters
KR20220133141A (en) * 2022-03-10 2022-10-04 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Text extraction method, text extraction model training method, apparatus and device
US20230325725A1 (en) * 2022-04-12 2023-10-12 Google Llc Parameter Efficient Prompt Tuning for Efficient Models at Scale
CN115565038A (en) * 2022-09-19 2023-01-03 广州市网星信息技术有限公司 Content audit, content audit model training method and related device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHONG MOU 等: "T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models", 《COMPUTER VISION AND PATTERN RECOGNITION》, 28 February 2023 (2023-02-28), pages 3 - 7 *
RENRUI ZHANG 等: "LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention", 《COMPUTER VISION AND PATTERN RECOGNITION》, 14 June 2023 (2023-06-14), pages 1 - 5 *
刘夏鸣: "一种基于迁移学习的视觉多任务模型探析", 《科学技术创新》, 31 December 2022 (2022-12-31) *

Similar Documents

Publication Publication Date Title
JP2022056316A (en) Character structuring extraction method and device, electronic apparatus, storage medium, and computer program
US20240273932A1 (en) Method for recognizing text, and apparatus
JP2023541532A (en) Text detection model training method and apparatus, text detection method and apparatus, electronic equipment, storage medium, and computer program
JP7412847B2 (en) Image processing method, image processing device, server, and computer program
CN113704531A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN110852256A (en) Method, device and equipment for generating time sequence action nomination and storage medium
CN110852295B (en) Video behavior recognition method based on multitasking supervised learning
CN113658189B (en) Cross-scale feature fusion real-time semantic segmentation method and system
CN109523558A (en) A kind of portrait dividing method and system
Lobry et al. Visual question answering on remote sensing images
CN118155231B (en) Document identification method, device, equipment, medium and product
CN108959664A (en) Distributed file system based on picture processor
CN114863407A (en) Multi-task cold start target detection method based on visual language depth fusion
CN116309913A (en) Method for generating image based on ASG-GAN text description of generation countermeasure network
CN117216536A (en) Model training method, device and equipment and storage medium
CN115131801A (en) Multi-modal-based document recognition method, device, equipment and storage medium
US20230072445A1 (en) Self-supervised video representation learning by exploring spatiotemporal continuity
CN113536798A (en) Multi-instance document key information extraction method and system
CN117934803A (en) Visual positioning method based on multi-modal feature alignment
CN115937615B (en) Topic label classification method and device based on multi-mode pre-training model
CN117671460A (en) Cross-modal image-text emotion analysis method based on hybrid fusion
CN113516972A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN117351331A (en) Method and device for adding adapter for large visual model
CN114937277B (en) Image-based text acquisition method and device, electronic equipment and storage medium
CN116311455A (en) Expression recognition method based on improved Mobile-former

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination