WO2023136417A1 - Procédé et dispositif de construction d'un modèle de transformateur pour répondre à une question d'histoire vidéo - Google Patents
Procédé et dispositif de construction d'un modèle de transformateur pour répondre à une question d'histoire vidéo Download PDFInfo
- Publication number
- WO2023136417A1 WO2023136417A1 PCT/KR2022/012050 KR2022012050W WO2023136417A1 WO 2023136417 A1 WO2023136417 A1 WO 2023136417A1 KR 2022012050 W KR2022012050 W KR 2022012050W WO 2023136417 A1 WO2023136417 A1 WO 2023136417A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- story
- question
- transformer model
- answer
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 230000000007 visual effect Effects 0.000 claims description 14
- 230000002123 temporal effect Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 7
- 230000004044 response Effects 0.000 claims description 3
- 230000015654 memory Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000001364 causal effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000009795 derivation Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000000873 masking effect Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 235000009499 Vanilla fragrans Nutrition 0.000 description 1
- 244000263375 Vanilla tahitensis Species 0.000 description 1
- 235000012036 Vanilla tahitensis Nutrition 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001151 other effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/732—Query formulation
- G06F16/7328—Query by example, e.g. a complete video frame or video sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2477—Temporal data queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/732—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7847—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
Definitions
- Embodiments disclosed herein relate to an apparatus and method for building a transformer model for video story question and answer, and more particularly, to learn a video story by considering the context of a video clip included in video data. It relates to an apparatus and method for building a transformer model for story question and answer.
- video question answering measures video comprehension ability with the accuracy of multi-choice multiple-choice questions in natural language form.
- Patent Document 1 Korean Patent Publication No. 10-2020-0144417 (published on December 29, 2020)
- Embodiments disclosed in this specification are intended to provide an apparatus and method for building a transformer model for video story question and answer to learn a video story in consideration of the context of the video clip included in the video data.
- an apparatus for building a transformer model for video story question answering receives video data including a plurality of continuous video clips and question data for video question answering, and video an input/output unit for outputting a story query response result; a storage unit for storing programs and data for performing video story Q&A; and a control unit including at least one processor and executing the program to build a transformer model for video story Q&A, wherein the control unit performs a temporal order from video data including the plurality of consecutive video clips. It is characterized in that the video story is learned in consideration of the context of the adjacent video clips before and after.
- a method of building a transformer model for video story question and answer performed by an apparatus for building a transformer model for video story question and answer includes video data including a plurality of continuous video clips and video question and answer. Receiving question data for; and learning a video story from the video data including the plurality of consecutive video clips in consideration of the contexts of the video clips before and after that are adjacent to each other in temporal order.
- the recording medium is a computer readable recording medium on which a program for performing a method of building a transformer model for video story question and answer is recorded.
- a method of building a transformer model for video story question and answering performed by an apparatus for building a transformer model for video story question and answering includes inputting video data including a plurality of continuous video clips and question data for video question and answering. receiving step; and learning a video story from the video data including the plurality of consecutive video clips in consideration of the contexts of the video clips before and after that are adjacent to each other in temporal order.
- a computer program is executed by an apparatus for building a transformer model for video story question and answering, and is stored in a recording medium to perform a method of building a transformer model for video story questioning and answering.
- a method of building a transformer model for video story question and answering performed by an apparatus for building a transformer model for video story question and answering includes inputting video data including a plurality of continuous video clips and question data for video question and answering. receiving step; and learning a video story from the video data including the plurality of consecutive video clips in consideration of the contexts of the video clips before and after that are adjacent to each other in temporal order.
- any one of the above-described problem solving means in performing a video story question and answer by constructing a transformer that considers the context of the video clip included in the video data, it is possible to effectively process a long video without incurring a large calculation cost. There are possible effects.
- FIG. 1 is a diagram for explaining a transformer model according to the prior art.
- FIG. 2 is a diagram for explaining a transformer model according to an exemplary embodiment.
- FIG. 3 is a functional block diagram of a device for building a transformer model for video story question and answer according to an embodiment.
- FIG. 4 is a flowchart illustrating a method of constructing a transformer model for video story question and answer according to an exemplary embodiment.
- FIG. 1 is a diagram for explaining a transformer model according to the prior art.
- Figure 1 shows a transformer model according to the prior art, and shows the structure of the transformer model used for video representation (representation) learning.
- the transformer model may be a vanilla transformer.
- the transformer shown in FIG. 1 may be configured such that the encoder 100 is separated for each layer for all video frames.
- the encoder 100 separated for each section (S 1 , S 2 , S 3 ) in the transformer shown in FIG. 1 may be a temporal transformer.
- Video story question and answer can be performed using the transformer model shown in FIG . 1, but the transformer as shown in FIG . does not consider the context of the video clip included in the input video data, the computational cost increases exponentially when the length of the video increases, and is used only in short video story Q&A. It became.
- a transformer capable of more effectively processing a long video has been required, and accordingly, a transformer considering the context of a video clip included in video data has been constructed.
- a transformer in consideration of the context of a video clip included in video data according to an exemplary embodiment will be described later in detail with reference to FIGS. 2 and 3 .
- FIG. 2 is a diagram for explaining a transformer model according to an exemplary embodiment.
- the transformer model may be a contextual transformer.
- the transformer shown in FIG. 2 may be configured such that the encoder 200 is separated for each layer for all video frames.
- the separate encoder 200 for each section (S 1 , S 2 , S 3 ) learns the video story by considering the context of the video clip included in the input video data. , the number of video clips of the front and back sections that can be considered may vary as the hierarchy increases.
- a video clip may mean a short recorded video.
- the video data may include a plurality of continuous video clips
- the above-described video clips may include a plurality of visual tokens and text tokens.
- the encoder 200 separated for each section (S 1 , S 2 , S 3 ) may be a cross-modal transformer, and the above-described cross-modal transformer may receive visual tokens and text tokens corresponding to each section (S 1 , S 2 , S 3 ) as inputs.
- the above-described transformer of FIG. 2 may be built by an apparatus for building a transformer model for video story question and answer shown in FIG. 3 .
- FIG. 3 is a functional block diagram of a device for building a transformer model for video story question and answer according to an embodiment.
- an apparatus 300 for constructing a transformer model for video story question and answer includes an input/output unit 310, a storage unit 320, and a control unit 330.
- the input/output unit 310 may include an input unit for receiving an input from a user and an output unit for displaying information such as the status of the device 300 for constructing a transformer model for a task execution result or video story question and answer. . That is, the input/output unit 310 is a component for receiving video data including a plurality of continuous video clips and question data for video question answering, and outputting a video story question answering result.
- the video clip may include a plurality of visual tokens and text tokens.
- the storage unit 320 is a component capable of storing files and programs, and may be configured through various types of memories.
- the storage unit 320 may store data and programs that enable the controller 330 to build a transformer model for video story question and answer according to an algorithm presented below.
- the controller 330 is a component including at least one processor such as a CPU, GPU, iOS, and the like, and can control the overall operation of the apparatus 300 for building a transformer model for video story question and answer. That is, the controller 330 may control other elements included in the apparatus 300 for building a transformer model for video story question and answer so as to perform video story question and answer.
- the control unit 330 may perform an operation to build a transformer model for video story question and answer according to an algorithm presented below by executing a program stored in the storage unit 320 . A method for the controller 330 to perform an operation to build a transformer model for answering a video story question will be described later.
- the controller 330 may learn a video story from video data including a plurality of continuous video clips by considering contexts of video clips before and after that are adjacent to each other in temporal order.
- the video clip may include a plurality of visual tokens and text tokens.
- video data input through the input/output unit 310 is T consecutive video clips ( ) can be expressed as
- each video clip ( ) may include N visual tokens and M text tokens.
- a transformer having a general structure according to the prior art is used, a video clip ( ) can generate a hidden representation of can be expressed as In this case, d may mean a hidden dimension.
- the hidden representation can be modified and used as shown in Equation 1 below.
- Equation 2 Corresponds to a query, a key, and a value of the transformer structure, respectively, and m may mean a memory length. Meanwhile, is an extended context, where Only users can be a difference from transformers according to the prior art. also, is the linear projection parameter to be learned, may mean a stop-gradient. On the other hand, if the iterative regression transformer according to Equation 1 described above is modified and expressed in consideration of the context, it can be expressed as Equation 2 below.
- the controller 330 may build a transformer model for video story question and answer using Equation 2 described above.
- the control unit 330 receives visual tokens and text tokens included in each video clip for each video clip corresponding to a preset section through each of the separated encoders as inputs, and controls the lower layer of the video clips before and after each adjacent to each other.
- Video storage can be learned by calculating a hidden representation and calculating a representation of video data considering the context using the calculated hidden representation.
- the controller 330 may learn a temporal order for each video clip using a masked modality model (hereinafter referred to as MMM).
- MMMM Masked Modality Model
- MMMM is a token-based masking technique proposed in the previous model, the Masked Language Model, that masks all tokens in a given section.
- the Masked Modality Model allows one Modality to be created from another Modality, while preventing encoders from generating masked tokens too easily from surrounding tokens. and the alignment between modalities can be learned.
- the modality may be video and text. Accordingly, when the above-described learning is performed using the context transformer according to an embodiment, it is possible to predict the content of a segment (eg, video data separated by section) based on the context before and after, thereby learning the natural flow of the story. can do.
- the masked modality model may be learned through negative contrastive learning.
- the masked modality model can be expressed as Equation 3 below.
- the predicted token is closer to the ground-truth token embedding and away from other tokens.
- FIG. 4 is a flowchart illustrating a method of constructing a transformer model for video story question and answer according to an exemplary embodiment.
- the method of building a transformer model for video story question and answer according to the embodiment shown in FIG. 4 is processed time-sequentially in the apparatus 100 for building a transformer model for video story question and answer shown in FIGS. 2 and 3 includes steps to Therefore, even if the contents are omitted below, the above description of the apparatus 100 for building a transformer model for the video story question and answer shown in FIGS. 2 and 3 is based on the embodiment shown in FIG. It can also be applied to a method of building a transformer model for video story question and answer.
- the apparatus 100 for building a transformer model for video story question answering may receive video data including a plurality of continuous video clips and question data for video question answering (S410).
- the video clip may include a plurality of visual tokens and text tokens.
- the apparatus 100 for constructing a transformer model for video story question and answer builds a video story from video data including a plurality of continuous video clips input in step S410 by considering the context of adjacent video clips before and after each other in chronological order. It can be learned (S420).
- the apparatus 100 for constructing a transformer model for video story question and answer receives as input a visual token and a text token included in each video clip corresponding to a preset section through separate encoders, Video storage can be learned by calculating a hidden representation of a lower layer of adjacent video clips before and after, and calculating a representation of video data considering the context using the calculated hidden representation.
- the apparatus 100 for constructing a transformer model for video story Q&A can learn a temporal order for each video clip by using a Masked Modality Model (MMM).
- MMM Masked Modality Model
- MMM is a token-based masking technique proposed in the previous model, the Masked Language Model, that masks all tokens in a given section. may have been extended to
- MMM allows one Modality to be created from another Modality, while preventing encoders from generating masked tokens too easily from surrounding tokens. and the alignment between modalities can be learned. Meanwhile, the masked modality model may be learned through negative contrastive learning. In this case, the masked modality model can be expressed as Equation 3 described above.
- ' ⁇ unit' used in the above embodiments means software or a hardware component such as a field programmable gate array (FPGA) or ASIC, and ' ⁇ unit' performs certain roles.
- ' ⁇ part' is not limited to software or hardware.
- ' ⁇ bu' may be configured to be in an addressable storage medium and may be configured to reproduce one or more processors. Therefore, as an example, ' ⁇ unit' refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, and procedures. , subroutines, segments of program patent code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays and variables.
- components and ' ⁇ units' may be implemented to play one or more CPUs in a device or a secure multimedia card.
- the method for building a transformer model for video story question and answer may be implemented in the form of a computer-readable medium storing instructions and data executable by a computer.
- instructions and data may be stored in the form of program codes, and when executed by a processor, a predetermined program module may be generated to perform a predetermined operation.
- computer-readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media.
- a computer-readable medium may be a computer recording medium, which is a volatile and non-volatile memory implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data.
- the computer recording medium may be a magnetic storage medium such as HDD and SSD, an optical recording medium such as CD, DVD, and Blu-ray disc, or a memory included in a server accessible through a network.
- the method of building a transformer model for video story question and answer may be implemented as a computer program (or computer program product) including instructions executable by a computer.
- a computer program includes programmable machine instructions processed by a processor and may be implemented in a high-level programming language, object-oriented programming language, assembly language, or machine language.
- the computer program may be recorded on a tangible computer-readable recording medium (eg, a memory, a hard disk, a magnetic/optical medium, or a solid-state drive (SSD)).
- SSD solid-state drive
- a computing device may include at least some of a processor, a memory, a storage device, a high-speed interface connected to the memory and a high-speed expansion port, and a low-speed interface connected to a low-speed bus and a storage device.
- a processor may include at least some of a processor, a memory, a storage device, a high-speed interface connected to the memory and a high-speed expansion port, and a low-speed interface connected to a low-speed bus and a storage device.
- Each of these components are connected to each other using various buses and may be mounted on a common motherboard or mounted in any other suitable manner.
- the processor may process commands within the computing device, for example, to display graphic information for providing a GUI (Graphic User Interface) on an external input/output device, such as a display connected to a high-speed interface.
- GUI Graphic User Interface
- Examples include instructions stored in memory or storage devices.
- multiple processors and/or multiple buses may be used along with multiple memories and memory types as appropriate.
- the processor may be implemented as a chipset comprising chips including a plurality of independent analog and/or digital processors.
- Memory also stores information within the computing device.
- the memory may consist of a volatile memory unit or a collection thereof.
- the memory may be composed of a non-volatile memory unit or a collection thereof.
- Memory may also be another form of computer readable medium, such as, for example, a magnetic or optical disk.
- a storage device may provide a large amount of storage space to the computing device.
- a storage device may be a computer-readable medium or a component that includes such a medium, and may include, for example, devices in a storage area network (SAN) or other components, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, flash memory, or other semiconductor memory device or device array of the like.
- SAN storage area network
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Library & Information Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Fuzzy Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
Sont divulgués ici un dispositif et un procédé de construction d'un modèle de transformateur pour répondre à une question d'histoire vidéo. Le dispositif de construction d'un modèle de transformateur pour répondre à une question d'histoire vidéo comprend : une unité d'entrée-sortie permettant de recevoir des données vidéo qui comprennent une pluralité de clips vidéo consécutifs et des données de question pour répondre à une question vidéo et émettre le résultat d'exécution d'opérations sur les données vidéo et les données de question ; une unité de stockage dans laquelle un programme et des données pour répondre à la question d'histoire vidéo sont mémorisés ; et une unité de commande qui comprend au moins un processeur et qui construit un modèle de transformateur pour répondre à une question d'histoire vidéo en exécutant le programme, l'unité de commande apprenant une histoire vidéo à partir des données vidéo comprenant la pluralité de clips vidéo consécutifs en tenant compte du contexte de clips vidéo chronologiquement successifs.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2022-0005770 | 2022-01-14 | ||
KR1020220005770A KR20230109931A (ko) | 2022-01-14 | 2022-01-14 | 비디오 스토리 질의 응답을 위한 트랜스포머 모델을 구축하는 장치 및 방법 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023136417A1 true WO2023136417A1 (fr) | 2023-07-20 |
Family
ID=87279250
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2022/012050 WO2023136417A1 (fr) | 2022-01-14 | 2022-08-11 | Procédé et dispositif de construction d'un modèle de transformateur pour répondre à une question d'histoire vidéo |
Country Status (3)
Country | Link |
---|---|
JP (1) | JP2023103966A (fr) |
KR (1) | KR20230109931A (fr) |
WO (1) | WO2023136417A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117439800A (zh) * | 2023-11-21 | 2024-01-23 | 河北师范大学 | 一种网络安全态势预测方法、系统及设备 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101369270B1 (ko) * | 2012-03-29 | 2014-03-10 | 서울대학교산학협력단 | 멀티 채널 분석을 이용한 비디오 스트림 분석 방법 |
KR20190056940A (ko) * | 2017-11-17 | 2019-05-27 | 삼성전자주식회사 | 멀티모달 데이터 학습 방법 및 장치 |
KR102211939B1 (ko) * | 2018-12-07 | 2021-02-04 | 서울대학교산학협력단 | 질의 응답 장치 및 방법 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102276728B1 (ko) | 2019-06-18 | 2021-07-13 | 빅펄 주식회사 | 멀티모달 콘텐츠 분석 시스템 및 그 방법 |
-
2022
- 2022-01-14 KR KR1020220005770A patent/KR20230109931A/ko unknown
- 2022-08-11 WO PCT/KR2022/012050 patent/WO2023136417A1/fr unknown
- 2022-12-15 JP JP2022199912A patent/JP2023103966A/ja active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101369270B1 (ko) * | 2012-03-29 | 2014-03-10 | 서울대학교산학협력단 | 멀티 채널 분석을 이용한 비디오 스트림 분석 방법 |
KR20190056940A (ko) * | 2017-11-17 | 2019-05-27 | 삼성전자주식회사 | 멀티모달 데이터 학습 방법 및 장치 |
KR102211939B1 (ko) * | 2018-12-07 | 2021-02-04 | 서울대학교산학협력단 | 질의 응답 장치 및 방법 |
Non-Patent Citations (3)
Title |
---|
CHOI, SEONGHO ET AL.: "Multi-modal Contextual Transformer for Video Question Answering", PROCEEDINGS OF KOREA SOFTWARE CONGRESS 2021, December 2021 (2021-12-01), pages 801 - 803, XP009547739 * |
IVANO LAURIOLA; ALESSANDRO MOSCHITTI: "Context-based Transformer Models for Answer Sentence Selection", ARXIV.ORG, 1 June 2020 (2020-06-01), XP081690019 * |
XU HU, GHOSH GARGI, HUANG PO-YAO, ARORA PRAHAL, AMINZADEH MASOUMEH, FEICHTENHOFER CHRISTOPH, METZE FLORIAN, ZETTLEMOYER LUKE: "VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding", FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL-IJCNLP 2021, ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, STROUDSBURG, PA, USA, 1 January 2021 (2021-01-01), Stroudsburg, PA, USA, pages 4227 - 4239, XP093078913, DOI: 10.18653/v1/2021.findings-acl.370 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117439800A (zh) * | 2023-11-21 | 2024-01-23 | 河北师范大学 | 一种网络安全态势预测方法、系统及设备 |
CN117439800B (zh) * | 2023-11-21 | 2024-06-04 | 河北师范大学 | 一种网络安全态势预测方法、系统及设备 |
Also Published As
Publication number | Publication date |
---|---|
KR20230109931A (ko) | 2023-07-21 |
JP2023103966A (ja) | 2023-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Burns et al. | A dataset for interactive vision-language navigation with unknown command feasibility | |
JP6267711B2 (ja) | モデル化された依存関係に基づく、レガシーソフトウェアシステムのモダニゼーション | |
WO2017164478A1 (fr) | Procédé et appareil de reconnaissance de micro-expressions au moyen d'une analyse d'apprentissage profond d'une dynamique micro-faciale | |
US10553207B2 (en) | Systems and methods for employing predication in computational models | |
US10664659B2 (en) | Method for modifying segmentation model based on artificial intelligence, device and storage medium | |
CN102741859A (zh) | 用于减少模式辨识处理器中的功率消耗的方法及设备 | |
US20190130270A1 (en) | Tensor manipulation within a reconfigurable fabric using pointers | |
WO2020231005A1 (fr) | Dispositif de traitement d'image et son procédé de fonctionnement | |
WO2022163996A1 (fr) | Dispositif pour prédire une interaction médicament-cible à l'aide d'un modèle de réseau neuronal profond à base d'auto-attention, et son procédé | |
WO2023136417A1 (fr) | Procédé et dispositif de construction d'un modèle de transformateur pour répondre à une question d'histoire vidéo | |
WO2022059969A1 (fr) | Procédé de pré-apprentissage de réseau neuronal profond permettant une classification de données d'électrocardiogramme | |
WO2018056613A1 (fr) | Processeur multifil et procédé de commande de celui-ci | |
WO2022080582A1 (fr) | Procédé d'apprentissage par renforcement orienté cible et dispositif pour sa réalisation | |
CN110647360A (zh) | 协处理器的设备执行代码的处理方法、装置、设备及计算机可读存储介质 | |
US20210264247A1 (en) | Activation function computation for neural networks | |
WO2022025357A1 (fr) | Procédé de traitement de codage de blocs pour l'enseignement de la programmation | |
WO2023068463A1 (fr) | Système de dispositif de stockage pour simulation de circuit quantique | |
WO2021045434A1 (fr) | Dispositif électronique et procédé de commande associé | |
WO2023106466A1 (fr) | Dispositif et procédé d'apprentissage en nuage d'intelligence artificielle basé sur un type de nuage d'apprentissage | |
WO2021020848A2 (fr) | Opérateur matriciel et procédé de calcul matriciel pour réseau de neurones artificiels | |
Chamunorwa et al. | Embedded system learning platform for developing economies | |
WO2023101112A1 (fr) | Procédé d'apprentissage par méta-renforcement hors ligne de multiples tâches et dispositif informatique permettant sa mise en œuvre | |
CN111340043B (zh) | 关键点检测方法、系统、设备及存储介质 | |
US20210019592A1 (en) | Cooperative Neural Network for Recommending Next User Action | |
US20200184369A1 (en) | Machine learning in heterogeneous processing systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22920750 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |