KR20210114074A

KR20210114074A - Method, apparatus, device and medium for generating caption information of multimedia data

Info

Publication number: KR20210114074A
Application number: KR1020217028919A
Authority: KR
Inventors: 케 린; 쥬오신 간; 잉잉 지앙
Original assignee: 삼성전자주식회사
Priority date: 2019-03-21
Filing date: 2020-03-23
Publication date: 2021-09-17
Also published as: KR102593440B1; CN111723937A

Abstract

본 개시의 실시예들은 멀티미디어 데이터의 캡셔닝 정보를 생성하기 위한 방법, 장치, 디바이스 및 매체를 제공한다. 상기 방법은 처리될 멀티미디어 데이터의 특성 정보를 추출하는 단계로, 상기 멀티미디어 데이터는 비디오 또는 이미지를 포함하는, 추출 단계; 및 상기 추출된 특성 정보에 기반하여 상기 멀티미디어 데이터의 텍스트 캡션을 생성하는 단계를 포함한다. 본 개시의 실시예에서 제공하는 방법에 따르면, 상기 멀티미디어 데이터의 생성된 텍스트 캡션의 정확도가 효과적으로 향상될 수 있다.Embodiments of the present disclosure provide a method, apparatus, device and medium for generating captioning information of multimedia data. The method includes extracting characteristic information of multimedia data to be processed, wherein the multimedia data includes a video or an image; and generating a text caption of the multimedia data based on the extracted characteristic information. According to the method provided in the embodiment of the present disclosure, the accuracy of the generated text caption of the multimedia data can be effectively improved.

Description

Method, apparatus, device and medium for generating caption information of multimedia data

본 개시는 컴퓨터 기술 분야에 관한 것으로, 특히 본 개시는 멀티미디어 데이터의 캡셔닝 (captioning) 정보를 생성하는 방법, 장치, 전자 디바이스 및 저장 매체에 관한 것이다.The present disclosure relates to the field of computer technology, and in particular, the present disclosure relates to a method, an apparatus, an electronic device and a storage medium for generating captioning information of multimedia data.

컴퓨터 비전 기술에서, 비디오 캡셔닝 또는 이미지 캡셔닝은 주어진 비디오 또는 이미지에 대한 텍스트 캡션을 출력하는 것을 언급하는 것이다. 예를 들어, 아이가 바닥을 청소하는 비디오의 경우 비디오 캡셔닝은 "아이가 바닥을 청소하고 있다"라는 비디오의 텍스트 캡션을 자동으로 출력할 수 있다. 상기 비디오 캡셔닝은 컴퓨터 비전과 자연어 처리의 교차점이다.In computer vision technology, video captioning or image captioning refers to outputting a text caption for a given video or image. For example, for a video of a child mopping the floor, the video caption can automatically output a text caption of the video: "The child is mopping the floor." The video captioning is the intersection of computer vision and natural language processing.

기존의 비디오 캡셔닝 방식은 일반적으로 비디오에서 프레임을 선택하고, 그 선택된 프레임에서 전체 그래프 특징을 추출한 다음 이러한 특징들을 사용하여 디코딩을 수행하고 최대 가능성 확률을 기반으로 상기 비디오의 텍스트 캡션을 생성한다. 이미지 캡셔닝도 비슷한 원리를 가지고 있다. 기존의 비디오 캡셔닝 모델이 일반적으로 인코더-디코더 구조를 채택하고 있음을 상기 설명으로부터 알 수 있다. 인코더는 비디오 프레임의 특징을 추출하는 것을 담당하며, 디코더는 비디오 프레임의 특징을 디코딩하고 텍스트 캡션을 생성하는 것을 담당한다. 비디오 캡셔닝 정보를 생성하는 많은 방법들이 존재하지만, 생성된 비디오 캡셔닝 정보의 정확도는 여전히 최적화되어야 할 필요가 있다.The existing video captioning method generally selects a frame from a video, extracts full graph features from the selected frame, performs decoding using these features, and generates a text caption of the video based on a maximum probability probability. Image captioning has a similar principle. It can be seen from the above description that the existing video captioning model generally adopts an encoder-decoder structure. The encoder is responsible for extracting the features of the video frame, and the decoder is responsible for decoding the features of the video frame and generating text captions. Although there are many methods of generating video captioning information, the accuracy of the generated video captioning information still needs to be optimized.

본 개시의 실시예들은 생성된 비디오 캡셔닝 정보의 정확도를 향상시키기 위한 비디오 캡셔닝 정보를 생성하는 방법, 장치, 전자 디바이스 및 저장 매체를 제공한다.Embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a storage medium for generating video captioning information for improving the accuracy of the generated video captioning information.

본 개시의 실시예들은 생성된 비디오 캡셔닝 정보의 정확도를 향상시키기 위한 비디오 캡셔닝 정보를 생성하는 방법, 장치, 전자 디바이스 및 저장 매체를 제공하는 것을 목적으로 한다. 본 개시의 실시예에서 제공하는 솔루션은 다음과 같다.SUMMARY Embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a storage medium for generating video captioning information for improving the accuracy of the generated video captioning information. The solution provided by the embodiment of the present disclosure is as follows.

본 개시의 제1 측면에 따라, 멀티미디어 데이터의 캡셔닝 정보를 생성하는 방법이 제공되고, 상기 방법은: 처리될 멀티미디어 데이터의 특성 정보를 추출하는 단계, 상기 멀티미디어 데이터는 비디오 또는 이미지를 포함함; 및 상기 추출된 특성 정보에 기반하여 상기 멀티미디어 데이터의 텍스트 캡션을 생성하는 단계를 포함한다.According to a first aspect of the present disclosure, there is provided a method for generating captioning information of multimedia data, the method comprising: extracting characteristic information of multimedia data to be processed, the multimedia data including a video or an image; and generating a text caption of the multimedia data based on the extracted characteristic information.

본 개시의 제2 측면에 따라, 멀티미디어 데이터의 캡셔닝 정보를 생성하는 디바이스가 제공되고, 상기 디바이스는: 처리될 멀티미디어 데이터의 특성 정보를 추출하는 특성 정보 추출 모듈, 상기 멀티미디어 데이터는 비디오 또는 이미지를 포함함; 및 상기 추출된 특성 정보에 기초하여 상기 멀티미디어 데이터의 텍스트 캡션을 생성하는 캡셔닝 정보 생성 모듈을 포함한다. .According to a second aspect of the present disclosure, there is provided a device for generating caption information of multimedia data, the device comprising: a characteristic information extraction module for extracting characteristic information of multimedia data to be processed; included; and a captioning information generating module that generates a text caption of the multimedia data based on the extracted characteristic information. .

본 개시의 제3 측면에 따라, 전자 디바이스가 제공되고, 상기 전자 디바이스는 메모리 및 프로세서를 포함하고, 상기 메모리는 컴퓨터 프로그램을 저장하고, 상기 프로세서는 본 개시의 실시예들에 의해 제공된 방법들을 수행하기 위해 상기 컴퓨터 프로그램을 실행하도록 구성된다.According to a third aspect of the present disclosure, there is provided an electronic device, the electronic device comprising a memory and a processor, the memory storing a computer program, the processor performing the methods provided by the embodiments of the present disclosure and execute the computer program to do so.

본 개시의 제4 측면에 따라, 컴퓨터 판독가능 저장 매체가 제공되고, 상기 저장 매체는 프로세서에 의해 실행될 때 본 개시의 실시예들에 의해 제공된 방법을 수행하는 컴퓨터 프로그램을 저장한다.According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium, wherein the storage medium stores a computer program that, when executed by a processor, performs the method provided by the embodiments of the present disclosure.

본 개시의 실시예들에 따른 기술적 솔루션들에 의해 달성되는 유익한 효과들은 여기에서 설명되지 않은 다양한 대안적인 실시예들과 조합된 특정 구현 방식에 대한 다음 설명에서 상세히 설명될 것이다.The beneficial effects achieved by the technical solutions according to the embodiments of the present disclosure will be described in detail in the following description of a specific implementation manner in combination with various alternative embodiments not described herein.

본 개시의 실시예에서의 기술적 방안을 보다 명확하게 설명하기 위해서, 이하에서는 본 개시의 실시예를 설명하는데 있어서 사용된 도면들이 간략히 설명된다.
도 1은 예시적인 이미지 캡셔닝의 개략도이다.
도 2는 예시적인 비디오 캡셔닝의 개략도이다.
도 3은 기존 비디오 캡셔닝 알고리즘의 개략도이다.
도 4는 지도 학습 (supervised learning) 기반의 기존 비디오 캡셔닝 알고리즘의 트레이닝 프로세스의 개략도이다.
도 5는 본 개시의 실시예에 따른 멀티미디어 데이터의 캡셔닝 정보를 생성하는 방법의 개략적인 흐름도이다.
도 6은 본 개시의 실시예에 따른 의미론적 (semantic) 예측 네트워크를 통해 의미론적 특징들을 획득하는 개략도이다.
도 7a는 본 개시내용의 실시예에 따른 공간 장면 그래프의 개략도이다.
도 7b는 본 개시의 다른 실시예에 따른 공간 장면 (scene) 그래프의 개략도이다.
도 8은 본 개시의 실시예에 따른 관계 예측 네트워크를 통해 관계 특징들을 획득하는 원리도이다.
도 9는 본 개시의 실시예에 따른 속성 예측 네트워크를 통해 속성 특징들을 획득하는 원리도이다.
도 10은 본 개시내용의 실시예에 따른 공간-시간적 장면 그래프의 개략도이다.
도 11은 본 개시의 다른 실시예에 따른 공간-시간적 장면 그래프의 개략도이다.
도 12는 본 개시의 실시예에 따른 특징 선택 네트워크의 개략도이다.
도 13a는 본 개시의 실시예에 따른 셀프-어텐션 기반 코덱의 개략적인 구조도이다.
도 13b는 본 개시의 실시예에 따른 셀프-어텐션 기반 코덱의 개략적인 구조도이다.
도 14, 도 15 및 도 16은 각각 본 개시의 3가지 실시예에 따른 비디오 캡셔닝 정보 생성 방법의 개략도들이다.
도 17a 및 도 17b는 본 개시의 두 가지 대안적인 실시예에 따른 비디오 캡셔닝 정보 획득의 원리도들이다.
도 18 및 도 19는 본 개시의 다른 2개의 대안적인 실시예에 따른 비디오 캡셔닝 정보 생성 방법의 개략적인 흐름도들이다.
도 20은 본 개시의 실시예에 따른 비디오 캡셔닝 정보를 획득하는 원리도이다.
도 21은 본 개시의 실시예에 따른 이미지 캡셔닝 정보 생성 방법의 개략적인 흐름도이다.
도 22 및 도 23은 본 개시내용의 2개의 대안적인 실시예에 따른 코덱의 개략적인 구조도들이다.
도 24는 본 개시의 실시예에 따른 멀티미디어 데이터 캡셔닝 모델 학습 방법의 개략적인 흐름도이다.
도 25는 본 개시의 일 실시예에 따른 비디오 캡셔닝 라벨(즉, 원래의 캡셔닝 라벨)을 갖는 샘플 비디오의 개략도이다.
도 26은 본 개시의 일 실시예에 따른 비디오 캡셔닝 모델 트레이닝을 위한 방법의 원리도이다.
도 27은 본 개시의 일 실시예에 따른 증강 멀티미디어 데이터 캡셔닝 정보를 획득하기 위한 방법의 개략적인 흐름도이다.
도 28a 및 28b는 본 개시내용의 2개의 대안적인 실시예에 따른 2개의 코덱의 개략적인 구조도들이다.
도 29는 본 개시의 실시예에 따른 이미지 캡셔닝 모델을 트레이닝 방법의 개략적인 흐름도이다.
도 30 및 도 31은 본 개시의 두 실시예에 따른 비디오 캡셔닝 정보 생성 방법의 원리도들이다.
도 32는 본 개시의 일 실시예에 따른 멀티미디어 데이터의 캡셔닝 정보 생성 장치의 개략적인 구조도이다. 그리고
도 33은 본 개시의 실시예에 적용 가능한 전자 디바이스의 개략적인 구조도이다.In order to more clearly describe the technical solutions in the embodiments of the present disclosure, the drawings used in describing the embodiments of the present disclosure are briefly described below.
1 is a schematic diagram of exemplary image captioning.
2 is a schematic diagram of example video captioning.
3 is a schematic diagram of an existing video captioning algorithm.
4 is a schematic diagram of a training process of an existing video captioning algorithm based on supervised learning.
5 is a schematic flowchart of a method for generating caption information of multimedia data according to an embodiment of the present disclosure.
6 is a schematic diagram of acquiring semantic features through a semantic prediction network according to an embodiment of the present disclosure;
7A is a schematic diagram of a spatial scene graph according to an embodiment of the present disclosure;
7B is a schematic diagram of a spatial scene graph according to another embodiment of the present disclosure;
8 is a principle diagram of acquiring relational features through a relational prediction network according to an embodiment of the present disclosure.
9 is a principle diagram of acquiring attribute features through an attribute prediction network according to an embodiment of the present disclosure.
10 is a schematic diagram of a spatio-temporal scene graph according to an embodiment of the present disclosure;
11 is a schematic diagram of a spatio-temporal scene graph according to another embodiment of the present disclosure.
12 is a schematic diagram of a feature selection network according to an embodiment of the present disclosure;
13A is a schematic structural diagram of a self-attention-based codec according to an embodiment of the present disclosure.
13B is a schematic structural diagram of a self-attention-based codec according to an embodiment of the present disclosure.
14, 15, and 16 are schematic diagrams of a method of generating video captioning information according to three embodiments of the present disclosure, respectively.
17A and 17B are principle diagrams of video captioning information acquisition according to two alternative embodiments of the present disclosure.
18 and 19 are schematic flowcharts of a method for generating video captioning information according to another two alternative embodiments of the present disclosure.
20 is a diagram illustrating a principle of acquiring video captioning information according to an embodiment of the present disclosure.
21 is a schematic flowchart of a method for generating image captioning information according to an embodiment of the present disclosure.
22 and 23 are schematic structural diagrams of a codec according to two alternative embodiments of the present disclosure.
24 is a schematic flowchart of a method for learning a multimedia data captioning model according to an embodiment of the present disclosure.
25 is a schematic diagram of a sample video with a video captioning label (ie, an original captioning label) according to an embodiment of the present disclosure.
26 is a principle diagram of a method for training a video captioning model according to an embodiment of the present disclosure.
27 is a schematic flowchart of a method for obtaining augmented multimedia data captioning information according to an embodiment of the present disclosure;
28A and 28B are schematic structural diagrams of two codecs according to two alternative embodiments of the present disclosure;
29 is a schematic flowchart of a method for training an image captioning model according to an embodiment of the present disclosure.
30 and 31 are principle diagrams of a method for generating video captioning information according to two embodiments of the present disclosure.
32 is a schematic structural diagram of an apparatus for generating caption information of multimedia data according to an embodiment of the present disclosure. and
33 is a schematic structural diagram of an electronic device applicable to an embodiment of the present disclosure.

본 개시의 실시예들은 아래에서 상세하게 설명된다. 도면을 참조하여 아래에서 설명되는 실시예들은 예시적인 것이며, 특허청구범위 및 그 균등물에 의해 정의되는 본 개시의 실시예에 대한 포괄적인 이해를 돕기 위해 본 개시을 설명하기 위해 사용될 뿐이다. 여기에는 이해를 돕기 위한 다양한 특정 세부 사항이 포함되지만, 이러한 예 및 세부 사항은 예시에 불과하며, 본 개시 내용을 제한하는 것으로 해석되어서는 안 된다. 따라서, 당업자는 본 개시의 범위 및 사상을 벗어나지 않으면서 설명된 실시예에 변경 및 수정이 이루어질 수 있음을 인식할 것이다. 추가로, 아래의 설명에서는 명료함과 간결함을 위해 공지 기능 및 구조에 대한 일부 설명을 생략할 수 있다.Embodiments of the present disclosure are described in detail below. The embodiments described below with reference to the drawings are exemplary, and are only used to describe the present disclosure in order to provide a comprehensive understanding of the embodiments of the present disclosure as defined by the claims and their equivalents. Various specific details are included to aid understanding, but these examples and details are illustrative only and should not be construed as limiting the disclosure. Accordingly, those skilled in the art will recognize that changes and modifications may be made to the described embodiments without departing from the scope and spirit of the present disclosure. In addition, in the following description, some descriptions of well-known functions and structures may be omitted for clarity and conciseness.

당업자는 문맥이 명백하게 달리 지시하지 않는 한 단수 형태 "하나", "한", "상기" 및 "그"가 복수의 지시 타겟을 포함한다는 것을 이해해야 한다. 본 개시내용의 명세서에서 사용된 "포함하는" 또는 "구비하는"이라는 표현은 특징, 정수, 단계, 동작, 요소 및/또는 컴포넌트의 존재를 의미하지만, 하나 이상의 다른 특징, 정수, 단계, 연산, 요소, 컴포넌트 및/또는 이들의 조합의 존재 또는 추가를 배제하지 않는다는 것이 추가로 이해되어야 한다. 어떤 요소가 다른 요소에 "연결된" 또는 "결합된" 것으로 언급될 때에, 그것은 그 다른 컴포넌트에 직접 연결되거나 결합될 수 있거나, 또는 중간 요소가 존재할 수 있음이 이해되어야 한다. 또한, 본원에서 사용된 "연결된" 또는 "결합된"이라는 용어는 무선 연결 또는 무선 결합을 포함할 수 있다. 본원에서 사용된 "및/또는"이라는 문구는 관련된 나열된 항목들 중 하나 이상의 모든 또는 어떤 하나 및 모든 조합을 포함한다. One of ordinary skill in the art should understand that the singular forms "a", "an", "the" and "the" include plural referents unless the context clearly dictates otherwise. The expression "comprising" or "comprising" as used in the context of this disclosure means the presence of a feature, integer, step, operation, element and/or component, but includes one or more other features, integers, steps, operations, It should be further understood that this does not exclude the presence or addition of elements, components and/or combinations thereof. It should be understood that when an element is referred to as being “connected” or “coupled” to another element, it may be directly connected or coupled to the other component, or an intermediate element may be present. Also, as used herein, the terms “coupled” or “coupled” may include wireless connection or wireless coupling. As used herein, the phrase “and/or” includes all or any one and all combinations of one or more of the related listed items.

먼저, 본 개시의 실시예에서 제공된 멀티미디어 데이터의 캡셔닝 정보를 생성하는 방법은 다중-프레임 비디오를 이미지들을 포함하는 비디오의 캡셔닝 정보를 생성하기 위 이용될 수 있으며, 이미지의 캡셔닝 정보를 자막 생성하기 위해서도 이용될 수 있음에 유의해야 한다. 상기 이미지의 소스는 본 개시의 실시예에서 제한되지 않는다. 예를 들어, 그것은 획득된, 다운로드된 또는 수신된 이미지일 수도 있고, 또는 키 프레임 이미지나 지정된 프레임 이미지와 같은 비디오 내의 이미지일 수도 있다. 즉, 본 개시의 일 실시예에서 상기 생성 방법은 비디오의 캡셔닝 정보를 생성하는 방법일 수도 있고, 또는 이미지의 캡셔닝 정보를 생성하는 방법일 수도 있다.First, the method for generating captioning information of multimedia data provided in an embodiment of the present disclosure may be used to generate captioning information of a video including images of a multi-frame video, and subtitling the captioning information of the image. It should be noted that it can also be used to create The source of the image is not limited in the embodiments of the present disclosure. For example, it may be an acquired, downloaded, or received image, or it may be an image within a video, such as a key frame image or a designated frame image. That is, in an embodiment of the present disclosure, the generating method may be a method of generating captioning information of a video or a method of generating captioning information of an image.

본 개시의 실시예의 솔루션을 더 잘 이해하고 설명하기 위해, 본 개시의 실시예에 결부된 일부 기술에 대해 아래에서 간략히 설명한다.In order to better understand and describe the solutions of the embodiments of the present disclosure, some techniques associated with the embodiments of the present disclosure are briefly described below.

컴퓨터 비전 기술에서, 비디오/이미지 캡션은 주어진 비디오 또는 이미지에 대한 텍스트 캡션을 출력하는 것을 말하며, 그리고 그것은 컴퓨터 비전과 자연어 처리의 교차점이다. 물체 감지, 이미지 분할과 같은 다른 컴퓨터 비전 작업과 비교하면, 비디오/이미지 캡셔닝은 더 도전적인 작업이다. 비디오 또는 이미지에 대한 보다 포괄적인 이해가 필요할 뿐만 아니라 비디오나 이미지의 내용을 자연어의 형태로 표현해야 할 필요가 있다. 도 1에 도시된 바와 같이. 도 1에서 보이는 이미지가 도 1에 주어질 때에, 그 이미지의 "소년이 테니스를 치고 있다"라는 텍스트 캡션이 자동으로 출력될 수 있다. 도 2에 보이는 것처럼. 도 2에서 보이는 다중 프레임 비디오를 포함하는 비디오가 주어질 때에, "아이가 바닥을 청소하고 있다"는 비디오의 텍스트 캡션이 자동으로 출력될 수 있다.In computer vision technology, video/image captioning refers to outputting text captions for a given video or image, and it is the intersection of computer vision and natural language processing. Compared to other computer vision tasks such as object detection and image segmentation, video/image captioning is a more challenging task. There is a need for a more comprehensive understanding of a video or image, as well as a need to express the content of the video or image in the form of natural language. As shown in Figure 1. When the image shown in FIG. 1 is given in FIG. 1 , a text caption "the boy is playing tennis" of the image may be automatically output. As shown in Figure 2. When a video containing the multi-frame video shown in FIG. 2 is given, a text caption of the video "The child is mopping the floor" may be automatically output.

현재에, 기존 이미지 캡셔닝 모델은 일반적으로 인코더-디코더 구조를 채택한다. 상기 인코더는 이미지의 특징들을 추출하는 것을 담당하는 CNN (Convolutional Neural Networks) 기반으로 대개는 설계되며, 그리고 상기 디코더는 텍스트 캡션을 생성하기 위해 상기 이미지의 특징들을 디코딩하는 것을 담당하는 RNN (Recurrent Neural Network) 기반으로 보통 설계된다.At present, existing image captioning models generally adopt an encoder-decoder structure. The encoder is usually designed based on Convolutional Neural Networks (CNN) responsible for extracting features of an image, and the decoder is a Recurrent Neural Network (RNN) responsible for decoding features of the image to generate text captions. ) is usually designed based on

마찬가지로, 기존 비디오 캡셔닝 모델은 일반적으로 비디오에서 프레임을 선택하고, 선택한 프레임의 전체 그래프 (full-graph) 특징을 CNN을 사용하여 추출하며, 그리고 그 후에 RNN을 사용하여 모든 프레임의 특징들을 디코딩하고 최대 가능성 확률에 기반하여 상기 비디오의 텍스트 캡션을 생성한다. 기존 비디오 자막 알고리즘은 대개는 인코더-디코더 구조를 사용함을 알 수 있다. CNN은 비디오 프레임을 인코딩하는 데 사용되며 그 비디오 프레임의 특징을 추출하는 역할을 담당하므로, 인코더 또는 CNN 인코더라고도 언급될 수 있다. RNN은 비디오 프레임을 디코딩하며 그리고 그 비디오 프레임의 특징들을 디코딩하고 텍스트 캡션을 생성하는 역할을 하므로, 디코더 또는 RNN 디코더라고도 언급될 수 있다. RNN은 현재 LSTM 디코더라고 할 수 있는 LSTM(Long Short-Term Memory)을 사용할 수 있다.Similarly, existing video captioning models typically select a frame from a video, extract full-graph features of the selected frame using CNN, and then decode the features of every frame using RNN and Generate a text caption of the video based on a maximum probability probability. It can be seen that the existing video subtitle algorithm mostly uses an encoder-decoder structure. Since CNN is used to encode video frames and is responsible for extracting features of those video frames, it may also be referred to as an encoder or CNN encoder. Since the RNN is responsible for decoding a video frame and decoding features of the video frame and generating text captions, it may also be referred to as a decoder or RNN decoder. The RNN may use a Long Short-Term Memory (LSTM), which can be called a current LSTM decoder.

예를 들어, 기존 비디오 캡셔닝 모델의 개략도가 도 3에서 보인다. 도 3에서 보이는 것처럼, 도 3에 도시된 프레임들은 비디오로부터 선택되며 (도면들에서의 줄임표는 생략되어 도시되지 않은 프레임들을 나타냄), 각 프레임은 CNN 인코더에 의해 개별적으로 처리되어 선택된 각 비디오 프레임의 특징들을 추출하고, 그 추출된 특징들은 LSTM 디코더에 의해 디코딩되어 "남자가 오븐에 피자를 넣고 있다"라는 해당 텍스트 캡션을 생성한다.For example, a schematic diagram of an existing video captioning model is shown in FIG. 3 . As shown in FIG. 3 , the frames shown in FIG. 3 are selected from video (ellipses in the figures are omitted to indicate frames not shown), and each frame is individually processed by a CNN encoder to obtain a picture of each selected video frame. The features are extracted, and the extracted features are decoded by the LSTM decoder to generate the corresponding text caption "The man is putting pizza in the oven".

비록 종래 기술이 비디오 또는 이미지의 텍스트 캡션을 생성할 수 있었지만, 종래 기술에는 적어도 다음과 같은 기술적 문제가 있다.Although the prior art has been able to generate text captions for video or images, the prior art has at least the following technical problems.

(1) 순환 구조인 RNN과 같은 기존 디코더는 트레이닝 동안에 단계별로 트레이닝될 필요가 있다. 그러므로, 기존 디코더들은 학습 속도가 느리고 트레이닝 효율이 낮은 문제가 있으며, 장거리 종속성을 학습하기 어려워 디코딩 능력이 부족한 것과 같은 문제의 결과를 가져온다.(1) Existing decoders such as RNNs, which are cyclic structures, need to be trained step by step during training. Therefore, existing decoders have problems such as slow learning speed, low training efficiency, and difficulty in learning long-distance dependencies, resulting in problems such as insufficient decoding ability.

(2) 현재에, 비디오/이미지 캡셔닝 분야에서 일반적으로 사용되는 데이터세트들에서, 트레이닝 샘플 (즉, 샘플 비디오 또는 샘플 이미지)은 적은 양의 캡셔닝 정보를 가진다. 예를 들어, 상기 샘플 이미지에는 일반적으로 5개의 캡셔닝 라벨들만이 존재한다. 5개의 캡셔닝 라벨들만을 사용하여 상기 이미지 내 정보를 완전히 표현하는 것은 종종 어려우며, 자연어의 다양성으로 인해서, 동일한 의미가 여러 가지 방식들로로 표현될 수 있다. 그러므로, 트레이닝 샘플들의 캡셔닝 정보의 다양성이 좋지 않은 것도 해당 분야의 추가의 발전을 저해하는 문제이다.(2) Currently, in datasets commonly used in the field of video/image captioning, a training sample (ie, sample video or sample image) has a small amount of captioning information. For example, there are typically only 5 captioning labels in the sample image. It is often difficult to fully represent the information in the image using only five captioning labels, and due to the diversity of natural languages, the same meaning can be expressed in different ways. Therefore, poor diversity of captioning information of training samples is also a problem that hinders further development of the field.

(3) 다중 프레임 이미지를 포함하는 비디오의 경우, 종래 기술은 인트라-프레임 (intra-frame) 정보를 고려하지 않는다. 그러나, 이 정보는 보다 정확한 비디오 캡션을 생성하기 위해 매우 중요하다. 그러므로, 인트라-프레임 정보를 어떻게 완전하게 활용하느냐의 문제를 해결해야할 필요가 있다.(3) In the case of a video including a multi-frame image, the prior art does not consider intra-frame information. However, this information is very important to generate more accurate video captions. Therefore, there is a need to solve the problem of how to fully utilize intra-frame information.

(4) 종래 기술은 보다 정확한 비디오 캡션을 생성하기 위해 매우 중요한 비디오 또는 이미지의 의미론적 (semantic) 정보를 고려하지 않는다.(4) The prior art does not take into account the very important semantic information of the video or image in order to generate more accurate video captions.

(5) 기존 비디오 또는 이미지 캡셔닝 모델들은 일반적으로 지도 학습 (supervised learning) 방법을 기반으로 한다. 예를 들어, 비디오 캡션 알고리즘의 경우 각 트레이닝 비디오는 하나 이상의 라벨링된 비디오 캡션에 대응한다. 도 4에서 보이는 것처럼, 비디오 캡션으로 라벨링된 데이터에 대해, 그 데이터 내의 비디오 데이터 P는 비디오 캡셔닝 모델 K에 입력되어, 그 비디오 캡셔닝 모델 K가 상기 비디오 데이터 P를 분석 및 처리하여 대응하는 비디오 캡션을 생성하며, 그리고 상기 라벨링된 데이터 및 생성된 비디오 캡션 내 비디오 캡션 Q에 기반하여 손실 함수 Tmark(α)의 값을 그 후에 계산하며, 그리고 상기 비디오 캡셔닝 모델 K의 학습은 손실 함수 Tmark(α)에 의해 가이드된다. 그러나, 캡션들로 비디오에 라벨링하는 것은 많은 노동력과 시간을 필요로 하며, 이에 의해 기존 비디오 캡셔닝 데이터세트들 내 샘플들의 개수가 제한되며, 그에 의해, 비디오 캡셔닝 데이터세트를 기반으로 트레이닝된 비디오 캡셔닝 모델의 정확도와 정밀도가 떨어지게 된다.(5) Existing video or image captioning models are generally based on supervised learning methods. For example, in the case of a video caption algorithm, each training video corresponds to one or more labeled video captions. 4 , for data labeled with video captions, the video data P in the data is input to a video captioning model K, so that the video captioning model K analyzes and processes the video data P to obtain a corresponding video generating a caption, and then calculating the value of the loss function Tmark(α) based on the labeled data and the video caption Q in the generated video caption, and learning of the video captioning model K is performed using the loss function Tmark( α) is guided by However, labeling the video with captions requires a lot of labor and time, thereby limiting the number of samples in the existing video captioning datasets, whereby the video trained based on the video captioning dataset The accuracy and precision of the captioning model will decrease.

(6) 비디오 캡셔닝 정보 또는 이미지 캡셔닝 정보를 생성하는 기존 방법에서, 상기 생성된 캡셔닝 정보의 길이는 제어 가능하지 않으며 상이한 애플리케이션 시나리오에서 캡셔닝 정보의 상이한 길이에 대한 사용자의 애플리케이션 요구 사항들을 충족시킬 수 없다. 예를 들어, 사용자가 이미지나 비디오를 게시할 때는, 보다 상세한 정보를 공유하기 위해서 긴 캡셔닝 정보가 필요하다. 예를 들어, 사용자가 자동차를 운전할 때는, 짧은 캡셔닝 정보가 필요하다. 그러나 기존 기술로는 위의 요구 사항을 충족할 수 없다.(6) In the existing method of generating video captioning information or image captioning information, the length of the generated captioning information is not controllable and the user's application requirements for different lengths of captioning information in different application scenarios can't satisfy For example, when a user posts an image or video, long captioning information is needed to share more detailed information. For example, when a user drives a car, short captioning information is needed. However, existing technologies cannot meet the above requirements.

전술한 종래 기술의 기술적 문제점 중 적어도 하나를 해결하기 위해서, 본 개시의 실시예는 멀티미디어 데이터의 캡셔닝 정보를 생성하는 방법, 장치, 전자 디바이스 및 저장 매체를 제공하며, 여기에서 멀티미디어 데이터는 비디오 또는 이미지가 될 수 있다. In order to solve at least one of the technical problems of the prior art described above, an embodiment of the present disclosure provides a method, an apparatus, an electronic device and a storage medium for generating caption information of multimedia data, wherein the multimedia data is video or It can be an image.

본 개시의 목적, 기술적 해결방안 및 이점을 보다 명확하게 하기 위하여, 구체적인 실시예 및 도면들을 참조하여 본 개시의 대안적 구현방식 및 본 개시의 실시예의 기술적 해결방안 각각에 대하여 상세하게 설명한다. 다음의 구체적인 실시예들은 서로 결합될 수 있으며, 일부 실시예에서는 동일하거나 유사한 개념 또는 프로세스들이 반복되지 않을 수 있다. 아래에서는 도면을 참조하여 본 개시의 실시예를 설명한다.In order to make the object, technical solution and advantage of the present disclosure more clear, each of the technical solutions of the embodiments of the present disclosure and the alternative implementation manners of the present disclosure will be described in detail with reference to specific embodiments and drawings. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments. Hereinafter, embodiments of the present disclosure will be described with reference to the drawings.

도 5는 본 개시의 실시예에 따른 멀티미디어 데이터의 캡셔닝 정보를 생성하는 방법의 개략적인 흐름도이다. 도면에 보이는 것처럼, 상기 방법은 주로 다음 단계들을 포함할 수 있다.5 is a schematic flowchart of a method for generating caption information of multimedia data according to an embodiment of the present disclosure. As shown in the figure, the method may mainly include the following steps.

단계 S101: 처리할 멀티미디어 데이터의 특성 정보가 추출된다.Step S101: The characteristic information of the multimedia data to be processed is extracted.

단계 S102: 상기 추출된 특성 정보에 기반하여 상기 멀티미디어 데이터의 텍스트 캡션이 생성된다.Step S102: A text caption of the multimedia data is generated based on the extracted characteristic information.

본 개시의 일 실시예에서, 상기 멀티미디어 데이터는 비디오 또는 이미지를 포함한다. 본 개시의 실시예에서 제공하는 방법에 기반하여, 비디오 캡셔닝 정보 또는 이미지 캡셔닝 정보가 생성될 수 있다.In an embodiment of the present disclosure, the multimedia data includes a video or an image. Based on the method provided by the embodiment of the present disclosure, video captioning information or image captioning information may be generated.

본 개시의 대안적인 실시예에서, 처리될 멀티미디어 데이터의 특성 정보를 추출하는 단계는 다음 중 적어도 하나를 포함한다: 멀티미디어 데이터 내 각 이미지의 각각의 타겟 영역들에 포함된 타겟들의 로컬 시각적 특징들을 추출하는 단계; 상기 멀티미디어 데이터의 의미론적 특징들을 추출하는 단계; 상기 멀티미디어 데이터가 비디오일 때에, 상기 멀티미디어 데이터의 공간-시간적 시각적 특징들을 추출하는 단계; 상기 멀티미디어 데이터의 전역 시각적 특징들을 추출하는 단계; 상기 멀티미디어 데이터 내 각 이미지의 각각의 타겟 영역들에 포함된 상기 타겟들의 속성 특징들을 추출하는 단계; 및 멀티미디어 데이터에서 각 이미지의 전역 (global) 속성 특징을 추출하는 단계.In an alternative embodiment of the present disclosure, extracting characteristic information of the multimedia data to be processed includes at least one of the following: Extracting local visual features of targets included in respective target regions of each image in the multimedia data to do; extracting semantic features of the multimedia data; when the multimedia data is a video, extracting spatial-temporal visual features of the multimedia data; extracting global visual features of the multimedia data; extracting attribute characteristics of the targets included in respective target regions of each image in the multimedia data; and extracting global attribute features of each image from the multimedia data.

즉, 상기 멀티미디어 데이터의 특성 정보는 로컬 시각적 특징, 의미론적 특징, 공간-시간적 시각적 특징, 전역 시각적 특징, 로컬 속성 특징(즉, 타겟의 속성 특징), 및 전역 속성 특징 중 하나 이상을 포함할 수 있다. 상기 로컬 시각적 특징은 이미지 내 타겟 영역의 시각적 특징이며, 즉, 상기 타겟 영역은 그 타겟 영역이 속한 이미지에 대해 상대적으로 로컬이다.That is, the characteristic information of the multimedia data may include at least one of a local visual characteristic, a semantic characteristic, a spatial-temporal visual characteristic, a global visual characteristic, a local attribute characteristic (ie, an attribute characteristic of a target), and a global attribute characteristic. have. The local visual feature is a visual feature of a target area within an image, ie, the target area is relatively local to the image to which the target area belongs.

이미지의 경우, 그것은 상기 멀티미디어 데이터 내에서 위에서 언급된 각 이미지일 수 있다. 비디오의 경우, 상기 멀티미디어 데이터 내 각 이미지는 비디오 내 각 프레임이거나 그 비디오에서 선택된 여러 프레임일 수 있다. 예를 들어, 주어진 비디오에 대해, 주어진 비디오로부터 동일한 인터벌로 프레임들을 선택하거나, 또는 신경 네트워크를 통해 프레임들이 선택될 수 있는 키 프레임 알고리즘을 사용하여, 주어진 비디오로부터 여러 이미지 프레임들 (즉, 키 프레임들)을 선택할 수 있다. 아래의 설명에서, 비디오의 경우, 그 비디오에서 선택된 각 프레임을 예로 들어 멀티미디어 데이터의 각 이미지가 설명될 것이다.In the case of an image, it may be each of the images mentioned above within the multimedia data. In the case of video, each image in the multimedia data may be each frame in the video or multiple selected frames in the video. For a given video, for example, selecting frames at the same interval from the given video, or using a keyframe algorithm in which frames can be selected via a neural network, multiple image frames (i.e., keyframes) from a given video. ) can be selected. In the description below, in the case of a video, each image of multimedia data will be described taking each frame selected in the video as an example.

상기 시각적 특징은 이미지 내 픽셀 정보를 반영할 수 있는 특징이고, 상기 속성 특징은 이미지 내 각 타겟의 속성 정보를 반영할 수 있는 특징이다. 그러므로, 상기 시각적 특징 및 속성 특징은 상기 비디오 또는 이미지의 캡셔닝 정보를 생성하기 위해 사용될 수 있으며, 상기 로컬 시각적 특징은 각 이미지 내 각각의 타겟 영역들의 정보를 보다 정확하고 세분화하여 반영할 수 있다. 그러므로, 상기 로컬 시각적 특징은 각 이미지의 인터-프레임 정보를 완전히 사용하며, 이는 비디오들 또는 이미지들의 보다 정확한 텍스트 캡션을 생성할 수 있다.The visual feature is a feature that can reflect pixel information in the image, and the attribute feature is a feature that can reflect attribute information of each target in the image. Therefore, the visual characteristics and attribute characteristics may be used to generate captioning information of the video or image, and the local visual characteristics may reflect information of respective target regions in each image more accurately and subdivided. Therefore, the local visual feature fully uses the inter-frame information of each image, which can create more accurate text captions of videos or images.

실제 애플리케이션들에서, 상기 이미지의 시각적 특징들 및 속성들에 추가로, 일부 다른 특징들은 멀티미디어 데이터의 캡셔닝에 또한 도움이 된다. 예를 들어, 비디오의 상기 공간-시간적 시각적 특징은 비디오의 공간 및 시간적 동적 변화를 효과적으로 반영할 수 있으며, 그리고 비디오 또는 이미지의 의미론적 특징들은 비디오 또는 이미지 내 포함된 콘텐트의 의미론적 정보를 반영한다. 그러므로, 비디오 또는 이미지 내 텍스트 캡션을 생성할 때에, 그 비디오의 공간-시간적 시각적 특징들, 그 비디오 또는 이미지의 의미론적 특징들 등을 신경망을 통해 또한 추출하며, 보다 다양한 특징들을 비디오 캡셔닝 정보 생성에 통합하여, 상기 생성된 비디오 캡션의 표현력을 향상시키고 캡셔닝 정보의 정확도를 개선한다.In practical applications, in addition to the visual characteristics and properties of the image, some other characteristics are also conducive to the captioning of multimedia data. For example, the spatial-temporal visual characteristics of a video may effectively reflect spatial and temporal dynamic changes of the video, and the semantic characteristics of a video or image reflect semantic information of content contained within the video or image. . Therefore, when generating a text caption in a video or image, spatial-temporal visual features of the video, semantic features of the video or image, etc. are also extracted through a neural network, and more various features are generated in the video caption information , to improve the expressive power of the generated video caption and improve the accuracy of captioning information.

대안으로, 각 이미지에 대해, 여러 타겟 영역들 및 각각의 타겟 영역들의 지역적 특징들 (즉, 타겟 특징들이라고도 언급될 수 있는 위의 로컬 시각적 특징들)을 획득하기 위해서 특징 추출 네트워크가 사용될 수 있다. 상기 속성 특징들에 대해, 그것은 속성 예측 네트워크를 통해 획득될 수 있다. 예를 들어, 여러 타겟 지역들 및 각자의 타겟 지역들의 로컬 시각적 특징들 (즉, 로컬 특징/지역적 특징)을 추출하기 위해 Faster R-CNN (Fast Region-Convolutional Neural Network)이 각 이미지에 적용될 수 있다. Faster R-CNN은 (ImageNet 또는 Visual Genome 데이터 세트들과 같은) 샘플 데이터세트들을 통해 미리 트레이닝될 수 있다.Alternatively, for each image, a feature extraction network may be used to obtain several target regions and local features of each target regions (ie local visual features above, which may also be referred to as target features). . For the above attribute characteristics, it may be obtained through an attribute prediction network. For example, Faster R-CNN (Fast Region-Convolutional Neural Network) may be applied to each image to extract several target regions and local visual features (ie, local feature/regional feature) of each target region. . Faster R-CNN can be pre-trained on sample datasets (such as ImageNet or Visual Genome datasets).

Faster Region-CNN에 의해 타겟 영역들 및 각 타겟 영역의 로컬 특징들을 추출하는 상기 방법은 예시적이며, 상기 개시는 이에 제한되지 않으며, 그리고 상기 특징 추출 네트워크는 어떤 다른 사용 가능한 신경 네트워크에 의해 구현될 수도 있음에 유의해야 한다.The method for extracting target regions and local features of each target region by Faster Region-CNN is exemplary, the disclosure is not limited thereto, and the feature extraction network may be implemented by any other available neural network. It should be noted that there may be

비디오의 공간-시간적 시각적 특징들에 대해, 그것들은 공간-시간적 특징 추출 네트워크를 이용하여 획득될 수 있다. 예를 들어, ECO (Efficient Convolutional network for Online Video Understanding) 모델 또는 3D-CNN과 같은 3D 시각적 특징 추출 모델을 사용하여 비디오의 공간-시간적 시각적 특징들이 추출될 수 있다. 물론, 다른 공간-시간적 시각적 특징 추출 네트워크들이 또한 사용될 수 있다. 의미론적 특징들에 대해, 그것들은 의미론적 예측 네트워크를 통해 추출될 수 있다.For the spatio-temporal visual features of the video, they can be obtained using a spatio-temporal feature extraction network. For example, the spatial-temporal visual features of the video may be extracted using a 3D visual feature extraction model such as an Efficient Convolutional Network for Online Video Understanding (ECO) model or 3D-CNN. Of course, other spatio-temporal visual feature extraction networks may also be used. For semantic features, they can be extracted through a semantic prediction network.

특히, 트레이닝된 의미론적 예측 네트워크를 사용하여 전체 비디오 또는 이미지의 의미론적 특징을 획득할 수 있다. 예를 들어, 의미론적 예측 네트워크의 구조도는 도 6에 도시된다. 상기 도면에 도시된 바와 같이, 상기 의미론적 예측 네트워크는 CNN 및 다중 분류 구조(도면에서 보이는 다중 분류)를 포함할 수 있다. 도 6에 일 예로서 도시된 바와 같이, 비디오를 획득하고, 네트워크를 이용하여, 선택된 비디오 프레임들이 의미론적 예측 네트워크에 입력되며, CNN을 이용하여 각 프레임으로부터 비디오 특징들이 추출되며, 상기 추출된 비디오 특징에 대해 다중 분류 구조를 통해서 다중 분류 연산이 수행되어, 상기 비디오의 다수의 미리 정의된 의미론적 특징들에 대응하는 확률들을 획득하며, 최종적으로는 상기 확률들에 따라 상기 다수의 미리 정의된 의미론적 특징들 하나 이상의 미리 정의된 의미론적 특징들을 출력한다. 도 6에 도시된 바와 같이, 상기 입력 프레임들(도면에 도시된 사람 및 개를 포함하는 이미지들)에 기초하여, 사람, 개 및 도로와 같은 다양한 의미적론 특징들에 대응하는 확률들은 의미론적 예측 네트워크를 통해 획득될 수 있으며, 예를 들어, 도면에 도시된 바와 같이, 사람을 포함할 확률은 0.95이고 개를 포함할 확률은 0.8이다. 예측된 확률들 및 미리 설정된 의미론적 특징 필터링 규칙들에 기초하여, 설정된 임계보다 큰 확률들을 갖는 의미론적 특징들이 출력될 수 있으며, 또는 더 높은 확률을 갖는 설정된 개수의 의미론적 특징들이 출력될 수 있다.In particular, a trained semantic prediction network can be used to acquire semantic features of an entire video or image. For example, a structural diagram of a semantic prediction network is shown in FIG. 6 . As shown in the figure, the semantic prediction network may include a CNN and a multi-classification structure (multi-classification shown in the figure). As shown as an example in FIG. 6 , video is acquired, using the network, selected video frames are input to a semantic prediction network, and video features are extracted from each frame using CNN, and the extracted video A multiple classification operation is performed on a feature through a multiple classification structure to obtain probabilities corresponding to a plurality of predefined semantic features of the video, and finally, according to the probabilities, the plurality of predefined meanings Semantic Features Outputs one or more predefined semantic features. As shown in FIG. 6 , based on the input frames (images including people and dogs shown in the figure), probabilities corresponding to various semantic features such as people, dogs and roads are semantic predictions. It may be obtained through a network, for example, as shown in the figure, the probability of including a person is 0.95 and the probability of including a dog is 0.8. Based on the predicted probabilities and preset semantic feature filtering rules, semantic features with probabilities greater than a set threshold may be output, or a set number of semantic features with higher probabilities may be output. .

본 개시의 대안적인 실시예에서, 상기 멀티미디어 데이터의 특성 정보는 그 멀티미디어 데이터의 각 이미지 내 각각의 타겟 영역들에 포함된 상기 타겟의 로컬 시각적 특징들을 포함한다. 상기 추출된 특성 정보에 기반하여, 상기 멀티미디어 데이터의 텍스트 캡션을 생성하는 단계는: 상기 이미지 내의 각 타겟의 로컬 시각적 특징들에 기반하여 타겟들 사이의 관계 특징들을 획득하는 단계; 상기 타겟들 사이의 상기 로컬 시각적 특징들 및 상기 관계 특징들을 기반으로 상기 이미지의 장면 (scene) 그래프를 구축하는 단계; 상기 이미지의 장면 그래프에 기반하여 상기 이미지의 그래프 컨볼루션 특징들을 획득하는 단계; 및 상기 멀티미디어 데이터의 각 이미지의 그래프 컨볼루션 특징들에 기반하여 상기 멀티미디어 데이터의 텍스트 캡션을 생성하는 단계.In an alternative embodiment of the present disclosure, the characteristic information of the multimedia data includes local visual characteristics of the target included in respective target regions in each image of the multimedia data. Based on the extracted characteristic information, generating the text caption of the multimedia data includes: acquiring relational characteristics between targets based on local visual characteristics of each target in the image; building a scene graph of the image based on the local visual features and the relationship features between the targets; obtaining graph convolution features of the image based on a scene graph of the image; and generating a text caption of the multimedia data based on graph convolution characteristics of each image of the multimedia data.

장면 그래프 (scene graph)는, 상기 이미지 내 각각의 타겟 영역들의 로컬 시각적 특징들, 속성 특징들 (자세한 설명은 후술함), 및 관계 특징들이 그래프 방식으로 표현된 그래프 구조를 의미한다. 상기 장면 그래프는 다중 노드들 및 다중 에지들을 포함할 수 있으며, 여기에서 다중 노드들 중 각 노드는 타겟 특징 (즉, 위의 로컬 시각적 특징) 또는 타겟 영역에 포함된 타겟 (즉, 객체)의 속성 특징을 구체적으로 나타내며, 상기 다중 에지들 중 각 에지는 노드들 간의 관계 특성을 나타낸다.A scene graph refers to a graph structure in which local visual characteristics, attribute characteristics (to be described in detail later), and relational characteristics of respective target regions in the image are expressed in a graph manner. The scene graph may include multiple nodes and multiple edges, where each node of the multiple nodes is a target feature (ie, a local visual feature above) or a property of a target (ie an object) included in the target area. A characteristic is specifically indicated, and each edge among the multiple edges indicates a characteristic of a relationship between nodes.

일례로, 도 7a는 이미지 중의 프레임에 대응하는 장면 그래프를 보여준다. 도 7a에 도시된 바와 같이, 상기 장면 그래프 내 일부 노드들은 프레임 이미지 내 각 타겟의 로컬 시각적 특징을 나타내며, 예를 들어 도 7a에 도시된 노드 "사람", 노드 "개" 및 노드 "스케이트보드"는 "사람"의 특징 벡터, "개"의 특징 벡터 및 "스케이트보드"의 특징 벡터를 나타내고; 일부 노드들은 타겟의 속성 특징들을 나타내며, 예를 들어 "파란색" 노드는 타겟 "사람"의 속성 특징을 나타내고 "검은색"은 "개"의 속성 특징을 나타내며; 상기 장면 그래프 내서 노드들 사이의 에지들은 연결된 두 노드들 간의 관계를 나타내며, 즉 두 타겟들 간의 관계를 나타낸다. 예를 들어, 노드 "사람"과 노드 "개" 사이의 관계는 "보유(holding)"이고, "사람"과 "스케이트보드" 사이의 관계는 "스케이팅" 등이다. 상기 장면 그래프는 이미지 내 타겟들 사이의 공간적 관계를 반영할 수 있다는 것을 도 7a로부터 알 수 있으며, 그것은 공간적 장면 그래프라고도 언급될 수 있다.As an example, FIG. 7A shows a scene graph corresponding to a frame in an image. As shown in FIG. 7A , some nodes in the scene graph represent the local visual features of each target in the frame image, for example the nodes “People”, “Dogs” and “Skateboards” shown in FIG. 7A . denotes the feature vector of "human", the feature vector of "dog" and the feature vector of "skateboard"; Some nodes indicate the attribute characteristics of the target, for example, "blue" node indicates the attribute characteristic of the target "person" and "black" indicates the attribute characteristic of "dog"; Edges between nodes in the scene graph represent a relationship between two connected nodes, that is, a relationship between two targets. For example, the relationship between the node "person" and the node "dog" is "holding", the relationship between the "person" and "skateboard" is "skating", and so on. It can be seen from Fig. 7a that the scene graph can reflect the spatial relationship between targets in the image, which can also be referred to as a spatial scene graph.

실제의 애플리케이션들에서, 각 이미지에 대해, 특징 추출 네트워크는 특히 여러 타겟 영역들과 각각의 타겟 영역들의 지역적 특징 (즉, 위의 로컬 시각적 특징들)을 획득하기 위해 사용될 수 있다. 관계 예측 네트워크를 사용하여 타겟들 간의 관계 특징들이 획득될 수 있다. 특히, 특징 추출 네트워크를 통해 이미지들의 각각의 타겟 영역들의 로컬 특징들을 획득한 후에, 상기 추출된 타겟 지역적 특징들로부터 타겟 영역들 간의 관계 특징들을 획득하기 위해 미리 트레이닝된 관계 예측 네트워크가 사용될 수 있으며, 그리고 각각의 타겟 영역들의 지역적 특징들 및 이들의 관계 특징들을 그래프 방식으로 표현되어, 각 프레임에 대한 장면 그래프를 획득한다.In practical applications, for each image, a feature extraction network may be used to obtain, inter alia, several target regions and a local feature of each target region (ie, local visual features above). Relationship characteristics between targets may be obtained using a relationship prediction network. In particular, after obtaining local features of each target region of images through a feature extraction network, a pre-trained relation prediction network may be used to obtain relational characteristics between target regions from the extracted target regional features, Then, regional characteristics of each target area and their relational characteristics are expressed in a graph manner to obtain a scene graph for each frame.

상기 관계 예측 네트워크는 타겟 영역들 간의 연관 관계를 예측하기 위해 사용되는 분류 네트워크이다. 상기 관계 예측 네트워크의 특정 네트워크 구조는 실제 필요에 따라 선택될 수 있다. 대안적인 솔루션으로서, 상기 관계 예측 네트워크는 완전하게 연결된 계층, 특징 연쇄 (concatenation) 계층 및 소프트맥스 계층을 포함할 수 있다. 상기 완전 연결 계층은 타겟 영역의 로컬 시각적 특징을 추출하기 위해 사용될 수 있을 뿐만 아니라 연이은 특징 연쇄 계층 및 소프트맥스 계층에서 처리되도록 적응하기 위해 특징들의 차원을 줄이기 위해서도 사용될 수 있다. 상기 관계 예측 네트워크는 상기 샘플 데이터세트를 기반으로 트레이닝될 수 있다. 예를 들어, 상기 관계 예측 네트워크는 시각적 게놈 (Visual Genome)을 기반으로 트레이닝될 수 있다. 상기 비주얼 게놈 데이터세트는 일반적으로 사용되는 관계 및 속성 학습 데이터세트로, 다수의 객체 속성들과 관계 라벨들을 가진다. 따라서 시각적 게놈 데이터세트를 사용하여 관계 예측 네트워크를 트레이닝할 수 있다.The relationship prediction network is a classification network used to predict association relationships between target regions. A specific network structure of the relationship prediction network may be selected according to actual needs. As an alternative solution, the relation prediction network may include a fully connected layer, a feature concatenation layer and a softmax layer. The fully connected layer can be used not only to extract local visual features of the target region, but also to reduce the dimension of features to adapt to be processed in successive feature chain layers and softmax layers. The relationship prediction network may be trained based on the sample dataset. For example, the relationship prediction network may be trained based on a visual genome. The visual genome dataset is a commonly used relation and attribute learning dataset, and has a plurality of object attributes and relation labels. Thus, we can use visual genomic datasets to train relational prediction networks.

특히, 위의 관계 예측 네트워크를 통해 타겟 영역들 간의 관계 특징이 획득될 때에, 각각의 타겟 영역들 내 적어도 2개의 타겟 영역의 지역적 특징들에 대해, 완전 연결 계층을 사용하여 특징들을 추출할 수 있으며, 상기 완전 연결 계층에 의해 추출된 각자의 타겟 영역들에 대응하는 상기 추출된 특징들은 연쇄 특징들을 위해 상기 특징 연쇄 계층으로 입력되며 그 후에 상기 소프트맥스 계층으로 입력된다. 소프트맥스 계층이 출력하는 각 관계에 대응하는 확률에 따라, 적어도 두 개의 타겟 영역들 사이의 관계 특징들이 획득된다.In particular, when a relational feature between target regions is obtained through the above relational prediction network, for regional features of at least two target regions in each target region, features can be extracted using a fully connected layer, , the extracted features corresponding to respective target regions extracted by the fully connected layer are input to the feature chain layer for chain features and then to the softmax layer. According to a probability corresponding to each relationship output by the softmax layer, relationship characteristics between at least two target regions are obtained.

일례로, 도 8은 관계 예측 네트워크를 사용하여 관계 특징들을 예측하는 개략도를 도시한다. 도 8에 도시된 바와 같이, 이미지 프레임에서 타겟인 사람과 개의 지역적 특징들은 상기 관계 예측 네트워크의 완전 연결 계층으로 각각 입력된다. 상기 완전 연결 계층에서 출력된 특징들은 특징 연쇄를 수행하며, 그리고 그 후에 소프트맥스 계층에 입력되어, 사람과 개 사이에 다양한 미리 정의된 관계 특징들의 대응 확률들을 획득한다. 마지막으로, 미리 정의된 하나의 관계 특징이 상기 확률들에 따라 출력된다. 예를 들어, 상기 관계 특징이 "위에(on)"일 확률은 0.12이고, 상기 관계 특징이 "왼쪽"일 확률은 0.20이고, 상기 관계 특징이 "보유(holding)"일 확률은 0.40 등이고, 그 후에 각 관계 특징의 확률에 기반하여, 최종으로 출력되는 관계 특징은 가장 높은 확률을 갖는 "보유"이다. 물론, 실제 적용에서, 적어도 두 개의 타겟 영역에 대해, 상기 대응하는 관계 특징은 하나일 수 있으며, 상기 예에서와 같이, 최대 확률에 대응하는 관계 특징이 선택될 수 있다. As an example, FIG. 8 shows a schematic diagram of predicting relational features using a relational prediction network. As shown in FIG. 8 , local features of humans and dogs, which are targets in an image frame, are respectively input to the fully connected layer of the relationship prediction network. Features output from the fully connected layer perform feature chaining, and then input to the softmax layer to obtain corresponding probabilities of various predefined relational features between humans and dogs. Finally, one predefined relational feature is output according to the probabilities. For example, the probability that the relational feature is "on" is 0.12, the probability that the relational feature is "left" is 0.20, the probability that the relational feature is "holding" is 0.40, and so on; Afterwards, based on the probability of each relational feature, the finally output relational feature is “retained” with the highest probability. Of course, in practical application, for at least two target regions, the corresponding relational feature may be one, and as in the above example, the relational feature corresponding to the maximum probability may be selected.

본 개시의 일 실시예에서 제공되는 방법에서, 비디오 또는 이미지의 텍스트 캡션을 획득할 때에, 그것은 각 이미지의 각각의 타겟 영역들 (타겟 후보 지역이라고도 함)의 로컬 시각적 특징 (간단히 지역적 특징, 로컬 특징이라고도 함)을 기반으로 구현한다. 각각의 타겟 영역들의 로컬 시각적 특징들이 각 이미지 내 로컬 영역들의 정보를 보다 정확하고 세분화하여 반영할 수 있으므로, 본 개시의 실시예에서의 방법은 각 이미지의 인트라-프레임 정보를 최대한 활용하며, 그럼으로써 비디오들이나 이미지들의 보다 정확한 텍스트 캡션을 생성할 수 있다.In the method provided in an embodiment of the present disclosure, when obtaining a text caption of a video or image, it is a local visual feature (simply a local feature, a local feature) of respective target regions (also called target candidate region) of each image ) is implemented based on Since the local visual characteristics of each target area can reflect the information of the local areas in each image more accurately and subdivided, the method in the embodiment of the present disclosure makes full use of the intra-frame information of each image, thereby You can create more accurate text captions for videos or images.

또한, 본 개시의 실시예에서의 방법은 상기 로컬 특징에 기반하여 타겟 영역들 간의 관계 특징들을 결정하며, 그 관계 특징에 기반하여 각 이미지의 공간적 장면 그래프 (이미지에 대한, 이것은 그래프임)를 구축할 수 있다. 각 이미지의 그래프 컨볼루션 특징들은 상기 공간적 장면 그래프를 기반으로 획득될 수 있으며, 그래서 상기 비디오 또는 이미지의 텍스트 캡션이 각 이미지의 그래프 컨볼루션 특징들을 기반으로 획득될 수 있다. 상기 공간 장면 그래프는 상기 이미지 내 타겟들 및 그 타겟들 간의 관계를 잘 반영할 수 있으며, 그 타겟들 간의 관계는 이미지 콘텐트를 이해하고 설명하기 위해 매우 유용하므로, 상기 공간 장면 그래프 기반 그래프 컨볼루션 특징은 비디오들이나 이미지들의 텍스트 캡션의 정확성을 향상시킬 수 있다.In addition, the method in an embodiment of the present disclosure determines relational characteristics between target regions based on the local characteristic, and builds a spatial scene graph (for the image, which is a graph) of each image based on the relational characteristic. can do. The graph convolution features of each image may be obtained based on the spatial scene graph, so that a text caption of the video or image may be obtained based on the graph convolution characteristics of each image. The spatial scene graph can reflect well the targets in the image and the relationships between the targets, and the relationships between the targets are very useful for understanding and describing image content, so the spatial scene graph based graph convolution feature can improve the accuracy of text captions of videos or images.

설명의 편의를 위해, 이하에서 타겟 영역에 대해, 타겟 영역의 로컬 시각적 특징에 대응하는 노드는 타겟 노드로 언급될 수 있으며, 상기 타겟 노드가 나타내는 로컬 시각적 특징은 타겟 특징으로 언급될 수 있으며, 상기 타겟 영역의 속성 특징의 노드는 속성 노드로 언급될 수 있다.For convenience of description, hereinafter, with respect to a target region, a node corresponding to a local visual feature of the target region may be referred to as a target node, and a local visual feature represented by the target node may be referred to as a target feature, and A node of an attribute characteristic of the target region may be referred to as an attribute node.

본 개시의 대안적인 실시예에서, 멀티미디어 데이터의 특성 정보는 그 멀티미디어 데이터 내의 각 이미지의 각자의 타겟 영역들에 포함된 타겟들의 속성 특징들을 포함할 수 있으며; 여기에서 각 타겟의 로컬 시각적 특징들 및 타겟들 간의 관계 특징들을 기반으로 상기 이미지의 장면 그래프를 구축하는 것은, 각 타겟의 로컬 시각적 특징들, 상기 타겟들 간의 관계 특징들, 그리고 각 타겟의 속성 특징들을 기반으로 상기 이미지의 장면 그래프를 구축하는 것을 포함할 수 있으며, 여기에서 상기 장면 그래프 내 하나의 노드는 상기 타겟 영역에 대응하는 하나의 타겟의 로컬 시각적 특징들이나 속성 특징들을 나타낸다.In an alternative embodiment of the present disclosure, the characteristic information of the multimedia data may include attribute characteristics of targets included in respective target regions of each image in the multimedia data; Here, the construction of the scene graph of the image based on the local visual features of each target and the relational features between the targets includes the local visual features of each target, the relational features between the targets, and the attribute feature of each target. and constructing a scene graph of the image based on

이 대안적인 솔루션에서, 각각의 타겟 영역들에 대응하는 로컬 시각적 특징들, 속성 특징들 및 관계 특징들은 그래프 방식으로 장면 그래프를 구축하기 위해 사용되며, 즉, 상기 장면 그래프는 타겟 노드들 및 속성 노드들을 포함할 수 있다. 이때, 장면 그래프 내 노드들은 각각의 타겟 영역들의 로컬 시각적 특징들 또는 속성 특징들을 나타내며, 즉, 각각의 타겟 영역들 내에 포함된 타겟들의 속성들 및 타겟 특징들을 나타낸다. 대응하는 에지들은 타겟들 사이의 관계를 나타내는 에지들 (예를 들어, 도 7a에서 보이는 공간 장면 그래프 내 "사람"과 "개" 사이의 에지), 그리고 타겟과 속성들 사이의 관계를 나타내는 에지들 (예를 들어, 도 7a에도시된 공간 장면 그래프에서 "사람"과 "파란색(In blue)" 사이의 에지)을 포함할 수 있다.In this alternative solution, local visual features, attribute features and relational features corresponding to the respective target regions are used to build a scene graph in a graph manner, that is, the scene graph is composed of target nodes and attribute nodes. may include In this case, the nodes in the scene graph indicate local visual characteristics or attribute characteristics of each target region, that is, attributes and target characteristics of targets included in each target region. Corresponding edges are edges representing a relationship between targets (eg, an edge between “person” and “dog” in the spatial scene graph shown in FIG. 7A ), and edges representing a relationship between a target and attributes. (eg, an edge between “person” and “In blue” in the spatial scene graph shown in FIG. 7A ).

실제 애플리케이션에서, 연결 중복성을 줄이기 위해서, 동일한 타겟 영역 내 속성 노드들 및 타겟 노드들은 또한 결합될 수도 있으며, 즉, 타겟 영역에 대해, 상기 로컬 시각적 특징들을 나타내는 노드들과 상기 타겟 영역의 속성 특징들을 나타내는 노드들은 동일한 노드들일 수 있다. 예를 들어, 도 7a에 도시된 장면 그래프에 대해, 동일한 타겟의 타겟 노드들과 속성 노드들은 결합되어, 상기 이미지 내 사람의 타겟 특징들 및 속성 특징들을 각각 나타내는 도 7a에서의 "사람" 및 "파란색"과 같은 도 7b에서 보이는 장면 그래프를 획득하며, 그래서 그것들은 도 7b에서 보이는 것처럼 "파란색인 사람"의 노드로 결합될 수 있으며, 이는 해당 타겟 지역에서 사람의 타겟 특징들과 속성 특징들을 동시에 반영한다. 유사하게, 도 7a에서의 "개" 및 "검은색"은 상기 이미지 내 개의 타겟 특징들 및 속성 특징들을 각각 나타내며, 그래서 그것들은 도 7b에서 보이는 것처럼 "검은 개"의 노드로 결합될 수도 있다.In practical application, in order to reduce connection redundancy, attribute nodes and target nodes within the same target region may also be combined, ie, for a target region, nodes representing the local visual characteristics and attribute characteristics of the target region The indicated nodes may be identical nodes. For example, for the scene graph shown in FIG. 7A , target nodes and attribute nodes of the same target are combined, such that “person” and “person” in FIG. 7A represent target characteristics and attribute characteristics of a person in the image, respectively. obtain the scene graph shown in FIG. 7b as "blue", so they can be combined into a node of "person in blue" as shown in FIG. 7b, which simultaneously combines the target characteristics and attribute characteristics of a person in the target area in question reflect Similarly, “dog” and “black” in FIG. 7A indicate target characteristics and attribute characteristics of the dog in the image, respectively, so they may be combined into a node of “black dog” as shown in FIG. 7B .

상기 속성 특징들은 타겟 영역의 로컬 시각적 특징들을 기반으로 하여 획득될 수 있다. 이 대안적인 솔루션은 장면 그래프를 구축할 때에 상기 이미지 내 각자의 타겟 영역들의 속성 특징들을 또한 고려한다. 상기 이미지 내 각 타겟의 속성들이 상기 이미지 콘텐트를 설명하는 데 또한 매우 도움이 되기 때문에, 속성 특징들을 통합한 장면 그래프는 비디오를 보다 정확하게 기술할 수 있다.The attribute characteristics may be obtained based on local visual characteristics of the target area. This alternative solution also takes into account the attribute characteristics of the respective target regions in the image when building the scene graph. Since the attributes of each target in the image are also very helpful in describing the image content, a scene graph incorporating attribute characteristics can more accurately describe the video.

대안으로, 각각의 타겟 영역들에 대응하는 속성 특징들은 속성 예측 네트워크를 이용하여, 상기 추출된 로컬 시각적 특징들로부터 획득될 수 있다. 상기 속성 예측 네트워크는 다중 분류 네트워크이다. 상기 속성 예측 네트워크는 (비주얼 게놈 데이터세트와 같은) 샘플 데이터 세트들을 기반으로 트레이닝될 수 있다. 실제 애플리케이션에서, 상기 속성 예측 네트워크의 특정 네트워크 구조는 실제 필요에 따라 선택될 수 있다.Alternatively, attribute features corresponding to respective target regions may be obtained from the extracted local visual features using an attribute prediction network. The attribute prediction network is a multi-classification network. The attribute prediction network may be trained based on sample data sets (such as a visual genomic dataset). In actual application, the specific network structure of the attribute prediction network may be selected according to actual needs.

본 개시의 대안적인 실시예에서, 상기 속성 예측 네트워크는 다수의 속성 분류기들을 포함할 수 있고, 여기에서 각 분류기는 속성 예측의 하나의 유형(type)에 대응한다.In an alternative embodiment of the present disclosure, the attribute prediction network may include multiple attribute classifiers, where each classifier corresponds to one type of attribute prediction.

상기 속성 유형의 특정 분할은 실제 필요에 따라 구성될 수 있다. 대체 방법으로서, 상기 속성 유형은 그 속성에 해당하는 품사 (part of speech)를 특별하게 참조할 수 있다. 예를 들어, 상기 속성에 해당하는 품사는 명사, 동사, 형용사 및 다른 비교적 드문 유형의 속성 등을 포함할 수 있다. 속성 특징들에 대한 예측은 복수의 속성 유형들을 포함하는 분류기를 사용하여 수행된다.A specific division of the above attribute types can be configured according to actual needs. As an alternative method, the attribute type may specifically refer to a part of speech corresponding to the attribute. For example, the part-of-speech corresponding to the attribute may include a noun, a verb, an adjective, and other relatively rare types of attributes. Prediction of attribute features is performed using a classifier comprising a plurality of attribute types.

공간 장면 그래프를 구축하는 전통적인 방식에서, 객체들(즉, 타겟들)의 속성은 구별되지 않으며, 다양한 속성들은 하나의 분류기에 의해 분류되며, 이에 따라 획득된 속성들의 정확도가 낮다. 그러나, 이러한 방식으로 예측된 속성 특징들을 기반으로 공간 장면 그래프가 구성될 때에, 명사 속성 (예: 옷, 안경 등), 동사 속성 (예: 서 있기, 걷기 등), 형용사 속성 (예: 키가 큰, 뚱뚱한 등) 및 비교적 희귀한 속성 (예: 대왕고래, 조랑말 등)을 포함하여 보다 구체적인 타겟 속성 정보를 획득할 수 있다. 추가로, 상이한 분류기들을 사용하여 상이한 유형의 속성들을 얻을 수 있다. 이와 같이, 획득된 속성들은 보다 정확하고 해당 속성들은 더욱 다양화되어, 예측된 속성 특징들에 기초하여, 보다 정확한 캡셔닝 정보를 생성한다. 추가로, 중복성을 줄이기 위해서, 속성 노드들 및 타겟 노드들은 결합되어, 데이터 처리 효율성을 높일 수도 있다.In the traditional way of building a spatial scene graph, the properties of objects (ie, targets) are not distinguished, and various properties are classified by one classifier, and thus the accuracy of the properties obtained is low. However, when a spatial scene graph is constructed based on the attribute characteristics predicted in this way, noun attributes (eg, clothes, glasses, etc.), verb attributes (eg, standing, walking, etc.), adjective attributes (eg, height It is possible to obtain more specific target attribute information, including large, fat, etc.) and relatively rare attributes (eg, blue whale, pony, etc.). Additionally, different classifiers can be used to obtain different types of attributes. In this way, the acquired properties are more accurate and the corresponding properties are more diversified to generate more accurate captioning information based on the predicted property characteristics. Additionally, in order to reduce redundancy, attribute nodes and target nodes may be combined to increase data processing efficiency.

대안의 솔루션으로, 상기 속성 예측 네트워크의 구조는 완전 연결 계층 (fully connected layer)과 다중 분류 (multi-classification) 계층을 포함할 수 있다. 완전 연결 계층은 타겟 영역의 속성 특징들을 추출하기 위해 사용될 수 있을 뿐만 아니라 후속의 다중 분류 계층에서 처리될 적응을 위해 상기 특징들의 차원을 줄이기 위해 사용될 수 있다. 대안으로, 상기 다중 분류 계층은 다중의 S자형 (Sigmoid)들에 의해 구현될 수 있다. 속성 예측 네트워크를 사용하여 다양한 타겟 영역들 간의 속성 특징들을 획득할 때에, 상기 네트워크의 입력은 타겟 영역의 로컬 시각적 특징이며, 출력은 미리 정의된 여러 속성 특징들 중의 하나 이상의 속성 특징들이다.As an alternative solution, the structure of the attribute prediction network may include a fully connected layer and a multi-classification layer. The fully connected layer can be used not only to extract the attribute features of the target region, but also can be used to reduce the dimension of the features for adaptation to be processed in subsequent multiple classification layers. Alternatively, the multiple classification layer may be implemented by multiple Sigmoids. When using an attribute prediction network to obtain attribute characteristics between various target regions, the input of the network is a local visual characteristic of the target region, and the output is one or more attribute characteristics among several predefined attribute characteristics.

일례로, 도 9는 속성 예측 네트워크를 이용하여 타겟 영역의 속성 특징들을 예측하는 원리도이다. 도 9에 도시된 바와 같이, 이미지 프레임 내 타겟인 "사람"의 로컬 시각적 특징들 (도면에서 보이는 로컬 특징들)은 상기 속성 예측 네트워크의 완전 연결 계층에 입력되며, 상기 완전 연결 계층에서 출력된 특징들은 다중 분류 연산들을 수행하기 위해 다중 분류 계층으로 입력되어 사람의 다양하게 미리 정의된 속성 특징들에 각각 대응하는 확률들을 획득하며, 최종적으로는 그 확률들에 따라 상기 미리 정의된 다양한 속성 특징들 중 일부의 특징들을 출력한다. 예를 들어, 0.92의 확률을 가진 "파란색" 속성 특징과 0.54의 확률을 가진 "키가 큰" 속성 특징이 출력된다. 특히, 실제 적용에서, 타겟 영역에 대해, 그것은 하나 이상의 속성 특징들을 가질 수 있다. 예를 들어, 최대 확률을 가진 속성 특징이 타겟 영역의 속성 특징으로서 출력될 수 있다; 또는, 설정된 임계값보다 확률이 높은 속성 특징 또는 최대 출력 확률을 가지는 설정된 개수의 속성 특징들이 타겟 영역의 속성 특징으로서 출력될 수 있다.As an example, FIG. 9 is a principle diagram of predicting attribute characteristics of a target region using an attribute prediction network. As shown in FIG. 9 , local visual features (local features shown in the figure) of the target “person” in the image frame are input to the fully connected layer of the attribute prediction network, and features output from the fully connected layer are input to a multiple classification layer to perform multiple classification operations to obtain probabilities respectively corresponding to various predefined attribute characteristics of a person, and finally, from among the various predefined attribute characteristics according to the probabilities Print some features. For example, a "blue" attribute feature with a probability of 0.92 and a "tall" attribute feature with a probability of 0.54 are output. In particular, in practical application, for a target area, it may have one or more attribute characteristics. For example, the attribute feature with the maximum probability may be output as the attribute feature of the target area; Alternatively, an attribute feature having a higher probability than a set threshold value or a set number of attribute features having a maximum output probability may be output as attribute features of the target region.

본 개시의 대안적인 솔루션으로서, 각각의 타겟 영역들에 대응하는 속성 특징들을 획득하는 상기 단계들은 사용되거나 생략될 수 있다. 즉, 장면 그래프를 구축할 때에, 로컬 시각적 특징, 관계 특징 및 속성 특징을 사용할 수 있다. 또는, 속성 특징들을 사용하지 않으면서 로컬 시각적 특징들 및 관계 특징들을 기반으로 구축될 수도 있다. 속성 특징을 획득하는 단계가 생략될 때에, 상기 구축된 장면 그래프의 노드들은 타겟 영역들에 대응하는 타겟 특징들, 즉 로컬 속성 특징들을 나타낸다. 관계 특징 및 속성 특징을 기반으로 장면 그래프를 구축할 때에, 그 장면 그래프 내 각 노드는 타겟 영역에 대응하는 타겟 특징 및/또는 속성을 나타내며, 그리고 각 에지는 상기 노드들에 대응하는 타겟들 간의 관계를 나타낸다.As an alternative solution of the present disclosure, the above steps of obtaining attribute characteristics corresponding to respective target regions may be used or omitted. That is, when building the scene graph, local visual features, relational features, and attribute features can be used. Alternatively, it may be built based on local visual features and relational features without using attribute features. When the step of obtaining an attribute feature is omitted, nodes of the constructed scene graph represent target features corresponding to target regions, ie, local attribute features. When constructing a scene graph based on relational characteristics and attribute characteristics, each node in the scene graph represents a target characteristic and/or attribute corresponding to a target area, and each edge is a relation between targets corresponding to the nodes. indicates

예시를 위한 일 예로서 도 7a를 참조하면, 도 7a에 도시된 바와 같이, 이미지로부터 타겟 특징, 속성 특징, 및 사람과 개의 관계 특징을 추출한 후에, 해당 이미지에 대한 장면 그래프가 구축될 수 있다. 도 7a에서 보이는 장면 그래프에서. 각 블록은 타겟 영역의 타겟 특징 또는 속성 특징을 나타내며, 상기 타겟 특징들을 나타내는 블록들 사이의 선은 그 타겟들 간의 관계, 즉 대응하는 관계 특징을 나타낸다. 타원 블록은 타겟 영역들 간의 관계의 특징들을 나타내며, 화살표(즉, 에지의 방향)는 주체와 객체 사이의 관계이다. 도 7의 장면 그래프에 도시된 바와 같이. "사람"과 "개"의 관계에서, "사람"과 "개" 사이의 에지의 방향은 "사람"이 주체이며, "개"가 객체라는 것을 나타낸다. 타겟 특징을 나타내는 노드와 속성 특징을 나타내는 노드 사이의 선은 이들 사이의 속성 관계를 나타낸다. 도 7a에 도시된 바와 같이, "사람"과 "파란색"의 관계에서, "파란색"은 사람의 속성이고, 화살표의 방향은 속성 관계를 나타내며, 즉 "파란색"의 속성은 사람에게 속한다는 것을 나타낸다. 도 7a에 도시된 장면 그래프는 도면에 포함된 각 타겟, 타겟들의 상대적인 위치, 속성 및 행동적 관계를 명확하게 표시할 수 있다.Referring to FIG. 7A as an example for illustration, as shown in FIG. 7A , after extracting a target feature, an attribute feature, and a relationship feature between a person and a dog from an image, a scene graph for the image may be constructed. In the scene graph shown in Figure 7a. Each block represents a target characteristic or attribute characteristic of the target area, and a line between blocks representing the target characteristic indicates a relationship between the targets, ie, a corresponding relationship characteristic. The ellipse block represents the characteristics of the relationship between the target regions, and the arrow (ie, the direction of the edge) is the relationship between the subject and the object. As shown in the scene graph of Figure 7. In the relationship between "person" and "dog", the direction of the edge between "person" and "dog" indicates that "person" is the subject and "dog" is the object. A line between the node representing the target feature and the node representing the attribute feature represents the attribute relationship between them. 7A , in the relationship between "person" and "blue", "blue" is an attribute of a person, and the direction of the arrow indicates the attribute relationship, that is, the attribute of "blue" indicates that the attribute belongs to a person. . The scene graph shown in FIG. 7A may clearly indicate each target included in the drawing, the relative positions of the targets, attributes, and behavioral relationships.

본 개시의 대안의 실시예에서, 멀티미디어 데이터가 비디오이면, 상기 멀티미디어 데이터의 이미지들은 그 비디오에서 선택된 복수의 프레임들이고, 인접한 두 프레임들의 타겟 영역들이 동일한 타겟을 포함할 때에, 그 인접한 2개의 프레임들의 장면 그래프들은 상기 동일한 타겟 (타겟 노드들)에 대응하는 노드들 사이에 시간적 에지들을 가지며, 즉, 복수의 에지들은 시간적 에지들을 포함한다.In an alternative embodiment of the present disclosure, if the multimedia data is a video, the images of the multimedia data are a plurality of frames selected from the video, and when the target areas of two adjacent frames contain the same target, Scene graphs have temporal edges between nodes corresponding to the same target (target nodes), ie, a plurality of edges include temporal edges.

시간 정보를 더 잘 활용하기 위해서, 이 대안의 솔루션은 상기 비디오에 대해 선택된 프레임들 중 인접한 두 프레임들 사이의 시간 정보를 더 고려하며, 각 프레임에 대응하는 장면 그래프에 시간 정보가 추가되어, 공간-시간적 장면 그래프를 획득한다. 구체적으로, 인접한 두 프레임들 사이의 타겟 영역들에 대응하는 타겟들이 동일하면, 상기 두 인접 프레임들의 장면 그래프들 내 동일한 타겟을 포함하는 타겟 영역들의 타겟 노드들 사이에 시간적 에지들이 추가되며, 시간적 에지들이 추가된 장면 그래프들은 상기 타겟들 사이의 공간적 관계 및 시간적 관계 둘 모두를 반영할 수 있으며, 그것은 공간-시간적 장면 그래프들로 언급될 수 있다.To better utilize temporal information, this alternative solution further considers temporal information between two adjacent ones of the frames selected for the video, and temporal information is added to the scenegraph corresponding to each frame, - Acquire a temporal scene graph. Specifically, if targets corresponding to target regions between two adjacent frames are the same, temporal edges are added between target nodes of target regions including the same target in scene graphs of the two adjacent frames, and the temporal edge Scene graphs to which they are added may reflect both spatial and temporal relationships between the targets, which may be referred to as spatial-temporal scene graphs.

일례로, 도 10은 공간-시간 장면 그래프의 개략도를 도시한다. 각 프레임에 대응하는 장면 그래프에 시간적 에지들이 추가된다. 인접한 두 프레임들에 대응하는 장면 그래프들에서, 상기 두 개의 인접한 프레임들의 장면 그래프들에 속하는 타겟 영역들의 타겟 클래스들이 동일하면, 상기 두 타겟 영역들 사이에 시간적 에지가 추가된다. 예를 들어, 도 10에서 보이는 제1 프레임 및 제2 프레임에 대응하는 장면 그래프들에서. 장면 그래프의 상기 두 프레임들 내 사람, 오븐 및 피자의 타겟 클래스들이 동일하며, 장면 그래프의 상기 두 프레임들 내 대응하는 타겟 영역들 사이에 시간적 에지들이 추가되며, 예를 들어, 장면 그래프의 두 프레임들 내에 사람들 사이에, 오븐들 사이에, 그리고 피자들 사이에 상기 시간적 에지들이 각각 추가된다.As an example, FIG. 10 shows a schematic diagram of a space-time scene graph. Temporal edges are added to the scene graph corresponding to each frame. In scene graphs corresponding to two adjacent frames, if target classes of target regions belonging to scene graphs of the two adjacent frames are the same, a temporal edge is added between the two target regions. For example, in scene graphs corresponding to the first frame and the second frame shown in FIG. 10 . The target classes of person, oven and pizza in the two frames of the scene graph are the same, and temporal edges are added between the corresponding target regions in the two frames of the scene graph, for example, two frames of the scene graph The temporal edges are added between people in fields, between ovens, and between pizzas, respectively.

공간적 장면 그래프와 비교하면, 상기 공간적-시간적 장면 그래프는 시간적 차원에서, 객체들 (즉, 타겟들) 간의 관계를 추가하며, 이는 상기 비디오의 공간적-시간적 정보를 더 잘 기술할 수 있다. 또한, 상기 공간-시간적 장면 그래프는 행동 캡셔닝의 정확도를 향상시키기 위해 (이하에서 설명될) 시간적 에지에 대응하는 타겟의 행동 정보를 더 포함할 수 있다.Compared to the spatial scene graph, the spatial-temporal scene graph adds relationships between objects (ie, targets) in the temporal dimension, which can better describe the spatial-temporal information of the video. In addition, the spatial-temporal scene graph may further include behavior information of a target corresponding to a temporal edge (to be described below) in order to improve accuracy of behavior captioning.

본 개시의 실시예의 이 대안적인 솔루션에서, 인터-프레임 (inter-frame) 시간 정보도 고려되며, 즉, 장면 그래프가 구축될 때에, 시간적 에지들을 결합하여 공간-시간적 장면 그래프가 획득된다. 이 솔루션에서, 인접한 프레임들 내 동일한 타겟들 사이에 시간적 에지들이 확립되기 때문에 시간 정보를 결합하여 인터-프레임 이미지들 사이의 상관 관계를 완전하게 고려한다. 그러므로, 장면 그래프를 기반으로 그래프 컨볼루션 특징들이 추출될 때에, 상이한 이미지들 내 타겟의 연속성 정보를 더 잘 학습할 수 있으며, 그럼으로써 상기 프레임들의 인트라-프레임 및 인터-프레임 정보를 최대한 활용하는 것에 기반하여 더 나은 비디오 캡션을 획득할 수 있다.In this alternative solution of the embodiment of the present disclosure, inter-frame temporal information is also considered, that is, when the scene graph is built, the spatial-temporal scene graph is obtained by combining temporal edges. In this solution, the correlation between inter-frame images is fully taken into account by combining temporal information because temporal edges are established between identical targets in adjacent frames. Therefore, when graph convolution features are extracted based on the scene graph, it is possible to better learn the continuity information of the target in different images, thereby making the best use of the intra-frame and inter-frame information of the frames. Based on this, better video captions can be obtained.

본 개시의 대안적인 실시예에서, 상기 이미지의 장면 그래프에 기반하여 그 이미지의 그래프 컨볼루션 특징들을 획득하는 단계는, 상기 장면 그래프 내 노드들 및 에지들을 인코딩하여 특징 벡터의 타겟 차원을 획득하는 단계; 그리고 상기 획득된 특징 벡터를 기반으로 그래프 컨볼루션 네트워크를 이용하여 그래프 컨볼루션 특징들을 획득하는 단계를 포함한다.In an alternative embodiment of the present disclosure, the obtaining graph convolution features of the image based on the scene graph of the image comprises: encoding nodes and edges in the scene graph to obtain a target dimension of a feature vector ; and obtaining graph convolution features using a graph convolution network based on the obtained feature vector.

실제 적용에서, 획득된 로컬 시각적 특징들, 속성 특징들 및 관계 특징들의 차원들이 동일하면, 이 단계가 수행되거나 수행되지 않을 수 있다.In practical application, if the dimensions of the obtained local visual characteristics, attribute characteristics and relational characteristics are the same, this step may or may not be performed.

구체적으로, 구축된 장면 그래프 내 노드들이 타겟 영역에 대응하는 타겟 특징들이나 속성 특징들을 나타낼 때에, 그 타겟 영역에서 획득된 관계 특징들 및 속성 특징들은 특징 벡터의 타겟 차원들로 인코딩될 수 있으며, 그 후에 그래프 컨볼루션 네트워크가 상기 인코딩된 특징 벡터들에 적용되어 상기 장면 그래프 내 인접 노드들 및 에지들 간의 관계를 학습하도록 하여, 상기 장면 그래프에 포함된 각 노드의 그래프 컨볼루션된 특징들 (즉, 그래프 컨볼루션 특징)을 획득한다. 그래프 컨볼루션 네트워크를 기반으로 학습된 특징들은 그래프 구조(장면 그래프)를 기반으로 획득되며, 상기 그래프 컨볼루션된 특징들이 타겟 특징들, 속성들 및 관계 정보를 포함할 수 있다.Specifically, when nodes in the constructed scene graph represent target features or attribute characteristics corresponding to a target region, relational features and attribute features obtained in the target region may be encoded into target dimensions of a feature vector, the Later, a graph convolution network is applied to the encoded feature vectors to learn the relationship between adjacent nodes and edges in the scene graph, so that the graph-convolved features of each node included in the scene graph (i.e., graph convolution feature). Features learned based on the graph convolution network are obtained based on a graph structure (scene graph), and the graph-convolved features may include target features, properties, and relationship information.

상기 구축된 장면 그래프 내 노드들이 타겟 특징들만을 나타낼 때에, 상기 그래프 컨볼루션된 특징들은 타겟 특징들 및 관계 정보를 포함할 수 있다. 이때에, 각각의 타겟 영역들의 속성 특징들은 상기 추출된 타겟의 지역적 특징들로부터, 상기 속성 예측 네트워크를 이용하여 획득되지 않는다.When the nodes in the constructed scene graph represent only target features, the graph convolved features may include target features and relationship information. At this time, the attribute characteristics of each target area are not obtained from the extracted local characteristics of the target by using the attribute prediction network.

예를 들어, 완전 연결 행렬을 사용하여 장면 그래프 내 (타겟 노드들과 같은) 노드들 전체 또는 일부와 에지들을 특징 벡터의 타겟 차원 (후속하는 벡터의 입력 벡터의 특징 차원과 관련된 고정 차원)으로 인코딩할 수 있다. 예를 들어, 장면 그래프의 관계 특징의 차원이 512이며 상기 타겟 차원이 1024일 때에, 상기 장면 그래프의 관계 특징에 512*1024 행렬이 적용될 수 있으며, 그래서 상기 타겟 차원 1024와 동일한 차원이 획득될 수 있도록 한다.For example, using a fully connected matrix to encode all or some of the nodes (such as target nodes) and edges in the scene graph into the target dimension of the feature vector (a fixed dimension relative to the feature dimension of the input vector of the subsequent vector) can do. For example, when the dimension of the relation feature of the scene graph is 512 and the target dimension is 1024, a 512*1024 matrix may be applied to the relation feature of the scene graph, so that a dimension equal to the target dimension 1024 can be obtained let it be

특징 벡터의 타겟 차원을 획득한 후에, 그 획득한 특징 벡터의 각 노드에 대해, 각 노드의 그래프 컨볼루션 특징들을 획득하기 위해 그래프 컨볼루션 공식이 사용될 수 있으며, 가장 간단한 가중치 없는 그래프 컨볼루션 공식은 아래 수학식 1로 나타난다:After obtaining the target dimension of the feature vector, for each node of the obtained feature vector, a graph convolution formula can be used to obtain the graph convolution features of each node, the simplest weightless graph convolution formula is It is represented by Equation 1 below:

여기에서 v_i는 장면 그래프 내에서 노드 i의 특징 벡터, 즉, 타겟 특징 벡터 또는 속성 특징 벡터를 나타내고, N(v_i)는 상기 장면 그래프 내 (즉, 이미지 내 동일한 프레임 내) 노드 i에 인접한 노드들의 세트를 나타내며, 그리고 v_j는 이미지의 동일한 프레임 (즉, 동일한 장면 그래프) 내 노드 i에 인접한 노드 j의 특징 벡터이다. 노드 i의 인접 노드들은 일반적으로 노드 i 그 자체를 포함하지 않는다. W는 상기 그래프 컨볼루션 네트워크의 학습 가능한 네트워크 가중치 파라미터이고, v_j ⁽¹⁾은 노드 i의 그래프 컨볼루션된 특징들 (그래프 컨볼루션 특징들)을 나타내며, σ는 비선형 활성화 함수를 나타낸다.where v _i denotes the feature vector of node i in the scenegraph, i.e., the target feature vector or attribute feature vector, and N( _vi ) is adjacent to node i in the scenegraph (i.e. within the same frame in the image). represents the set of nodes, and v _j is the feature vector of node j adjacent to node i in the same frame (ie, same scenegraph) of the image. Neighbors of node i generally do not include node i itself. W is the learnable network weight parameter of the graph convolution network, v _j ⁽¹⁾ denotes the graph convolutional features (graph convolution features) of node i, and σ denotes the nonlinear activation function.

실제 애플리케이션에서, 장면 그래프 내 상이한 노드들 간의 관계는 이미지의 중요도에서 상이하기 때문에, 즉, 멀티미디어 데이터의 최종 캡셔닝 정보에 대한 상이한 타겟들 간의 관계의 중요도가 상이하기 때문에, 상기 장면 그래프 내 에지들은 가중치 적용된 에지들이 될 수 있으며, 그래서 다른 대안으로서, 다음의 수학식 2를 사용하여 각 노드의 그래프 컨볼루션 특징들을 획득할 수 있다.In practical application, since the relationship between different nodes in the scene graph is different in the importance of the image, that is, the importance of the relationship between different targets to the final captioning information of the multimedia data is different, the edges in the scene graph are weighted edges, so as another alternative, the graph convolution characteristics of each node can be obtained using Equation 2 below.

여기에서 v_j와 N(v_i)는 수학식 1에서와 같은 의미를 가지며, N(v_i)는 v_j 그 자체를 포함할 수 있으며; W 및 b는 상기 그래프 컨볼루션 네트워크의 학습 가능한 가중치들 및 바이어스 파라미터들을 나타내며; dir(vj, vj)는 상기 에지의 방향을 나타내며, 여기에서 dir(vj, vj)는, vi에서 vj까지(즉, 상기 에지의 방향은 노드 i에서 노드 j 이다) 또는 vj에서 vi까지의 두 개의 값을 가지며; W_dir(vj,vj)에는 두 개의 대응 결과들을 가지며, 여기에서 dir_(vj,vj) 각각은 하나의 결과에 대응하며; label(v_j,v_j)는 v_i와 v_j 사이의 관계를 나타내며, 그리고 상이한 관계들에 대해서는, 그것은 상이한 바이어스 값들을 가지며; σ는 비선형 활성화 함수를 나타내며; 그리고 v_j ⁽¹⁾은 노드 i에 대응하는 그래프 컨볼루션 특징을 나타낸다.Here, v _j and N(vi _i ) have the same meaning as in Equation 1, and N(vi _i ) may include v _j itself; W and b represent the learnable weights and bias parameters of the graph convolution network; dir(vj, vj) denotes the direction of the edge, where dir(vj, vj) is either vi to vj (i.e., the direction of the edge is from node i to node j) or two from vj to vi. has a value of ; W _dir(vj,vj) has two corresponding results, where _{each dir(vj,vj)} corresponds to one result; label(v _j, v _j ) _{indicates the relationship between v i} and v _j , and for different relationships, it has different bias values; σ denotes the nonlinear activation function; And v _j ⁽¹⁾ represents the graph convolution feature corresponding to node i.

또 다른 대안으로, 수학식 2는 아래 수학식 3에서 보이는 것처럼 다음의 형태로 확장될 수도 있다. 이 대안에서, 노드 그 자체에 대해 가중치가 사용될 수 있으며. 즉, 상이한 노드들의 중요도 또한 상이하며, 다른 두 가중치들이 소속 관계에 따라 이웃 노드들에 대해 사용되며, 즉, 두 개의 가중치들이 관계 특징들을 위해 사용된다.As another alternative, Equation 2 may be expanded to the following form as shown in Equation 3 below. In this alternative, weights may be used for the node itself. That is, the importance of different nodes is also different, and different two weights are used for neighboring nodes according to the membership relationship, that is, two weights are used for relationship characteristics.

여기에서 v_j ⁽¹⁾, σ, v_i, v_j, N(v_i)의 의미는 전술한 대응 파라미터들과 동일한 의미를 가지며, 여기에서 반복되지 않는다; W_s, W_(sub,obj), W_(in,out), W_a는 학습 가능한 파라미터 가중치 행렬들이다. 구체적으로, W_s는 v_i의 파라미터 가중치 행렬이고, 그리고 W_(sub,obj)는 인접 노드 특징(즉, v_j)의 파라미터 가중치 행렬이며, 여기에서 (sub, obj)는 관계의 소속을 나타내며, 그리고 그것은 두개의 값들을 가지며, 그 두 값들은 v_i가 관계의 주체 또는 객체임을 각각 나타낸다. 예를 들어, 도 7a에서 보이는 장면 그래프에서. "사람"과 "개" 사이의 관계에서, "사람"은 주체이며, "개"는 객체이다. v_j가 주체이고 v_i가 객체이면, W_obj가 사용되며(예를 들어, 도 7a에서 "개"에 대해 "사람"이 주체임), 그렇지 않으면 W_sub가 사용된다. W_(in,out)은 관계의 파라미터 가중치 행렬을 나타내며, 여기에서 (in, out)은 에지의 방향을 나타내며, 두 개의 값들을 가지며, 그것은 상기 에지가 노드 i에서 출력되는지 또는 노드 i에 입력되는지를 각각 나타내는 두 개의 값을 가진다. 도 7a에 도시된"사람"과 "개" 사이의 에지에 대해, "사람" 노드에 대해, 상기 노드로부터 에지가 출력되며, 그리고 "개" 노드에 대해, 그 노드로 상기 에지가 입력된다. v_i가 주체이고 v_j가 객체이면, W_in이 사용되며, 그렇지 않으면 W_out이 사용된다; e^r은 대응하는 두 노드들 사이의 관계 특징 벡터를 나타내며, 그리고 그것은 상기 에지들의 방향에 따라 상이한 값들을 가질 수 있으며, 물론 동일한 값을 가질 수도 있다. e^r _{{(vi, vj),(vj,vi)}}는 노드 i와 노드 j 사이의 관계의 특징 벡터를 나타내며, 그리고 에지들의 상이한 방향에 따라, 즉, 상이한 주체-객체 관계에 따라, 상기 특징 벡터는 같거나 다를 수 있다. 예를 들어, "사람"과 "개"의 관계에서, "사람"에 대해서는 "보유함(holding)"이지만, "개"에 대해서는 "보유됨(held)"이다. A_vi,vj는 어텐션 (attention) 계층, 특히 어텐션 파라미터 행렬을 나타내며, 그리고 W_a는 어텐션 계층의 가중치를 나타낸다.Here, _{the meanings of v j} ⁽¹⁾ , σ, v _i , v _j , N(vi _i ) have the same meanings as the aforementioned corresponding parameters, and are not repeated here; W _s , W _(sub,obj) , W _(in,out) , W _a are learnable parameter weight matrices. Specifically, W _s is the parametric weight matrix of v _i _{, and W (sub,obj)} is the parametric weight matrix of the adjacent node feature (i.e., v _j ), where (sub, obj) denotes the membership of the relationship , and it has two values _{, each indicating that v i} is the subject or object of the relationship. For example, in the scene graph shown in Figure 7a. In the relationship between "person" and "dog", "person" is the subject, and "dog" is the object. If v _j is a subject and v _i is an object, then W _obj is used (eg, "person" is the subject for "dog" in FIG. 7A), otherwise W _sub is used. W _{(in, out)} denotes the parameter weight matrix of the relationship, where (in, out) denotes the direction of the edge, and has two values, whether the edge is output from node i or input to node i It has two values, each representing For the edge between "person" and "dog" shown in Fig. 7A, for the "human" node, an edge is output from the node, and for the "dog" node, the edge is input to that node. If v _i is a subject and v _j is an object, then W _in is used, otherwise W _out is used; e ^r denotes a relational feature vector between two corresponding nodes, and it may have different values depending on the direction of the edges, and of course may have the same value. e ^r _{{(vi, vj),(vj,vi)}} denotes the feature vector of the relationship between node i and node j, and according to different directions of edges, i.e. according to different subject-object relationship, said feature Vectors can be the same or different. For example, in the relationship between "person" and "dog", it is "holding" for "person", but "held" for "dog". A _vi,vj denotes an attention layer, particularly an attention parameter matrix, and W _a denotes a weight of the attention layer.

다른 대안의 방식에서, 비디오에 대해, 상기 구축된 장면 그래프가 공간-시간적 장면 그래프이면, 이미지의 각 프레임에 대해, 상기 인트라-프레임 정보에 추가로 인터-프레임 정보가 더 고려될 수 있다. 즉, 프레임 이미지의 장면 컨볼루션 특징들은 상기 프레임 이미지의 장면 그래프 및 상기 프레임 이미지의 인접 프레임 이미지의 장면 그래프에 기반하여 획득될 수 있다. 구체적으로, 상기 공간-시간 장면 그래프 내 노드들의 전부 또는 일부 및 에지들은 인코딩되어 특징 벡터의 타겟 차원을 획득하도록 하며, 그리고 그 후에 상기 획득된 특징 벡터는 그래프 컨볼루션 네트워크를 사용하여 트레이닝되어 그래프 컨볼루션 특징들을 획득하도록 한다.In another alternative manner, for video, if the constructed scene graph is a spatio-temporal scene graph, for each frame of an image, inter-frame information may be further considered in addition to the intra-frame information. That is, scene convolution features of a frame image may be obtained based on a scene graph of the frame image and a scene graph of an adjacent frame image of the frame image. Specifically, all or some of the nodes and edges in the space-time scene graph are encoded to obtain a target dimension of a feature vector, and then the obtained feature vector is trained using a graph convolution network to perform graph convolution to acquire the solution features.

대안으로, 상기 그래프 컨볼루션 특징들은 다음 수학식 4에 의해 획득될 수 있다.Alternatively, the graph convolution features may be obtained by the following equation (4).

여기에서 수학식 3에서와 동일한 수학식 4에서의 파라미터들은 동일한 의미를 가진다. N_b(v_i)는 현재 프레임 이미지의 인접 프레임 이미지 내 노드 i와 동일한 분류를 갖는 노드들의 세트, 즉, 인접 프레임들 내 동일한 타겟들의 세트를 나타내며; 도 10에 도시된 예시와 같이, 상기 이미지의 제2 프레임 내 "사람" 노드에 대해, 그 노드가 v_i 이면, 상기 도면에서 보이는 제1 프레임 및/또는 제3 프레임 내 "사람" 노드들의 세트는 N_b(v_i)이다. W_(pre,aft)는 인접 프레임들 내 동일한 타겟의 파라미터 가중치 행렬인 N_b(v_i)에서 노드 i가 속한 현재 프레임과 노드 j가 속한 프레임 간의 시퀀스 관계를 나타낸다. W_(pre,aft)는 현재 프레임이 인접 프레임의 이전인지 또는 이후인지를, 즉, 인접 프레임에 대하여, 노드 i가 위치한 프레임이 시간 시퀀스 내에서 이전 프레임인지 아니면 다음 프레임인지 각각 나타내는 두 개의 값을 가진다. v_j가 이전 프레임이면 W_(pre)가 사용되며, 그렇지 않으면 W_(aft)가 사용된다.Here, the parameters in the same Equation 4 as in Equation 3 have the same meaning. N _b (v _i ) denotes the set of nodes with the same classification as node i in the adjacent frame image of the current frame image, ie the set of identical targets in the adjacent frames; As in the example shown in Figure 10, for a "person" node in the second frame of the image, if that node is v _{i ,} then the set of "person" nodes in the first frame and/or the third frame visible in the figure. is N _b (v _i ). W _(pre,aft) represents a sequence relationship between a current frame to which node i belongs and a frame to which node j belongs in _{N b} (v _i ), which is a parameter weighting matrix of the same target in adjacent frames. W _{(pre, aft)} is two values indicating whether the current frame is before or after the adjacent frame, i.e., for the adjacent frame, whether the frame in which node i is located is the previous frame or the next frame in the time sequence, respectively. have v If _j is the previous frame, W _(pre) is used, otherwise W _(aft) is used.

본 개시의 대안적인 실시예에서, 공간-시간적 장면 그래프를 구성할 때, 상기 방법은 시간적 에지에 대응하는 타겟의 행동 특징들을 결정하는 단계를 더 포함할 수 있다.In an alternative embodiment of the present disclosure, when constructing the spatio-temporal scene graph, the method may further include determining behavioral characteristics of the target corresponding to the temporal edge.

이때에, 이미지의 각 프레임에 대해, 프레임 이미지의 타겟 영역들에 대응하는 로컬 시각적 특징들, 속성 특징들 (대안), 관계 특징들 및 시간적 에지에 대응하는 타겟의 행동 특징들을 기반으로 공간-시간적 장면 그래프를 구축할 수 있다.At this time, for each frame of the image, based on local visual features corresponding to target regions of the frame image, attribute features (alternative), relational features, and behavioral features of the target corresponding to a temporal edge, the spatial-temporal You can build a scene graph.

다른 말로 하면, 상기 장면 그래프 내 시간적 에지에 대응하는 각 타겟 노드의 행동 특징들을 또한 추가할 수 있다. 대안으로, 객체 추적 방법을 사용하여 인접한 프레임들 간에 동일한 타겟을 식별할 수 있다. 이 동일한 타겟에 대해, 미리 트레이닝된 행동 분류기 (행동 탐지기)를 사용하여 상기 이미지 내 타겟의 행동 분류 (행동 관계라고도 함)를 식별하며, 행동 분류의 특징 벡터를 그 타겟의 행동 특징으로서 사용할 수 있다.In other words, behavioral characteristics of each target node corresponding to a temporal edge in the scene graph may also be added. Alternatively, an object tracking method may be used to identify the same target between adjacent frames. For this same target, a pre-trained behavior classifier (behavior detector) is used to identify the target's behavioral classification (also called behavioral relationship) in the image, and the feature vector of the behavioral classification can be used as the target's behavioral features. .

도 11에 도시된 공간-시간 장면 그래프에서와 같이, 각 프레임에 포함된 동일한 타겟들은 "사람", "피자" 및 "오븐"을 포함하며, 여기에서 "사람"에 해당하는 행동은 "개방(opening)"이며, 즉, 공간-시간적 장면 그래프에서, 인접한 프레임 내 동일한 타겟의 시간적 에지의 값은 "개방(opening)"이며; "피자"에 해당하는 행동은 "보유됨(held)"이며; "오븐"에 해당하는 행동은 "열림(opened)"이다. 도 10의 장면 그래프와 비교하면, 상기 장면 그래프는 인접 프레임들에 포함된 동일한 타겟에 대응하는 행동 정보를 더 포함할 수 있다. 장면 그래프를 기반으로 캡셔닝 정보를 생성할 때에, 보다 많은 이미지 상세 정보가 사용될 수 있으며, 그럼으로써 상기 생성된 캡셔닝 정보의 정확도를 더욱 향상시킬 수 있다.As in the space-time scene graph shown in Fig. 11, the same targets included in each frame include "person", "pizza" and "oven", where the action corresponding to "person" is "open ( opening)", that is, in the spatio-temporal scene graph, the value of the temporal edge of the same target in an adjacent frame is "opening"; The action corresponding to "pizza" is "held"; The action corresponding to "oven" is "opened". Compared with the scene graph of FIG. 10 , the scene graph may further include behavior information corresponding to the same target included in adjacent frames. When generating captioning information based on the scene graph, more image detail information may be used, thereby further improving the accuracy of the generated captioning information.

이 대안 솔루션에 대응하여, 상기 장면 그래프 내 각 노드의 그래프 컨볼루션 특징들을 계산하기 위해 다음의 수학식 5를 사용할 수도 있다.Corresponding to this alternative solution, the following equation (5) may be used to compute the graph convolution characteristics of each node in the scene graph.

상기 수학식 4에서와 동일한 파라미터들에 대한 설명은 전술한 것을 참조할 수 있다. W_T는 인접한 프레임 내 동일한 타겟의 행동 관계 (즉, 행동 분류)의 파라미터 가중치 행렬을 나타내고, e^a _(vi,vj)는 행동 분류 (구체적으로는 동작 분류의 특징 벡터일 수 있음), 즉, 인접한 프레임들 내 동일한 타겟의 행동 특징들을 나타낸다. 도 11에 도시된 바와 같이, 인접 프레임들에 포함된 동일한 타겟 "사람"에 대해, 사람의 해당 행동 관계는 "개방(opeining)"이며; 상기 이미지의 제1 프레임의 장면 그래프에 대해, 이 장면 그래프 내 노드 "오븐" (즉, 노드 i) 및 제2 프레임의 장면 그래프 내 노드 "오븐" (즉, 인접 프레임 내 노드 j)에 대응하는 행동 분류는 행동 "열림(openend)"의 특징 벡터이며; W_T는 학습 가능한 가중치 행렬이며, 상이한 행동 분류에 대해서는, 다른 가중치 값들을 가진다. A_vi,vj는 상이한 타겟들에게 상기한 다른 가중치들를 할당할 수 있는 어텐션 파라미터 행렬이다. 도 7b에 도시된 바와 같이, "개" 노드의 특징들을 업데이트할 때에, "사람" 노드는 "스케이트보드" 노드보다 "개" 노드와 더 밀접한 관계를 가지며, "사람" 노드에게 더 높은 가중치가 부여된다.For a description of the same parameters as in Equation 4, reference may be made to the foregoing. W _T denotes a parameter weighting matrix of the behavioral relationship (ie, behavioral classification) of the same target in an adjacent frame, and e ^a _(vi,vj) denotes a behavioral classification (specifically, it may be a feature vector of a behavioral classification), that is, It represents the behavioral characteristics of the same target in adjacent frames. 11 , for the same target “person” included in adjacent frames, the corresponding behavioral relationship of the person is “opening”; For the scenegraph of the first frame of the image, the corresponding node "oven" (ie, node i) in this scenegraph and the node "oven" in the scenegraph of the second frame (ie, node j in the adjacent frame) Behavior classification is a feature vector of behavior “openends”; W _T is a learnable weight matrix, with different weight values for different behavior classifications. A _vi,vj is an attention parameter matrix that can assign the different weights described above to different targets. As shown in Fig. 7b, when updating the characteristics of the "dog" node, the "human" node has a closer relationship with the "dog" node than the "skateboard" node, and a higher weight is given to the "human" node. is granted

위의 수학식 5는 보다 직관적으로 이해될 수 있으며, 즉, 도 11에 도시된 노드 "사람"과 같이, 상이한 객체들 (즉, 타겟 노드들)의 특징 vi에 대해, 그래프 컨볼루션 네트워크를 적용한 후, 업데이트된 특징들 (즉, 그래프 컨볼루션 특징들)은 타겟 자체의 특징 vi, 상기 타겟의 일부 관련 객체들 (예: 오븐, 피자)의 특징들 및 인접한 프레임들 내 타겟 노드들의 특징들을 포함한다.Equation 5 above can be understood more intuitively, that is, a graph convolution network is applied to features vi of different objects (ie, target nodes), such as the node “person” shown in FIG. 11 . Then, the updated features (i.e. graph convolution features) include features vi of the target itself, features of some related objects of the target (eg oven, pizza), and features of target nodes in adjacent frames. do.

본 개시의 실시예에 의해 제공되는 그래프 컨볼루션 특징들을 획득하는 방식은, (1) 노드 특징들을 업데이트할 때에, 상기 어텐션 (즉, 어텐션 가중치 A_vi,vj)이 추가되고, 즉, 특징들을 업데이트할 때에, 그것의 이웃 노드들에 상이한 가중치들이 부여될 수 있다는 점에서 기존 그래프 컨볼루션 특징 추출 방식과 상이하다. 위의 예에서와 같이, "개" 노드의 특징들을 업데이트할 때에, "사람" 노드는 "스케이트보드" 노드보다 "개" 노드와 더 밀접한 관계를 가지며, "사람" 노드에 더 높은 가중치가 부여된다. (2) 인접한 두 객체들에 대해, 상기 주체와 객체 사이의 차이 및 시간의 차이에 따라 상이한 가중치 파라미터 행렬을 사용하여 상기 특징들이 업데이트될 수 있다는 점에서 상기 획득 방식은 기존 그래프 컨볼루션 특징 추출 방식과 상이하다. 예를 들어, "사람" 노드와 "스케이트보드" 노드의 관계에서, "사람"은 주체이고 "스케이트보드"는 객체이다. "사람" 노드를 업데이트할 때에, 상기 가중치 파라미터 행렬은 W_sub이며, "스케이트보드" 노드를 업데이트할 때에, 상기 가중치 파라미터 행렬은 W_obj이다.The manner of obtaining graph convolution features provided by an embodiment of the present disclosure is: (1) when updating node features, the attention (ie, attention weight A _vi,vj ) is added, that is, updating the features It is different from the existing graph convolution feature extraction method in that different weights can be given to its neighboring nodes. As in the example above, when updating the characteristics of the "dog" node, the "human" node has a closer relationship with the "dog" node than the "skateboard" node, and the "human" node is given a higher weight do. (2) For two adjacent objects, the acquisition method is the existing graph convolution feature extraction method in that the features can be updated using different weight parameter matrices according to the time difference and the difference between the subject and the object. different from For example, in the relationship of a "person" node and a "skateboard" node, "person" is a subject and "skateboard" is an object. When updating the “person” node, the weight parameter matrix is W _sub , and when updating the “skateboard” node, the weight parameter matrix is W _obj .

실제 애플리케이션에서, 장면 그래프에서 속성 특징을 나타내는 노드 및 동일한 타겟의 타겟 특징을 나타내는 노드는 결합될 수 있으며, 즉, 장면 그래프 내 노드는 타겟 특징과 속성 특징 둘 모두를 나타내는 노드가 될 수 있다. 본원의 위의 대안적인 방법을 통해서, 각 이미지의 장면 그래프 내 각 노드의 그래프 컨볼루션 특징들은 상기 그래프 컨볼루션 네트워크를 통해 획득될 수 있다. 상기 이미지의 컨볼루션 특징들은 상기 이미지의 장면 그래프에 포함된 모든 노드들의 컨볼루션 특징들이다. 추가로, 상기 장면 그래프에서 속성 특징들을 나타내는 노드 및 동일한 타겟의 타겟 특징을 나타내는 노드는 결합될 수 있으며, 즉, 장면 그래프 내 노드들은 타겟 노드들 및 속성 노드들이 될 수 있다. 이 때, 각 타겟 노드 및 각 속성 노드의 그래프 컨볼루션 특징들이 획득될 수 있다. 위의 대안적인 방법들을 통해 노드들의 그래프 컨볼루션 특징들을 획득할 때에, 위 수학식 내 하나 이상의 파라미터들이 특정 노드에 대해 존재하지 않는다면, 0 벡터 또는 기타 미리-설정된 특징 벡터들과 같은 미리-설정된 값이 사용될 수 있다.In a practical application, a node representing an attribute feature in a scene graph and a node representing a target feature of the same target may be combined, that is, a node in the scene graph may be a node representing both the target feature and the attribute feature. Through the above alternative method of the present application, the graph convolution characteristics of each node in the scene graph of each image may be obtained through the graph convolution network. The convolutional features of the image are convolutional features of all nodes included in the scene graph of the image. Additionally, a node representing attribute characteristics in the scene graph and a node representing a target characteristic of the same target may be combined, ie, nodes in the scene graph may be target nodes and attribute nodes. In this case, graph convolution characteristics of each target node and each attribute node may be obtained. When obtaining graph convolution features of nodes through the above alternative methods, if one or more parameters in the above equation do not exist for a particular node, a pre-set value such as a zero vector or other pre-set feature vectors this can be used

본 개시의 대안의 실시예에서, 멀티미디어 데이터의 특성 정보가 로컬 시각적 특징, 의미론적 특징, 공간-시간적 시각적 특징 및 전역 특징 중 적어도 2개를 포함하면, 추출된 특성 정보에 기반하여 상기 멀티미디어 데이터 텍스트 캡션을 생성하는 단계는: 각 특성 정보의 가중치를 결정하는 단계; 각 특성 정보의 가중치에 기반하여 각 특성 정보에 가중치를 부여하는 단계; 그리고 상기 가중치가 부여된 특성 정보에 기반하여 상기 멀티미디어 데이터의 텍스트 캡션을 생성하는 단계를 포함한다.In an alternative embodiment of the present disclosure, if the characteristic information of multimedia data includes at least two of a local visual characteristic, a semantic characteristic, a spatial-temporal visual characteristic, and a global characteristic, based on the extracted characteristic information, the multimedia data text The generating of the caption may include: determining a weight of each characteristic information; assigning a weight to each characteristic information based on the weight of each characteristic information; and generating a text caption of the multimedia data based on the weighted characteristic information.

실제 애플리케이션에서, (상이한 비디오들 및 상이한 이미지들과 같은) 상이한 유형의 멀티미디어 데이터에 대해, 각 분류의 특성 정보의 중요성은 상이할 가능성이 있으며, 상이한 분류들의 특성 정보는 상이한 가중치들을 가질 수 있으므로, 상이한 특징들은 상이한 역할을 하며, 이에 따라, 본 개시의 실시예의 솔루션은 상이한 비디오들의 캡셔닝 정보 생성에 적응 가능할 수 있다; 즉, 상이한 비디오들에 대해, 여러 특성 정보는 각각 상이한 역할들을 할 수 있다.In actual application, for different types of multimedia data (such as different videos and different images), the importance of the characteristic information of each classification is likely to be different, and since the characteristic information of different classifications may have different weights, Different features play different roles, and thus, the solution of an embodiment of the present disclosure may be adaptable to generating captioning information of different videos; That is, for different videos, different characteristic information may each play different roles.

대안으로, 특성 정보의 각 유형별 가중치를 결정할 때에, 특징 선택 네트워크가 이용될 수 있다. 특징 선택 네트워크를 트레이닝함으로써, 상기 네트워크는 상이한 멀티미디어 데이터에 대한 멀티미디어 데이터의 캡셔닝 정보를 생성하기 위해 특정 특성 정보를 선택할 수 있으며, 즉, 주어진 멀티미디어 데이터에 대해, 특징 선택 네트워크를 통해 상이한 특성 정보에 대한 각자의 가중치들를 결정하는 것이 가능하다.Alternatively, a feature selection network may be used when determining a weight for each type of feature information. By training the feature selection network, the network can select specific characteristic information to generate captioning information of multimedia data for different multimedia data, i.e., for a given multimedia data, to different characteristic information through the characteristic selection network. It is possible to determine the respective weights for

일례로, 도 12는 특징 선택 네트워크의 개략도를 예시한다. 도 12에서 보이는 바와 같이, 이 예에서 각 특성 정보는 그래프 컨볼루션 특징 (V_GCN), 공간-시간적 시각적 특징 (V_3dV) 및 의미론적 특징(V_SF)이다. 상기 예에서, 상기 특징 선택 네트워크의 특정 구현은 다음 수학식 6으로 표현될 수 있다.As an example, FIG. 12 illustrates a schematic diagram of a feature selection network. As shown in FIG. 12 , each feature information in this example is a graph convolution feature (V _GCN ), a spatial-temporal visual feature (V _3d V ) and a semantic feature (V _SF ). In the above example, a specific implementation of the feature selection network can be expressed by the following equation (6).

여기에서 a_t은 시점 t에서 특성 선택 네트워크에 의해 출력된 가중치 값들의 세트, 즉, 시점 t에서 각 특성 정보의 가중치를 나타내고, E_1:t-1은 시점 1로부터 시점 t-1까지 단어들의 임베딩 (embedding)을 나타내며, 즉, 비디오 캡셔닝 정보의 t 번째 단어가 디코딩될 때에, 제1 t-1 개 단어들의 특징 벡터들은 이미 디코딩되었으며, W_3d, W_GCN, W_SF 및 We는 상기 네트워크의 파라미터 가중치 행렬들이다. 특징들 각각은 파라미터 가중치 행렬들을 사용하여 동일한 차원들로 변환되어 추가된다; 비선형 계층 tanh를 통해서 통과한 후에, 그것은 W_a ^T 파라미터 가중치 행렬을 사용하여 3*l 벡터로 변환되며, 그 후에 최종적으로 소프트맥스로 정규화된다. 각 차원은 상이한 특징의 가중치를 나타내며, 그것의 합계는 1이다. 이 공식의 직관적인 의미는 각 특징에 대한 어텐션 연산을 수행하여, 각 특징의 어텐션 가중치를 획득하는 것이다.Here, a _t represents the set of weight values output by the feature selection network at time t, that is, the weight of each feature information at time t, and E _{1: t-1} is the number of words from time 1 to time t-1. Represents embedding, that is, when the t-th word of video captioning information is decoded, the feature vectors of the first t-1 words have already been decoded, and W _3d , W _GCN , W _SF and We are the network are the parameter weight matrices of Each of the features is transformed and added to the same dimensions using parametric weight matrices; After passing through the nonlinear layer tanh, it _{is transformed into a 3*l vector using a} ^{W a T} parameter weight matrix, and then finally normalized to a softmax. Each dimension represents the weight of a different feature, the sum of which is 1. The intuitive meaning of this formula is to obtain an attention weight of each feature by performing an attention operation on each feature.

도 12에 도시된 예에서와 같이, 상기 공간-시간적 시각적 특징, 그래프 컨볼루션 특징 및 의미론적 특징의 가중치들은 각각 0.3, 0.2 및 0.5이다.As in the example shown in FIG. 12 , the weights of the spatio-temporal visual feature, graph convolution feature, and semantic feature are 0.3, 0.2 and 0.5, respectively.

위의 시점 l, 시점 t-1, 시점 t는 모두 상대적인 시간 개념이라는 것이 이해될 수 있다. 그것들은 디코더가 비디오 캡셔닝 정보를 디코딩하여 출력할 때에 그 디코더가 비디오 캡셔닝 정보에서 l번째 단어, t-1번째 단어 및 t번째 단어를 디코딩하여 획득한 상대적인 디코딩 시간이다. 시간 1이 아닌 각 시각에서의 각 특성 정보의 가중치는 현재 시점 이전에 디코딩된 각 특성 정보 및 단어들에 기반하여 획득될 수 있다.It can be understood that the above time point l, time point t-1, and time point t are all relative time concepts. They are the relative decoding times obtained by the decoder decoding the l-th word, the t-1 th word and the t-th word in the video captioning information when the decoder decodes and outputs the video captioning information. The weight of each characteristic information at each time other than time 1 may be obtained based on each characteristic information and words decoded before the current time.

본 개시의 실시예에 의해 제공되는 솔루션은 다양한 상이한 특징을 이용하여 비디오 정보를 표현한다. 공간-시간적 장면 그래프 특징 (즉, 그래프 컨볼루션 특징)에 추가로, 그것은 공간-시간적 시각적 특징들 및 의미적 특징들, 즉 세 가지 유형의 특징들을 사용할 수 있다. 그래프 컨볼루션 특징이 타겟 간의 관계 및 속성에 더 관련이 있고 공간-시간적 시각적 특징이 시간적 정보에 더 관련이 있는 경우, 의미론적 특징은 비디오에 포함된 전체 의미론적 정보에 더 관련이 있다. 특징 선택 네트워크 (예를 들어, 특징 선택 게이트)는 상이한 비디오들에 기반하여 상이한 특징들을 선택할 수 있다. 특징 선택 게이트의 출력은 상이한 특징들의 가중치들을 각각 나타내는 가중치 값들의 셋 a_t이다. 예를 들어, 어떤 비디오들은 더 길며 시간 정보가 더 중요하며, 그래서 공간-시간적 시각적 특징의 가중치가 더 높다. 일부 비디오들은 더 짧고 더 많은 객체들을 포함하며, 그래서 객체들과 그 객체들의 속성들 간의 관계가 더 중요하며, 그로 인해 상기 그래프 컨볼루션 특징의 가중치가 더 높다.The solution provided by embodiments of the present disclosure uses a variety of different features to represent video information. In addition to the spatio-temporal scene graph feature (ie, graph convolution feature), it can use spatio-temporal visual features and semantic features, namely three types of features. If graph convolution features are more related to relationships and properties between targets and spatial-temporal visual features are more related to temporal information, then semantic features are more related to the overall semantic information contained in the video. A feature selection network (eg, a feature selection gate) may select different features based on different videos. The output of the feature selection gate is a set of weight values a _t each representing the weights of the different features. For example, some videos are longer and have more temporal information, so the spatial-temporal visual feature weights higher. Some videos are shorter and contain more objects, so the relationship between objects and their properties is more important, and hence the weight of the graph convolution feature is higher.

각 특성 정보의 가중치를 획득한 후, 각 특성 정보에 기반한 후속 처리에서 대안적인 방법으로, 각각의 가중치를 이용하여 각 특성 정보에 가중치가 부여될 수 있으며, 가중치가 부여한 특징은 후속 처리를 위해 사용될 수 있다. 통합된 특징을 획득하기 위해 각각의 가중치를 기반으로 각 특성 정보의 통합에 가중치가 부여될 수 있으며 그리고 그 통합된 특징은 후속 처리에 사용될 것이다. 도 12에 도시된 예에서. 통합된 특징들 (즉, 0.3 * 공간-시간적 시각적 특징들 + 0.2 * 그래프 컨볼루션 특징들 + 0.5 * 의미론적 특징들)은 후속 처리를 위해 사용될 수 있으며, 또는 가중치들을 갖는 이러한 특징들을 별도로 처리하여 가중치가 부여된 특징들, 즉, 0.3 * 공간-시간적 시각적 특징들, 0.2 * 그래프 컨볼루션 특징들, 및 0.5 * 의미론적 특징들을 획득할 수 있다. 상이한 특징에 적응적으로 상이한 가중치를 할당함으로써, 멀티미디어 데이터 그 자체의 특징들에 따라 텍스트 캡션을 생성할 때에 상이한 유형의 특징들이 상이한 중요성을 나타낼 수 있다.After obtaining the weight of each characteristic information, as an alternative method in the subsequent processing based on each characteristic information, each characteristic information may be weighted using each weight, and the weighted characteristic will be used for subsequent processing. can A weight may be given to the aggregation of each characteristic information based on each weight to obtain an integrated characteristic, and the integrated characteristic will be used for subsequent processing. In the example shown in FIG. 12 . The integrated features (i.e., 0.3 * spatio-temporal visual features + 0.2 * graph convolution features + 0.5 * semantic features) can be used for subsequent processing, or by processing these features with weights separately Weighted features can be obtained, ie, 0.3 * spatio-temporal visual features, 0.2 * graph convolution features, and 0.5 * semantic features. By adaptively assigning different weights to different features, different types of features can exhibit different importance when generating text captions according to features of the multimedia data itself.

본 개시의 다른 실시예에서, 추출된 특성 정보에 기반하여 멀티미디어 데이터의 텍스트 캡션을 생성하는 단계는: 획득된 특성 정보를 셀프-어텐션 기반 인코더를 이용하여 인코딩하는 단계; 인코딩된 특성 정보를 디코더에 입력하여 멀티미디어 데이터의 텍스트 캡션을 생성하는 단계를 포함하며, 상기 멀티미디어 데이터가 이미지일 때에, 상기 셀프-어텐션 기반 기반 인코더는 셀프-어텐션 기반 인트라-프레임 인코더이며; 상기 멀티미디어 데이터가 비디오일 때에, 상기 셀프-어텐션 기반 인코더는 셀프-어텐션 기반 인트라-프레임 인코더 및/또는 셀프-어텐션 기반 인터(inter)-프레임 인코더를 포함한다.In another embodiment of the present disclosure, generating a text caption of multimedia data based on the extracted characteristic information includes: encoding the acquired characteristic information using a self-attention-based encoder; generating a text caption of multimedia data by inputting encoded characteristic information into a decoder, wherein when the multimedia data is an image, the self-attention-based encoder is a self-attention-based intra-frame encoder; When the multimedia data is a video, the self-attention-based encoder includes a self-attention-based intra-frame encoder and/or a self-attention-based inter-frame encoder.

다시 말해서, 상기 셀프-어텐션 기반 인트라-프레임 인코더는 상기 획득된 특성 정보를 각각 인코딩하기 위해 사용될 수 있으며 (가중치 부여가 수행될 것이라면, 그것은 가중치 부여된 특징이며; 가중치 부여된 통합이 수행되어야 하면, 그것은 가중치 부여된 통합 특징일 수 있음), 더 깊고 더 향상된 특성 정보를 획득하며 그리고 인코딩된 특징들을 상기 디코더에 입력하여 대응하는 텍스트 캡션을 생성하며; 비디오의 경우, 그것은 셀프-어텐션 기반 인터-프레임 인코더를 사용하여 상기 획득된 비디오 특징들을 인코딩하고, 그 인코딩된 특징들을 상기 디코더에 입력하여 상기 비디오의 텍스트 캡션을 생성할 수도 있다. 상기 디코더는 셀프-어텐션 기반 디코더일 수 있다. 예를 들어, 이미지의 경우, 그것은 디코딩 동안 인트라-프레임 정보를 더 잘 학습하기 위한 어텐션 기반 인트라-프레임 디코더일 수 있다. 비디오의 경우, 그것은 디코딩 동안 인트라-프레임 정보 및/또는 인터-프레임 정보를 더 잘 학습하기 위한 셀프-어텐션 기반 인트라-프레임 디코더 및/또는 셀프-어텐션 기반 인터-프레임 디코더일 수 있으며, 그럼으로써 비디오의 더욱 정확한 텍스트 캡션을 획득한다.In other words, the self-attention based intra-frame encoder can be used to encode the obtained characteristic information respectively (if weighting is to be performed, it is a weighted characteristic; if weighted aggregation is to be performed, it may be a weighted integrated feature), obtain deeper and more advanced feature information and input the encoded features into the decoder to generate a corresponding text caption; In the case of video, it may encode the obtained video features using a self-attention based inter-frame encoder, and input the encoded features to the decoder to generate a text caption of the video. The decoder may be a self-attention-based decoder. For example, in the case of an image, it may be an attention-based intra-frame decoder to better learn intra-frame information during decoding. In the case of video, it may be a self-attention-based intra-frame decoder and/or self-attention-based inter-frame decoder for better learning intra-frame information and/or inter-frame information during decoding, so that the video to obtain a more accurate text caption of

그래프 컨볼루션 특징들 (이해할 수 있듯이 이는 또한 공간-시간적 시각적 특징들 및/또는 의미론적 특징들 그리고 그래프 컨볼루션 특징들일 수도 있음)을 기반으로 하여 획득된 비디오의 텍스트 캡션을 예로 들면, 선택된 각 프레임에 대해, 상기 그래프 컨볼루션 특징에 관한 벡터가 획득될 수 있다. 이러한 특징 벡터들은 프레임들 간의 인터-프레임 정보를 획득하는 것을 학습하기 위해 셀프-어텐션 기반 디코더에 입력된다.Each frame selected, for example, a text caption of a video obtained on the basis of graph convolution characteristics (which as can be understood may also be spatio-temporal visual characteristics and/or semantic characteristics and graph convolution characteristics) For , a vector relating to the graph convolution feature may be obtained. These feature vectors are input to a self-attention-based decoder to learn to obtain inter-frame information between frames.

상기 그래프 컨볼루션 특징들 기반으로 상기 비디오의 텍스트 캡션을 생성할 때에, 상기 디코더는, 디코더 입력 및 상기 그래프 컨볼루션 특징들에 따라 각 순간에 출력될 수 있는 단어들 및 그것의 출력 확률을 출력한다. 일 예로, 셀프-어텐션 기반 디코더는 변환기 디코더에 의해 구현될 수 있다.When generating the text caption of the video based on the graph convolution features, the decoder outputs words that can be output at each moment according to decoder input and the graph convolution features and their output probability . As an example, the self-attention-based decoder may be implemented by a transformer decoder.

상기 비디오의 텍스트 캡션을 생성하는 과정에서, 제1 순간에 상기 디코더의 입력들은 전역 특징 및 시작 토큰이며, 셀프-어텐션 기반 디코더들의 출력은 확률 값들의 셋이고, 각 값은 단어의 확률을 나타내면, 최대 확률을 가진 단어가 제1 순간에 출력으로 선택된다. 제2 순간의 입력은 전역 특징 + 시작 토큰 + 제1 순간의 출력이며, 최대 확률을 가진 단어가 여전히 출력으로 선택된다. 제3 순간의 입력은 전역 특징 + 시작 토큰 + 제1 순간의 출력 및 제2 순간의 출력이다. 이 루프는 특정 순간에 최대 확률을 가진 단어가 정지 토큰이 될 때까지 계속되다가 루프가 종료되며, 이에의해 문장 시퀀스를 획득하며, 즉, 상기 셀프-어텐션 기반 디코더의 출력이 상기 비디오 캡션에 관한 최종 문장 시퀀스이다.In the process of generating the text caption of the video, at a first instant the inputs of the decoder are a global feature and a start token, and the output of the self-attention based decoders is a set of probability values, each value representing the probability of a word, The word with the greatest probability is selected as output at the first instant. The input of the second instant is the global feature + the start token + the output of the first instant, and the word with the maximum probability is still chosen as the output. The input of the third instant is the global feature + the start token + the output of the first instant and the output of the second instant. This loop continues until the word with the greatest probability at a certain moment becomes a stop token, and then the loop is terminated, thereby obtaining a sentence sequence, i.e., the output of the self-attention based decoder is the last related to the video caption. It is a sequence of sentences.

예를 들어, 어휘는 a, b, c, d, e이다. 상기 변환기 디코더는 제1 순간에 출력될 수 있는 어휘로부터, 단어 a 의 출력 확률 (예: 60%) 및 단어 b의 출력 확률 (예: 40%), c, d, e(0%)로 단어들을 출력할 수 있다. 제2 순간에, 단어 c의 출력 확률(예: 60%), 단어 d의 출력 확률(예: 20%) 및 단어 e의 출력 확률(예: 20%), a, b(0%) ) 등등이다. 이 경우, 본 개시의 일 실시예에 따라서, 비디오 캡셔닝 문장은 그리디 (greedy) 디코딩 알고리즘에 의해 생성될 수 있으며, 즉, 비디오 캡셔닝 문장은 각 순간 출력될 수 있는 최대 출력 확률을 가진 단어들을 시간 순서대로 결합함으로써 생성된다. 그러나, 본 개시는 이에 한정되지 않으며, 비디오 캡셔닝 문장들을 생성하기 위해 다른 디코딩 방법이 이용될 수 있다.For example, the vocabulary is a, b, c, d, e. The transformer decoder, from the vocabulary that can be output at a first instant, has an output probability of word a (eg 60%) and an output probability of word b (eg 40%), c, d, e (0%) of words can be printed out. At the second instant, the output probability of word c (eg 60%), the output probability of word d (eg 20%) and the output probability of word e (eg 20%), a, b(0%) ), etc. am. In this case, according to an embodiment of the present disclosure, the video captioning sentence may be generated by a greedy decoding algorithm, that is, the video captioning sentence is a word having the maximum output probability that can be output at each instant. are created by combining them in chronological order. However, the present disclosure is not limited thereto, and other decoding methods may be used to generate the video captioning sentences.

본 개시의 일 실시예에 따라, 정지 토큰이 출력될 확률이 최대가 될 때까지 시간 순서대로 매 순간에 출력될 수 있는 최대 확률을 갖는 단어들을 조합하여 비디오 자막 문장이 획득될 수 있다. 셀프-어텐션 기반 디코더의 역할은 인터-프레임 정보를 학습하는 것이다. 셀프-어텐션 기반 디코더는, 다중-헤드 어텐션 계층, 계층 정규화 계층, 피드 포워드 네트워크 계층을 포함하는 셀프-어텐션 메커니즘 기반의 구조를 가진다. 셀프-어텐션 기반 디코더는 RNN 구조의 디코더에 비해 트레이닝 속도가 더 빠르고 파라미터들의 개수가 더 적으며, 장거리 종속성을 쉽게 학습할 수 있는 이점들을 가진다.According to an embodiment of the present disclosure, a video caption sentence may be obtained by combining words having a maximum probability that can be output at every moment in chronological order until a probability of outputting a stop token is maximum. The role of the self-attention-based decoder is to learn inter-frame information. The self-attention-based decoder has a self-attention mechanism-based structure including a multi-head attention layer, a layer normalization layer, and a feed forward network layer. The self-attention-based decoder has advantages in that the training speed is faster, the number of parameters is smaller, and long-range dependencies can be easily learned compared to the decoder of the RNN structure.

일례로, 도 13a는 본 개시의 일 실시예에 의해 제공되는 셀프-어텐션 기반 코덱 모델의 개략적인 구조도를 도시한다. 도 13a에 도시된 바와 같이, 본 개시의 셀프-어텐션 기반 코덱 모델은 셀프-어텐션 기반 인코더와 셀프-어텐션 기반 디코더라는 두 부분으로 분할된다. 대안으로, 상기 셀프-어텐션 기반 코덱은 변환기 코덱으로 구현될 수 있다. 이 예에서 비디오의 특성 정보는 여전히 그래프 컨볼루션 특징을 예로 들어 설명한다. 도 13a에서 보이는 바와 같이, 이 예의 셀프-어텐션 기반 인코더는 다중-헤드 어텐션 계층, 피드 포워드 네트워크 및 계층 정규화 계층을 포함한다. 상기 셀프-어텐션 기반 디코더는 마스킹된 다중-헤드 어텐션 계층, 다중-헤드 어텐션 계층, 계층 정규화 계층, 및 피드 포워드 네트워크 계층을 포함할 수 있다.As an example, FIG. 13A shows a schematic structural diagram of a self-attention-based codec model provided by an embodiment of the present disclosure. As shown in FIG. 13A , the self-attention-based codec model of the present disclosure is divided into two parts: a self-attention-based encoder and a self-attention-based decoder. Alternatively, the self-attention-based codec may be implemented as a converter codec. In this example, the characteristic information of the video is still described by taking the graph convolution characteristic as an example. As shown in FIG. 13A , the self-attention-based encoder of this example includes a multi-head attention layer, a feed forward network and a layer normalization layer. The self-attention-based decoder may include a masked multi-head attention layer, a multi-head attention layer, a layer normalization layer, and a feed forward network layer.

도 13a에서 보이는 셀프-어텐션 메커니즘에 기반한 구조를 갖는 인코더는 다중-블록 구조일 수 있으며, 각 블록의 상기 구조는 동일하거나 상이할 수 있다. 상기 다중-블록 구조는 순차적으로 캐스케이드될 수 있으며, 즉, 현재 블록의 출력이 다음 블록의 입력이 된다. 대안적인 솔루션에서, 예를 들어, 상기 인코더는 동일한 구조를 갖는 6개의 블록을 포함할 수 있으며 (하나는 도 13a에 도시됨), 각 블록은 위치마다 완전 연결된 피드 포워드 네트워크 및 다중-헤드 어텐션 계층의 두 개의 부분들을 주로 포함하며, 상기 피드 포워드 네트워크는 두 개의 선형 예측 계층들에 의해 구현될 수 있다. 상기 두 개의 선형 예측 계층들은 ReLU 활성화 연산들을 포함한다. 각 블록 내 다중-헤드 어텐션 계층 및 피드 포워드 네트워크는 각자 계층 정규화 계층들에 대응할 수 있다. 구체적으로, 도 13a에서 보이는 것처럼, 각 블록은 다중-헤드 어텐션 계층, 계층 정규화 계층, 피드 포워드 네트워크 계층 및 계층 정규화 계층을 차례로 포함할 수 있다. 각 블록을 쌓아 인코더를 획득할 수 있다. 인코더의 입력은 그래프 컨볼루션 특징이다.The encoder having a structure based on the self-attention mechanism shown in FIG. 13A may have a multi-block structure, and the structure of each block may be the same or different. The multi-block structure can be sequentially cascaded, that is, the output of the current block becomes the input of the next block. In an alternative solution, for example, the encoder may comprise 6 blocks with the same structure (one shown in FIG. 13a ), each block being a fully connected feed forward network and multi-head attention layer per location It mainly includes two parts of , and the feed forward network can be implemented by two linear prediction layers. The two linear prediction layers include ReLU activation operations. The multi-head attention layer and feed forward network in each block may correspond to respective layer normalization layers. Specifically, as shown in FIG. 13A , each block may include a multi-head attention layer, a layer normalization layer, a feed forward network layer, and a layer normalization layer in turn. Each block can be stacked to obtain an encoder. The input of the encoder is a graph convolution feature.

셀프-어텐션 기반 인코더를 사용하여 그래프 컨볼루션 특징들을 인코딩할 때에, 상기 그래프 컨볼루션 특징들에 대해 인코더 임베딩이 수행될 수 있다. 특성 정보의 차원을 후속 인코더들이 처리하기에 적합하도록 변경하는 것이 목적이다. 임베딩 후에 출력된 상기 특성 정보는 인코딩되도록 하기 위해 상기 인코더에 입력된다.When encoding graph convolution features using a self-attention based encoder, encoder embedding may be performed on the graph convolution features. The purpose is to change the dimension of the feature information to be suitable for subsequent encoders to process. The characteristic information output after embedding is input to the encoder to be encoded.

다음은 예시적인 설명을 위한 예로 상기 인코더에서 제1 블록의 처리를 사용한다. 상기 그래프 컨볼루션 특징은 먼저 다중-헤드 어텐션 계층에서 처리되고, 그 출력 결과는 (추가 처리와 같은) 인코더 임베딩의 출력과 통합되며, 그리고 통합 결과에 대해 레이어 정규화 처리가 수행된다. 상기 정규화된 결과는 피드 포워드 네트워크를 통해 처리되며, 그 후에 이전의 레이어 정규화 레이어의 출력(예: 추가 처리)과 통합되며, 그 후에 레이어 정규화가 다시 수행되어 제1 블록의 출력 결과를 얻는다. 상기 제1 블록의 출력 결과는 제2 블록의 입력으로 사용되며, 인코더의 출력 결과(즉, 도 13에서의 인코더의 출력)를 얻기 위해 상기 인코딩 과정이 차례로 수행된다.The following uses the processing of the first block in the encoder as an example for illustrative description. The graph convolution feature is first processed in a multi-head attention layer, the output result is integrated with the output of encoder embedding (such as further processing), and layer normalization processing is performed on the integration result. The normalized result is processed through a feed-forward network, and then integrated with the output of the previous layer normalization layer (eg, additional processing), after which the layer normalization is performed again to obtain the output result of the first block. The output result of the first block is used as an input of the second block, and the encoding process is sequentially performed to obtain an output result of the encoder (ie, the output of the encoder in FIG. 13 ).

도 13a에서 보이는 셀프-어텐션 메커니즘의 디코더 구조는 또한 다중-블록 구조일 수 있다. 각 블록의 구조는 같거나 상이할 수 있다. 여러 블록들의 구조들은 순서대로 캐스케이드될 수 있다. 예를 들어, 상기 디코더는 동일한 구조를 갖는 6개의 블록을 포함할 수 있고, 각 블록은 마스킹된 다중-헤드 어텐션, 특징들에 대응하는 다중-헤드 셀프-어텐션 및 피드 포워드 네트워크의 3개 부분들을 포함할 수 있다. 각 블록 내 다중-헤드 어텐션 및 피드 포워드 네트워크는 각각 레이어 정규화 레이어들에 대응할 수 있다. 구체적으로, 각 블록의 구조는 마스킹된 다중 헤드 어텐션 레이어, 레이어 정규화 레이어, 다중-헤드 어텐션, 레이어 정규화 레이어, 피드 포워드 네트워크 레이어 및 레이어 정규화 레이어을 차례로 포함할 수 있으며, 각 블록은 디코더를 얻기 위해 쌓여질 수 있다. 상기 도면 내 디코더 입력들은 전역 특성 정보와 단어 벡터들이다. 각 프레임에서 상기 특징 추출 네트워크에 의해 추출된 타겟 영역의 특징 벡터는 로컬 특징들 또는 지역적 특징들로 언급될 수 있다. 따라서, 이 프레임에 대해 여러 타겟 영역들의 지역적 특징들이 획득될 수 있다. 지역적 특징들을 평균하여 프레임에 대응하는 전역 특징들을 획득하거나 전역 특징을 (가중치 부여와 같은) 다른 방법들에 의해 획득할 수 있다. 추가로, 반복 예측 과정 동안에 예측된 시작 토큰과 단어 벡터가 또한 획득될 수 있다 (반복 예측의 제1 예측인 경우, 시작 토큰만이 획득되며, 그리고 모든 단어 벡터들은 상기 모델을 트레이닝할 때에 입력될 수 있다). 위의 디코더 입력 (즉, 반복 예측 과정 동안에 예측된 단어 벡터들, 그리고 전역 특징 및 시작 토큰)에 대해 디코더 임베딩 처리가 수행될 수 있으며, 이는 특성 정보의 차원(dimension)을 후속 디코더 처리를 위해 적합하도록 변경하기 위한 것이다. 임베딩 처리 후의 전역 특성 정보, 시작 토큰 및 단어 벡터 출력은 디코딩 처리를 위해 상기 디코더에 입력될 수 있다.The decoder structure of the self-attention mechanism shown in FIG. 13A may also be a multi-block structure. The structure of each block may be the same or different. The structures of several blocks may be cascaded in sequence. For example, the decoder may include six blocks having the same structure, each block having a masked multi-head attention, multi-head self-attention corresponding to features, and three parts of the feed forward network. may include The multi-head attention and feed forward network in each block may each correspond to layer normalization layers. Specifically, the structure of each block may include a masked multi-head attention layer, a layer normalization layer, a multi-head attention layer, a layer normalization layer, a feed forward network layer and a layer normalization layer in turn, and each block is stacked to obtain a decoder. can get The decoder inputs in the figure are global feature information and word vectors. The feature vector of the target region extracted by the feature extraction network in each frame may be referred to as local features or local features. Thus, regional characteristics of several target areas can be obtained for this frame. Global features corresponding to a frame may be obtained by averaging local features, or global features may be obtained by other methods (such as weighting). In addition, predicted start tokens and word vectors may also be obtained during the iterative prediction process (in the case of the first prediction of iterative prediction, only the start token is obtained, and all word vectors to be input when training the model. can). Decoder embedding processing can be performed on the above decoder input (i.e., word vectors predicted during iterative prediction process, and global feature and start token), which adapts the dimension of feature information for subsequent decoder processing. to change it to Global characteristic information, start token, and word vector output after embedding processing may be input to the decoder for decoding processing.

다음은 예시적인 설명을 위한 예로 제1 블록의 처리를 사용한다. 임베딩 처리 이후 출력된 전역 특성 정보, 시작 토큰 및 단어 벡터는 마스킹된 다중-헤드 어텐션 레이어에 의해 처리되고, 그 처리된 결과는 디코더 임베딩 (예를 들면, 가산 처리)의 출력과 통합되어, 그 후에 레이어 정규화 처리를 거친다. 정규화된 결과와 인코더에 의해 출력된 결과는 다중-헤드 어텐션 레이어에 의해 함께 처리되며 (인코더가 인터-프레임 인코더를 포함하면, 그 인코더의 출력은 상기 인터-프레임 인코더에 의해 출력된 결과이며; 상기 인코더가 인트라-프레임 인코더만을 가지면. 상기 인코더의 출력은 인트라-프레임 인코더들로부터 출력된 결과들을 통합하여 얻은 결과일 수 있으며, 예를 들면, 인트라-프레임 인코더들로부터 출력된 결과들을 연쇄한 결과일 수 있다), 그 후에 이전의 레이어 정규화 레이어의 출력과 통합되고 (예: 가산 처리) 그 후에 레이어 정규화 처리가 수행된다. 상기 정규화된 결과는 피드 포워드 네트워크를 통해 처리 된 다음 이전의 레이어 정규화 레이어의 출력과 통합되고, 그 후에 레이어 정규화를 거치며, 그 처리된 결과는 상기 제1 블록의 출력이다. 상기 제1 블록의 출력 결과는 제2 블록의 입력으로 사용되며, 상기 디코더의 출력 결과를 획득하기 위해 상기 디코딩 과정이 차례로 수행된다.The following uses the processing of the first block as an example for illustrative description. The global feature information, start token, and word vector output after the embedding process are processed by the masked multi-head attention layer, and the processed result is integrated with the output of decoder embedding (eg, addition processing), and then Layer normalization processing is performed. The normalized result and the result output by the encoder are processed together by the multi-head attention layer (if the encoder includes an inter-frame encoder, the output of that encoder is the result output by the inter-frame encoder; If the encoder has only an intra-frame encoder, the output of the encoder may be a result obtained by integrating the results output from the intra-frame encoders, for example, a result of concatenating the results output from the intra-frame encoders. can be), and then merged with the output of the previous layer normalization layer (eg, addition processing), and then layer normalization processing is performed. The normalized result is processed through a feed forward network and then integrated with the output of the previous layer normalization layer, and then subjected to layer normalization, and the processed result is the output of the first block. The output result of the first block is used as an input of the second block, and the decoding process is sequentially performed to obtain an output result of the decoder.

상기 디코더의 출력은 선형 레이어에 의해 선형으로 변환된 다음 소프트맥스 레이어에 의해 처리되며, 이는 현재 순간 (즉, 이 반복의 예측)에 출력될 수 있는 단어 벡터들 그리고 단어 a 및 단어 b의 출력 확률들은 물론이며, 단어들 a 및 b와 같은 대응 출력 확률들을 출력하기 위한 것이다. 상기 디코더, 상기 선형 레이어 및 상기 소프트맥스 레이어은 정지 글자(character)를 출력할 확률이 최대가 될 때까지 위의 반복 예측 과정을 반복하며, 각 반복에서 획득된 단어 벡터에 따라 상기 비디오에 대응하는 캡셔닝 정보가 획득될 수 있다.The output of the decoder is linearly transformed by the linear layer and then processed by the softmax layer, which is the word vectors that can be output at the current moment (ie the prediction of this iteration) and the output probabilities of words a and b are, of course, for outputting corresponding output probabilities such as words a and b. The decoder, the linear layer, and the softmax layer repeat the above iterative prediction process until the probability of outputting a static character is maximized, and the cap corresponding to the video according to the word vector obtained in each iteration Schinging information may be obtained.

위의 예들은 예시적인 설명을 위한 예로서 상기 그래프 컨볼루션 특징을 사용하는 것으로 이해될 수 있다. 실제 애플리케이션들에서는, 상기 그래프 컨볼루션 특징들에 추가로, 그것은 상기 비디오의 공간-시간적 시각적 특징들 및/의미론적 특징들을 또한 포함할 수 있다. 이 경우, 상기 인코딩 과정은 각 특성 정보를 각각 인코딩할 수 있으며, 상기 디코더는 상기 인코딩된 특징들을 통합함으로써 획득된 특징들을 디코딩할 수 있다. 이때, 위의 특징 선택 네트워크를 이용하여 상기 인코딩된 특징들의 가중치를 결정할 수 있으며, 상기 인코딩된 특징들은 상기 인코더의 출력으로서 가중치들을 기반으로 통합되며, 그리고 상기 디코더는 상기 통합된 특징에 기반하여 상기 비디오의 텍스트 캡션을 획득한다. 상기 인코딩된 특징들을 상기 가중치들을 기반으로 상기 디코더의 상이한 교차 어텐션 레이어들에 입력하여 각각 처리하는 것 또한 가능하며, 상기 디코더는 상기 비디오의 텍스트 캡션을 획득한다.The above examples can be understood as using the graph convolution feature as an example for illustrative purposes. In practical applications, in addition to the graph convolution characteristics, it may also include spatio-temporal visual characteristics and/or semantic characteristics of the video. In this case, the encoding process may encode each characteristic information, and the decoder may decode the characteristics obtained by integrating the encoded characteristics. In this case, the above feature selection network may be used to determine the weights of the encoded features, the encoded features are integrated based on weights as an output of the encoder, and the decoder is configured to determine the weights of the encoded features based on the integrated features. Get the text caption of the video. It is also possible to respectively process the encoded features by inputting them into different cross-attention layers of the decoder based on the weights, and the decoder obtains a text caption of the video.

본 개시의 다른 실시예에서, 상기 추출된 특성 정보에 기반하여 상기 멀티미디어 데이터로부터 텍스트 캡션을 생성하는 것은: 상기 추출된 특성 정보를 복수의 디코더들에 각각 입력하는 단계; 그리고 상기 디코더들의 디코딩 결과에 기반하여 상기 멀티미디어 데이터의 텍스트 캡션을 생성하는 단계를 포함한다.In another embodiment of the present disclosure, generating a text caption from the multimedia data based on the extracted characteristic information includes: inputting the extracted characteristic information to a plurality of decoders, respectively; and generating a text caption of the multimedia data based on the decoding result of the decoders.

디코딩 능력을 제공하고 캡셔닝 정보의 표현 능력을 향상시키기 위해서, 본 개시의 실시예의 솔루션에서, 상기 인코딩된 결과들이 처리될 때에, 복수의 디코더들을 포함하는 디코더-뱅크가 사용되어 상기 인코딩된 결과들을 개별적으로 디코딩하여 상기 디코더의 디코딩 능력을 향상시키고, 각각의 디코더들의 디코딩 결과들에 기반하여 최종의 텍스트 캡션 정보를 획득할 수 있다. 예를 들어, 각각의 디코더들의 디코딩 결과들을 평균하여 최종 출력을 획득할 수 있다. 상기 디코더 뱅크는 2개 이상의 디코더들을 포함할 수 있으며, 상기 디코더 뱅크에 포함되는 디코더들의 유형들은 본 개시의 실시예에서 제한되지 않는다. 예를 들어, 상기 디코더 뱅크는 LSTM 기반 디코더 및 게이트 순환 유닛 기반 디코더, 셀프-어텐션 기반 디코더 등을 포함할 수 있으며, 각각의 디코더들의 출력들은 평균화되어, 최종 출력 결과를 얻을 수 있다.In order to provide decoding capability and improve expression capability of captioning information, in the solution of the embodiment of the present disclosure, when the encoded results are processed, a decoder-bank including a plurality of decoders is used to store the encoded results By individually decoding, the decoding capability of the decoder may be improved, and final text caption information may be obtained based on decoding results of the respective decoders. For example, a final output may be obtained by averaging decoding results of respective decoders. The decoder bank may include two or more decoders, and types of decoders included in the decoder bank are not limited in the embodiment of the present disclosure. For example, the decoder bank may include an LSTM-based decoder, a gate circulation unit-based decoder, a self-attention-based decoder, and the like, and outputs of each decoder may be averaged to obtain a final output result.

테스트를 통해서, 상기 디코더 뱅크 내 디코더들 개수가 2에서 증가할 때에, 효과가 점점 더 좋아지지만 그 개수가 4를 초과하면, 디코딩 성능의 향상이 부동적(stable)이며, 디코더들의 개수 증가는 시스템의 복잡성을 또한 증가시키므로, 실제 애플리케이션에서는, 성능과 복잡성 사이의 절충에 따라 디코더의 개수가 선택될 필요가 있다. 대안으로, 일반적으로 2개 또는 3개의 디코더를 선택할 수 있다. 예를 들어, 온-디바이스 시스템에서 사용되면, 2개의 디코더들이 선택될 수 있고, 클라우드 엔드에서 사용하는 경우 3개 이상의 디코더들이 선택될 수 있다.Through the test, when the number of decoders in the decoder bank increases from 2, the effect gets better, but when the number exceeds 4, the improvement of decoding performance is stable, and the increase in the number of decoders is the system Since it also increases the complexity of , in practical applications, the number of decoders needs to be selected according to a trade-off between performance and complexity. Alternatively, in general two or three decoders can be selected. For example, when used in an on-device system, two decoders may be selected, and when used in a cloud end, three or more decoders may be selected.

상기 디코더 뱅크에서 각 디코더를 선택하는 것과 관련하여, 많은 디코더 뱅크들을 사전 트레이닝시킬 수 있다. 상이한 디코더 뱅크들 내에 포함된 디코더들은 상이할 수 있다. 디코더 뱅크들 내 디코더들의 개수는 상이할 수 있다. 실제 적용에서, 여러 디코더 뱅크들 중에서 검증 세트 또는 테스트 세트에 가장 좋은 영향을 미치는 디코더 뱅크가 선택될 수 있다. 디코더 뱅크를 선택할 때에, 그 디코더 뱅크의 디코딩 효율과 디코딩 성능의 두 가지 측면을 고려할 수 있다.In connection with selecting each decoder in the decoder bank, it is possible to pre-train a number of decoder banks. The decoders included in different decoder banks may be different. The number of decoders in the decoder banks may be different. In practical application, among several decoder banks, the decoder bank that has the best effect on the verification set or the test set may be selected. When selecting a decoder bank, two aspects of decoding efficiency and decoding performance of the decoder bank can be considered.

각자의 디코딩을 위해 여러 디코더들이 사용될 때에, 그 디코더 뱅크를 트레이닝하는 동안에, 디코더들의 출력 결과들을 실측 (ground-truth)에 가깝게 만들기 위해서, 디코더의 출력 결과에 일관성 손실을 제약(constraint)으로 추가하는 것이 필요하며, 이는 디코더 뱅크의 상이한 디코더의 성능이 크게 달라지는 것을 방지하여 상기 디코더 뱅크의 성능이 단일 디코더의 성능보다 떨어지는 것을 방지하기 위한 것이다. 상기 디코더 뱅크는 2개의 디코더들을 구비하며, 그 출력들은 각각 2개의 확률 분포 p1 및 p2인 것으로 가정한다. 상기 일관성 손실은 다음의 수학식 7과 같이 정의된다:When multiple decoders are used for their respective decoding, while training the decoder bank, in order to make the output results of the decoders close to ground-truth, a coherence loss is added as a constraint to the output result of the decoder. It is necessary to prevent the performance of different decoders of a decoder bank from being significantly different, so that the performance of the decoder bank is not inferior to that of a single decoder. It is assumed that the decoder bank has two decoders, the outputs of which are respectively two probability distributions p1 and p2. The coherence loss is defined as Equation 7 below:

여기에서, D_KL은 K-L 발산을 나타낸다.Here, D _KL represents KL divergence.

본 개시의 실시예들에서의 대안적인 솔루션들 각각에서, 인코딩 또는 디코딩을 수행하는 동안에, 어텐션 기반 신경 네트워크는 이 상이한 입력들과 타겟 위치들 사이의 전역 의존성을 동시에 끌어낼 수 있기 때문에 사용될 수 있으며, 그래서 장기 종속성들이 더 잘 학습할 수 있으며, 이러한 유형의 신경 네트워크는 데이터를 처리하는 동안에 더 효율적인 병렬 컴퓨팅을 허용한다. 추가로, 셀프-어텐션 기반 인코더들 또는 디코더들, 특히 셀프-어텐션 기반 디코더들에 대해, 다중 교차-어텐션 (cross-attention) 레이어들이 쌓여질 수 있다. 셀프-어텐션 기반 신경 네트워크가 동일한 특징 벡터의 요소들 간의 연관 정보를 잘 학습할 수 있으며, 교차-어텐션 (그 핵심은 다중-헤드 어텐션)은 상이한 특징 벡터들 간의 연관 정보를 잘 학습할 수 있기 때문이다. 그러므로, 셀프-어텐션 기반 디코더 내 교차-어텐션 계층을 추가함으로써, 상기 디코더가 특징 벡터의 요소들 간의 연관된 특징들을 잘 학습할 수 있을 뿐만 아니라 상이한 특징들 간의 연관된 특징들도 잘 학습할 수 있으며, 그래서 더 나은 캡셔닝 정보를 획득하기 위해 많은 상이한 유형의 특징들을 더 잘 처리할 수 있다.In each of the alternative solutions in the embodiments of the present disclosure, while performing encoding or decoding, an attention-based neural network can be used because it can simultaneously derive the global dependence between these different inputs and target locations, So long-term dependencies can learn better, and this type of neural network allows for more efficient parallel computing while processing data. Additionally, for self-attention based encoders or decoders, in particular self-attention based decoders, multiple cross-attention layers may be stacked. This is because self-attention-based neural networks can learn association information between elements of the same feature vector well, and cross-attention (the core of which is multi-head attention) can learn association information between different feature vectors well. am. Therefore, by adding a cross-attention layer in a self-attention based decoder, the decoder can not only learn the associated features between elements of a feature vector well, but also the associated features between different features, so It can better handle many different types of features to obtain better captioning information.

일례로, 도 13b는 본 개시의 실시예에서 제공되는 셀프-어텐션 기반 디코더의 개략적인 구조도를 보여준다. 이 예에서, 인코더 부분은 공간-시간적 시각적 인코더 (공간-시간적 특징 추출 네트워크) 및 의미론적 인코더(의미론적 예측 네트워크)를 포함할 수 있다. 이 두 인코더들을 통해, 상기 출력은 공간-시간적 시각적 특징들 및 의미적론 특징들의 인코더 출력을 포함할 수 있다. 도 13a의 디코더가 디코딩을 위해 사용될 때에, 의미론적 인코더의 출력은 의미론적 교차-어텐션 레이어에 입력되고, 시간-공간적 시각적 특징 인코더의 출력은 상기시간-공간적 시각적 교차-어텐션 레이어에 입력된다. 상기 마스킹된 셀프-어텐션 레이어는, 트레이닝 동안에 이전 순간에서 다음 순간에서의 정보가 수신되지 않고, 다음 순간에 입력을 마스킹할 것이라는 것을 보장할 수 있다. 상기 마스킹된 셀프-어텐션 레이어의 입력은 시작 토큰을 포함하는 도 13a에서 보이는 디코더 입력에 대응할 수 있으며, 예를 들면, 이것은 디코더 임베딩 레이어에 의해 처리된 특징 벡터일 수 있다.As an example, FIG. 13B shows a schematic structural diagram of a self-attention-based decoder provided in an embodiment of the present disclosure. In this example, the encoder part may include a spatio-temporal visual encoder (a spatio-temporal feature extraction network) and a semantic encoder (a semantic prediction network). With these two encoders, the output may include the encoder output of spatio-temporal visual features and semantic features. When the decoder of FIG. 13A is used for decoding, the output of the semantic encoder is input to the semantic cross-attention layer, and the output of the temporal-spatial visual feature encoder is input to the temporal-spatial visual cross-attention layer. The masked self-attention layer can ensure that information from the previous instant to the next moment will not be received during training, and will mask the input at the next instant. The input of the masked self-attention layer may correspond to the decoder input shown in FIG. 13A including the start token, for example, it may be a feature vector processed by the decoder embedding layer.

본 개시의 대안의 실시예에서, 상기 추출된 특성 정보에 기반하여 멀티미디어 데이터의 텍스트 캡션을 생성하는 단계는: 생성될 텍스트 캡션의 길이 정보를 획득하는 단계; 그리고 상기 길이 정보 및 상기 추출된 특성 정보에 기반하여 상기 비디오의 텍스트 캡션을 생성하는 단계를 포함한다.In an alternative embodiment of the present disclosure, generating a text caption of multimedia data based on the extracted characteristic information includes: obtaining length information of a text caption to be generated; and generating a text caption of the video based on the length information and the extracted characteristic information.

종래 기술에서 사용자들을 위해 상이한 길이의 비디오 캡션 또는 이미지 캡션을 생성할 수 없는 문제를 해결하기 위해서, 본원에서 제공되는 솔루션에서는 생성될 텍스트 캡션의 길이 정보를 획득하여 대응 길이의 텍스트 캡션을 생성할 수 있으며, 이는 다양한 애플리케이션 시나리오들에 대한 필요성을 충족하기 위한 것이다. 상기 길이 정보는 "긴" (예: 생성된 캡셔닝 정보가 20단어 이상임), "중간" (예: 생성된 캡셔닝 정보가 10-20단어 사이임), "짧음" (예를 들어, 생성된 캡셔닝 정보는 10단어 미만) 등과 같은 상대적 길이 정보일 수 있다. 상기 길이 정보는 사용자로부터 얻을 수 있다. 예를 들어, 긴 캡셔닝 정보 또는 짧은 캡셔닝 정보를 생성하도록 사용자에게 요청하는 프롬프트를 보낼 수 있으며, 사용자는 그 프롬프트에 따라 대응하는 명령들을 부여할 수 있다. 상기 길이 정보는 비디오를 분석함으로써 획득될 수도 있다. 상기 비디오가 실시간 캡처된 비디오일 때에, 그 비디오를 분석하여 현재 응용 시나리오를 결정할 수 있으며, 상이한 길이 정보가 상이한 응용 시나리오에 따라 결정될 수 있다.In order to solve the problem that video captions or image captions of different lengths cannot be generated for users in the prior art, in the solution provided herein, it is possible to generate text captions of corresponding lengths by obtaining length information of text captions to be generated. and this is to meet the needs of various application scenarios. The length information is "long" (eg, the generated captioning information is more than 20 words), "medium" (eg, the generated captioning information is between 10-20 words), "short" (eg, generated The closed captioning information may be relative length information such as less than 10 words). The length information may be obtained from a user. For example, a prompt may be sent to the user requesting to generate long captioning information or short captioning information, and the user may grant corresponding commands according to the prompt. The length information may be obtained by analyzing the video. When the video is a real-time captured video, the video may be analyzed to determine a current application scenario, and different length information may be determined according to different application scenarios.

이 솔루션에 대하여, 상기 디코더를 트레이닝시키는 동안에, 종래 기술과 달리, 상기 디코더의 시작 토큰은 길이 정보를 포함하는 시작 토큰일 수 있다. 각 트레이닝 샘플에 대해, 시작 식별자는, 생성될 필요가 있는 더 긴 캡셔닝 또는 생성될 필요가 있는 더 짧은 캡셔닝을 나타내는 시작 토큰 및 상이한 샘플 캡셔닝 라벨 정보에 대응하는 상이한 시작 토큰들을 포함할 수 있다. 디코더가 트레이닝 샘플을 기반으로 트레이닝될 때, 상기 디코더는 상기 상이한 길이 정보에 대응하는 시작 토큰과 캡셔닝 정보의 대응 길이 간의 매핑 관계를 학습할 수 있으며, 상기 트레이닝된 디코더를 기반으로 디코딩이 수행될 때에, 상기 길이 정보에 대응하는 시작 토큰이 상기 획득된 길이 정보에 기반한 디코딩의 시작 토큰으로서 사용될 수 있으며, 이에 의해 길이 요구 사항을 충족하는 비디오 캡션 또는 이미지 캡션을 생성할 수 있다.For this solution, while training the decoder, unlike the prior art, the start token of the decoder may be a start token containing length information. For each training sample, the start identifier may include a start token indicating a longer captioning needing to be generated or a shorter captioning needing to be generated and different start tokens corresponding to different sample captioning label information. have. When the decoder is trained based on the training sample, the decoder may learn a mapping relationship between the start token corresponding to the different length information and the corresponding length of the captioning information, and decoding will be performed based on the trained decoder. At this time, a start token corresponding to the length information may be used as a start token of decoding based on the obtained length information, thereby generating a video caption or image caption that meets the length requirement.

즉, 본 개시의 실시예의 솔루션에서, 트레이닝 동안에, 기존 방식에서의 "BOS"(Begin of Sentence)를 "짧음", "중간" 또는 "긴"과 같은 길이 정보로 대체하여 출력 캡셔닝 정보의 길이를 제어하도록 한다. 실제 트레이닝에서, 상이한 길이 정보를 위해 상이한 길이 식별자들이 사용될 수 있다. 구체적으로, 트레이닝이 짧은 캡셔닝 정보를 출력할 때에, 상기 시작 토큰은 "짧음"으로 입력되고, 중간 캡셔닝 정보는 "중간" 시작 토큰에 대응하고, 긴 캡셔닝 정보는 "긴" 시작 토큰에 대응한다. 그처럼, 문장의 길이들은 트레이닝 동안에 "짧음", "중간" 및 "긴"에 각각 대응할 수 있다. 온라인 사용 동안에, 사용자의 상이한 필요성에 따라, "짧은", "중간" 또는 "긴"을 입력하여 상이한 길이의 캡셔닝 정보를 얻을 수 있다.That is, in the solution of the embodiment of the present disclosure, during training, the length of the output captioning information is replaced by length information such as "short", "medium" or "long" for "BOS" (Begin of Sentence) in the existing method. to control In actual training, different length identifiers may be used for different length information. Specifically, when training outputs short captioning information, the start token is input as "short", the middle captioning information corresponds to a "medium" start token, and the long captioning information corresponds to a "long" start token. respond As such, the lengths of the sentences may correspond to “short”, “medium” and “long” respectively during training. During online use, according to different needs of the user, "short", "medium" or "long" may be input to obtain captioning information of different lengths.

본 개시의 대안적인 실시예들에서의 방법들에 기반하여, 비디오 내 각 프레임 또는 이미지의 인트라-프레임 정보를 (그 비디오 또는 이미지의 의미론적 정보, 공간-시간적 시각적 특징들 등은 물론이며 상기 이미지들 객체, 속성 및 관계처럼) 상세하게 분석하고, 그 이미지 정보를 최대한 활용하여 보다 정확한 텍스트 캡션을 생성하는 것이 가능하다. 전술한 설명으로부터, 본 개시의 실시예들에서 제공된 비디오 캡셔닝 정보를 생성하는 방법에 기반하여, 실제 애플리케이션들에서, 실제 애플리케이션 요구사항에 따라 다양한 상이한 특정 구현이 선택될 수 있음을 알 수 있다.Based on the methods in alternative embodiments of the present disclosure, intra-frame information of each frame or image in a video (semantic information of the video or image, spatial-temporal visual characteristics, etc. as well as the image It is possible to create more accurate text captions by analyzing in detail (such as objects, properties, and relationships) and making the most of the image information. From the foregoing description, it can be seen that, in actual applications, various different specific implementations may be selected according to actual application requirements, based on the method of generating video captioning information provided in the embodiments of the present disclosure.

추가로, 본 개시의 실시예에서 제공된 솔루션에서, 멀티미디어 데이터의 특성 정보를 추출하는 동안에, 이미지의 각 영역의 특징들을 추출하기 위해 특징 추출 네트워크를 이용하는 것에 추가로, 영역들의 특징들 사이의 관계들을 학습하는 인코더 (즉, 관계 예측 네트워크)를 더 추가할 수 있다. 상기 인코더는 셀프-어텐션 기반 인코더(예: 변환기 인코더)에 의해 구현될 수 있으며, 그에 의해 특징 인코딩의 성능을 향상시켜 비디오 또는 이미지 캡셔닝 정보를 획득하는 성능을 높인다. 추가로, 본 개시의 일 실시예에서 캡셔닝 정보를 획득하는 동안에, 디코더의 통상적인 RNN 구조를 사용하지 않고, 셀프-어텐션 기반 디코더 (예를 들어, 변환기 디코더)를 사용할 수 있다. 셀프-어텐션 기반 디코더는 기존의 RNN에 비해 트레이닝 속도가 빠르고, 파라미터 개수가 적으며 그리고 장거리 의존성을 학습하기 쉽다는 장점들을 가진다.Further, in the solution provided in the embodiment of the present disclosure, while extracting characteristic information of multimedia data, in addition to using a feature extraction network to extract features of each region of an image, relationships between features of regions You can add more encoders that learn (i.e. relational prediction networks). The encoder may be implemented by a self-attention-based encoder (eg, a transformer encoder), thereby improving the performance of feature encoding to increase the performance of obtaining video or image captioning information. Additionally, while obtaining captioning information in an embodiment of the present disclosure, it is possible to use a self-attention-based decoder (eg, a transformer decoder) without using a conventional RNN structure of the decoder. Compared to the conventional RNN, the self-attention-based decoder has advantages in that the training speed is fast, the number of parameters is small, and it is easy to learn long-distance dependencies.

다음은 몇 가지 대안적인 실시예들과 결합하여 본 개시의 실시예들에 의해 제공된 멀티미디어 데이터의 캡셔닝 정보를 생성하기 위한 방법을 설명하기 위한 예로서 비디오들을 사용한다.The following uses videos as an example to describe a method for generating captioning information of multimedia data provided by embodiments of the present disclosure in combination with several alternative embodiments.

예 1Example 1

도 14는 본 개시 내용의 대안적인 실시예에 의해 제공되는 비디오 캡셔닝 정보를 생성하기 위한 방법의 개략적인 흐름도를 도시한다. 도 14에 도시된 바와 같이, 비디오 캡셔닝 정보를 생성하기 위한 상기 방법은 다음과 같은 단계를 포함할 수 있다.14 shows a schematic flow diagram of a method for generating video captioning information provided by an alternative embodiment of the present disclosure; 14 , the method for generating video captioning information may include the following steps.

단계 S301: 비디오에서 프레임들을 선택할 수 있다.Step S301: You can select frames from the video.

단계 S302: 장면 그래프가 프레임들 각각에 대해 별도로 구축된다.Step S302: A scene graph is built separately for each of the frames.

단계 S303: 그래프 컨볼루션 네트워크를 사용하여, 상기 구축된 장면 그래프를 기반으로 각 프레임의 그래프 컨볼루션 특징들을 획득한다.Step S303: Using the graph convolution network, obtain graph convolution features of each frame based on the constructed scene graph.

단계 S304: 상기 획득된 그래프 컨볼루션 특징을 기반으로 상기 비디오에 대한 텍스트 캡션이 생성된다.Step S304: A text caption for the video is generated based on the obtained graph convolution feature.

대안으로, 각 프레임의 그래프 컨볼루션 특징들을 획득한 후, 상기 그래프 컨볼루션 특징을 기반으로 비디오의 텍스트 캡션을 획득될 수 있다. 예를 들어, 상기 획득된 그래프 컨볼루션 특징들은 디코더에 입력될 수 있고, 주어진 비디오에 대한 텍스트 캡션이 상기 획득된 그래프 컨볼루션 특징을 디코딩함으로써 획득될 수 있다. 대안적인 방법으로, 각 프레임들의 그래프 컨볼루션 특징들에 따라 비디오의 텍스트 캡션을 생성하기 위해 셀프-어텐션 기반 디코더가 사용될 수 있으나, 본 개시는 이에 한정되는 것은 아니다.Alternatively, after obtaining the graph convolution characteristics of each frame, a text caption of the video may be obtained based on the graph convolution characteristics. For example, the obtained graph convolution features may be input to a decoder, and a text caption for a given video may be obtained by decoding the obtained graph convolution features. Alternatively, a self-attention-based decoder may be used to generate a text caption of the video according to the graph convolution characteristics of each frame, but the present disclosure is not limited thereto.

예 2Example 2

도 15는 이 예에서 주어진 비디오 캡셔닝 정보를 생성하기 위한 대안적인 방법의 개략적인 흐름도를 도시한다. 도 15에 도시된 바와 같이, 이 대안적인 구현은 다음 단계들을 포함할 수 있다.15 shows a schematic flow diagram of an alternative method for generating the video captioning information given in this example. As shown in FIG. 15 , this alternative implementation may include the following steps.

단계 S1201: 도 17a의 501 및 도 17b의 1001처럼, 주어진 비디오로부터 프레임들이 선택될 수 있다.Step S1201: Frames can be selected from a given video, like 501 in FIG. 17A and 1001 in FIG. 17B.

단계 S1202: 특징 추출 네트워크를 사용하여 도 17a의 502 및 도 17b의 1002처럼, 상기 선택된 프레임들 각각으로부터 여러 타겟 영역들 및 그 특징들 (예를 들어, 지역적 특징들 또는 로컬 특징들)이 획득된다. 각 이미지 프레임에 대해, Faster R-CNN 알고리즘을 사용하여 각 프레임 내 타겟 영역들과 그 타겟 영역들의 특징들을 추출할 수 있다.Step S1202: Several target regions and their characteristics (eg, local features or local features) are obtained from each of the selected frames, as 502 in FIG. 17A and 1002 in FIG. 17B, using a feature extraction network . For each image frame, the Faster R-CNN algorithm can be used to extract target regions within each frame and features of the target regions.

단계 S1203: 도 8의 예에 도시된 바와 같이, 각각의 타겟 영역들의 추출된 지역적 특징들에 관계 예측 네트워크가 적용되어 타겟 영역들 간의 관계 특징들을 획득한다.Step S1203: As shown in the example of FIG. 8 , a relation prediction network is applied to the extracted regional features of each target region to obtain relational features between the target regions.

단계 S1204: 각 이미지 프레임에 대한 장면 그래프는, 도 17a 내 503처럼, 각각의 타겟 영역들 사이에서 획득된 관계 특징들을 기반으로 구축된다.Step S1204: A scene graph for each image frame is built based on the obtained relationship features between the respective target regions, as shown in 503 in FIG. 17A .

단계 S1205: 그래프 컨볼루션 네트워크가 사용되어, 도 17a 내 504에 보이는 것처럼, 각 프레임의 장면 그래프 내 노드들 및 에지들에 기반하여 그래프 컨볼루션 특징들을 획득한다.Step S1205: A graph convolution network is used to obtain graph convolution features based on nodes and edges in the scene graph of each frame, as shown in 504 in FIG. 17A .

단계 S1206: 상기 그래프 컨볼루션 특징들을 기반으로 비디오의 텍스트 캡션이 생성된다. 대안으로, 주어진 비디오의 텍스트 캡션을 생성하기 위해서 상기 획득된 그래프 컨볼루션 특징들에 따라 선택된 프레임들의 인터-프레임 정보를 학습하기 위해 셀프-어텐션 기반 디코더가 사용될 수 있다. 예를 들어, 선택된 각 프레임에 대해, 상기 그래프 컨볼루션 특징에 대한 벡터가 획득될 수 있으며, 이러한 특징 벡터들은 도 17a 내 505에 도시된 바와 같이, 프레임들 간 인터-프레임 정보를 획득하기 위해서 학습하기 위해 상기 셀프-어텐션 기반 디코더에 입력된다. 각 프레임의 그래프 컨볼루션 특징들의 벡터는 인터-프레임 변환기 디코더에 입력될 수 있으며, 그리고 상기 디코더의 출력을 기반으로 상기 비디오의 텍스트 캡션, 즉, "사람이 오븐에 피자를 넣고 있다"가 획득된다.Step S1206: A text caption of the video is generated based on the graph convolution features. Alternatively, a self-attention based decoder may be used to learn inter-frame information of selected frames according to the obtained graph convolution characteristics to generate a text caption of a given video. For example, for each selected frame, a vector for the graph convolution feature may be obtained, and these feature vectors are trained to obtain inter-frame information between frames, as shown at 505 in FIG. 17A . to the self-attention-based decoder. A vector of graph convolutional features of each frame may be input to an inter-frame converter decoder, and based on the output of the decoder, a text caption of the video, i.e., “a man is putting pizza in the oven” is obtained. .

예 3Example 3

도 16은 이 예에서 주어진 비디오 캡셔닝 정보를 생성하기 위한 대안적인 방법의 개략적인 흐름도를 도시한다. 도 15 및 도 16을 비교하면,, 이 예(도 16)의 처음 3개의 단계들은 위의 예 2(도 15)에서와 동일하므로,, 반복하여 설명하지 않는다. 상기 예는 이 예에 단계 S1304가 추가된다는 점, 즉 속성 예측 네트워크가 각각의 타겟 영역들의 추출된 특징들에 적용되어 각각의 타겟 영역들의 속성 특징을 획득한다는 점에서 예 1과 상이하다. 예를 들어, 시각적 게놈 데이터세트를 기반으로 관계 예측 네트워크가 학습될 수 있으며, 그 후에, 도 9의 예에 도시된 바와 같이, 각각의 타겟 영역들의 속성 특징들이 트레이닝된 속성 예측 네트워크를 이용하여 획득될 수 있다.16 shows a schematic flow diagram of an alternative method for generating the video captioning information given in this example. 15 and 16, the first three steps of this example (FIG. 16) are the same as in Example 2 (FIG. 15) above, and thus will not be repeated. The above example is different from Example 1 in that step S1304 is added to this example, that is, an attribute prediction network is applied to the extracted features of each target regions to obtain an attribute characteristic of each target regions. For example, a relational prediction network may be trained based on a visual genomic dataset, and then, as shown in the example of FIG. 9 , the attribute characteristics of each target region are obtained using the trained attribute prediction network. can be

따라서, 단계 S1305에서, 각 프레임의 장면 그래프를 구축할 때에, 그것은 각각의 타겟 영역들의 획득된 속성 특징들 및 각각의 타겟 영역들 간의 관계 특징들을 기반으로 구체적으로 구축될 수 있으며, 그 다음에 단계 S1306에서, 각 프레임의 그래프 컨볼루션 특징은 상기 속성 특징과 관계 특징을 기반으로 구축된 상기 장면 그래프에 따라 획득된다.Therefore, in step S1305, when constructing the scene graph of each frame, it may be concretely constructed based on the obtained attribute characteristics of each target area and the relational characteristics between each target area, and then step In S1306, a graph convolution feature of each frame is obtained according to the scene graph constructed based on the attribute feature and the relation feature.

상기 단계들 S1303 및 S1304의 순서는 역전될 수 있거나 또는 상기 단계들 S1303 및 S1304가 동시에 수행될 수 있음에 유의해야 한다.It should be noted that the order of the steps S1303 and S1304 may be reversed or the steps S1303 and S1304 may be performed simultaneously.

각 프레임의 그래프 컨볼루션 특징을 획득한 이후에, 단계 S1307을 통해 획득된 그래프 컨볼루션 특징들에 따라 상기 비디오의 텍스트 캡션이 생성될 수 있다. 셀프-어텐션 기반 디코더는 주어진 비디오의 텍스트 캡션을 생성하기 위해 선택된 프레임들의 인터-프레임 정보를 학습하기 위해 사용될 수 있다.After obtaining the graph convolution features of each frame, a text caption of the video may be generated according to the graph convolution features obtained through step S1307. A self-attention based decoder can be used to learn inter-frame information of selected frames to generate a text caption of a given video.

예를 들어, 도 17a의 505에 도시된 바와 같이, 어텐션 기반 디코더 (예를 들어, 도면에서 보이는 인터-변환기 디코더)는 그래프 컨볼루션 특징들에 따라 비디오의 텍스트 캡션을 생성하기 위해 인터-프레임 정보를 학습하는데 사용될 수 있다.For example, as shown at 505 of FIG. 17A , the attention-based decoder (eg, the inter-converter decoder shown in the figure) may use inter-frame information to generate a text caption of the video according to graph convolution characteristics. can be used to learn

다른 예로, 도 17b의 1005, 1006, 1007에 도시된 바와 같이, 상기 획득된 그래프 컨볼루션 특징들은 별도로 인코딩될 수 있으며, 상기 인코딩된 특징들은 특징 벡터의 타겟 차원을 획득하기 위해 처리될 수 있다. 도 17b에서의 장면 그래프 구축과 그래프 컨볼루션 특징들 사이의 교차 연결선은 상기 사용된 장면 그래프가 공간-시간적 장면 그래프일 수 있음을 나타내며, 즉, 장면 그래프를 구축할 때에 인터-프레임 정보가 고려될 수 있으며, 물론 상기 공간적 장면 그래프도 사용될 수 있음을 나타낸다. 구체적으로, 셀프-어텐션 기반 인트라-인코더 (예: 도 17b에서 보이는 인트라-프레임 변환기 인코더)를 사용하여 각 프레임의 그래프 컨볼루션 특징들에 대한 인코딩 동작을 별도로 수행할 수 있다. 셀프-어텐션 기반 인트라-인코더의 기능은 인터-프레임 정보를 학습하는 것이며, 즉, 셀프-어텐션 메커니즘을 사용하여 상기 프레임 내 객체들 간의 연관 정보를 추가로 학습할 수 있다. 대안으로, 상기 셀프-어텐션 기반 인트라-프레임 인코더의 구조는 셀프-어텐션 메커니즘을 기반으로 한다. 상기 구조는 다중-헤드 어텐션 레이어, 레이어 정규화 레이어 및 피드 포워드 네트워크 레이어를 포함한다. 다음으로, 상기 셀프-어텐션 기반 인트라-프레임 인코더로부터 출력은 각 프레임에 대한 특징 벡터의 목표 차원을 획득하기 위해 처리된다.As another example, as shown in 1005 , 1006 , and 1007 of FIG. 17B , the obtained graph convolution features may be separately encoded, and the encoded features may be processed to obtain a target dimension of a feature vector. The cross link between the scene graph construction and graph convolution features in Fig. 17b indicates that the used scene graph may be a spatio-temporal scene graph, i.e., inter-frame information will be considered when building the scene graph. and, of course, indicates that the spatial scene graph can also be used. Specifically, an encoding operation on the graph convolution features of each frame may be separately performed using a self-attention-based intra-encoder (eg, the intra-frame converter encoder shown in FIG. 17B ). The function of the self-attention-based intra-encoder is to learn inter-frame information, that is, it is possible to additionally learn association information between objects in the frame using the self-attention mechanism. Alternatively, the structure of the self-attention-based intra-frame encoder is based on a self-attention mechanism. The structure includes a multi-head attention layer, a layer normalization layer and a feed forward network layer. Next, the output from the self-attention based intra-frame encoder is processed to obtain the target dimension of the feature vector for each frame.

예를 들어, 셀프-어텐션 기반 인트라-디코더에서 출력되는 시퀀스의 차원이 T*C라고 가정하고, 여기에서 T는 장면 그래프 내 노드들의 개수를 나타내고, C는 각 노드에 대응하는 특징 벡터의 특징 차원을 나타내며, 상기 셀프-어텐션 기반 인트라-프레임 인코더는 셀프-어텐션 메커니즘을 사용하여 출력 시퀀스와 다른 정보의 관계를 학습하고 그 학습된 시퀀스를 출력한다. 여기에서, 출력 시퀀스의 길이는 입력 시퀀스의 길이인 T*C와 같다. 상기 출력은 1*C 차원의 특징 벡터를 획득하기 위해 평균화된다. 이와 같이, 프레임마다 1*C 특징 벡터를 얻을 수 있다.For example, it is assumed that the dimension of the sequence output from the self-attention-based intra-decoder is T*C, where T represents the number of nodes in the scene graph, and C is the feature dimension of a feature vector corresponding to each node. , and the self-attention-based intra-frame encoder learns a relationship between an output sequence and other information using a self-attention mechanism and outputs the learned sequence. Here, the length of the output sequence is equal to T*C, which is the length of the input sequence. The output is averaged to obtain a 1*C dimension feature vector. In this way, a 1*C feature vector can be obtained for each frame.

인코딩된 특징 벡터가 각 선택된 프레임에 대해 획득될 수 있다. 이러한 인코딩된 특징 벡터들은 셀프-어텐션 기반 인트라-프레임 인코더 (예를 들어, 도 17b에 도시된 인터-변환기 인코더)에 입력되며, 그 후에 특징 벡터의 타겟 차원을 획득하기 위해 다시 인코딩된다.An encoded feature vector may be obtained for each selected frame. These encoded feature vectors are input to a self-attention based intra-frame encoder (eg, the inter-transformer encoder shown in FIG. 17B ), and then re-encoded to obtain the target dimension of the feature vector.

그 이후에, 상기 인코딩된 특징들에 기반하여, 셀프-어텐션 기반 디코더 (예를 들어, 도 17b에 도시된 인터-프레임 변환기 디코더)는 주어진 비디오의 텍스트 캡션을 생성하기 위해 상기 인터-프레임 정보를 학습하는 데 사용된다. 상기 인코딩된 특징들은 프레임들 간의 인터-프레임 정보를 획득하는 것을 학습하기 위해 셀프-어텐션 기반 인터-프레임 디코더에 입력되고, 주어진 비디오의 텍스트 캡션이 상기 입력된 특징들을 학습함으로써 생성된다.Thereafter, based on the encoded characteristics, a self-attention based decoder (eg, the inter-frame converter decoder shown in FIG. 17B ) converts the inter-frame information to generate a text caption of the given video. used to learn The encoded features are input to a self-attention based inter-frame decoder to learn to obtain inter-frame information between frames, and a text caption of a given video is generated by learning the input features.

다른 예로, 상기 획득된 프레임들 각각의 그래프 컨볼루션 특징을 별도로 인코딩하기 위해 상기 셀프-어텐션 기반 인트라-프레임 인코더만 사용될 수 있으며, 인코딩된 특징들을 상기 디코더에 입력하여 상기 비디오의 텍스트 캡션을 생성한다.. 대안으로, 상기 획득된 프레임들 각각의 그래프 컨볼루션 특징들을 인코딩하기 위해, 상기 셀프-어텐션 기반 인터-프레임 인코더만이 사용될 수 있으며, 그 인코딩된 특징들을 디코더에 입력하여 상기 비디오의 텍스트 캡션을 생성한다. 즉, 그것은 도 17b의 동작 1005 단계 또는 도 17의 동작 1006만을 수행하거나, 또는 그것은 도 17b의 동작 1005 및 도 17b의 동작 1006을 함께 수행할 수 있다.As another example, only the self-attention-based intra-frame encoder may be used to separately encode a graph convolution feature of each of the obtained frames, and input the encoded features to the decoder to generate a text caption of the video .. Alternatively, only the self-attention based inter-frame encoder may be used to encode the graph convolutional features of each of the obtained frames, inputting the encoded features into a decoder to provide a text caption of the video create That is, it may perform only operation 1005 of FIG. 17B or operation 1006 of FIG. 17 , or it may perform operation 1005 of FIG. 17B and operation 1006 of FIG. 17B together.

주어진 비디오에 대해 일련의 처리를 수행한 후에, 상기 주어진 비디오의 텍스트 캡션이 구현될 수 있다. 도 17b에 도시된 바와 같이, "남자가 피자를 오븐에 넣고 있다"라는 텍스트 캡션이 상기 선택된 프레임들로부터 생성될 수 있다.After performing a series of processing on a given video, a text caption of the given video can be implemented. As shown in FIG. 17B , a text caption “Man is putting pizza in the oven” may be created from the selected frames.

예 4Example 4

도 18은 이 예에서 주어진 비디오 캡셔닝 정보를 생성하기 위한 대안적인 방법의 개략적인 흐름도를 도시한다. 도 18에 도시된 바와 같이, 도 18 및 도 16을 비교하면, 이 예는 단계 S1505에서 예 3과는 다르다는 것을 알 수 있다. 이 예에서는 각 프레임의 장면 그래프를 구축할 때에, 시간 정보도 고려된다. 구체적으로, 각각의 타겟 영역들의 획득된 속성 특징들, 및 각각의 타겟 영역들 간의 관계 특징들을 기반으로 각 이미지에 대한 공간 장면 그래프를 구축할 수 있으며, 프레임들에 대한 장면 그래프들 사이에 시간 정보가 추가되어 도 10에 도시된 예에서처럼 공간-시간적 장면 그래프를 획득한다.18 shows a schematic flow diagram of an alternative method for generating the video captioning information given in this example. As shown in Fig. 18, comparing Figs. 18 and 16, it can be seen that this example is different from Example 3 in step S1505. In this example, when constructing the scene graph of each frame, temporal information is also considered. Specifically, it is possible to construct a spatial scene graph for each image based on the acquired attribute characteristics of each target region and the relational characteristics between each target region, and temporal information between the scene graphs for frames is added to obtain a spatio-temporal scene graph as in the example shown in FIG. 10 .

이 예는 예 1에 기반하여 구현될 수도 있으며, 즉, 단계 S1504는 생략될 수 있고 타겟 영역의 속성 특징들은 고려되지 않을 수 있다.This example may be implemented based on example 1, that is, step S1504 may be omitted and the attribute characteristics of the target area may not be considered.

각 프레임의 공간-시간적 장면 그래프를 획득한 후에, 단계 S1506에서, 그래프 컨볼루션 네트워크는, 시간-공간적 장면 그래프 내 노드들 및 에지들에 기반하여, 각 프레임의 그래프 컨볼루션 특징들을 획득하는데 사용된다. 그 후에, 단계 S1507에서, 주어진 비디오의 텍스트 캡션이 상기 획득된 그래프 컨볼루션 특징들에 따라 생성된다. 예를 들어, 셀프-어텐션 기반 인코더는 그래프 컨볼루션 특징들을 인코딩하기 위해 사용될 수 있다. 상기 인코딩된 그래프 컨볼루션 특징에 따라, 셀프-어텐션 기반 디코더는 상기 비디오의 텍스트 캡션을 생성하기 위해 인터-프레임 정보를 학습하는데 사용된다.After obtaining the spatio-temporal scene graph of each frame, in step S1506, the graph convolution network is used to obtain the graph convolution characteristics of each frame, based on the nodes and edges in the spatio-temporal scene graph. . Then, in step S1507, a text caption of the given video is generated according to the obtained graph convolution characteristics. For example, a self-attention based encoder may be used to encode graph convolution features. According to the encoded graph convolution feature, a self-attention based decoder is used to learn inter-frame information to generate a text caption of the video.

예 5Example 5

도 18은 이 예에서 주어진 비디오 캡셔닝 정보를 생성하기 위한 대안적인 방법의 개략적인 흐름도를 도시한다. 도 19 및 도 16을 비교하면, 이 예(도 19)는 다음과 같은 점에서 위의 예 3(도 16)과 다르다는 것을 알 수 있다.18 shows a schematic flow diagram of an alternative method for generating the video captioning information given in this example. Comparing FIGS. 19 and 16 , it can be seen that this example ( FIG. 19 ) is different from the above example 3 ( FIG. 16 ) in the following respects.

단계 S1602: 상기 선택된 프레임들 각각에 대해, 여러 타겟 영역들 및 각각의 타겟 영역들의 특징들 (즉, 지역적 특징들 또는 로컬 특징들), 및 상기 비디오의 공간-시간적 시각적 특징들을 획득하기 위해 특징 추출 네트워크가 사용된다.Step S1602: For each of the selected frames, feature extraction to obtain several target regions and features of each target region (ie, regional features or local features), and spatial-temporal visual features of the video network is used.

위의 예에서 지역적 특징들을 추출하는 단계들과 비교하면, 이 예에서 이 단계에 의해 추출된 특징들은 상기 비디오의 공간-시간적 시각적 특징들도 포함할 수 있다. 대안적으로, 도 20에 보이듯이, 공간-시간적 특징 추출 네트워크를 통해 상기 공간-시간적 시각적 특징들이 획득될 수 있다.Compared with the steps of extracting local features in the above example, the features extracted by this step in this example may also include spatio-temporal visual features of the video. Alternatively, as shown in FIG. 20 , the spatio-temporal visual features may be obtained through a spatio-temporal feature extraction network.

단계 S1603: 선택된 프레임들에 기반하여, 상기 비디오의 의미론적 특징들은 상기 의미론적 특징 추출 네트워크를 통해 추출된다. 도 20에서 보이듯이, 프레임들에 기반한 의미론적 예측 네트워크를 통해 의미론적 특징들이 획득될 수 있다.Step S1603: Based on the selected frames, semantic features of the video are extracted through the semantic feature extraction network. As shown in FIG. 20 , semantic features may be obtained through a semantic prediction network based on frames.

단계 S1604 내지 단계 S1607은 이전 예에서 관계 특징들 및 속성 특징들을 획득하는 단계, 장면 그래프(공간 장면 그래프 또는 공간-시간적 장면 그래프)를 구축하는 단계, 그리고 그래프 컨볼루션 특징들을 추출하는 단계에 대응하며, 여기에서는 반복하여 설명하지 않는다.Steps S1604 to S1607 correspond to obtaining relational features and attribute features in the previous example, building a scene graph (spatial scene graph or space-temporal scene graph), and extracting graph convolution features, and , which is not repeated here.

단계 S1608: 각 프레임의 그래프 컨볼루션 특징, 위의 공간-시간적 시각적 특징들 및 의미론적 특징들에 따라 상기 비디오의 텍스트 캡션이 생성된다. 구체적으로, 복수의 디코더들 (도 20에서 보이는 디코더 뱅크)은 공간-시간적 시각적 특징들, 의미론적 특징들 및 그래프 컨볼루션 특징들에 따라 비디오의 텍스트 캡션을 생성하기 위해 인터-프레임 정보를 학습하는데 사용될 수 있다.Step S1608: A text caption of the video is generated according to the graph convolution characteristics of each frame, the above spatial-temporal visual characteristics and semantic characteristics. Specifically, a plurality of decoders (decoder bank shown in FIG. 20 ) learn inter-frame information to generate a text caption of a video according to spatio-temporal visual features, semantic features and graph convolution features. can be used

각 디코더는 셀프-어텐션 기반 디코더 또는 RNN 기반 디코더일 수 있다. 구체적으로, 프레임들 사이의인터-프레임 정보를 획득하는 것을 학습하기 위해, 공간-시간적 시각적 특징, 의미론적 특징 및 그래프 컨볼루션 특징이 상기 디코더 뱅크에 입력될 수 있으며,, 그 후 각각의 디코더들의 결과들을 평균화하여 최종 디코딩 결과를 획득한다. 주어진 비디오에 관한 텍스트 캡션은 입력 특징들을 학습함으로써 생성된다.Each decoder may be a self-attention based decoder or an RNN based decoder. Specifically, in order to learn to obtain inter-frame information between frames, spatial-temporal visual features, semantic features and graph convolutional features may be input to the decoder bank, and then The results are averaged to obtain the final decoding result. A text caption for a given video is generated by learning the input features.

본 개시에서 제공된 비디오 캡셔닝 정보를 생성하는 방법은 인트라-프레임 정보를 무시함으로 인한 기존 비디오 캡셔닝 알고리즘의 낮은 정확도라는 문제를 해결하며, 개선된 비디오 캡셔닝 기법을 제안한다. 본원에서 제공된 방법이 구현될 때에, 그것은 그래프 컨볼루션 네트워크를 기반으로 특징들을 획득할 수 있고, 셀프-어텐션 구조의 디코딩 출력을 기반으로 비디오의 텍스트 캡션을 획득할 수 있다. 구체적으로, 그래프 컨볼루션 특징을 획득한 후, 그 그래프 컨볼루션 특징은, 주어진 비디오에 관한 텍스트 캡션을 디코딩 및 출력할 수 있도록, 셀프-어텐션 기반 디코더에 직접 입력될 수 있으며, 그것은 상기 획득한 그래프 컨볼루션에 대해 인코딩 연산을 더 수행할 수 있으며, 상기 인코딩된 특징들을, 인터-프레임 인코딩 및 디코딩을 위해, 셀프-어텐션 기반 코덱에 입력하고, 이에 의해 주어진 비디오의 텍스트 캡션을 출력한다.The method for generating video captioning information provided in the present disclosure solves the problem of low accuracy of the existing video captioning algorithm due to ignoring intra-frame information, and proposes an improved video captioning technique. When the method provided herein is implemented, it may obtain features based on a graph convolution network, and obtain a text caption of a video based on a decoding output of a self-attention structure. Specifically, after obtaining a graph convolution feature, the graph convolution feature can be directly input to a self-attention-based decoder, so as to decode and output a text caption for a given video, which is the obtained graph An encoding operation may be further performed on the convolution, and the encoded features are input to a self-attention based codec for inter-frame encoding and decoding, thereby outputting a text caption of a given video.

본 개시에 의해 생성된 캡셔닝 정보의 정확도는 셀프-어텐션 기반 인트라-인코더 및 셀프-어텐션 기반 인터-프레임 인코더를 사용함으로써 더욱 향상될 수 있다. 대안으로, 그것은 셀프-어텐션 기반 인트라-인코더를 사용하여 각 프레임의 상기 획득된 그래프 컨볼루션 특징을 별도로 인코딩하고, 그 인코딩된 특징을 통합하여 디코더에 입력하여 상기 비디오의 텍스트 캡션을 생성하도록 할 수 있다; 또는 그것은 셀프-어텐션 기반 인터-프레임 인코더를 사용하여 각 프레임의 상기 획득된 그래프 컨볼루션 특징을 인코딩하고 그 인코딩된 특징을 상기 디코더에 입력하여 비디오의 텍스트 캡션을 생성하도록 할 수 있다. 즉, 인트라-프레임 인코더와 인터-프레임 인코더를 선택적으로 사용할 수 있다. 본 개시의 실시예에 의해 제공되는 솔루션은 인트라-프레임 정보 및/또는 인터-프레임 정보를 최대한 활용할 수 있으며, 그럼으로써 주어진 비디오에 대해 보다 정확한 텍스트 캡션을 생성할 수 있다.The accuracy of captioning information generated by the present disclosure can be further improved by using a self-attention-based intra-encoder and a self-attention-based inter-frame encoder. Alternatively, it can separately encode the obtained graph convolution features of each frame using a self-attention-based intra-encoder, and integrate the encoded features to input to a decoder to generate a text caption of the video. have; Or it may use a self-attention based inter-frame encoder to encode the obtained graph convolution feature of each frame and input the encoded feature to the decoder to generate a text caption of the video. That is, an intra-frame encoder and an inter-frame encoder can be selectively used. The solution provided by the embodiments of the present disclosure may make full use of intra-frame information and/or inter-frame information, thereby generating more accurate text captions for a given video.

상기 이미지는 이미지 캡셔닝 정보를 생성하는 몇 가지 대안적인 구현을 설명하기 위해 아래에서 예로서 취해진다.The above images are taken as examples below to illustrate some alternative implementations of generating image captioning information.

예 6Example 6

도 21은 본 개시의 실시예에 따른 이미지 캡셔닝 정보 생성 방법의 개략적인 흐름도를 도시한다. 도면에서 보이듯이, 상기 방법은 다음의 단계들을 포함한다.21 is a schematic flowchart of a method for generating image captioning information according to an embodiment of the present disclosure. As shown in the figure, the method includes the following steps.

단계 S10: 상기 이미지에 대응하는 특성 정보가 추출된다.Step S10: characteristic information corresponding to the image is extracted.

단계 S20: 상기 추출된 특성 정보에 기반하여 상기 이미지에 대응하는 캡셔닝 정보가 획득된다.Step S20: Captioning information corresponding to the image is obtained based on the extracted characteristic information.

상기 이미지는 필요에 따라 로컬 저장소 또는 로컬 데이터베이스에서 얻어질 수 있으며 또는 입력 디바이스 또는 전송 매체를 통해 (인터넷, 서버, 데이터베이스 등과 같은) 외부 데이터 소스로부터 수신될 수 있다.The image may be obtained from a local storage or a local database as required, or may be received from an external data source (such as the Internet, a server, a database, etc.) via an input device or transmission medium.

구체적으로, 이미지에 대응하는 특성 정보가 특징 추출 네트워크를 통해 추출될 수 있다. 대안의 방법으로, 각자의 타겟 지역들의 로컬 특징들은 트레이닝된 Faster R-CNN에 의해 추출될 수 있으며, 예를 들면, Pool5 (즉, 관심 영역(RoI)) 계층의 특징 그래프를 평균화하고 풀링 (pooling)한 후 획득된 특징 벡터를 특징으로 선택할 수 있다.Specifically, characteristic information corresponding to the image may be extracted through the feature extraction network. As an alternative method, local features of respective target regions may be extracted by a trained Faster R-CNN, for example, by averaging and pooling the feature graph of the Pool5 (ie, region of interest (RoI)) layer. ), the acquired feature vector may be selected as a feature.

상기 이미지의 특성 정보를 획득한 후, 상기 추출된 특성 정보에 따라 디코더들을 통해 상기 이미지에 대응하는 캡셔닝 정보가 획득될 수 있다. 상기 디코더의 구체적인 구조는 본 개시의 실시예에서 제한되지 않는다. 예를 들어, 상기 디코더는 셀프-어텐션 기반 디코더(예: 변환기 디코더)에 의해 구현될 수 있다. 구체적으로, 상기 디코더는 상기 추출된 특성 정보 및 입력 단어 벡터 (시작 토큰 및 반복 예측 과정 동안에 예측된 단어 벡터를 포함할 수 있음)를 기반으로 각 순간에 출력될 수 있는 단어들 및 그 단어들의 출력 확률(정규화된 확률일 수 있음)을 출력할 수 있다. 대안적인 솔루션으로서, 상기 셀프-어텐션 기반 디코더는 마스킹된 다중-헤드 어텐션 레이어, 다중-헤드 어텐션 레이어, 레이어 정규화 레이어 및 피드 포워드 네트워크 레이어를 포함할 수 있다.After obtaining the characteristic information of the image, captioning information corresponding to the image may be obtained through decoders according to the extracted characteristic information. The specific structure of the decoder is not limited in the embodiment of the present disclosure. For example, the decoder may be implemented by a self-attention-based decoder (eg, a transformer decoder). Specifically, the decoder outputs words that can be output at each moment based on the extracted characteristic information and an input word vector (which may include a starting token and a word vector predicted during iterative prediction process) and the output of those words. You can output the probabilities (which can be normalized probabilities). As an alternative solution, the self-attention-based decoder may include a masked multi-head attention layer, a multi-head attention layer, a layer normalization layer and a feed forward network layer.

예를 들어, 어휘는 a, b, c, d, e이다. 상기 디코더는 제1 순간에 출력될 수 있는 어휘로부터, 단어 a 의 출력 확률 (예: 60%) 및 단어 b의 출력 확률 (예: 40%), 그리고 c, d 및 e(0%)로 단어들을 출력할 수 있다. 제2 순간에, 단어 c의 출력 확률(예: 60%), 단어 d의 출력 확률(예: 20%) 및 단어 e의 출력 확률(예: 20%), a, b(0%) ) 등이다. 이 경우, 본 개시의 일 실시예에 따라서, 이미지 캡셔닝 문장은 그리디 (greedy) 디코딩 알고리즘에 의해 생성될 수 있으며, 즉, 이미지 캡셔닝 문장은 각 순간 출력될 수 있는 최대 출력 확률을 가진 단어들을 시간 순서대로 결합함으로써 생성된다. 대안으로, 본 개시의 다른 예시적인 실시예에 따르면, 몬테카를로 (Monte Carlo) 샘플링 방법에 의해 이미지 캡셔닝 문장이 생성될 수 있으며, 즉, 각 순간에 출력될 수 있는 단어의 출력 확률에 기반하여 몬테카를로 샘플링을 수행함으로써 이미지 캡셔닝 문장이 생성된다.For example, the vocabulary is a, b, c, d, e. The decoder, from the vocabulary that can be output at a first instant, has an output probability of word a (eg 60%) and an output probability of word b (eg 40%), and a word with c, d and e (0%). can be printed out. At the second instant, the output probability of word c (eg 60%), the output probability of word d (eg 20%) and the output probability of word e (eg 20%), a, b(0%) ), etc. am. In this case, according to an embodiment of the present disclosure, the image captioning sentence may be generated by a greedy decoding algorithm, that is, the image captioning sentence is a word having the maximum output probability that can be output at each instant. are created by combining them in chronological order. Alternatively, according to another exemplary embodiment of the present disclosure, an image captioning sentence may be generated by the Monte Carlo sampling method, that is, based on the output probability of a word that may be output at each moment in Monte Carlo By performing sampling, image captioning sentences are generated.

이에 대응하여, 이미지의 캡셔닝 정보를 생성할 때에, 정지 문자(character)의 출력 확률이 최대가 될 때까지 시간 순서대로 각 순간에 출력될 수 있는 최대 출력 확률을 갖는 단어들을 조합함으로써 상기 이미지 캡셔닝 문장이 획득될 수 있다.Correspondingly, when generating captioning information of an image, by combining words having the maximum output probability that can be output at each moment in chronological order until the output probability of a static character is maximum, the image cap A shunting sentence may be obtained.

대안으로, 상기 S10 단계는 상기 이미지에 대응하는 전역 특징을 획득하는 단계를 더 포함할 수 있다.Alternatively, the step S10 may further include obtaining a global feature corresponding to the image.

이에 대응하여, 상기 단계 S20에서, 이미지의 텍스트 캡션을 획득하는 단계는 상기 획득된 로컬 특징 및 전역 특징에 기반하여 상기 이미지의 텍스트 캡션을 획득하는 단계일 수 있다.Correspondingly, in step S20 , the step of obtaining the text caption of the image may be a step of obtaining the text caption of the image based on the obtained local features and global features.

보다 정확한 이미지 캡션을 얻기 위해, 상기 이미지의 각각의 타겟 영역들의 로컬 특징들을 획득한 후에, 상기 로컬 특징들을 기반으로 상기 이미지의 전역 특징들을 더 획득할 수 있으며, 이는 상기 로컬 및 전역 특성 정보를 기반으로 보다 정확한 이미지 캡셔닝 정보를 얻기 위한 것이다.In order to obtain a more accurate image caption, after obtaining local features of respective target regions of the image, global features of the image may be further obtained based on the local features, which is based on the local and global feature information. to obtain more accurate image captioning information.

대안으로, 상기 이미지에 대응하는 전역 특징을 획득하는 단계는: 상기 이미지의 로컬 특징에 기반하여 상기 이미지의 전역 특징들을 획득하거나, 상기 이미지를 기반으로 특징 추출 네트워크를 통해 상기 전역 특징들을 추출하는 단계를 포함할 수 있다.Alternatively, the acquiring global features corresponding to the image includes: acquiring global features of the image based on local features of the image, or extracting global features based on the image through a feature extraction network may include.

구체적으로, 상기 특징 추출 네트워크를 통해 상기 이미지의 각각의 타겟 후보 영역들에 대응하는 로컬 특성 정보가 추출될 수 있으며, 그리고 로컬 특성 정보에 기반하여 상기 이미지에 대응하는 전역 특성 정보가 획득될 수 있다. 대응하여, 상기 디코더는 상기 로컬 특성 정보 및 상기 전역 특성 정보에 따라 상기 이미지에 대응하는 캡셔닝 정보를 획득하기 위해 사용될 수 있다. 대안의 방법으로, 상기 이미지의 각각의 타겟 후보 영역들에 대응하는 로컬 특성 정보를 평균하여 상기 전역 특성 정보가 획득될 수 있다. 다른 대안의 방법으로, 상기 전역 특성 정보는 상기 이미지에 특징 추출 네트워크 (예: CNN)를 적용하여 획득될 수 있으며, 예를 들어, 그것은 상기 이미지의 각각의 레이어들(즉, 각자의 채널들)의 특징 맵들을 ResNet을 통해 추출하고 그것들을 평균화하고 풀링함으로써 획득될 수 있다.Specifically, local characteristic information corresponding to each target candidate regions of the image may be extracted through the feature extraction network, and global characteristic information corresponding to the image may be obtained based on the local characteristic information. . Correspondingly, the decoder may be used to obtain captioning information corresponding to the image according to the local characteristic information and the global characteristic information. As an alternative method, the global characteristic information may be obtained by averaging local characteristic information corresponding to respective target candidate regions of the image. Alternatively, the global feature information may be obtained by applying a feature extraction network (eg, CNN) to the image, for example, it may be obtained by applying a feature extraction network (eg, CNN) to the respective layers (ie, respective channels) of the image. It can be obtained by extracting the feature maps of , through ResNet, averaging them, and pooling them.

본 개시의 대안적인 실시예에서, 상기 로컬 특징은 로컬 이미지 특징 및/또는 로컬 속성 특징을 포함할 수 있고, 상기 전역 특징은 전역 이미지 특징 및/또는 전역 속성 특징을 포함하고; 대응하여, 이미지에 대응하는 전역 이미지 특징은 상기 로컬 이미지 특징에 기반하여 획득될 수 있으며; 그리고/또는 상기 이미지에 대응하는 전역 속성 특징은 로컬 속성 특징에 기반하여 획득될 수 있다.In an alternative embodiment of the present disclosure, the local feature may include a local image feature and/or a local attribute feature, the global feature comprising a global image feature and/or a global attribute feature; Correspondingly, a global image feature corresponding to the image may be obtained based on the local image feature; and/or a global attribute characteristic corresponding to the image may be obtained based on a local attribute characteristic.

즉, 상기 획득된 로컬 특성 정보는 로컬 이미지 특성 정보에 추가로 로컬 텍스트 속성 정보를 포함할 수 있다. 그러므로, 특징 추출 네트워크를 통해 로컬 특징을 추출할 때에, 상기 특징 추출 네트워크는 속성 예측 네트워크를 더 포함할 수 있다. 상기 속성 예측 네트워크는 다중-라벨 분류 네트워크일 수 있다. 대안으로, 노이지(noisy)-OR과 같은 약한-지도 트레이닝 (weakly-supervised training) 방법을 사용하여 속성 예측 네트워크가 획득될 수 있다. 실제 애플리케이션들에서, 속성들은 명사, 동사, 형용사, 상대적으로 희귀한 단어 및 주제에 따라 미세하게 분할될 수 있다. 각 속성은 고유한 속성 예측 네트워크(예: 다중 인스턴스 학습(Multiple Instance Learning (MIL)))을 기반으로 획득된다. 마지막으로, 다양한 속성 기능을 연쇄하여 최종 텍스트 속성 특징들을 획득할 수 있다.That is, the obtained local property information may include local text property information in addition to the local image property information. Therefore, when extracting local features through the feature extraction network, the feature extraction network may further include an attribute prediction network. The attribute prediction network may be a multi-label classification network. Alternatively, attribute prediction networks can be obtained using weakly-supervised training methods such as noisy-OR. In practical applications, attributes can be finely divided according to nouns, verbs, adjectives, relatively rare words and subjects. Each attribute is obtained based on a unique attribute prediction network (eg Multiple Instance Learning (MIL)). Finally, various attribute functions can be chained to obtain the final text attribute characteristics.

대안으로, 상기 획득된 로컬 특성 정보가 로컬 이미지 특성 정보 및 로컬 텍스트 속성 정보를 포함할 때에, 상기 획득된 전역 특성 정보는 전역 이미지 특성 정보 및 전역 텍스트 속성 정보도 포함할 수 있다. 유사하게, 상기 전역 이미지 특성 정보 및 전역 텍스트 속성 정보는 대응하는 로컬 특성 정보를 기반으로 획득될 수 있으며, 즉, 상기 이미지에 대응하는 전역 이미지 특성 정보는 로컬 이미지 특성 정보를 기반으로 획득될 수 있으며, 상기 이미지에 대응하는 전역 텍스트 속성 정보는 로컬 텍스트 속성 정보를 기반으로 획득될 수 있다. 상기 전역 이미지 특성 정보 및 전역 텍스트 속성 정보도 신경 네트워크를 통해 상기 이미지를 기반으로 추출될 수 있다. 물론, 로컬 특성 정보 및 전역 특성 정보에 대해, 상기 로컬 특성 정보 및 상기 전역 특성 정보 중 하나는 이미지 특성 정보 및 텍스트 속성 정보를 포함할 수 있으며, 다른 하나는 이미지 특성 정보 또는 텍스트 속성 정보를 포함할 수 있으며, 이는 애플리케이션 요구 사항에 따라 설정될 수 있다.Alternatively, when the obtained local property information includes local image property information and local text property information, the obtained global property information may also include global image property information and global text property information. Similarly, the global image property information and global text property information may be obtained based on corresponding local property information, that is, global image property information corresponding to the image may be obtained based on local image property information, , global text property information corresponding to the image may be obtained based on local text property information. The global image property information and global text property information may also be extracted based on the image through a neural network. Of course, with respect to local property information and global property information, one of the local property information and the global property information may include image property information and text property information, and the other may include image property information or text property information. and can be set according to application requirements.

본 개시의 대안적인 실시예에서, 상기 로컬 특징 및 전역 특징에 따라 이미지의 텍스트 캡션을 획득하는 단계는: 추출된 모든 로컬 특징에 기반하여 상기 인코딩된 로컬 특징을 획득하기 위해 각 로컬 특징을 개별적으로 인코딩하는 단계; 및 상기 인코딩된 로컬 특징 및 전역 특징에 기반하여 상기 이미지의 텍스트 캡션을 획득하는 단계를 포함한다.In an alternative embodiment of the present disclosure, obtaining the text caption of the image according to the local feature and the global feature comprises: individually acquiring each local feature to obtain the encoded local feature based on all extracted local features. encoding; and obtaining a text caption of the image based on the encoded local feature and global feature.

즉, 상기 이미지의 로컬 특징을 획득한 후, 상기 인코더는 상기 인코딩된 로컬 특성 정보를 획득하기 위해 상기 추출된 모든 로컬 특성 정보에 따라 상기 로컬 특성 정보를 인코딩하기 위해 사용될 수 있다. 상기 인코더는 추출된 모든 로컬 특성 정보를 기반으로 각각의 타겟 영역 지역들의 로컬 특징들 간의 관계를 학습하기 위해 사용될 수 있다.That is, after obtaining the local feature information of the image, the encoder may be used to encode the local feature information according to all the extracted local feature information to obtain the encoded local feature information. The encoder may be used to learn a relationship between local features of each target area region based on all extracted local feature information.

대안의 솔루션으로, 상기 인코더는 셀프-어텐션 기반 인코더에 의해 구현될 수 있으며, 상기 인코더는 인코딩된 로컬 특성 정보를 획득하기 위해 상기 추출된 모든 로컬 특성 정보에 기반하여 각 로컬 특성 정보를 인코딩할 수 있다. 대응하여, 상기 디코더가 상기 로컬 특성 정보 및 전역 특성 정보에 따라 상기 이미지에 대응하는 캡셔닝 정보를 획득할 때에, 상기 디코더는 상기 인코딩된 로컬 특성 정보 및 전역 특성 정보, 그리고 입력 단어 벡터 (반복 예측 프로세스 동안에 예측된 시작 토큰 및 단어 벡터를 포함할 수 있음)를 기반으로 각 순간에 출력될 수 있는 단어들 및 그 단어들의 출력 확률들(이는 정규화될 수 있는 확률일 수 있음)을 출력할 수 있다. 상기 이미지 캡셔닝 문장은 상기 정지 문자(character)의 출력 확률이 최대가 될 때까지 시간 순서로 각 순간 출력될 수 있는 최대 출력 확률을 가진 단어들을 결합하여 획득될 수 있다.As an alternative solution, the encoder may be implemented by a self-attention based encoder, and the encoder may encode each local characteristic information based on all the extracted local characteristic information to obtain encoded local characteristic information. have. Correspondingly, when the decoder obtains the captioning information corresponding to the image according to the local feature information and the global feature information, the decoder is configured to: It is possible to output the words that can be output at each moment and the output probabilities of those words (which can be the probability that can be normalized) based on the predicted starting token and the word vector during the process) . The image captioning sentence may be obtained by combining words having the maximum output probability that can be output at each instant in chronological order until the output probability of the stop character becomes maximum.

대안의 솔루션으로, 상기 셀프-어텐션 기반 인코더는, 순차적으로 캐스케이드된 다중-헤드 어텐션 레이어, 레이어 정규화 레이어 및 피드 포워드 네트워크 레이어를 포함할 수 있다. As an alternative solution, the self-attention-based encoder may include a sequentially cascaded multi-head attention layer, a layer normalization layer and a feed forward network layer.

일례로, 도 22는 본 개시의 일 실시예에서 제공되는 코덱의 개략적인 구조도를 도시한다. 도면에서 보이듯이, 상기 인코더의 입력 정보를 획득하기 위해, 처리될 이미지(상기 도면의 우측 하단 모서리에 표시된 이미지)에 대해, 상기 이미지의 각각의 타겟 영역들의 로컬 특징들이 특징 추출 네트워크(예: 상기 도면에 도시된 Faster R-CNN)를 통해 추출될 수 있다. 구체적으로, 상기 특징 추출 네트워크는 입력 이미지를 여러 타겟 후보 영역들(즉, 타겟 영역)로 분할하고 각 타겟 후보 영역들에서 특징 벡터(즉, 로컬 특징)를 획득할 수 있으며, 그에 의해 도 22에서 보이는 {v_j}와 같은 다수의 특징 벡터들을 획득하며, {v_j} 내 각 벡터는 하나의 타겟 후보 영억(상기 도면의 좌측 하단에서 보이는 직사각형 블록에 대응하는 영역)의 하나의 로컬 특성 정보를 나타낸다.As an example, FIG. 22 shows a schematic structural diagram of a codec provided in an embodiment of the present disclosure. As shown in the figure, for the image to be processed (the image displayed in the lower right corner of the figure) to obtain input information of the encoder, the local features of each target region of the image are determined by a feature extraction network (eg, the Faster R-CNN shown in the figure) can be extracted. Specifically, the feature extraction network may partition the input image into several target candidate regions (ie, target region) and obtain a feature vector (ie, local feature) from each target candidate region, thereby in FIG. A plurality of feature vectors such as {v _j _{} shown are obtained, and each vector in {v j} } contains one local feature information of one target candidate area (region corresponding to the rectangular block shown in the lower left of the figure). indicates.

상기 특징 추출 네트워크에 의해 추출된 로컬 특성 정보에 대해 지역적 특징 임베딩이 수행될 수 있으며, 이는 상기 특성 정보의 차원을 후속 인코더 처리에 적합하게 변경하기 위한 것이다. 상기 임베딩 처리 이후 출력된 로컬 특성 정보는 인코딩을 위해 상기 인코더에 입력된다.Local feature embedding may be performed on the local feature information extracted by the feature extraction network, in order to change the dimension of the feature information to be suitable for subsequent encoder processing. The local characteristic information output after the embedding process is input to the encoder for encoding.

대안으로, 도 22에서 보이는 인코더는 하나 이상의 블록 구조를 가질 수 있다. 상기 인코더가 다중-블록 구조일 때에, 각 블록의 구조는 같거나 다를 수 있다. 예를 들어, 상기 인코더는 동일한 구조를 갖는 6개의 블록을 포함할 수 있으며, 그 6개의 블록이 순차적으로 캐스케이드된다고 가정한다. 각 블록은, 위치 당 완전 연결된 피드 포워드 네트워크 및 다중-헤드 어텐션 레이어의 두 부분을 주로 포함할 수 있다. 대안으로, 상기 피드 포워드 네트워크는 두 개의 선형 예측 레이어들에 의해 구현될 수 있다. 상기 두 개의 선형 예측 레이어은 ReLU 활성화 연산들을 포함할 수 있다. 각 블록 내 다중-헤드 어텐션 레이어 및 피드 포워드 네트워크는 각각 레이어 정규화 레이어들에 대응할 수 있다. 도 22에서 보이는 것처럼, 이 예에서의 각 블록은 다중-헤드 어텐션 레이어, 레이어 정규화 레이어, 피드 포워드 네트워크 레이어 및 레이어 정규화 레이어를 순서대로 포함할 수 있다. 각 블록을 쌓아 인코더를 획득할 수 있다. Alternatively, the encoder shown in FIG. 22 may have one or more block structures. When the encoder has a multi-block structure, the structure of each block may be the same or different. For example, it is assumed that the encoder may include six blocks having the same structure, and the six blocks are sequentially cascaded. Each block may mainly contain two parts per location: a fully connected feed forward network and a multi-head attention layer. Alternatively, the feed forward network may be implemented by two linear prediction layers. The two linear prediction layers may include ReLU activation operations. A multi-head attention layer and a feed forward network in each block may respectively correspond to layer normalization layers. As shown in FIG. 22 , each block in this example may include a multi-head attention layer, a layer normalization layer, a feed forward network layer, and a layer normalization layer in order. Each block can be stacked to obtain an encoder.

다음은 설명을 위한 예로 제1 블록의 처리를 취한다. 상기 로컬 특성 정보는 먼저 다중-헤드 어텐션 레이어에 의해 처리되고, 그 출력 결과는 지역적 특징 임베딩(예: 추가 처리)의 출력과 통합된 다음 레이어 정규화 처리를 거친다. 상기 정규화된 결과는 피드 포워드 네트워크를 통해 처리 된 다음 이전의 레이어 정규화 레이어 (예를 들면, 추가 처리)의 출력과 통합되고, 그 후에 레이어 정규화를 거쳐서 상기 제1 블록의 출력을 획득한다. 상기 제1 블록의 출력 결과는 제2 블록의 입력으로 사용되며, 그리고 상기 인코딩 과정이 차례로 수행되며, 그에 의해 상기 인코더의 출력 결과(즉, 도 25에서의 인코더 출력)를 획득한다.The following takes the processing of the first block as an example for explanation. The local feature information is first processed by the multi-head attention layer, and the output result is integrated with the output of local feature embedding (eg, additional processing) and then subjected to layer normalization processing. The normalized result is processed through a feed forward network and then integrated with the output of the previous layer normalization layer (eg, additional processing), and then subjected to layer normalization to obtain the output of the first block. The output result of the first block is used as an input of the second block, and the encoding process is sequentially performed, thereby obtaining an output result of the encoder (ie, the encoder output in FIG. 25 ).

대안으로, 도 22에 보이듯이, 상기 특징 추출 네트워크에 의해 추출된 로컬 특성 정보에 기반하여 상기 전역 특성 정보가 더 획득될 수 있다. 예를 들어, 상기 로컬 특성 정보는 도 22에서 보이는 특징 벡터

와 같은 전역 특성 정보를 획득하기 위해 평균화될 수 있다. 추가로, 반복 예측 과정 동안에 예측된 시작 토큰과 단어 벡터가 또한 획득될 수 있으며 (반복 예측의 제1 예측이라면, 시작 토큰만이 획득된다), 그리고 상기 시작 토큰 및 상기 예측 벡터는 도 22에서 w 로 도시된다. 상기 모델을 트레이닝시킬 때에, 상기 샘플에 대응하는 모든 단어 벡터가 입력될 수 있다.Alternatively, as shown in FIG. 22 , the global feature information may be further obtained based on the local feature information extracted by the feature extraction network. For example, the local feature information is a feature vector shown in FIG. 22 .

may be averaged to obtain global characteristic information such as In addition, a predicted start token and a word vector may also be obtained during the iterative prediction process (if the first prediction of the iterative prediction, only a start token is obtained), and the start token and the prediction vector are w in FIG. is shown as When training the model, all word vectors corresponding to the sample may be input.

상기 전역 특성 정보

및 반복 예측 프로세스 그리고 상기 반복 예측 과정 내 예측된 단어 벡터 w에 대해, 디코더 임베딩이 수행될 수 있으며, 이는 특성 정보의 차원을 후속 디코더 처리에 적합하도록 변경하기 위한 것이다. 임베딩 이후 출력된 상기 전역 특성 정보, 시작 토큰 및 단어 벡터는 디코딩을 위해 상기 디코더에 입력될 수 있다.the global property information

and for the iterative prediction process and the word vector w predicted in the iterative prediction process, decoder embedding may be performed, for changing the dimension of characteristic information to be suitable for subsequent decoder processing. The global characteristic information, start token, and word vector output after embedding may be input to the decoder for decoding.

대안으로, 도 22의 디코더는 하나 이상의 블록을 가질 수 있으며, 각 블록의 구조는 동일할 수 있다. 예를 들어, 상기 디코더는 동일한 구조를 갖는 6개의 블록을 포함할 수 있다. 대안의 구조로서, 각 블록은 마스킹된 다중-헤드 어텐션 레이어, 상기 특징에 대응하는 다중-헤드 셀프-어텐션 레이어 및 피드 포워드 네트워크를 주로 포함할 수 있다. 각 블록 내 다중-헤드 어텐션 레이어 및 피드 포워드 네트워크는 각자 레이어 정규화 레이어들에 대응할 수 있다. 각 블록의 구조는 마스킹된 다중 헤드 어텐션 레이어, 레이어 정규화 레이어, 다중-헤드 어텐션 레이어, 레이어 정규화 레이어, 피드 포워드 네트워크 레이어, 레이어 정규화 레이어를 차례로 포함할 수 있다. 각 블록을 쌓아 디코더를 얻을 수 있다.Alternatively, the decoder of FIG. 22 may have one or more blocks, and the structure of each block may be the same. For example, the decoder may include six blocks having the same structure. As an alternative structure, each block may mainly include a masked multi-head attention layer, a multi-head self-attention layer corresponding to the above feature, and a feed forward network. A multi-head attention layer and feed forward network in each block may correspond to respective layer normalization layers. The structure of each block may sequentially include a masked multi-head attention layer, a layer normalization layer, a multi-head attention layer, a layer normalization layer, a feed forward network layer, and a layer normalization layer. You can get a decoder by stacking each block.

다음은 설명을 위한 예로 제1 블록의 처리를 취한다. 상기 전역 특성 정보

, 시작 토큰 및 단어 벡터 w는 마스킹된 다중-헤드 어텐션 레이어에 의해 먼저 처리되고, 그 처리 결과는 인코더 임베딩(예: 추가 처리)의 출력과 통합된 다음 레이어 정규화 처리된다. 정규화된 결과 및 상기 인코더에 의해 출력된 결과는 다중-헤드 어텐션 레이어를 통해 함께 처리되며,그 후에 이전의 레이어 정규화 레이어의 출력과 통합(예: 추가 처리)되며, 그리고 그 후에 레이어 정규화 처리를 거친다. 상기 정규화된 결과는 피드 포워드 네트워크를 통해 처리된 다음 이전의 레이어 정규화 레이어의 출력과 통합되고, 그 후에 레이어 정규화 처리를 거치며, 최종적으로는 상기 처리된 결과는 상기 제1 블록의 출력이다. 상기 제1 블록의 출력 결과는 제2 블록의 입력으로 사용되며, 상기 디코더의 출력 결과를 획득하기 위해 상기 디코딩 과정이 차례로 수행된다.The following takes the processing of the first block as an example for explanation. the global property information

, start token and word vector w are first processed by the masked multi-head attention layer, and the processing result is integrated with the output of encoder embedding (eg, further processing) and then subjected to layer normalization. The normalized result and the result output by the encoder are processed together through a multi-head attention layer, then merged with the output of the previous layer normalization layer (eg, additional processing), and then subjected to layer normalization processing . The normalized result is processed through a feed-forward network, then integrated with the output of the previous layer normalization layer, and then subjected to layer normalization processing, and finally, the processed result is the output of the first block. The output result of the first block is used as an input of the second block, and the decoding process is sequentially performed to obtain an output result of the decoder.

상기 디코더의 출력 결과는 상기 선형 레이어에서 처리되며, 그 후에 상기 소프트맥스 레이어에서 처리되며, 이는 단어들 a 및 b 그리고 단어 a의 출력 확률 및 단어 b의 출력 확률처럼 현재 순간(즉, 이 반복 예측)에서 출력될 수 있는 단어 벡터 그리고 대응하는 출력 확률들을 출력하기 위한 것이다. 상기 디코더, 선형 레이어, 소프트맥스 레이어는 정지 문자(character) 출력 확률이 최대가 될 때까지 위의 반복 예측 과정을 반복한다. 각 반복에서 획득된 단어 벡터에 따라 상기 입력 이미지에 대응하는 캡셔닝 정보가 획득될 수 있다.The output result of the decoder is processed in the linear layer, and then processed in the softmax layer, which is processed at the current moment (i.e., this iterative prediction ) to output word vectors and corresponding output probabilities. The decoder, the linear layer, and the softmax layer repeat the above iterative prediction process until the stop character output probability is maximized. Captioning information corresponding to the input image may be obtained according to the word vector obtained in each repetition.

상기 로컬 특징이 로컬 이미지 특징 및 로컬 텍스트 속성 정보를 포함하는 경우, 상기 사용되는 인코더는 이미지 특징 인코더 및 속성 특징 인코더를 포함할 수 있으며, 인코더의 두 부분은 로컬 이미지 특성 정보 및 로컬 텍스트 속성 정보를 인코딩하기 위해 각각 사용된다. 구체적으로, 상기 이미지 특징 인코더는 상기 추출된 모든 로컬 이미지 특성 정보에 따라 각 로컬 이미지 특성 정보를 인코딩하여, 인코딩된 로컬 이미지 특성 정보를 획득할 수 있으며, 상기 속성 특징 인코더는 로컬 텍스트 속성 정보를 인코딩하여, 인코딩된 로컬 텍스트 속성을 상기 추출된 모든 로컬 텍스트 속성 정보에 따라 획득할 수 있다. 따라서, 이때 상기 디코더는 상기 인코딩된 로컬 이미지 특성 정보, 상기 인코딩된 로컬 텍스트 속성 정보, 상기 전역 이미지 특성 정보 및 전역 텍스트 속성 정보에 따라 상기 이미지에 대응하는 캡셔닝 정보를 획득하기 위해 사용될 수 있다.When the local feature includes a local image feature and local text attribute information, the used encoder may include an image feature encoder and an attribute feature encoder, two parts of the encoder are: local image feature information and local text attribute information Each is used to encode. Specifically, the image feature encoder may encode each local image property information according to all the extracted local image property information to obtain encoded local image property information, and the property feature encoder encodes the local text property information Thus, the encoded local text attribute may be obtained according to all the extracted local text attribute information. Accordingly, at this time, the decoder may be used to obtain captioning information corresponding to the image according to the encoded local image property information, the encoded local text property information, the global image property information, and the global text property information.

다른 예로, 도 23은 본 개시의 일 실시예에서 제공되는 코덱의 개략적인 구조도를 도시한다. 도 22 및 도 23으로부터 알 수 있듯이, 도 23의 디코더의 구조는 도 22의 디코더의 구조와 유사할 수 있다. 이 예에서 상기 인코더는 이미지 특징 인코더 및 속성 특징 인코더를 포함할 수 있고, 두 부분의 구조는 동일하거나 상이할 수 있다.As another example, FIG. 23 shows a schematic structural diagram of a codec provided in an embodiment of the present disclosure. As can be seen from FIGS. 22 and 23 , the structure of the decoder of FIG. 23 may be similar to that of the decoder of FIG. 22 . In this example, the encoder may include an image feature encoder and an attribute feature encoder, and the structures of the two parts may be the same or different.

도 23에 도시된 바와 같이, 처리될 이미지(상기 도면의 왼쪽 하단에서 보이는 이미지)에 대해, 특징 추출 네트워크(예: 상기 도면에서 보이는 Faster R-CNN)를 통해 여러 타겟 후보 영역들의 도 22에 도시된 {v_j}와 같은 로컬 이미지 특징 벡터들이 추출될 수 있으며, {v_j} 내 각 벡터는 하나의 타겟 후보 영역(도 23에 도시된 Faster R-CNN에 의해 처리된 이미지 내에 마킹된 직사각형 블록에 해당하는 지역)의 로컬 이미지 특징 벡터를 나타낸다. 그것은 또한 도 23에 도시된 {a_j}와 같은 다수의 로컬 텍스트 속성 벡터들을 또한 획득할 수 있으며, {a_j} 내 각 벡터는 하나의 타겟 후보 영역(상기 이미지의 좌측 하단에 표시된 직사각형 블록에 해당하는 영역)의 로컬 텍스트 속성 벡터를 나타낸다.As shown in Fig. 23, for the image to be processed (the image shown in the lower left of the figure), several target candidate regions are shown in Fig. 22 through a feature extraction network (eg, Faster R-CNN shown in the figure). Local image feature vectors such as {v _j _{} can be extracted, and each vector in {v j} } is a single target candidate region (a rectangular block marked in the image processed by Faster R-CNN shown in FIG. 23 ) represents the local image feature vector of the region corresponding to . It can also obtain multiple local text attribute vectors, such as _{{a j} } shown in Fig. 23, _{each vector in {a j} } is one target candidate area (in the rectangular block shown in the lower left of the image above) It represents the local text attribute vector of the corresponding region).

상기 추출된 로컬 비디오 특성 정보에 대해 지역 이미지 특징 임베딩이 수행될 수 있다. 임베딩이 수행된 후 출력된 상기 로컬 이미지 특성 정보는 인코딩을 위해 상기 이미지 특징 인코더에 입력된다. 상기 추출된 로컬 텍스트 속성 정보에 대해 영역 속성 특성 임베딩이 수행될 수 있다. 임베딩 수행 후 출력된 로컬 텍스트 속성 정보는 인코딩을 위해 상기 속성 특징 인코더에 입력된다.Local image feature embedding may be performed on the extracted local video feature information. The local image characteristic information output after embedding is performed is input to the image characteristic encoder for encoding. Region attribute attribute embedding may be performed on the extracted local text attribute information. Local text attribute information output after performing embedding is input to the attribute feature encoder for encoding.

이 예에서, 도 23에 도시된 이미지 특징 인코더 및 속성 특징 인코더의 구조들은 도 22에 도시된 인코더의 구조를 일 예로서 취하여 설명된다. 예를 들어, 그것들은 동일한 구조를 가진 6개의 블록을 각각 포함할 수 있다. 각 블록의 구조는 도 23에서 보이며, 그리고 각 블록은 인코더를 얻기 위해 쌓여질 수 있다. 각 블록의 처리 흐름은 위에서 설명된 도 22의 인코더의 블록 구조에 대한 설명을 참조할 수 있다. 상기 이미지 특징 인코더와 상기 속성 특징 인코더가 특징 인코딩을 수행할 때에, 유일한 차이점은 상기 이미지 특징 인코더의 입력은 지역적 이미지 특징 임베딩이 상기 로컬 이미지 특성 정보에 대해 수행된 후 획득된 특징이며, 상기 속성 특징 인코더의 입력은 상기 지역적 속성 특징 임베딩이 로컬 텍스트 속성에 대해 수행된 후에 획득된 특징이라는 것이다. 인코딩 처리 이후, 상기 인코더의 출력 결과가 획득된다(즉, 도 23 내 이미지 특징 인코더의 출력 및 속성 특징 인코더의 출력).In this example, the structures of the image feature encoder and the attribute feature encoder shown in FIG. 23 are described taking the structure of the encoder shown in FIG. 22 as an example. For example, they may contain 6 blocks each with the same structure. The structure of each block is shown in Fig. 23, and each block can be stacked to obtain an encoder. The processing flow of each block may refer to the description of the block structure of the encoder of FIG. 22 described above. When the image feature encoder and the attribute feature encoder perform feature encoding, the only difference is that the input of the image feature encoder is a feature obtained after local image feature embedding is performed on the local image feature information, and the attribute feature The input of the encoder is that the feature is obtained after the local attribute feature embedding has been performed on the local text attribute. After the encoding process, the output result of the encoder is obtained (ie, the output of the image feature encoder and the output of the attribute feature encoder in FIG. 23 ).

또한, 상기 로컬 이미지 특성 정보는 도 23 내 특징 벡터

와 같은 전역 이미지 특성 정보를 획득하기 위해 평균화될 수 있으며, 그리고 상기 로컬 텍스트 속성 정보는 도 23 내 특징 벡터

와 같은 전역 텍스트 속성 정보를 획득하기 위해 평균화될 수 있다. 추가로, 반복 예측 과정 동안에 예측된 시작 토큰과 단어 벡터가 더 획득될 수 있다 (그것이 반복 예측의 제1 예측이면, 시작 토큰만이 획득된다). 상기 시작 토큰 및 상기 예측 단어 벡터는 도 23에서 w로 도시되며, 그리고 상기 모델을 트레이닝할 때에, 상기 샘플의 모든 단어 벡터들이 입력될 수 있다.In addition, the local image characteristic information is a feature vector in FIG.

may be averaged to obtain global image characteristic information such as

may be averaged to obtain global text attribute information such as Additionally, a predicted start token and a word vector may be further obtained during the iterative prediction process (if it is the first prediction of the iterative prediction, only the start token is obtained). The start token and the predictive word vector are shown as w in FIG. 23 , and when training the model, all word vectors of the sample can be input.

상기 인코더 임베딩은 전역 이미지 특성 정보

, 전역 텍스트 속성 정보

, 시작 토큰 및 반복 예측 과정에서 예측된 단어 벡터 w에 관하여 수행될 수 있으며, 이는 후속 디코더 처리에 적합하도록 상기 특성 정보의 차원을 변경하기 위한 것이다. 상기 전역 이미지 특성 정보, 전역 텍스트 속성 정보, 시작 토큰, 임베딩 후 출력되는 단어 벡터는 디코딩을 위해 상기 디코더에 입력될 수 있다.The encoder embedding is global image characteristic information.

, global text attribute information

, the start token and the word vector w predicted in the iterative prediction process, to change the dimension of the characteristic information to be suitable for subsequent decoder processing. The global image characteristic information, global text attribute information, start token, and word vector output after embedding may be input to the decoder for decoding.

대안의 구조로서, 도 23에 도시된 디코더 구조는 하나 이상의 블록을 가질 수 있다. 상기 디코더가 다중-블록 구조일 때에, 각 블록의 구조는 같거나 다를 수 있다. 예를 들어, 그것은 동일한 구조의 6개의 블록을 포함할 수 있으며, 각 블록의 구조는 마스킹된 다중-헤드 어텐션 레이어, 이미지 특징에 대응하는 다중-헤드 어텐션 레이어, 속성 특징에 대응하는 다중-헤드 어텐션 레이어 및 피드 포워드 네트워크를 포함할 수 있다. 각 블록 내 다중-헤드 어텐션 및 피드 포워드 네트워크는 각각 레이어 정규화 레이어들에 대응할 수 있다. 구체적으로, 도 23에서 보이는 것처럼, 각 블록의 구조는 마스킹된 다중-헤드 어텐션 레이어, 레이어 정규화 레이어, 다중-헤드 어텐션 레이어, 레이어 정규화 레이어, 다중-헤드 어텐션 레이어, 레이어 정규화 레이어, 피드 포워드 네트워크 레이어, 및 레이어 정규화 레이어를 차례로 포함할 수 있다. 각 블록을 쌓아 디코더를 획득할 수 있다.As an alternative structure, the decoder structure shown in FIG. 23 may have one or more blocks. When the decoder has a multi-block structure, the structure of each block may be the same or different. For example, it may include six blocks of the same structure, the structure of each block being a masked multi-head attention layer, a multi-head attention layer corresponding to an image feature, and a multi-head attention corresponding to an attribute feature. It may include layers and feed forward networks. The multi-head attention and feed forward network in each block may each correspond to layer normalization layers. Specifically, as shown in FIG. 23, the structure of each block is a masked multi-head attention layer, a layer normalization layer, a multi-head attention layer, a layer normalization layer, a multi-head attention layer, a layer normalization layer, and a feed forward network layer. , and a layer normalization layer may be sequentially included. Each block can be stacked to obtain a decoder.

다음은 설명을 위한 예로서 제1 블록의 처리를 취한다: 상기 전역 이미지 특성 정보

, 전역 텍스트 속성 정보

, 시작 토큰 및 반복 예측 프로세스에서 예측된 단어 벡터 w는 마스킹된 다중-헤드 어텐션 계층에 의해 먼저 처리되고, 그 처리 결과는 디코더 임베딩(예: 추가 처리)의 출력과 통합된 다음 레이어 정규화 처리를 거친다. 정규화된 결과 및 상기 이미지 특징 인코더에 의해 출력된 결과는 다중-헤드 어텐션 레이어를 통해 함께 처리되며, 그 후에 이전의 레이어 정규화 레이어의 출력과 통합(예: 추가 처리)되며, 그리고 그 후에 레이어 정규화 처리를 거친다. 정규화된 결과 및 상기 속성 특징 인코더에 의해 출력된 결과는 다중-헤드 어텐션 레이어를 통해 함께 처리되며, 그 후에 이전의 레이어 정규화 레이어의 출력과 통합(예: 추가 처리)되며, 그리고 그 후에 레이어 정규화 처리를 거친다. 상기 정규화된 결과는 피드 포워드 네트워크를 통해 처리 된 다음 이전의 레이어 정규화 레이어의 출력과 통합되고, 그 후에 레이어 정규화를 거친다. 상기 처리 결과는 상기 제1 블록의 출력 결과이다. 상기 제1 블록의 출력 결과는 제2 블록의 입력으로 사용되며, 상기 디코더의 출력 결과를 획득하기 위해 상기 디코딩 과정이 차례로 수행된다.The following takes the processing of the first block as an example for illustration: the global image property information

, global text attribute information

, the start token and the word vector w predicted in the iterative prediction process are first processed by the masked multi-head attention layer, and the processing result is integrated with the output of the decoder embedding (eg, further processing) and then subjected to layer normalization processing. . The normalized result and the result output by the image feature encoder are processed together through a multi-head attention layer, and then integrated (eg, further processed) with the output of the previous layer normalization layer, and then layer normalization processing go through The normalized result and the result output by the attribute feature encoder are processed together through a multi-head attention layer, and then integrated (eg, further processed) with the output of the previous layer normalization layer, and then layer normalization processing go through The normalized result is processed through a feed-forward network, then integrated with the output of the previous layer normalization layer, and then subjected to layer normalization. The processing result is an output result of the first block. The output result of the first block is used as an input of the second block, and the decoding process is sequentially performed to obtain an output result of the decoder.

상기 디코더의 출력 결과는 상기 선형 레이어에서 처리되며, 그 후에 상기 소프트맥스 레이어에서 처리되며, 이는 단어들 a 및 b 그리고 단어 a의 출력 확률 및 단어 b의 출력 확률처럼 현재 순간(즉, 이 반복 예측)에서 출력될 수 있는 단어 벡터 그리고 대응하는 출력 확률들을 출력하기 위한 것이다. 상기 디코더, 선형 계층, 및 소프트맥스 계층은 상기 정지 문자(character)의 출력 확률이 최대가 될 때까지 위의 반복 예측 과정을 반복한다. 각 반복에서 획득된 단어 벡터에 따라 위의 입력 이미지에 대응하는 캡셔닝 정보가 획득될 수 있다.The output result of the decoder is processed in the linear layer, and then processed in the softmax layer, which is processed at the current moment (i.e., this iterative prediction ) to output word vectors and corresponding output probabilities. The decoder, the linear layer, and the softmax layer repeat the above iterative prediction process until the output probability of the stop character is maximized. Captioning information corresponding to the above input image may be obtained according to the word vector obtained in each repetition.

전술한 대안적인 실시예에서의 이미지 캡셔닝 정보를 생성하기 위한 방법에 대한 설명으로부터, 상기 생성 방법은 이미지 캡셔닝 모델에 의해 구체적으로 구현될 수 있음을, 즉, 이미지가 상기 이미지 캡셔닝 모델에 입력될 수 있으며, 상기 이미지의 텍스트 캡션은 상기 모델 출력을 기반으로 획득될 수 있음을 알 수 있다. 상기 이미지 캡셔닝 모델의 구체적인 신경망 구조는 본 개시의 실시예에서 제한되지 않는다. 예를 들어, 도 22 또는 도 23에 도시된 셀프-어텐션 기반 인코더와 셀프-어텐션 기반 디코더를 포함하는 코덱 네트워크 구조가 사용될 수 있지만, 이에 한정되지 않는다.From the description of the method for generating image captioning information in the above-described alternative embodiment, it is understood that the generating method can be specifically implemented by an image captioning model, that is, an image is attached to the image captioning model. can be input, and it can be seen that the text caption of the image can be obtained based on the model output. The specific neural network structure of the image captioning model is not limited in the embodiments of the present disclosure. For example, a codec network structure including a self-attention-based encoder and a self-attention-based decoder illustrated in FIG. 22 or 23 may be used, but is not limited thereto.

전술한 예들에서의 솔루션들은 단지 본 개시의 몇몇 대안적인 방식들의 예일 뿐이며, 본 개시의 솔루션들을 제한하기 위해 사용되지 않는다는 것이 이해될 수 있다. 추가로, 위의 예들은 상기 이미지의 캡셔닝 정보를 생성하기 위해 적합하며 상기 비디오의 캡셔닝 정보를 생성하기 위해서도 또한 적합하다. 상기 이미지의 캡셔닝 정보 생성은 상기 비디오 캡셔닝 정보의 생성과 다음과 같은 점에서 상이하다: 상기 이미지의 캡셔닝 정보를 생성할 때에, 이미지가 하나뿐이므로 인터-프레임 정보, 즉, 앞서 설명한 시간적 에지들 및 인터-프레임 인코더들처럼 인접 비디오들 간의 정보를 고려할 필요가 없다.It can be understood that the solutions in the above examples are merely examples of some alternative manners of the present disclosure, and are not used to limit the solutions of the present disclosure. Additionally, the above examples are suitable for generating captioning information of the image and are also suitable for generating captioning information of the video. The generation of captioning information of the image is different from the generation of video captioning information in the following respects: When generating the captioning information of the image, since there is only one image, inter-frame information, that is, the temporal information described above There is no need to consider information between adjacent videos like edges and inter-frame encoders.

본 개시의 각 대안적인 실시예에서 제공된 멀티미디어 데이터의 캡셔닝 정보 생성을 위한 방법에 대한 전술한 설명으로부터, 상기 멀티미디어 데이터의 캡셔닝 정보 생성은 멀티미디어 데이터 캡셔닝 모델을 이용하여 구체적으로 구현될 수 있음을 알 수 있다. 비디오들에 대해, 상기 멀티미디어 데이터 캡셔닝 모델은 비디오 캡셔닝 모델이며, 그리고 이미지들에 대해, 상기 멀티미디어 데이터 캡셔닝 모델은 이미지 캡셔닝 모델이다. 상기 비디오 캡셔닝 모델과 상기 이미지 캡셔닝 모델은 다른 모델일 수도 있고 동일한 모델일 수도 있다. 즉, 그것은 이미지 캡션 생성에 적합할 뿐만 아니라 비디오 캡션 생성에도 적합한 모델일 수 있다. 대안으로, 상기 비디오 캡셔닝 모델은 RNN 기반 모델일 수도 있고, 변환기 (Transformer) 기반 모델과 같은 다른 네트워크 구조 기반 모델일 수도 있다. 실제 애플리케이션에서, 상기 모델의 구체적인 구조는 실제 필요에 따라 세팅될 수 있으며, 이는 본 개시의 실시예들에서 제한되지 않는다.From the foregoing description of the method for generating captioning information of multimedia data provided in each alternative embodiment of the present disclosure, generating captioning information of multimedia data may be specifically implemented using a multimedia data captioning model. can be known For videos, the multimedia data captioning model is a video captioning model, and for images, the multimedia data captioning model is an image captioning model. The video captioning model and the image captioning model may be different models or may be the same model. That is, it can be a model suitable for generating image captions as well as for generating video captions. Alternatively, the video captioning model may be an RNN-based model or another network structure-based model such as a Transformer-based model. In actual application, the specific structure of the model may be set according to actual needs, which is not limited in the embodiments of the present disclosure.

비디오 캡셔닝 정보가 획득되어야 하는 비디오에 대해, 그 비디오에서 선택된 비디오 또는 프레임들은 비디오 캡셔닝 모델에 입력될 수 있으며, 그 비디오 캡셔닝 모델의 입력에 기반하여 상기 비디오의 텍스트 캡션이 획득될 수 있다. 상기 비디오 캡셔닝 모델의 구체적인 모델 구조는 본 개시의 실시예에서 제한되지 않는다. 원본 비디오 캡셔닝 모델은 비디오 샘플을 사용하여 트레이닝될 수 있으며, 그 트레이닝된 비디오 캡셔닝 모델은 비디오 캡셔닝 정보를 생성하기 위해 사용된다.For a video for which video captioning information is to be obtained, a video or frames selected from the video may be input to a video captioning model, and a text caption of the video may be obtained based on the input of the video captioning model. . The specific model structure of the video captioning model is not limited in the embodiments of the present disclosure. An original video captioning model can be trained using video samples, and the trained video captioning model is used to generate video captioning information.

구체적으로, 본 개시의 대안적인 실시예에서, 상기 멀티미디어 데이터의 텍스트 캡션은 멀티미디어 데이터 캡셔닝 모델을 사용함으로써 획득되고, 상기 멀티미디어 데이터 캡셔닝 모델은 다음과 같은 방식으로 트레이닝에 의해 획득된다: 트레이닝 샘플을 획득하고, 여기에서 상기 트레이닝 샘플들은 캡셔닝 라벨이 있는 제1 샘플 멀티미디어 데이터를 포함하고; 모델 손실 함수가 수렴할 때까지 상기 제1 샘플 멀티미디어 데이터에 기반하여 원래의 캡셔닝 모델을 트레이닝하며; 그리고 상기 트레이닝된 캡셔닝 모델을 상기 멀티미디어 데이터 캡셔닝 모델로 취한다.Specifically, in an alternative embodiment of the present disclosure, the text caption of the multimedia data is obtained by using a multimedia data captioning model, and the multimedia data captioning model is obtained by training in the following manner: training sample , wherein the training samples include first sample multimedia data with caption labels; train an original captioning model based on the first sample multimedia data until the model loss function converges; Then, the trained captioning model is taken as the multimedia data captioning model.

비디오 캡셔닝 모델에 대해, 상기 샘플 멀티미디어 데이터는 샘플 비디오이며, 그리고 상기 캡셔닝 라벨은 비디오 캡셔닝 라벨이며; 이미지 캡셔닝 모델에 대해, 상기 샘플 멀티미디어 데이터는 샘플 이미지이며, 그리고 상기 캡셔닝 라벨은 샘플 이미지 캡셔닝 라벨이라는 것이 이해될 수 있다. 상기 모델 손실 함수의 구체적인 형태는 실제 필요에 따라 설정될 수 있다. 예를 들어, 비디오 캡셔닝 모델 또는 이미지 캡셔닝 모델을 트레이닝할 때 일반적으로 사용되는 모델 손실 함수가 선택될 수 있다. 트레이닝 동안에, 상기 모델 손실 함수의 값은 상기 모델에 의해 예측된 멀티미디어 데이터의 캡셔닝 정보와 캡셔닝 라벨 정보 간의 차이를 나타내거나, 또는 상기 예측된 캡셔닝 정보가 미리 세팅된 다른 종료 조건들을 충족하는지의 여부를 표시한다. 지속적인 트레이닝을 통해, 상기 모델에 의해 예측된 멀티미디어 정보는 상기 캡셔닝 라벨 정보에 가깝거나 다른 미리 세팅된 조건들을 충족시킬 수 있다.for a video captioning model, the sample multimedia data is a sample video, and the caption label is a video caption label; It can be understood that for the image captioning model, the sample multimedia data is a sample image, and the caption label is a sample image caption label. The specific form of the model loss function may be set according to actual needs. For example, a model loss function commonly used when training a video captioning model or an image captioning model may be selected. During training, whether the value of the model loss function represents a difference between captioning information and captioning label information of multimedia data predicted by the model, or whether the predicted captioning information meets other preset termination conditions indicate whether Through continuous training, the multimedia information predicted by the model may be close to the captioning label information or meet other preset conditions.

멀티미디어 데이터의 상기 생성된 캡셔닝 정보의 정확도를 향상시키기 위해, 도 24는 본 개시의 대안의 실시예에서 제공된 멀티미디어 데이터 캡셔닝 모델을 트레이닝시키는 방법을 예시한다. 상기 도면에서 보이는 것처럼, 상기 트레이닝 방법의 트레이닝 샘플은 캡셔닝 라벨들이 없는 멀티미디어 데이터의 제2 샘플도 포함된다. 상기 모델 손실 함수는 제1 손실 함수와 제2 손실 함수를 포함한다. 제1 샘플 멀티미디어 데이터를 기반으로 원본 캡셔닝 모델을 트레이닝시킬 때에, 상기 방법은 다음의 단계들 S201 내지 S203을 포함할 수 있다.In order to improve the accuracy of the generated captioning information of multimedia data, FIG. 24 illustrates a method of training a multimedia data captioning model provided in an alternative embodiment of the present disclosure. As shown in the figure, the training sample of the training method also includes a second sample of multimedia data without caption labels. The model loss function includes a first loss function and a second loss function. When training the original captioning model based on the first sample multimedia data, the method may include the following steps S201 to S203.

단계 S201: 미리 세팅된 멀티미디어 데이터 캡셔닝 모델은 제1 샘플 멀티미디어 데이터에 기반하여 트레이닝되어 제1 손실 함수의 값을 획득하고, 상기 캡셔닝 모델은 제2 샘플 멀티미디어 데이터에 기반하여 트레이닝되어 제2 손실 함수의 값을 획득한다.Step S201: A preset multimedia data captioning model is trained based on the first sample multimedia data to obtain a value of a first loss function, and the captioning model is trained based on the second sample multimedia data to obtain a second loss function Get the value of the function.

구체적으로, 본 개시의 실시예에서, 캡셔닝 라벨들을 가진 제1 샘플 멀티미디어 데이터와 캡셔닝 라벨들이 없는 제2 샘플 멀티미디어 데이터는 미리 세팅된 비디오 캡셔닝 모델을 함께 트레이닝시키기 위해 사용될 수 있다.Specifically, in an embodiment of the present disclosure, the first sample multimedia data with captioning labels and the second sample multimedia data without captioning labels may be used to train a preset video captioning model together.

상기 제1 샘플 멀티미디어 데이터 및 제2 샘플 멀티미디어 데이터의 소스들은 본 개시의 실시예에서 제한되지 않는다. 비디오를 예로 들면, 제1 샘플 비디오 데이터에 대응하는 원본 비디오 캡션은 도 25에 도시된 비디오와 같이 기술자에 의해 수동으로 라벨링될 수 있으며, 그 기술자는 상기 비디오에 대해 " a child is cleaning the ground "라는 비디오 캡션을 라벨링할 수 있다. 상기 제2 샘플 비디오 데이터는 비디오 캡션이 없는 임의의 획득한 비디오, 예를 들어 비디오 웹사이트에서 획득한 비디오 또는 사용자가 촬영한 비디오 등일 수 있다. 상기 제1 손실 함수 및 제2 손실 함수의 구체적인 형태는 본 개시의 실시예들에서 한정되지 않으며, 실제 적용 애플리케이션 요구 사항에 따라 설정될 수 있다.Sources of the first sample multimedia data and the second sample multimedia data are not limited in the embodiment of the present disclosure. Taking a video as an example, the original video caption corresponding to the first sample video data may be manually labeled by a descriptor as in the video shown in FIG. 25 , and the descriptor may say "a child is cleaning the ground" for the video. You can label the video caption . The second sample video data may be any acquired video without a video caption, for example, a video acquired from a video website or a video shot by a user. Specific forms of the first loss function and the second loss function are not limited in the embodiments of the present disclosure, and may be set according to actual application requirements.

단계 S202: 상기 캡셔닝 모델의 모델 손실 함수 값은 제1 손실 함수의 값과 제2 손실 함수의 값을 기반으로 획득된다.Step S202: A model loss function value of the captioning model is obtained based on a value of a first loss function and a value of a second loss function.

대안으로, 상이한 손실 함수들에 있어, 각 함수는 자기 자신의 가중치를 가질 수 있으며, 이에 따라 트레이닝 과정에서 상이한 손실 함수들의 중요성은 상이하다. 예를 들어, 상기 제1 멀티미디어 데이터는 원본 켑셔닝 라벨을 가지며, 상기 제2 샘플 멀티미디어 데이터는 원본 캡셔닝 라벨을 가지지 않기 때문에, 상기 제1 샘플 멀티미디어 데이터(즉, 원본 캡셔닝 라벨)의 캡셔닝 라벨 정보는 매우 정확하며, 상기 제1 손실 함수의 가중치는 제2 손실 함수의 가중치보다 클 수 있다. 상이한 손실 함수들이 자기 자신의 가중치를 가질 때에, 상기 멀티미디어 데이터 캡셔닝 모델의 최종 손실 함수는 각각의 손실 함수에 대응하는 가중치를 기반으로 결정될 수 있다. 예를 들어, 상기 최종 손실 함수는 각각의 손실 함수들의 가중치가 부여된 합(가중 합)일 수 있다.Alternatively, for different loss functions, each function may have its own weight, so the importance of the different loss functions in the training process is different. For example, since the first multimedia data has an original captioning label and the second sample multimedia data does not have an original captioning label, the captioning of the first sample multimedia data (ie, the original captioning label) The label information is very accurate, and the weight of the first loss function may be greater than the weight of the second loss function. When different loss functions have their own weights, the final loss function of the multimedia data captioning model may be determined based on a weight corresponding to each loss function. For example, the final loss function may be a weighted sum (weighted sum) of respective loss functions.

즉, 상기 1 손실 함수의 값과 제2 손실 함수의 값에 기반하여 상기 모델 손실 함수(최종 손실 함수라고도 함)의 값을 구하는 단계는 다음과 같은 단계를 포함할 수 있다.That is, the step of obtaining a value of the model loss function (also referred to as a final loss function) based on the value of the first loss function and the value of the second loss function may include the following steps.

그것은 상기 제1 손실 함수의 미리 설정된 가중치에 기반하여 대응하는 타겟 제1 손실 함수의 값을 획득하고, 제2 손실 함수의 미리 설정된 가중치에 기반하여 대응하는 타겟 제2 손실 함수의 값을 획득할 수 있다. It can obtain the value of the corresponding target first loss function based on the preset weight of the first loss function, and obtain the value of the corresponding target second loss function based on the preset weight of the second loss function have.

상기 타겟 제1 손실 함수의 값과 상기 타겟 제2 손실 함수의 값의 합을 최종 손실 함수의 값으로 취한다.The sum of the value of the target first loss function and the value of the target second loss function is taken as a value of the final loss function.

구체적으로, 상기 최종 손실 함수의 값은 다음 수학식 8에 의해 계산될 수 있다:Specifically, the value of the final loss function can be calculated by the following Equation 8:

여기에서 ε은 하이퍼파라미터 (hyperparameter)이다. 이 예에서 J_label(θ)은 제1 손실 함수이고 J_unlabel(θ)은 제2 손실 함수이다. 상기 제1 손실 함수의 가중치는 l로 설정될 수 있고 제2 손실 함수의 가중치는 ε이다. 이와 같이, 상기 1차 손실 함수와 해당 가중치의 곱이 상기 타겟 1차 손실 함수이고, 상기 2차 손실 함수와 해당 가중치의 곱이 상기 타겟 목표 2차 손실 함수이다. 상기 타겟 제1 손실 함수와 상기 타겟 제2 손실 함수의 합이 최종 손실 함수이다.Here, ε is a hyperparameter. In this example, J _label (θ) is the first loss function and J _unlabel (θ) is the second loss function. The weight of the first loss function may be set to l and the weight of the second loss function is ε. As such, the product of the primary loss function and the corresponding weight is the target primary loss function, and the product of the secondary loss function and the corresponding weight is the target target secondary loss function. The sum of the target first loss function and the target second loss function is a final loss function.

단계 S203: 상기 최종 손실 함수가 수렴할 때까지 그 최종 손실 함수의 값을 기반으로 캡셔닝 모델이 트레이닝되어, 트레이닝된 멀티미디어 데이터 캡셔닝 모델을 획득한다.Step S203: A captioning model is trained based on the value of the final loss function until the final loss function converges to obtain a trained multimedia data captioning model.

구체적으로, 상기 비디오 캡셔닝 모델의 최종 손실 함수를 획득한 후, 그 최종 손실 함수가 최소값을 기반으로 수렴될 때까지 최종 손실 함수를 기반으로 비디오 캡셔닝 모델의 모델 파라미터들을 업데이트하여, 트레이닝된 비디오 캡셔닝 모델을 획득한다. 상기 비디오 캡셔닝 모델의 최종 손실 함수는 상기 제1 손실 함수와 제2 손실 함수에 의해 결정된다. 최소값에 기반하여 수렴하는 최종 손실 함수는 최소값에 기반하여 수렴하는 이런 함수이거나, 또는 최소값에 기반하여 동시에 수렴하는 제1 손실 함수와 제2 손실 함수일 수 있다.Specifically, after obtaining the final loss function of the video captioning model, the model parameters of the video captioning model are updated based on the final loss function until the final loss function converges based on the minimum value, so that the trained video Acquire a captioning model. A final loss function of the video captioning model is determined by the first loss function and the second loss function. The final loss function that converges on the basis of the minimum may be such a function that converges on the basis of the minimum, or it may be a first loss function and a second loss function that converge simultaneously on the basis of the minimum.

본 개시의 실시예에서, 캡셔닝 라벨을 갖는 제1 샘플 멀티미디어 데이터가 수신될 때에, 미리 설정된 멀티미디어 데이터 캡셔닝 모델이 상기 제1 샘플 멀티미디어 데이터 및 캡셔닝 라벨에 기반하여 트레이닝되어, 제1 손실 함수의 값을 획득한다. 캡셔닝 라벨이 없는 제2 샘플 멀티미디어 데이터가 수신될 때에, 상기 캡셔닝 모델은 제2 샘플 멀티미디어 데이터를 기반으로 트레이닝되어 상기 제2 손실 함수 값을 획득하고, 상기 멀티미디어 데이터 캡셔닝 모델의 최종 손실 함수 값은 상기 제1 손실 함수 및 상기 제2 손실 함수에 기반하여 획득되며, 멀티미디어 데이터 캡셔닝 모델은 최종 손실 함수가 최소값에 기반하여 수렴할 때까지 상기 최종 손실 함수에 기반하여 트레이닝되며, 이는 트레이닝된 멀티미디어 데이터 캡셔닝 모델을 획득하기 위한 것이다.In an embodiment of the present disclosure, when the first sample multimedia data having a captioning label is received, a preset multimedia data captioning model is trained based on the first sample multimedia data and the captioning label, so that the first loss function get the value of When the second sample multimedia data without a caption label is received, the captioning model is trained based on the second sample multimedia data to obtain the second loss function value, and a final loss function of the multimedia data captioning model is obtained. Values are obtained based on the first loss function and the second loss function, and a multimedia data captioning model is trained based on the final loss function until the final loss function converges based on the minimum value, which To obtain a multimedia data captioning model.

상기 방식을 통해, 캡셔닝 라벨들을 구비한 샘플 비디오 데이터를 사용하여 상기 멀티미디어 데이터 캡셔닝 모델을 트레이닝하는 것에 추가로, 본 개시의 이 대안적인 실시예는 상기 비디오 캡셔닝 모델을 트레이닝하기 위해 캡셔닝 라벨이 없는 상기 샘플 멀티미디어 데이터를 또한 동시에 사용할 수 있다. 특히 상기 샘플 멀티미디어 데이터의 양이 많을 때에, 상기 샘플 멀티미디어 데이터에 캡셔닝 정보를 라벨링하는 데 필요한 노동 및 시간 비용을 크게 줄일 수 있다. 또한, 샘플 멀티미디어 데이터의 양이 증가하기 때문에, 상기 멀티미디어 데이터 캡셔닝 모델의 정확도 및 정밀도 또한 향상된다. 추가로, 본 개시의 실시예에서의 알고리즘은 상기 RNN 기반 모델 또는 변환기 기반 모델과 같은 상이한 모델들에 적용될 수 있다. 이 방법은 일반적인 트레이닝 방법이다.Through the above manner, in addition to training the multimedia data captioning model using sample video data with captioning labels, this alternative embodiment of the present disclosure provides a captioning method for training the video captioning model. The sample multimedia data without labels can also be used simultaneously. In particular, when the amount of the sample multimedia data is large, the labor and time cost required for labeling the caption information on the sample multimedia data can be greatly reduced. In addition, as the amount of sample multimedia data increases, the accuracy and precision of the multimedia data captioning model is also improved. Additionally, the algorithm in an embodiment of the present disclosure may be applied to different models, such as the RNN-based model or the transformer-based model. This method is a general training method.

본 개시의 대안의 실시예에서, 단계 S201에서, 제1 샘플 멀티미디어 데이터에 기반하여 제1 손실 함수의 값을 획득하는 단계는: 제1 샘플 멀티미디어 데이터를 비디오 캡셔닝 모델에 입력하여 예측 타겟 캡셔닝 정보를 획득하는 단계; 그리고 타겟 캡셔닝 정보 및 대응하는 캡셔닝 라벨에 기반하여 상기 제1 손실 함수의 값을 획득하는 단계를 포함한다.In an alternative embodiment of the present disclosure, in step S201, obtaining a value of the first loss function based on the first sample multimedia data includes: inputting the first sample multimedia data into a video captioning model to predict target captioning obtaining information; and obtaining a value of the first loss function based on the target captioning information and the corresponding captioning label.

상기 제1 손실 함수의 값은 상기 모델 출력을 기반으로 획득된 타겟 캡셔닝 정보와 상기 대응하는 라벨링된 캡셔닝 정보 간의 차이를 나타낸다.The value of the first loss function represents a difference between the target captioning information obtained based on the model output and the corresponding labeled captioning information.

일례로, 도 26은 본 개시의 실시예에서 제공되는 멀티미디어 데이터 캡셔닝 모델을 트레이닝하기 위한 방법의 개략도를 도시한다. 이 예는 비디오를 예로 들어 설명한다. 상기 도면에 도시된 라벨링된 데이터는 제1 샘플 비디오 데이터 내 비디오 데이터에 대응하고, 상기 도면에서 보이는 라벨링되지 않은 데이터는 제2 샘플 비디오 데이터 내 비디오 데이터에 대응한다. 상기 트레이닝 방법은 도 26을 참조하여 아래에서 설명된다.As an example, FIG. 26 shows a schematic diagram of a method for training a multimedia data captioning model provided in an embodiment of the present disclosure. This example is explained using a video as an example. The labeled data shown in the figure corresponds to video data in the first sample video data, and the unlabeled data shown in the figure corresponds to the video data in the second sample video data. The training method is described below with reference to FIG. 26 .

도 26에 도시된 바와 같이, 상기 라벨링된 비디오 데이터 V는 비디오 캡셔닝 모델 M에 입력될 수 있고, 상기 비디오 데이터는 상기 비디오 캡셔닝 모델에 의해 분석 및 처리되어 대응 타겟 비디오 캡션을 생성하며, 그 후에 상기 제1 손실 함수의 값이 제1 샘플 비디오 데이터 내 원본 비디오 캡션 (도 26의 라벨 y에 대응함) 및 타겟 비디오 캡션을 기반으로 계산된다. 이 예에서, 상기 제1 손실 함수는 교차 엔트로피 손실 함수일 수 있으며, 그 교차 엔트로피 손실 함수는 수학식 9에서 보인다.26 , the labeled video data V may be input to a video captioning model M, wherein the video data is analyzed and processed by the video captioning model to generate a corresponding target video caption, the Afterwards, the value of the first loss function is calculated based on the original video caption (corresponding to the label y in FIG. 26 ) and the target video caption in the first sample video data. In this example, the first loss function may be a cross entropy loss function, and the cross entropy loss function is shown in Equation (9).

여기에서 J_label(θ)은 교차 엔트로피 손실을 나타내며, θ는 비디오 캡셔닝 모델의 모델 파라미터들을 나타내며, t는 현재 순간을 나타내며, T는 최대 순간을 나타내며, y_t는 현재 순간에 대응하는 실측 (ground-truth)을 나타내며, y_1:t -1은 시점 1부터 시점 t-1까지에 대응하는 실측을 나타내고, V는 상기 비디오를 나타내며, 그리고 p_*?*는 출력 단어가 실측일 확률을 나타낸다. 구체적으로, p

(y_t|y_1:t-1,V)는 현재 순간에 상기 모델에 의해 예측된 단어가 대응하는 라벨링된 단어일 확률을 나타낸다. 상기 손실 함수의 의미는, 현재 시간(시점)의 입력이 현재 시간(시점) 이전의 각 시간에서 맞는(correct) 단어일 때, 현재 시간의 출력이 맞는 단어일 확률이 최대화된다는 것이다.where J _label (θ) denotes the cross-entropy loss, θ denotes the model parameters of the video captioning model, t denotes the current moment, T denotes the maximum moment, and y _t denotes the actual measurement ( ground-truth), y _1:t -1 represents the actual measurement corresponding to the time point 1 to t-1, V represents the video, and p _*?* represents the probability that the output word is actually measured . Specifically, p

(y _t| y _1:t-1 ,V) represents the probability that the word predicted by the model at the current moment is the corresponding labeled word. The meaning of the loss function is that when the input of the current time (time point) is a correct word at each time before the current time (time point), the probability that the output of the current time is a correct word is maximized.

예를 들어, 도 25에서 도시된 비디오는 상기 비디오 캡셔닝 모델에 의해 분석 및 처리된다. 현재 시점 t = 2이며, 초기 시점 t = 0에서 출력되는 단어 y₀이 "a"이고, 시점 t = 1에서 출력되는 단어 y₁이 "child"라고 가정하면, 현재 시점 t = 2이고 시간 t = 1에서 출력된 단어 y₁ "child"가 맞는(correct) 단어일 때에, y₂ 출력 "is"의 확률이 최대화된다고 가정한다.For example, the video shown in FIG. 25 is analyzed and processed by the video captioning model. _{Assuming that the current time t = 2, the word y 0} output at the initial time t = 0 is "a", and the word y ₁ output at the time t = 1 is "child", the current time t = 2 and the time t _{It is assumed that when the word y 1} "child" output at = 1 is the correct word, _{the probability of the output y 2} output "is" is maximized.

상기 비디오 캡셔닝 모델을 통해 도 25에 도시된 비디오를 분석 및 처리함으로써 획득된 비디오 캡션은 "a child is sweeping the ground"이고, 그 후에 상기 비디오 캡셔닝 모델은 "a child is sweeping the ground"와 원본 비디오 캡션 "a child is cleaning the ground"를 기반으로 트레이닝된다고 가정한다.A video caption obtained by analyzing and processing the video shown in FIG. 25 through the video captioning model is "a child is sweeping the ground", and then the video captioning model is "a child is sweeping the ground" and Assume training is based on the original video caption "a child is cleaning the ground".

실제 애플리케이션에서, 본 개시의 실시예는 어휘 목록(lexicon)을 미리 설정할 수 있으며, 그 어휘 목록으로부터 각 순간에 출력되는 단어가 결정된다. 초기 시점 t = 0에서 출력되는 단어 y₀은 비디오의 시작 토큰을 기반으로 결정될 수 있다. 예를 들어, 도 25에 도시된 비디오에 대해, 시점 t = 0에서 출력되는 단어 y₀ "a"는 상기 비디오의 시작 토큰에 기반하여 결정된다. 물론, 실제 애플리케이션에서, 상기 비디오 캡션의 제1 단어를 결정하기 위해 다른 방법들이 사용될 수도 있으며, 이는 본 개시의 실시예에서 제한되지 않는다.In an actual application, an embodiment of the present disclosure may preset a lexicon, from which a word to be output at each moment is determined. _{The word y 0} output at the initial time point t = 0 may be determined based on the start token of the video. For example, for the video shown in FIG. 25 , the word y ₀ “a” output at time t = 0 is determined based on the start token of the video. Of course, in an actual application, other methods may be used to determine the first word of the video caption, which is not limited in the embodiment of the present disclosure.

본 개시의 대안의 실시예에서, 상기 제2 손실 함수의 값을 획득하기 위해 상기 제2 샘플 멀티미디어 데이터에 기반하여 캡셔닝 모델을 트레이닝하는 단계는: 제3 샘플 멀티미디어 데이터를 획득하기 위해 상기 제2 샘플 멀티미디어 데이터에 대해 데이터 증강을 적어도 1회 수행하는 단계; 적어도 하나의 멀티미디어 캡셔닝을 획득하기 위해 상기 제2 샘플 멀티미디어 데이터를 상기 캡셔닝 모델에 입력하는 단계; 상기 제2 샘플 멀티미디어 데이터 및 제3 샘플 멀티미디어 데이터에 기반하여 각 멀티미디어 캡셔닝의 점수를 결정하는 단계; 그리고 상기 멀티미디어 캡셔닝의 점수들에 기반하여 상기 제2 손실 함수의 값을 획득하는 단계를 포함한다.In an alternative embodiment of the present disclosure, training a captioning model based on the second sample multimedia data to obtain a value of the second loss function comprises: the second sample multimedia data to obtain a third sample multimedia data. performing data augmentation on the sample multimedia data at least once; inputting the second sample multimedia data into the captioning model to obtain at least one multimedia captioning; determining a score of each multimedia caption based on the second sample multimedia data and the third sample multimedia data; and obtaining a value of the second loss function based on the scores of the multimedia captioning.

즉, 캡셔닝 라벨이 없는 제2 샘플 멀티미디어 데이터에 기반하여 모델이 트레이닝될 때에, 상기 샘플 멀티미디어 데이터가 증강되어 상기 제3 샘플 멀티미디어 데이터를 획득할 수 있고; 상기 제3 샘플 멀티미디어 데이터 및 상기 제2 샘플 비디오 데이터에 기반하여, 상기 제2 샘플 멀티미디어 데이터에 기반한 모델을 통해 획득된 각 캡셔닝 정보의 점수가 결정되며, 그리고 각자의 점수들에 기반하여 제2 손실 함수의 값을 획득한다.That is, when a model is trained based on the second sample multimedia data without a caption label, the sample multimedia data may be augmented to obtain the third sample multimedia data; Based on the third sample multimedia data and the second sample video data, a score of each captioning information obtained through a model based on the second sample multimedia data is determined, and a second score is determined based on the respective scores. Get the value of the loss function.

대안으로, 예를 들어, 상기 제2 샘플 멀티미디어 데이터는 제1 캡셔닝 정보 및 제2 캡셔닝 정보를 획득하기 위해 상기 캡셔닝 모델에 입력될 수 있다. 상기 제2 샘플 멀티미디어 데이터 및 제3 샘플 멀티미디어 데이터에 기반하여, 상기 제1 캡셔닝 정보의 제1점수 및 제2 캡셔닝 정보의 제2 점수가 결정되며, 그리고 상기 제2 손실 함수의 값은 상기 제1 점수 및 제2 점수에 기반하여 획득된다.Alternatively, for example, the second sample multimedia data may be input into the captioning model to obtain first and second captioning information. Based on the second sample multimedia data and the third sample multimedia data, a first score of the first captioning information and a second score of the second captioning information are determined, and the value of the second loss function is the It is obtained based on the first score and the second score.

본 개시의 실시예의 솔루션에서, 상기 멀티미디어 데이터 캡셔닝 모델이 캡셔닝 라벨이 없는 샘플 멀티미디어 데이터를 사용하여 트레이닝될 때에, 상기 제2 샘플 멀티미디어 데이터는 제3 샘플 멀티미디어 데이터를 획득하기 위해 K(K) 차례 데이터 증강을 거친다. 상기 캡셔닝 모델은 상기 제2 샘플 멀티미디어 데이터 및 제3 샘플 멀티미디어 데이터를 기반으로 트레이닝된다. 상기 제2 샘플 멀티미디어 데이터와 제3 샘플 멀티미디어 데이터는 동일하거나 유사하므로, 상기 제2 샘플 멀티미디어 데이터의 캡셔닝 정보와 제3 샘플 멀티미디어 데이터의 캡셔닝 정보도 또한 동일하거나 유사해야 하며, 그래서, 이 방식을 기반으로 제2 손실 함수가 계산될 수 있으며, 이 함수를 기반으로 캡셔닝 모델을 트레이닝되며, 그로 인해 상기 캡셔닝 모델의 정확도와 정밀도를 더욱 향상시킬 수 있다.In the solution of the embodiment of the present disclosure, when the multimedia data captioning model is trained using sample multimedia data without a caption label, the second sample multimedia data is K(K) to obtain the third sample multimedia data. It goes through data augmentation in turn. The captioning model is trained based on the second sample multimedia data and the third sample multimedia data. Since the second sample multimedia data and the third sample multimedia data are the same or similar, the captioning information of the second sample multimedia data and the captioning information of the third sample multimedia data must also be the same or similar, so this way A second loss function may be calculated based on , and a captioning model is trained based on this function, thereby further improving the accuracy and precision of the captioning model.

본 개시의 대안적인 실시예에서, 대응하는 제1 캡셔닝 정보 및 제2 캡셔닝 정보를 획득하기 위해 제2 샘플 멀티미디어 데이터를 멀티미디어 데이터 캡셔닝 모델에 입력하는 단계는 구체적으로: 제2 샘플 멀티미디어 데이터를 캡셔닝 모델에 입력하고, 그리디 (greedy) 알고리즘을 통한 캡셔닝 모델의 출력 결과로부터 제1 캡셔닝 정보를 결정하는 단계; 그리고/또는 제2 샘플 멀티미디어 데이터를 캡셔닝 모델에 입력하고, 확률 샘플링에 기반하여 캡셔닝 모델의 출력 결과로부터 비디오 캡셔닝 정보를 결정하는 단계를 포함한다.In an alternative embodiment of the present disclosure, the step of inputting the second sample multimedia data into the multimedia data captioning model to obtain the corresponding first captioning information and the second captioning information specifically includes: the second sample multimedia data inputting to the captioning model, and determining first captioning information from an output result of the captioning model through a greedy algorithm; and/or inputting the second sample multimedia data to the captioning model, and determining video captioning information from an output result of the captioning model based on the probability sampling.

그리디 알고리즘(그리디 검색이라고도 함)은 문제를 해결할 때에 항상 최선의 선택을 하는 것을 말한다. 즉, 전체적인 최적 솔루션을 고려하는 것 대신에, 어떤 의미에서 로컬 최적 솔루션을 만든다.A greedy algorithm (also known as a greedy search) is about always making the best choice when solving a problem. That is, instead of considering the overall optimal solution, in a sense it creates a local optimal solution.

대안으로, 비디오를 예로 들면, 상기 제1 캡셔닝 정보는 제1 비디오 캡션이고, 상기 제2 캡셔닝 정보는 제2 비디오 캡션이다. 그리디 알고리즘을 통해 제1 비디오 캡션 c_g를 구하는 공식은 수학식 10에서 보이는 것일 수 있다Alternatively, taking a video as an example, the first captioning information is a first video caption, and the second captioning information is a second video caption. The formula for obtaining the first video caption c _g through the greedy algorithm may be as shown in Equation 10

여기에서 J_label(θ)은 교차 엔트로피 손실을 나타내며, θ는 비디오 캡셔닝 모델의 모델 파라미터들을 나타내며, t는 현재 순간을 나타내며, T는 최대 순간을 나타내며, y_t는 현재 순간에 대응하는 실측 (ground-truth)을 나타내며, y_1:t -1은 시점 1부터 시점 t-1까지에 대응하는 실측을 나타내고, V는 상기 비디오를 나타내며, 그리고 p_θ는 출력 단어가 실측일 확률을 나타낸다. 구체적으로,p_θ(y_t|y_1:t-1,V)는 현재 순간에 상기 모델에 의해 예측된 단어가 대응하는 라벨링된 단어일 확률을 나타낸다. 상기 손실 함수의 의미는, 현재 시간(시점)의 입력이 현재 시간(시점) 이전의 각 시간에서 맞는(correct) 단어일 때, 현재 시간의 출력이 맞는 단어일 확률이 최대화된다는 것이다.where J _label (θ) denotes the cross-entropy loss, θ denotes the model parameters of the video captioning model, t denotes the current moment, T denotes the maximum moment, and y _t denotes the actual measurement ( ground-truth), y _1:t -1 represents the actual measurement corresponding to the time point 1 to t-1, V represents the video, and p _θ represents the probability that the output word is actually measured. Specifically, p _θ (y _t| y _1:t-1 ,V) represents the probability that the word predicted by the model at the present moment is the corresponding labeled word. The meaning of the loss function is that when the input of the current time (time point) is a correct word at each time before the current time (time point), the probability that the output of the current time is a correct word is maximized.

대안으로, 비디오를 예로 들면, 상기 제1 캡셔닝 정보는 제1 비디오 캡션이고, 상기 제2 캡셔닝 정보는 제2 비디오 캡션이다. 그리디 알고리즘을 통해 제1 비디오 캡션 c_g를 구하는 공식은 수학식 10에서 보이는 것일 수 있다:Alternatively, taking a video as an example, the first captioning information is a first video caption, and the second captioning information is a second video caption. A formula for obtaining the first video caption c _g through the greedy algorithm may be shown in Equation (10):

여기에서, c_g(t)(t = 1, 2,..., T)는 현재 시점 t에서 출력되는 단어를 나타낸다; 대안으로, c_g(t)=(argmax_y∈Y(p_θ(c_g,1:t-1,V)))이며, 여기에서 V는 제2 샘플 비디오 데이터를 나타내고, c_g,1:t-1은 초기 시점부터 시점 t-1까지의 출력된 단어들의 시퀀스, 즉, 현재 시점 이전에 출력된 단어들을 나타낸다. 이때, c_g(t)는 현재 시점 t에서 최대 확률을 갖는 단어가 현재 시점에 출력되는 단어로 선택됨을 나타내며, 현재 시점에서 각각의 후보 단어들의 출력 확률들은 현재 시점과 비디오 V 이전의 각각의 시점들에서 출력된 단어들에 기반하여 결정되며, 현재 시점에서의 최종 출력 단어는 상기 후보 단어들 중 가장 확률이 높은 단어이다.Here, c _g (t) (t = 1, 2,..., T) represents a word output at the current time point t; Alternatively, c _g (t)=(argmax _y∈Y (p _θ (c _g,1:t-1 ,V)))), where V denotes the second sample video data, and c _{g,1: t-1} represents a sequence of words output from the initial time point to the time point t-1, that is, words output before the current time point. In this case, c _g (t) indicates that the word having the maximum probability at the current time t is selected as the word output at the current time, and the output probabilities of each candidate word at the current time are the current time and each time before the video V. It is determined based on the words output from the fields, and the final output word at the current time is the word with the highest probability among the candidate words.

각 시점에 출력된 단어들이 획득된 이후에, 각 시점에 출력된 단어들이 출력 순서에 따라 정렬되어, 상기 제1 비디오 캡션 c_g.를 획득한다.After the words output at each time point are obtained, the words output at each time point are sorted according to the output order to obtain _{the first video caption c g .}

제2 비디오 캡셔닝 정보의 경우, 확률 샘플링은 조사 모집단 샘플의 각 단위가 동일한 확률로 선택되는 것을 말하며, 몬테카를로 샘플링 또는 확률 샘플링이라고도 한다. 확률 샘플링은 확률 이론과 랜덤 원칙에 기반하여 샘플이 샘플링되는 샘플링으로, 이는 모집단의 모든 단위가 0이 아닌 알려진 확률로 샘플링되게 한다. 모집단 단위가 샘플링될 확률은 샘플링 설계를 통해 지정되며, 비록 랜덤 샘플들이 일반적으로 모집단과 완전히 일치하지는 않지만, 일부 랜덤화 작업을 통해 달성될 수 있다.In the case of the second video captioning information, the probability sampling refers to that each unit of a sample of the survey population is selected with the same probability, also referred to as Monte Carlo sampling or probability sampling. Probabilistic sampling is sampling in which samples are sampled based on probability theory and the principle of randomness, which causes every unit of a population to be sampled with a known non-zero probability. The probability that a population unit will be sampled is specified through the sampling design, and can be achieved through some randomization operation, although random samples generally do not completely match the population.

대안으로, 확률 샘플링에 기반하여 제2 비디오 캡션 c_s를 획득하기 위한 공식은 수학식 11로 보여질 수 있다:Alternatively, the formula for obtaining the second video caption c _s based on the probabilistic sampling can be shown as Equation (11):

여기에서, c_s(t)(t = 1,2,..., T)는 현재 시점 t에서 단어 출력 ut를 나타낸다; 대안으로, c_s(t)=(multinomial_y∈Y(p_θ(c_s,1:t-1,V)))이며, 여기에서 c_s,1:t-1은 초기 시각부터 시점 t-1까지 출력된 단어들의 시퀀스, 즉, 현재 시긱 이전에 출력된 단어들을 나타내며, V는 제2 샘플 비디오 데이터를 나타낸다. 이때에, c_s(t)는 현재 시각에서 각 단어의 출력 확률에 따라 샘플링하여 샘플링한 결과를 출력으로 사용하며, 현재 시점에 대응하는 각각의 후보 단어들의 출력 확률은 현재 시점과 비디오 V 이전의 각 시점에 출력된 단어들에 기반하여 결정되며, 현재 시점에서의 최종 출력 단어는 각각의 출력 확률들로부터 샘플링된 몬테카를로이다.Here, c _s (t)(t = 1,2,..., T) represents the word output ut at the current time point t; Alternatively, c _s (t)=(multinomial _y∈Y (p _θ (c _s,1:t-1 ,V)))), where c _s,1:t-1 is from the initial time to the time t- A sequence of words output up to 1, that is, words output before the current start, is represented, and V represents the second sample video data. At this time, c _s (t) is sampled according to the output probability of each word at the current time, and the sampling result is used as an output, and the output probability of each candidate word corresponding to the current time is the current time and video V before It is determined based on the words output at each time point, and the final output word at the current time point is a Monte Carlo sampled from the respective output probabilities.

도 26에서의 예를 설명을 위한 예로 들면, 상기 제2 샘플 비디오 데이터, 즉, 도 26에서 보이는 라벨링되지 않은 데이터에 대해, 데이터 증강은 (도 26의 증강 비디오 데이터 V'와 같은) 제3 샘플 비디오 데이터를 획득하기 위해 K번 수행될 수 있다. 데이터 증강 방식은 상기 비디오 내 프레임들을 랜덤으로 제거하거나 비디오 내 각 프레임에서 회전 및 자르기와 같은 변환을 수행하는 것일 수 있다. 그것은 데이터 증강을 위한 다른 방식일 수도 있다. 실제 애플리케이션에서, 비디오 데이터에 대한 데이터 증강을 수행하는 방식은 본 개시의 실시예에 모두 적용 가능하며, 본 개시의 실시예들은 이를 제한하지 않는다. 상기 제2 샘플 비디오 데이터는 대응하는 제1 비디오 캡션 c_g 및 제2 비디오 캡션 c_s를 획득하기 위해 상기 비디오 캡셔닝 모델 M에 입력된다. 예를 들어, 상기 제1 비디오 캡션 c_g는 위의 수학식 2에 의해 획득될 수 있고, 상기 제2 비디오 캡션 c_s는 위의 수학식 3에 의해 획득될 수 있다.Taking the example in FIG. 26 as an illustrative example, for the second sample video data, that is, unlabeled data shown in FIG. 26 , data enhancement is performed on a third sample (such as augmented video data V′ in FIG. 26 ). It may be performed K times to obtain video data. The data augmentation method may be to randomly remove frames in the video or to perform transformation such as rotation and cropping at each frame in the video. It may be another way for data augmentation. In actual application, all methods of performing data augmentation on video data are applicable to the embodiments of the present disclosure, and the embodiments of the present disclosure are not limited thereto. The second sample video data is input to the video captioning model M to obtain _{corresponding first video caption c g} and second video caption c _{s .} For example, the first video caption c _g may be obtained by Equation 2 above, and the second video caption c _s may be obtained by Equation 3 above.

구체적으로, 예를 들어, 비디오 캡셔닝 모델을 사용하여 도 25에 도시된 동일한 콘텐츠를 갖지만 원본 캡셔닝 정보는 없는 비디오들을 분석 및 처리함으로써, 5회에 대응하는 단어들 (출력 결과들)은 다음과 같이 획득될 수 있다: "a (c_g(1)에 해당)", "child (c_g(2)에 해당)", "is (c_g(3)에 해당)", "cleaning (c_g(4)에 해당)", "sweeping (c_g(4)에 해당)", "organizing (c_g(4)에 해당)", "ground (c_g(5)에 해당)", 여기에서 c_g(4) 시각에서 세 후보 단어들의 출력 확률들은 각각 0.5 (cleaning), 0.2 sweeping), 0.3 (organizing)이며, 다른 단어들의 출력 확률은 고려하지 않으면서, 그리디 알고리즘을 통해서, 출력 확률이 최대인 "cleaning"을 시점 c_g(4)에서 출력 단어로 취하며, 그리디 알고리즘을 기반으로 생성된 최종 c_g는 " A child is cleaning the ground"이다."Specifically, for example, by using a video captioning model to analyze and process videos having the same content shown in Fig. 25 but no original captioning information, words corresponding to 5 times (output results) are can be obtained as: "a (corresponding to c _g (1))", "child (corresponding to c _g (2))", "is (corresponding to c _g (3))", "cleaning (c _{equivalent to g} (4))", "sweeping (corresponding to c _g (4))", "organizing (corresponding to c _g (4))", "ground (corresponding to c _g (5))", where At the time c _g (4), the output probabilities of the three candidate words are 0.5 (cleaning), 0.2 sweeping), and 0.3 (organizing), respectively, and without considering the output probabilities of other words, the output probabilities are The maximum "cleaning" _{is taken as the output word at time c g} (4), and the final c _g generated based on the greedy algorithm is "A child is cleaning the ground".

다른 예로, 위의 예에 이어서, c_g(4) 시간에 결정된 세 개의 후보 단어들은 "sweeping", "organizing" 및 "cleaning"이다. "sweeping"의 출력 확률은 0.2이며, "organizing"의 출력 확률은 0.3이며, 그리고 "cleaning"의 출력 확률은 0.5이며, 다른 단어의 출력 확률을 고려하지 않으면서, 3개의 비디오 캡션들이 생성될 수 있으며, 이는 각각 출력 확률이 0.2인 " a child is sweeping the ground". 출력 확률이 0.3인 " a child is organizing the ground", 출력 확률이 0.5인 " a child is cleaning the ground"이며, 확률 샘플링을 기반으로 생성된 최종 c_s는 3개의 비디오 캡션들 중 하나가 될 수 있다.As another example, following the example above, the _{three candidate words determined at time c g} (4) are “sweeping”, “organizing” and “cleaning”. The output probability of "sweeping" is 0.2, the output probability of "organizing" is 0.3, and the output probability of "cleaning" is 0.5, without considering the output probability of other words, three video captions can be generated. , which is " a child is sweeping the ground", each with an output probability of 0.2. “a child is organizing the ground” with an output probability of 0.3, “a child is cleaning the ground” with an output probability of 0.5, and the final c _s generated based on probability sampling can be one of three video captions. have.

즉, 위와 같은 상황에서, 상기 제2 샘플 비디오 데이터가 비디오 캡셔닝 모델에 입력되고 출력 결과가 획득된다. 그리디 알고리즘에 기반한 출력 결과에 대해 상기 비디오 캡션이 10회 생성되었다고 가정하면, 상기 획득된 비디오 캡션은 10회 모두 " a child is cleaning the ground "일 수 있으며; 확률 샘플링에 기반한 출력 결과에 대해 비디오 캡션이 10회 생성되었다고 가정하면, " a child is sweeping the ground "의 비디오 캡션을 2차례, " a child is organizing the ground "의 비디오 캡션을 3차례, 그리고 " a child is cleaning the ground "의 비디오 캡션을 5차례 획득하는 것이 가능하다.That is, in the above situation, the second sample video data is input to a video captioning model and an output result is obtained. Assuming that the video caption is generated 10 times for the output result based on the greedy algorithm, the obtained video caption may be "a child is cleaning the ground" all 10 times; Assuming that video captions are generated 10 times for the output result based on probabilistic sampling, the video caption of " a child is sweeping the ground " twice, the video caption of " a child is organizing the ground " 3 times, and " It is possible to acquire the video caption of "a child is cleaning the ground" 5 times.

본 개시의 대안의 실시예에서, 상기 제2 샘플 멀티미디어 데이터 및 제3 샘플 멀티미디어 데이터에 기반하여 상기 제1 캡셔닝 정보의 제1 점수 및 상기 제2 캡셔닝 정보의 제2 점수를 획득하는 단계는 구체적으로 다음을 포함할 수 있다.In an alternative embodiment of the present disclosure, the step of obtaining a first score of the first captioning information and a second score of the second captioning information based on the second sample multimedia data and the third sample multimedia data includes: Specifically, it may include:

상기 제1 캡셔닝 정보는 제2 샘플 멀티미디어 데이터 및 제3 샘플 멀티미디어 데이터를 각각 가진 캡셔닝 모델에 입력되어, 제2 샘플 멀티미디어 데이터의 제1 출력 확률 분포 및 제3 샘플의 제2 출력 확률 분포를 획득하며, 그리고 상기 제1 출력 확률 분포 및 제2 출력 확률 분포에 기반하여 상기 제1 캡셔닝 정보의 제1 점수가 획득된다.The first captioning information is input to a captioning model having second sample multimedia data and third sample multimedia data, respectively, to obtain a first output probability distribution of the second sample multimedia data and a second output probability distribution of the third sample. and a first score of the first captioning information is obtained based on the first output probability distribution and the second output probability distribution.

상기 제2 캡셔닝 정보는 제2 샘플 멀티미디어 데이터 및 제3 샘플 멀티미디어 데이터를 각각 가진 캡셔닝 모델에 입력되어, 제2 샘플 멀티미디어 데이터의 제3 출력 확률 분포 및 제3 샘플의 제4 출력 확률 분포를 획득하며, 그리고 상기 제3 출력 확률 분포 및 제4 출력 확률 분포에 기반하여 상기 제2 캡셔닝 정보의 제2점수가 획득된다.The second captioning information is input to a captioning model having second sample multimedia data and third sample multimedia data, respectively, to obtain a third output probability distribution of the second sample multimedia data and a fourth output probability distribution of the third sample. and a second score of the second captioning information is obtained based on the third output probability distribution and the fourth output probability distribution.

구체적으로, 비디오를 예로 들면, 제1 비디오 캡션에 대해, 상기 제1 비디오 캡션이 실측으로서 취해져서 상기 제2 샘플 비디오 데이터와 함께 상기 비디오 캡셔닝 모델에 입력되어, 상기 제1 비디오 캡션의 각 순간의 제1 출력 확률 분포를 획득한다. 동시에, 상기 제1 비디오 캡션은 실측으로서 취해지며 제3 샘플 비디오 데이터와 함께 비디오 캡셔닝 모델에 입력되어, 제1 비디오 캡션의 각 순간의 제2 출력 확률 분포를 획득하며, 그것은 상기 제1 출력 확률 분포와 제2 출력 확률 분포의 KL 발산을 계산하고, 그 KL 발산들에 기반하여 상기 제1 점수 r_g를 얻을 수 있다.Specifically, taking video as an example, for a first video caption, the first video caption is taken as a ground truth and input to the video captioning model together with the second sample video data, so that each moment of the first video caption Obtain a first output probability distribution of . At the same time, the first video caption is taken as a ground truth and input to a video captioning model together with third sample video data to obtain a second output probability distribution of each instant of the first video caption, which is the first output probability The KL divergence of the distribution and the second output probability distribution may be calculated, and the first score r _g may be obtained based on the KL divergences.

대안의 솔루션으로, 상기 KL 발산들에 시간 도메인 가중치들을 곱하고, 음수 1을 곱하여 수학식 12에 구체적으로 표시된 것처럼 상기 제1 비디오 캡션의 제1 점수 rg를 얻을 수 있다:As an alternative solution, we can multiply the KL divergences by time domain weights and multiply by a negative number to get the first score rg of the first video caption, as specifically indicated in equation (12):

W_t=T/t가 시간 도메인 가중치인 경우, 오류 누적의 영향을 줄이기 위해서, 제1 비디오 캡션 내 제1 단어들에 더 높은 가중치를 부여하고 그 제1 비디오 캡션의 마지막 단어들에 더 낮은 가중치를 부여한다. 증강 비디오 V'는 K개의 데이터를 포함할 수 있으므로, K r_g를 획득할 수 있다. 이때에, K r_g 를 기반으로 최종 1차 점수 r'_g를 획득할 수 있다. 예를 들어, 상기 K r_g는 상기 제1 점수 r'_g를 획득하기 위해 평균화될 수 있다. 그것은 가중치가 부여된 평균과 같은 다른 방법에 의해서도 얻을 수 있으며, 상이한 증강 방법을 사용하여 획득된 비디오들 V'에 상이한 가중치들이 부여된다.When W _t =T/t is the time domain weight, in order to reduce the effect of error accumulation, the first words in the first video caption are given a higher weight and the last words of the first video caption are given a lower weight. to give Since the augmented video V' may include K data, K r _g may be obtained. In this case, a final primary score r' _g may be obtained _{based on K r g .} For example, the K r _g may be averaged to obtain the first score r' _{g .} It can also be obtained by other methods such as weighted averaging, and videos V' obtained using different augmentation methods are given different weights.

유사하게, 상기 제2 비디오 캡션의 경우, 상기 제2 비디오 캡션은 실측으로 간주되며 상기 제2 샘플 비디오 데이터와 함께 상기 비디오 캡셔닝 모델에 입력되어, 제2 비디오 캡션의 각 순간의 제3 출력 확률 분포를 획득한다. 동시에, 수학식 13에서 구체적으로 보이는 것처럼, 제2 비디오 캡션은 실측으로 간주되어 제3 샘플 비디오 데이터와 함께 비디오 캡셔닝 모델에 입력되어, 제2 비디오 캡션의 각 순간의 제4 출력 확률 분포를 획득하며, 그리고 그 후에 그것은 상기 제3 출력 확률 분포와 상기 제4 출력 확률 분포의 KL 발산들을 계산할 수 있으며, 그리고 그 KL 발산들에 시간 도메인 가중치들이 곱해지고, 그 후에 음수 1이 곱해져서 상기 제2 비디오 캡션의 제2 점수 r_s를 획득한다.Similarly, in the case of the second video caption, the second video caption is considered real-world and is input to the video captioning model together with the second sample video data, so that the third output probability of each instant of the second video caption is get the distribution. At the same time, as specifically shown in Equation (13), the second video caption is regarded as ground truth and is input to the video captioning model together with the third sample video data, to obtain a fourth output probability distribution of each instant of the second video caption , and thereafter it may compute the KL divergences of the third and fourth output probability distributions, and the KL divergences are multiplied by time domain weights, and then multiplied by a negative 1 to determine the second A second score r _s of the video caption is obtained.

V'에는 K개의 데이터가 포함되어 있으므로, K r_s를 평균하여 제1 점수 r'_s를 얻을 수 있다.Since V' contains K pieces of data, the first score r' _s can be obtained _{by averaging K r s .}

r'_g 및 r'_s 를 획득한 후에, 상기 비디오 캡셔닝 모델의 제2 손실 함수는 r'_g및 r'_s 둘 모두를 통해 계산될 수 있다.After obtaining r' _g and r' _s , a second loss function of the video captioning model can be calculated through both _{r' g} and r' _s.

도 26에 도시된 예에서. 제1 비디오 캡션 c_g를 획득한 후, 상기 c_g 및 상기 제2 샘플 비디오 데이터 V를 상기 비디오 캡셔닝 모델 M에 입력하고, 상기 모델 출력에 기반하여 상기 제2 비디오 캡션의 각 순간의 제3 출력 확률 분포를 획득하며; 각각의 증강 비디오 V'(도 21에서 하나의 V'가 도시됨)에 대해, 상기 c_g 및 비디오 V'는 비디오 캡셔닝 모델 M에 입력되어, 각 V'에 대응하는 제2 비디오 캡션의 각 순간의 제4 출력 확률 분포를 획득하며, 위의 수학식 12를 통해 상기 제3 출력 확률 분포 및 각 제4 출력 확률 분포에 대응하는 r_g를 계산하고, 평균화 또는 다른 방법들에 의해 r'_g (상기 도면에서 c_g에 대응하는 하나의 KL 발산)를 획득할 수 있다. 동일한 계산 원리를 사용하여, 수학식 13을 통해 r'_s (상기 도면 내 c_s에 해당하는 하나의 KL 발산)을 획득할 수 있다.In the example shown in FIG. 26 . After acquiring the first video caption c _g , the c _g and the second sample video data V are input to the video captioning model M, and the third of each moment of the second video caption is based on the model output. obtain an output probability distribution; For each augmented video V′ (one V′ is shown in FIG. 21 ), the c _g and video V′ are input to a video captioning model M, so that each of the second video captions corresponding to each V′ _{A fourth output probability distribution of the instant is obtained, and r g} corresponding to the third output probability distribution and each fourth output probability distribution is _{calculated through Equation 12 above, and r' g} by averaging or other methods (one KL divergence corresponding to _{c g} in the figure) may be obtained. Using the same calculation principle, r' _s (one KL divergence corresponding to _{c s} in the figure) can be obtained through Equation 13.

본 개시의 대안의 실시예에서, 상기 제1 점수 및 상기 제2 점수에 기반하여 상기 제2 손실 함수의 값을 획득하는 단계는: 상기 제1 점수와 상기 제2 점수의 차이를 보상 값으로 사용하는 단계; 상기 보상 값 및 제2 캡셔닝 정보에 기반하여 캡셔닝 모델의 제2 손실 함수를 획득하는 단계를 포함한다.In an alternative embodiment of the present disclosure, obtaining the value of the second loss function based on the first score and the second score includes: using the difference between the first score and the second score as a reward value to do; and obtaining a second loss function of the captioning model based on the compensation value and second captioning information.

구체적으로, 상기 제2 손실 함수는 정책 그레디언트 손실 함수일 수 있으며, 상기 정책 그레디언트 손실 함수는 수학식 14에서 보인다:Specifically, the second loss function may be a policy gradient loss function, and the policy gradient loss function is shown in Equation (14):

여기에서, (r'_s-r'_g)는 보상 값, 즉 상기 제1 점수와 제2 점수의 차이이다. ∇θ는 θ를 계산하기 위한 그레디언트이다. 상기 제2 손실 함수를 획득한 후에, 상기 정책 그레디언트는 캡셔닝 모델을 트레이닝 시키기 위해 사용된다. 샘플링에 의해 획득된 단어들이 더 정확하면, 상기 제3 출력 확률 분포와 제4 출력 확률 분포의 KL 발산들이 더 작아지고, 상기 보상이 커질 것이며, 그래서 상기 단어를 출력할 확률은 상기 모델이 업데이트된 이후에 더 커진다는 것을 위의 설명에서 알 수 있다. 반대로, 샘플링에 의해 획득된 단어들이 상대적으로 정확하지 않으면(poor), 제3 출력 확률 분포와 제4 출력 확률 분포의 KL 발산들이 커지고 보상은 작아질 것이며, 상기 단어를 출력할 확률은 상기 모델이 업데이트된 이후에 더 작아진다.Here, (r' _s -r' _g ) is a reward value, that is, the difference between the first score and the second score. ∇θ is the gradient for calculating θ. After obtaining the second loss function, the policy gradient is used to train a captioning model. If the words obtained by sampling are more accurate, the KL divergences of the third output probability distribution and the fourth output probability distribution will be smaller, and the reward will be larger, so the probability of outputting the word will depend on whether the model is updated. It can be seen from the above description that it becomes larger afterward. Conversely, if the words obtained by sampling are relatively poor, the KL divergences of the third output probability distribution and the fourth output probability distribution will be large and the reward will be small, and the probability of outputting the word is that the model It gets smaller after being updated.

여전히 비디오를 예로 들면, 도 26에 도시된 예와 같이, r'_g (상기 도면 내 c_g에 대응하는 KL 발산)와 r'_s (상기 도면의 c_s에 대응하는 KL 발산)의 차이에 기반하여 보상 값(도면에서 보이는 보상)을 얻을 수 있으며, 상기 정책 그레디언트 손실의 값은 상기 보상 값을 기반으로 위의 수학식 14를 통해 계산될 수 있으며, 상기 최종 손실 함수의 값은 상기 제1 손실 함수의 값 (즉, 도 26에 도시된 교차 엔트로피 손실의 값) 및 제2 손실 함수의 값(즉, 도 26에 도시된 정책 그레디언트 손실의 값)에 기반하여 획득된다.Still taking video as an example, based on the difference between _{r' g} (KL divergence corresponding to _{c g} in the figure _{) and r' s} _{(KL divergence corresponding to c s} in the figure), as in the example shown in Fig. 26 . to obtain a compensation value (compensation shown in the drawing), the value of the policy gradient loss can be calculated through Equation 14 above based on the compensation value, and the value of the final loss function is the first loss It is obtained based on the value of the function (ie, the value of the cross-entropy loss shown in FIG. 26) and the value of the second loss function (ie, the value of the policy gradient loss shown in FIG. 26).

추가로, 상기 제2 샘플 멀티미디어 데이터를 기반으로 한 유형의 캡셔닝 정보만 생성할 수 있음을 전술한 설명으로부터 알 수 있다. 이때에, 상기 캡셔닝 정보에만 기반하여 사기 제2 손실 함수의 값을 획득할 수 있다. 위의 수학식 14를 예로 들면, 수학식 14에서 r'_g 를 제거하는 것도 가능하며, 그래서 수학식 14는 다음의 수학식 15와 같이 다시 작성된다:In addition, it can be seen from the above description that only a type of captioning information can be generated based on the second sample multimedia data. In this case, the value of the fraudulent second loss function may be obtained based only on the captioning information. Taking Equation 14 above as an example, _{it is also possible to remove r' g} from Equation 14, so Equation 14 is rewritten as Equation 15 below:

즉, 상기 제2 캡셔닝 정보에만 기반하여 대응하는 점수(예를 들어, 위의 예에서의 제2 점수)를 획득할 수 있고, 제2 캡셔닝 정보 및 점수에 기반하여 상기 제2 손실 함수의 값을 획득할 수 있다.That is, a corresponding score (eg, the second score in the above example) may be obtained based only on the second captioning information, and based on the second captioning information and the score, value can be obtained.

현재, 일반적으로 사용되는 비디오 캡셔닝 또는 이미지 캡셔닝의 데이터세트들에서, 비디오 또는 이미지의 캡셔닝 라벨들은 일반적으로 더 적다. 예를 들면, 트레이닝 샘플 이미지에는 일반적으로 5개의 캡셔닝 라벨들만이 존재한다. 5개의 캡셔닝 라벨만으로 상기 이미지 내 정보를 완전히 표현하는 것은 종종 어렵다. 트레이닝 샘플 캡셔닝 주석의 다양성을 증가시키기 위해서, 멀티미디어 데이터 캡셔닝을 획득하기 위한 방법이 본 개시의 실시예에서 제공된다. 상기 방법에 기반하여, 상기 샘플 멀티미디어 데이터의 캡셔닝 라벨들은 샘플 데이터의 캡션들의 개수를 증가시키기 위한 증강된 캡셔닝 정보를 획득하기 위해 데이터 증강을 받을 수 있으며, 그래서 증강된 캡셔닝 정보가 있는 샘플 데이터에 기반하여 더 양호한 멀티미디어 데이터 캡셔닝 모델을 획득하기 위한 트레이닝을 수행할 수 있다.Currently, in commonly used datasets of video captioning or image captioning, there are generally fewer captioning labels of video or image. For example, there are typically only 5 captioning labels in a training sample image. It is often difficult to fully represent the information in the image with only five captioning labels. In order to increase the diversity of training sample captioning annotations, a method for obtaining multimedia data captioning is provided in an embodiment of the present disclosure. Based on the method, the captioning labels of the sample multimedia data may be subjected to data augmentation to obtain augmented captioning information for increasing the number of captions of the sample data, so that the sample with the augmented captioning information Training can be performed to obtain a better multimedia data captioning model based on the data.

따라서, 본 개시의 대안의 실시예에서, 상기 제1 샘플 멀티미디어 데이터의 캡셔닝 라벨은 제1 샘플 멀티미디어 데이터의 적어도 하나의 원본 캡셔닝 라벨, 및 각각의 원본 캡셔닝 라벨들에 대응하는 증강된 자막 라벨들을 포함할 수 있다.Accordingly, in an alternative embodiment of the present disclosure, the captioning label of the first sample multimedia data includes at least one original captioning label of the first sample multimedia data, and an enhanced caption corresponding to the respective original captioning labels. Labels may be included.

도 27은 본 개시의 일 실시예에서 제공된 멀티미디어 데이터의 캡셔닝을 획득하기 위한 방법의 개략적인 흐름도이다. 상기 도면에 도시된 바와 같이, 상기 방법은 다음의 단계들을 포함할 수 있다.27 is a schematic flowchart of a method for obtaining captioning of multimedia data provided in an embodiment of the present disclosure; As shown in the figure, the method may include the following steps.

단계 S2501: 상기 멀티미디어 데이터에 대응하는 적어도 하나의 원본 캡셔닝 라벨을 획득한다.Step S2501: Obtain at least one original captioning label corresponding to the multimedia data.

상기 제1 샘플 멀티미디어 데이터에 대해, 즉, 상기 제1 샘플 멀티미디어 데이터의 원본 캡셔닝 라벨들을 획득한다.For the first sample multimedia data, that is, obtain original captioning labels of the first sample multimedia data.

상기 멀티미디어 데이터는, 필요에 따라 로컬 저장소 또는 로컬 데이터베이스로부터 획득된 트레이닝 이미지 데이터 세트나 트레이닝 비디오 데이터세트 내의 또는 입력 디바이스나 전송 매체를 통해 외부 데이터 소스로부터 수신된 트레이닝 비디오 데이터세트 내의 샘플 데이터일 수 있다. 이미지를 예로 들면, 상기 트레이닝 이미지는 미리 결정된 N개의 이미지 캡셔닝 라벨들을 포함할 수 있으며, 여기에서 N은 1 이상의 양의 정수일 수 있다. 예를 들어, 이 솔루션에서의 이미지는 이미지 캡셔닝 분야에서 일반적으로 사용되는 트레이닝 이미지 데이터 세트(예: 데이터세트 MS-COCO) 내 트레이닝 이미지일 수 있다. 일반적으로 사용되는 트레이닝 이미지 데이터세트들 내 이미지에는 보통 5개의 이미지 캡셔닝 라벨들이 존재한다. 동일한 트레이닝 이미지들에 대한 5개의 이미지 캡셔닝 라벨들은 서로 상이하지만 유사한 의미를 가진다.The multimedia data may be sample data in a training image data set or a training video dataset obtained from a local storage or a local database, as required, or in a training video dataset received from an external data source via an input device or transmission medium. . Taking an image as an example, the training image may include N predetermined image captioning labels, where N may be a positive integer greater than or equal to 1 . For example, the image in this solution could be a training image in a training image dataset commonly used in the image captioning field (eg dataset MS-COCO). There are usually five image captioning labels for an image in commonly used training image datasets. The five image captioning labels for the same training images have different but similar meanings.

단계 S2502: 각 원본 캡셔닝 라벨에 대응하는 증강된 캡셔닝 정보는 상기 멀티미디어 데이터에 대응하는 각 원본 캡셔닝 라벨에 기반하여 각각 생성된다.Step S2502: Enhanced captioning information corresponding to each original captioning label is respectively generated based on each original captioning label corresponding to the multimedia data.

구체적으로, 상기 생성부는 상기 멀티미디어 데이터에 대응하는 각각의 원본 캡셔닝 라벨들에 따라 각각의 원본 캡셔닝 라벨들에 대응하는 증강된 캡셔닝 정보를 각각 생성할 수 있다. 상기 생성기는 상기 원본 캡셔닝 라벨들과 상이한 유사한 의미를 가진 캡셔닝 문장을 생성하기 위해 사용할 수 있다. 즉, 원본 캡셔닝으로 라벨링된 문장이 상기 생성기에 입력될 때에, 상기 생성기는 원본 캡셔닝으로 라벨링된 문장을 기반으로 상기 원본 캡셔닝으로 라벨링된 문장과는 상이하고, 유사한 의미를 갖는 캡셔닝 문장을 생성할 수 있다.Specifically, the generator may generate each of the augmented captioning information corresponding to each of the original captioning labels according to the respective original captioning labels corresponding to the multimedia data. The generator may be used to generate captioning sentences having similar meanings different from those of the original captioning labels. That is, when a sentence labeled with the original captioning is input to the generator, the generator is a caption sentence having a different and similar meaning from the sentence labeled with the original captioning based on the sentence labeled with the original captioning can create

생성기에 의해 문장을 생성하는 과정은 시간 시퀀스 과정이다. 대안의 방법으로, 디코딩 방법이 문장을 생성하기 위해 사용될 수 있다. 즉, 제1 순간에, 상기 입력 단어 벡터는 시작 토큰이고, 상기 출력은 최대 예측 출력 확률을 가진 제1 단어이다; 제2 순간에, 상기 입력은 제1 순간에서의 시작 토큰 및 출력이며, 상기 출력은 예상 출력 확률이 최대인 제2 단어가 되는 식으로 상기 출력 단어가 중지 토큰이 될 때까지 계속된다.The process of generating a sentence by the generator is a time sequence process. Alternatively, a decoding method may be used to generate the sentence. That is, at a first instant, the input word vector is a starting token, and the output is the first word with the maximum predicted output probability; At a second instant, the input is the start token and the output at the first instant, the output continues until the output word becomes a stop token, such as the second word with the greatest expected output probability.

상기 생성기의 구체적인 네트워크 구조는 본 개시의 실시예에서 제한되지 않는다. 대안의 솔루션으로, 상기 생성기는 셀프-어텐션 기반 인코더와 셀프-어텐션 기반 디코더를 사용하여 구현될 수 있다.The specific network structure of the generator is not limited in the embodiments of the present disclosure. As an alternative solution, the generator may be implemented using a self-attention based encoder and a self-attention based decoder.

두 가지 예로서, 도 28a 및 도 28b는 본 개시의 실시예에서 제공되는 생성기의 네트워크 구조의 개략도를 도시한다. 상기 생성기는 도 28a 및 도 28b에 도시된 바와 같이, 셀프-어텐션 기반 인코더 (예: 변환기 인코더) 및 셀프-어텐션 기반 디코더 (예: 변환기 디코더)를 포함할 수 있다. 상기 생성기 내 인코더는 다중-헤드 어텐션 레이어, 레이어 정규화 레이어 및 입력된 원본 캡셔닝 라벨들을 인코딩하는 데 사용되는 피드 포워드 네트워크 레이어를 포함할 수 있다. 상기 생성기의 디코더는 마스킹된 다중-헤드 어텐션 레이어, 다중-헤드 어텐션 레이어, 레이어 정규화 레이어 및 피드 포워드 네트워크 레이어를 포함할 수 있으며, 이는 인코딩된 이미지 캡셔닝 라벨 또는 비디오 캡셔닝 라벨을 디코딩하여 증강된 이미지 캡셔닝 정보 또는 증강된 비디오 캡셔닝 정보를 획득하기 위해 사용된다. 도 28a 및 도 28b에 도시된 인코더 및 디코더의 각 부분의 구조에 대한 자세한 설명을 위해, 앞에서의 도 13, 도 22 또는 도 23에 도시된 인코더 및 디코더에 대한 대응하는 설명을 참조한다.As two examples, FIGS. 28A and 28B show a schematic diagram of a network structure of a generator provided in an embodiment of the present disclosure. The generator may include a self-attention-based encoder (eg, a transformer encoder) and a self-attention-based decoder (eg, a transformer decoder) as shown in FIGS. 28A and 28B . The encoder in the generator may include a multi-head attention layer, a layer normalization layer and a feed forward network layer used to encode the input original captioning labels. The decoder of the generator may include a masked multi-head attention layer, a multi-head attention layer, a layer normalization layer and a feed forward network layer, which are enhanced by decoding an encoded image captioning label or a video captioning label. It is used to obtain image captioning information or augmented video captioning information. For a detailed description of the structure of each part of the encoder and decoder shown in FIGS. 28A and 28B , reference is made to the preceding corresponding description of the encoder and decoder shown in FIG. 13, 22 or 23 .

본 개시의 실시예에서 상기 생성기의 네트워크 구조는 위의 예들에서 보이는 구조들을 포함할 수 있지만, 이에 제한되지 않으며, 임의의 다른 이용 가능한 인코더 및 디코더가 상기 생성기를 구현하기 위해 사용될 수 있음에 유의해야 한다.It should be noted that the network structure of the generator in an embodiment of the present disclosure may include, but is not limited to, the structures shown in the above examples, and any other available encoders and decoders may be used to implement the generator. do.

상기 생성기에 의해 생성된 증강된 이미지 캡셔닝 정보의 정확성을 보장하기 위해서, 상기 생성기 또한 트레이닝될 필요가 있다. 대안으로, 상기 생성기는 다음과 같은 방식으로 트레이닝하여 획득될 수 있다.To ensure the accuracy of the augmented image captioning information generated by the generator, the generator also needs to be trained. Alternatively, the generator may be obtained by training in the following manner.

트레이닝 데이터세트는 여러 트레이닝 샘플 데이터를 포함하고 각 트레이닝 샘플 데이터는 N개의 원본 캡셔닝 라벨들을 포함하는 트레이닝 데이터세트가 획득되며, 여기에서 N은 1 이상의 양의 정수이다.The training dataset includes several training sample data and each training sample data is obtained a training dataset including N original captioning labels, where N is a positive integer greater than or equal to 1 .

상기 생성기는 상기 트레이닝 데이터세트 내 여러 트레이닝 샘플 데이터의 원본 캡셔닝 라벨을 기반으로 트레이닝되며, 여기에서 상기 생성기는 원본 캡셔닝 라벨들과는 상이하고, 유사한 의미를 가진 캡셔닝 정보를 생성하기 위해 사용된다.The generator is trained based on original captioning labels of several training sample data in the training dataset, where the generator is used to generate captioning information that is different from, and has a similar meaning to, the original captioning labels.

추가로, 상기 생셩기의 효과를 향상시키기 위해서, 대안의 솔루션으로서, 상기 생성기를 트레이닝시킬 때에, 판별기가 도입될 수 있으며, 그리고 상기 생성기는 대립적 (adversarial) 트레이닝 방식으로 트레이닝된다. 구체적으로, 상기 생성기를 트레이닝시키는 단계는 다음을 포함할 수 있다.Further, in order to enhance the effectiveness of the vigour, as an alternative solution, a discriminator may be introduced when training the generator, and the generator is trained in an adversarial training manner. Specifically, training the generator may include the following.

상기 생성기와 상기 판별기는 각 트레이닝 샘플 데이터의 각자의 원본 캡셔닝 라벨들에 대해 상기 생성기에 의해 생성된 캡셔닝 정보의 유사도 값이 미리 설정된 조건을 만족시킬 때까지 교대로 트레이닝되며, 여기에서 상기 판별기는 상기 생성기에 의해 생성된 캡셔닝 정보가 진정한 원본 캡셔닝 라벨일 확률을 판별하기 위해 특별히 사용될 수 있다.The generator and the discriminator are alternately trained until a similarity value of the captioning information generated by the generator for respective original captioning labels of each training sample data satisfies a preset condition, wherein the discriminator The group may be specifically used to determine the probability that the captioning information generated by the generator is a true original captioning label.

상기 판별기의 구체적인 네트워크 구조는 실제 필요에 따라 설정될 수 있다. 상기 판별기도 트레이닝받을 필요가 있다는 것이 이해될 수 있다. 상기 트레이닝된 판별기가 상기 생성기에 의해 생성된 캡셔닝 문장이 진정한 원본 캡셔닝 라벨일 확률이 높다고(예: 미리 결정된 임계값을 초과함) 판별할 때에, 그것은 상기 생성기에 의해 생성된 상기 캡셔닝 문장이 트레이닝된 판별기를 "속일 수 있는" 실제 샘플 (즉, 진정한 원본 캡셔닝 라벨)의 캡셔닝에 가깝다는 것을 의미한다. 이 경우에, 그런 캡셔닝 문장은 샘플 다양성을 증가시키기 위해 트레이닝 과정에서 적용되는 증강된 캡셔닝 정보로서 사용될 수 있다.The specific network structure of the discriminator may be set according to actual needs. It can be appreciated that the discriminator also needs to be trained. When the trained discriminator determines that the caption sentence generated by the generator is most likely to be a true original captioning label (eg, exceeds a predetermined threshold), it determines that the caption sentence generated by the generator is This means that the trained discriminator is close to the captioning of the real sample (i.e. the true original captioning label) that can "trick" the discriminator. In this case, such captioning sentences can be used as augmented captioning information applied in the training process to increase sample diversity.

구체적으로, 트레이닝 동안에, 상기 생성기와 상기 판별기는 각 트레이닝 샘플 데이터의 각자의 원본 캡셔닝 라벨들에 대해 상기 생성기에 의해 생성된 캡셔닝 정보의 유사도 값이 미리 설정된 조건을 만족시킬 때까지 교대로 트레이닝될 수 있으며, 여기에서 상기 유사도 값의 구체적인 계산 방법은 본 개시의 실시예에서 제한되지 않는다.Specifically, during training, the generator and the discriminator are alternately trained until a similarity value of captioning information generated by the generator for respective original captioning labels of each training sample data satisfies a preset condition. In this case, the specific calculation method of the similarity value is not limited in the embodiments of the present disclosure.

대안의 솔루션으로, 상기 유사도 값은 CIDEr 값일 수 있다. CIDEr는 캡셔닝의 성능을 평가하기 위해 일반적으로 사용되는 평가 메트릭이다. CIDEr 값이 높을수록 상기 생성된 캡셔닝 문장은 실제 원본 캡셔닝 라벨과 더 유사하다. 상기 CIDEr 메트릭은 각 문장을 "문서"로서 취급하고 tf-idf 벡터로 표현할 수 있다. 레퍼런스 (즉, 진정한(truth)) 캡셔닝 문장 및 상기 생성된 캡셔닝 문장 사이의 코사인 유사도가 계산되어, CIDEr 값을 생성하기 위한 점수로 사용된다. 따라서, 본 개시의 예시적인 실시예에 따르면, 상기 생성된 캡셔닝 문장(이미지 캡셔닝 문장 또는 비디오 캡셔닝 문장)의 CIDEr 값은 상기 캡셔닝 문장 그리고 캡셔닝 문장을 생성하기 위한 원래 캡셔닝 라벨인 트레이닝 샘플 데이터의 N개의 원본 캡셔닝 라벨들 간의 유사도에 기반하여 계산될 수 있다. 예를 들어, 상기 이미지를 예로 들면, 각 트레이닝 이미지의 각각의 원본 캡셔닝 라벨들을 위해 상기 생성기에 의해 생성된 상기 이미지 캡셔닝 문장의 CIDEr 값이 미리 설정된 조건을 충족시킬 때에, 이것은 상기 생성기가 실제 이미지 캡셔닝 라벨과 매우 유사한 이미지 캡셔닝 문장을 생성할 수 있음을, 즉, 상기 생성기 및 판별기에 대한 트레이닝이 완료되었다는 것을 의미한다.As an alternative solution, the similarity value may be a CIDEr value. CIDEr is an evaluation metric commonly used to evaluate the performance of captioning. The higher the CIDEr value, the more similar the generated captioning sentence is to the actual original captioning label. The CIDEr metric treats each sentence as a "document" and can be expressed as a tf-idf vector. The cosine similarity between the reference (ie, true) captioning sentence and the generated captioning sentence is calculated and used as a score for generating a CIDEr value. Therefore, according to an exemplary embodiment of the present disclosure, the CIDEr value of the generated captioning sentence (image captioning sentence or video captioning sentence) is the original captioning label for generating the captioning sentence and the captioning sentence It may be calculated based on the similarity between the N original captioning labels of the training sample data. For example, taking the image as an example, when the CIDEr value of the image captioning sentence generated by the generator for each original captioning label of each training image meets a preset condition, this means that the generator is actually It is possible to generate image captioning sentences that are very similar to image captioning labels, meaning that training for the generator and discriminator is complete.

상기 미리 설정된 조건은, 각 트레이닝 샘플 데이터의 각자의 원본 캡셔닝 라벨들을 위해 생성된 상기 캡셔닝 정보의 유사도 값이 미리 정해진 임계에 도달하거나, 또는 각 트레이닝 샘플 데이터의 각각의 원본 캡셔닝 라벨들을 위해 생성된 이미지 캡션의 평균 유사도 값이 미리 결정된 임계에 도달하는 것을 포함할 수 있다. 상기 미리 설정된 조건은 시스템 디폴트일 수도 있으며, 또는 필요성이나 경험에 따라 사용자에 의해 설정될 수 있다. 추가로, 사용자의 필요성이나 경험에 따라 상기 생성기 및 판별기의 트레이닝 완료되었는가의 여부가 판단될 수 있다. 예를 들어, 상기 생성기와 판별기가 어느 정도 트레이닝될 때에, 사용자는 트레이닝 샘플 데이터의 배치를 사용하여 상기 생성기를 테스트하고 그 생성기의 출력이 만족스러운지를 관찰할 수 있다. 상기 생성기의 출력이 만족스러울 때에, 상기 생성기 및 판별기의 트레이닝을 완료할 수 있다.The preset condition is that a similarity value of the captioning information generated for respective original captioning labels of each training sample data reaches a predetermined threshold, or for each original captioning label of each training sample data. and reaching an average similarity value of the generated image captions to a predetermined threshold. The preset condition may be a system default, or may be set by a user according to necessity or experience. In addition, it may be determined whether training of the generator and the discriminator is completed according to the user's needs or experience. For example, when the generator and discriminator have been trained to some extent, the user can test the generator using a batch of training sample data and observe whether the generator's output is satisfactory. When the output of the generator is satisfactory, training of the generator and discriminator can be completed.

본 개시의 대안적인 실시예에서, 상기 생성기와 판별기를 교대로 트레이닝시키는 단계는 고정된 생성기 파라미터들로 상기 판별기를 트레이닝시키는 단계; 그리고 상기 고정된 트레이닝된 판별기 파라미터들로 상기 생성기를 트레이닝하는 단계를 포함한다.In an alternative embodiment of the present disclosure, alternatingly training the generator and discriminator comprises: training the discriminator with fixed generator parameters; and training the generator with the fixed trained discriminator parameters.

즉, 생성기와 판별기가 교대로 트레이닝될 때에, 고정된 생성기 파라미터들로 상기 판별기가 먼저 트레이닝되며, 그 후에, 고정된 트레이닝된 식별자 파라미터들로 상기 생성기가 트레이닝된다. 상이한 트레이닝 샘플 데이터세트들에 대해, 위의 트레이닝 과정이 반복적으로 수행될 수 있다. 예를 들어, 이미지를 예로 들면, 상기 제1 트레이닝 이미지 세트에 대해, 생성기 및 판별기의 원래 파라미터들 (즉, 네트워크 구조 파라미터들)을 기반으로 상기 판별기와 생성기가 일단 트레이닝된다. 이어서, 상기 제2 트레이닝 이미지 세트에 대해, 상기 제1 트레이닝 이미지 세트에 대해 트레이닝된 상기 판별기 및 생성기의 파라미터들을 기반으로 상기 판별기 및 생성기가 다시 한 번 트레이닝된다. 그런 다음, 상기 제3 트레이닝 이미지 세트에 대해, 상기 제2 트레이닝 이미지 세트에 대해 트레이닝된 상기 판별기 및 생성기의 파라미터를 기반으로 상기 판별기 및 생성기가 다시 한 번 트레이닝되며, 이는 각 트레이닝 이미지의 각각의 이미지 캡셔닝 라벨들을 위해 상기 생성기에 의해 생성된 이미지 캡셔닝 정보의 유사도 값이 미리 설정된 조건을 충족시키거나 또는 상기 생성기의 출력 결과가 상기 사용자에 의해 테스트된 이후에 만족스러울 때까지 계속된다.That is, when the generator and the discriminator are alternately trained, the discriminator is first trained with fixed generator parameters, and then the generator is trained with the fixed trained identifier parameters. For different training sample datasets, the above training process may be iteratively performed. For example, taking an image as an example, for the first set of training images, the discriminator and the generator are trained once based on the original parameters (ie, network structure parameters) of the generator and the discriminator. Then, for the second set of training images, the discriminator and generator are trained once again based on the parameters of the discriminator and generator trained on the first set of training images. Then, for the third set of training images, the discriminator and generator are trained once again based on the parameters of the discriminator and generator trained on the second set of training images, which each This is continued until the similarity value of the image captioning information generated by the generator for the image captioning labels of either satisfies a preset condition or the output result of the generator is satisfied after being tested by the user.

본 개시의 대안적인 실시예에서, 상기 판별기는 다음 동작에 의해 트레이닝될 수 있다.In an alternative embodiment of the present disclosure, the discriminator may be trained by the following operation.

다음의 동작들은 (이 동작 모드에서 샘플 데이터의 원본 캡셔닝 라벨들의 수는 1보다 크며, 즉, N은 1보다 큼) 각 트레이닝 샘플 데이터의 각각의 원본 캡셔닝 라벨들에 대해 수행된다.The following operations (in this mode of operation the number of original captioning labels of the sample data is greater than 1, ie, N is greater than 1) are performed for each of the original captioning labels of each training sample data.

상기 원본 캡셔닝 라벨은 트레이닝 샘플 데이터의 다른 N-1개의 원본 캡셔닝 라벨들과 각자 쌍을 이루어, N-1개의 제1 쌍들을 생성하며; 상기 원본 캡셔닝 라벨은 생성기에 입력되며, 이는 상기 생성기에 의해 캡셔닝 정보를 생성하기 위한 것이며, 상기 생성된 캡셔닝 정보는 원본 캡셔닝 라벨과 쌍을 이루어 제2 쌍을 생성한다. 상기 제1 쌍들 중의 N-1개 및 상기 제2 쌍에 기반하여, 상기 판별기는 교차 엔트로피 손실 함수를 사용하여 트레이닝될 수 있으며, 여기에서 판별기의 출력은 각 쌍이 두 개의 진정한 원본 캡셔닝 라벨들일 확률이다.the original captioning label is paired with other N-1 original captioning labels of the training sample data, respectively, to generate N-1 first pairs; The original captioning label is input to the generator, which is for generating captioning information by the generator, and the generated captioning information is paired with the original captioning label to generate a second pair. Based on N-1 of the first pairs and the second pair, the discriminator can be trained using a cross entropy loss function, where the output of the discriminator is that each pair is two true original captioned labels. is the probability

즉, 하나의 원본 캡셔닝 라벨 (베이스라인 라벨이라고도 언급됨)은 N-1개의 레퍼런스 쌍들 (즉, 샘플 쌍들)을 획득하기 위해 다른 N-1개의 원본 자막 라벨들과 각각 쌍을 이룬다. 상기 베이스라인 라벨을 기반으로, 상기 생성기는 N-1개의 캡셔닝 정보를 생성할 수 있고, 상기 베이스라인 라벨은 N-1개의 생성된 캡셔닝 정보와 각자 쌍을 이루어 예측 쌍을 획득하고, 각자의 대응 샘플 쌍들 및 예측 쌍을 기반으로 손실 함수의 값을 계산된다; 그리고 상기 판별기의 네트워크 파라미터들은 미리 설정된 조건들이 충족될 때까지 상기 손실 함수의 값을 기반으로 조정된다. 예를 들어, 이미지의 경우, 상기 판별기의 출력은 각 예측 쌍이 2개의 진실한(truth) 이미지 캡셔닝 라벨들 (즉, 레퍼런스 쌍들)일 확률이 설정된 임계보다 크다는 것이다.That is, one original caption label (also referred to as a baseline label) is paired with each other N-1 original subtitle labels to obtain N-1 reference pairs (ie, sample pairs). Based on the baseline label, the generator may generate N-1 pieces of captioning information, and the baseline label is paired with N-1 pieces of generated captioning information to obtain prediction pairs, each Calculate the value of the loss function based on the corresponding sample pairs and the predicted pairs of ; And the network parameters of the discriminator are adjusted based on the value of the loss function until preset conditions are met. For example, in the case of an image, the output of the discriminator is that the probability that each prediction pair is two true image captioning labels (ie reference pairs) is greater than a set threshold.

본 개시의 대안적인 실시예(방식 1이라 할 수 있음)에서, 트레이닝된 판별기의 파라미터들이 고정된 경우, 상기 생성기를 트레이닝시키는 단계는 각 트레이닝 샘플 데이터에 대해 각각의 원본 캡셔닝 라벨들에 대해 다음과 같은 동작들을 수행하는 단계를 포함할 수 있다. .In an alternative embodiment of the present disclosure (which may be referred to as scheme 1), when the parameters of the trained discriminator are fixed, the step of training the generator is for each original captioning labels for each training sample data. It may include performing the following operations. .

상기 원본 캡셔닝 라벨이 상기 생성기에 입력되어 그리디 디코딩 방법을 통해 캡셔닝 정보를 생성한다: 그리고 상기 생성된 캡셔닝 정보에 대해, 다음의 동작들이 수행된다. The original captioning label is input to the generator to generate captioning information through a greedy decoding method: and the following operations are performed on the generated captioning information.

상기 생성된 캡셔닝 정보에 대응하는 유사도 값은 상기 생성된 캡셔닝 정보 및 대응하는 트레이닝 이미지의 N개의 원본 캡셔닝 라벨들에 기반하여 계산되고; 상기 생성된 캡셔닝 정보는 원본 캡셔닝 라벨들과 쌍을 이루어 제2 쌍을 생성하고, 상기 제2 쌍이 2개의 원본 이미지 캡셔닝 라벨들일 확률 값이 트레이닝된 판별기를 사용함으로써 획득되며, 상기 계산된 유사도 값과 상기 획득된 확률 값은 가중치가 부여되고 합산되어 보상을 획득하며; 그리고 상기 생성기의 파라미터들은 상기 획득한 보상에 따라 조정된다.a similarity value corresponding to the generated captioning information is calculated based on the generated captioning information and N original captioning labels of a corresponding training image; The generated captioning information is paired with original captioning labels to generate a second pair, and a probability value that the second pair is two original image captioning labels is obtained by using a trained discriminator, and the calculated the similarity value and the obtained probability value are weighted and summed to obtain a reward; And the parameters of the generator are adjusted according to the obtained reward.

본 개시의 다른 대안적인 실시예(이는 방식 2로 지칭될 수 있음)에서, 상기 트레이닝된 판별기의 파라미터들이 고정된 경우, 상기 생성기를 트레이닝시키는 단계는 각 트레이닝 샘플 데이터에 대해 각각의 원본 캡셔닝 라벨들을 위해 다음의 동작들을 수행하는 단계를 포함할 수 있다.In another alternative embodiment of the present disclosure (which may be referred to as scheme 2), when the parameters of the trained discriminator are fixed, the step of training the generator comprises: for each training sample data, each original caption It may include performing the following operations for the labels.

상기 원본 캡셔닝 라벨은 상기 생성기에 입력되어 그리디 디코딩 방식을 통해 제1 캡셔닝 정보를 생성한다.The original captioning label is input to the generator to generate first captioning information through a greedy decoding method.

상기 원본 캡셔닝 라벨은 상기 생성기에 입력되어 몬테카를로 샘플링 방식을 통해 제2 캡셔닝 정보를 생성한다.The original captioning label is input to the generator to generate second captioning information through a Monte Carlo sampling method.

상기 생성된 제1 캡셔닝 정보에 대해, 다음과 같은 작업들이 수행된다.The following operations are performed on the generated first captioning information.

상기 생성된 제1 캡셔닝 정보에 대응하는 제1 유사도 값은 상기 생성된 제1 캡셔닝 정보 및 대응하는 트레이닝 이미지의 N개의 원본 캡셔닝 라벨들에 기반하여 계산되며; 상기 생성된 제1 캡셔닝 정보는 원본 캡셔닝 라벨들과 쌍을 이루어 제2 쌍을 생성하고, 상기 제2 쌍이 두 개의 진정한 원본 캡셔닝 라벨들일 제1 확률 값은 트레이닝된 판별기를 사용함으로써 획득되며, 상기 계산된 제1 유사도 값 및 상기 획득된 제1 확률 값에 가중치가 적용되고 합산되어 제1 보상을 획득한다.a first similarity value corresponding to the generated first captioning information is calculated based on the generated first captioning information and N original captioning labels of a corresponding training image; The generated first captioning information is paired with original captioning labels to generate a second pair, and a first probability value that the second pair is two true original captioning labels is obtained by using a trained discriminator, , a weight is applied to the calculated first similarity value and the obtained first probability value and summed to obtain a first reward.

상기 생성된 2차 캡셔닝 정보에 대해, 다음과 같은 동작들이 수행된다.The following operations are performed on the generated secondary captioning information.

상기 생성된 제2 캡셔닝 정보에 대응하는 제2 유사도 값은 상기 생성된 제2 캡셔닝 정보 및 대응하는 트레이닝 이미지의 N개의 원본 캡셔닝 라벨들에 기반하여 계산되며; 상기 생성된 제2 캡셔닝 정보는 원본 캡셔닝 라벨들과 쌍을 이루어 제2 쌍을 생성하고, 상기 제2 쌍이 두 개의 진정한 원본 캡셔닝 라벨들일 제2 확률 값은 트레이닝된 판별기를 사용함으로써 획득되며, 상기 계산된 제2 유사도 값 및 상기 획득된 제2 확률 값에 가중치가 적용되고 합산되어 제2 보상을 획득한다.a second similarity value corresponding to the generated second captioning information is calculated based on the generated second captioning information and N original captioning labels of a corresponding training image; The generated second captioning information is paired with original captioning labels to generate a second pair, and a second probability value that the second pair is two true original captioning labels is obtained by using a trained discriminator, , a weight is applied to the calculated second similarity value and the obtained second probability value and summed to obtain a second reward.

상기 생성기의 파라미터들은 제1 보상과 제2 보상의 차이인 최종 보상에 따라 조정된다.The parameters of the generator are adjusted according to the final compensation, which is the difference between the first compensation and the second compensation.

실제 애플리케이션에서, 텍스트 데이터의 이산적인 특성으로 인해, 상기 판별기의 그래디언트를 다시 생성기로 전달하는 것이 어렵다. 이 문제를 해결하기 위해서, 대안적인 방법으로서, 정책 그래디언트 방법이 채택될 수 있다. 상기 보상은 상기 생성기에 의해 생성된 캡셔닝 문장을 기반으로 계산된다. 보상이 높을수록 현재 생성된 캡셔닝 문장이 더 잘 생성되며, 상기 생성기의 더 많은 파라미터들이 이 방향으로 조정된다. 전통적인 방법에서, 상기 보상은 상기 판별기의 출력만을 포함하지만, 본원에서 제공된 위의 대안적인 실시예의 보상은, 상기 판별기의 출력 및 유사도 값(예: CIDEr 값)의 두 부분을 포함할 수 있으며, 그것들 둘 모두가 가중되어 최종 보상으로 합산된다. 상기 생성기의 파라미터들의 조정을 위한 보상을 결정하기 위해 더 다양한 데이터를 사용함으로써, 상기 생성기가 더 많은 정보를 효과적으로 학습하게 할 수 있으며, 상기 원본 캡셔닝 라벨들과 더 유사하지만 상이한 증강된 캡셔닝 정보를 생성할 수 있으며, 이는 트레이닝된 생성기를 기반으로 더 나은 증강된 이미지 캡셔닝 정보를 제공하며 상기 증강된 캡셔닝 정보를 포함하는 샘플 데이터를 기반으로 멀티미디어 데이터 캡셔닝 모델을 트레이닝하기 위한 더 많으며 더 양호한 데이터 기반을 제공하기 위한 것이다.In practical applications, due to the discrete nature of text data, it is difficult to pass the gradient of the discriminator back to the generator. To solve this problem, as an alternative method, a policy gradient method may be adopted. The reward is calculated based on the caption sentence generated by the generator. The higher the reward, the better the currently generated captioning sentence is generated, and more parameters of the generator are adjusted in this direction. In the traditional method, the reward includes only the output of the discriminator, but the compensation of the above alternative embodiment provided herein may include two parts: the output of the discriminator and a similarity value (eg, CIDEr value), , both of them are weighted and summed to the final reward. By using more diverse data to determine a reward for adjusting the generator's parameters, the generator can effectively learn more information, and enhanced captioning information more similar to but different from the original captioning labels. , which provides better augmented image captioning information based on a trained generator and more and more for training a multimedia data captioning model based on sample data including the augmented captioning information. This is to provide a good data base.

본 개시의 일 실시예에서 제공되는 상기 생성기의 트레이닝 방식을 보다 잘 이해하고 설명하기 위해, 도 28a 및 도 28b를 참조하여 상기 트레이닝 방식이 더욱 상세하게 설명될 것이다. 이 예에서, 멀티미디어 데이터는 이미지이다. 이 예의 원리는 비디오에도 적용된다는 것이 이해될 수 있다.In order to better understand and explain the training scheme of the generator provided in an embodiment of the present disclosure, the training scheme will be described in more detail with reference to FIGS. 28A and 28B . In this example, the multimedia data is an image. It can be understood that the principles of this example apply to video as well.

대안적인 예로서, 도 28a에 도시된 바와 같이, 위의 방식 1에 대해, 상기 생성기를 트레이닝할 때에, 각 트레이닝 비디오의 각각의 이미지 캡셔닝 라벨들에 대해 다음과 같은 동작들이 수행될 수 있다. (예: 상기 도면에 도시된 X1: T과 같은) 상기 이미지 캡셔닝 라벨이 상기 생성기에 입력되어, 그리디 디코딩 방법을 기반으로 이미지 캡셔닝 정보를 생성한다; 상기 생성된 이미지 캡셔닝 정보(예: Y1: T)에 대해, 다음과 같은 동작들이 수행된다. 상기 생성된 이미지 캡셔닝 정보에 대응하는 유사도 값은 상기 생성된 이미지 캡셔닝 정보 및 대응하는 트레이닝 이미지의 N개의 이미지 캡셔닝 라벨들에 기반하여 계산되며; 상기 생성된 이미지 캡셔닝 정보는 이미지 캡셔닝 라벨들과 쌍을 이루어 제2 쌍 (예: X1: T, Y1: T)을 생성하고, 상기 제2 쌍이 두 개의 진정한 이미지 캡셔닝 라벨들일 확률 값은 트레이닝된 판별기를 사용함으로써 획득되며, 상기 계산된 유사도 값 및 상기 획득된 확률 값에 가중치가 적용되고 합산되어 보상을 획득하며; 그리고 상기 생성기의 파라미터들은 상기 획득한 보상에 따라 조정된다.As an alternative example, as shown in FIG. 28A , for scheme 1 above, when training the generator, the following operations may be performed for each image captioning label of each training video. The image captioning label (eg, X1: T shown in the figure) is input to the generator to generate image captioning information based on the greedy decoding method; With respect to the generated image captioning information (eg, Y1: T), the following operations are performed. a similarity value corresponding to the generated image captioning information is calculated based on the generated image captioning information and N image captioning labels of a corresponding training image; The generated image captioning information is paired with image captioning labels to generate a second pair (eg, X1: T, Y1: T), and the probability value that the second pair is two true image captioning labels is obtained by using a trained discriminator, weighted and summed to the calculated similarity value and the obtained probability value to obtain a reward; And the parameters of the generator are adjusted according to the obtained reward.

구체적으로, 도 28a에 도시된바와 같이, 그리디 디코딩 방법에 따라 상기 생성기에 의해 생성된 이미지 캡셔닝 문장 y^b에 대해 계산된 CIDEr 값 및 상기 그리디 디코딩 방법에 따라 상기 생성기에 의해 생성된 이미지 캡셔닝 문장 y^b에 대해 상기 판별기가 획득한 확률 값에 가중치가 부여되고 합산되어 상기 보상 (y^b)을 획득한다. 가중치가 부여되어 합산되는 상기 공식은 다음 수학식 16과 같다:Specifically, as shown in FIG. 28A , ^{the CIDEr value calculated for the image captioning sentence y b} generated by the generator according to the greedy decoding method and the image generated by the generator according to the greedy decoding method The probability values obtained by the discriminator for the caption sentence y ^b are weighted and summed to obtain the reward (y ^b ). The weighted and summed formula is the following Equation (16):

여기에서 r(도 26a의 r(y^b)에 해당됨)은 보상이며, τ는 가중치 계수이며, D_φ는 상기 판별기에 의해 출력된 확률 값이며, C는 CIDEr 값 (도 26a의 CIDEr 점수에 해당함)이다.where r ( ^{corresponding to r(y b} ) in Fig. 26a) is the reward, τ is the weighting coefficient, D _φ is the probability value output by the discriminator, and C is the CIDEr value (corresponding to the CIDEr score in Fig. 26a) )am.

28a에서 도시된 예에서, 상기 판별기의 구조는 CNN 기반의 구조일 수 있으며, 예를 들어, 그것은 컨볼루션 레이어, 최대 풀링 레이어 등을 포함할 수 있다. 구체적으로, 이미지 캡셔닝 라벨 및 대응하는 생성된 이미지 캡셔닝 정보의 각 쌍에 대해, 즉, 각 제2 쌍에 대해, 상기 쌍은 임베딩에 의해 처리되어 대응하는 특징 벡터를 획득할 수 있다. 다양한 서로 다른 컨볼루션 처리 파라미터들을 사용하는 컨볼루션 레이어는 특징 벡터에 대한 컨볼루션 처리를 수행하기 위해 이용되며, 각각의 컨볼루션 결과들은 최대 풀링 레이어를 통해 풀링되고, 풀링 결과들은 연쇄되며(concatenated), 각각의 제2 쌍들에 대응하는 확률 값은 상기 연쇄된 벡터 예측을 기반으로 하여 획득된다.In the example shown in 28a, the structure of the discriminator may be a CNN-based structure, for example, it may include a convolutional layer, a maximum pooling layer, and the like. Specifically, for each pair of image captioning labels and corresponding generated image captioning information, that is, for each second pair, the pair may be processed by embedding to obtain a corresponding feature vector. A convolution layer using a variety of different convolution processing parameters is used to perform convolution processing on a feature vector, and each convolution result is pooled through a max pooling layer, and the pooling results are concatenated. , a probability value corresponding to each of the second pairs is obtained based on the concatenated vector prediction.

추가로, 다른 대안적 방식, 즉 위의 방식 2에 따라, 효과를 더욱 향상시키기 위해서, 상기 생성기는 셀프-크리티컬 메커니즘을 사용하여 또한 트레이닝될 수 있으며, 즉, 몬테카를로 샘플링에 의해 획득된 이미지 캡셔닝 문장 및 그리디 디코딩에 의해 획득된 상기 이미지 캡셔닝 문장 사이의 보상 차이는 상기 최종 보상으로서 사용되며, 그리고 상기 생성기의 파라미터들은 상기 획득된 보장에 따라서 조절된다.In addition, according to another alternative scheme, namely scheme 2 above, in order to further improve the effect, the generator may also be trained using a self-critical mechanism, ie, image captioning obtained by Monte Carlo sampling The compensation difference between the sentence and the image captioning sentence obtained by greedy decoding is used as the final compensation, and the parameters of the generator are adjusted according to the obtained guarantee.

도 28b에 도시된 예에서, 그리디 디코딩에 따라 상기 생성기에 의해 생성된 이미지 캡셔닝 문장 y^b에 대해 계산된 CIDEr 값 및 상기 그리디 디코딩에 따라 상기 생성기에 의해 생성된 이미지 캡셔닝 문장 y^b에 대해 상기 판별기가 획득한 확률 값에 가중치가 부여되고 합산되어 상기 보상 (y^b)을 획득한다. 몬테카를로 샘플링을 기반으로 생성기에 의해 생성된 이미지 캡셔닝 문장 y^s에 대해 계산된 CIDEr 값 및 몬테카를로 샘플링을 기반으로 상기 생성기에 의해 생성된 이미지 캡셔닝 문장 y^s에 대해 판별기에 의해 획득된 확률 값에 가중치가 부여되어 합산되어 상기 보상(y^s)을 획득한다. r(y^s)-r(y^b)는 상기 생성기의 파라미터들을 조정하기 위한 최종 보상으로 사용된다. 몬테카를로 샘플링을 기반으로 생성된 이미지 캡셔닝 문장 y^s에 대해 계산된 CIDEr 값 및 몬테카를로 샘플링을 기반으로 생성된 이미지 캡셔닝 문장 y^s에 대한 확률 값에 가중치가 부여되고 합산되어 보상 (y^s)를 획득하는 특정 방식은 도 28a에서 r(y^b)를 획득하는 설명을 참조하며, 이는 유사한 원리를 가지며, 여기에서 설명을 반복하지 않는다. ^{In the example shown in FIG. 28B , the CIDEr value calculated for the image captioning sentence y b} generated by the generator according to greedy decoding and the image ^{captioning sentence y b} generated by the generator according to the greedy decoding A weight is given to the probability values obtained by the discriminator for , and summed to obtain the reward (y ^b ). The Probability values obtained by a determination for, based on the CIDEr value and a Monte Carlo sampling calculated for the image captioning text y ^s generated by the generator based on the Monte Carlo sampling generated by the generator image captioning text y ^s Weights are given and summed to obtain the reward y ^s . r(y ^s )-r(y ^b ) is used as the final compensation to adjust the parameters of the generator. The weighted value of probability for the generated based on the calculated CIDEr value and a Monte Carlo sampling image captioning text y ^s is given and summed compensation (y ^s) for an image generated based on Monte Carlo sampling captioning text y ^s A specific way of ^{obtaining refers to the description of obtaining r(y b} ) in FIG. 28A , which has a similar principle, and the description is not repeated here.

본 개시의 다른 실시예에서, 상기 생성된 증강된 캡셔닝 정보 내 중복 정보를 피하기 위해서, 상기 방법은 다음을 더 포함할 수 있다.In another embodiment of the present disclosure, in order to avoid redundant information in the generated augmented captioning information, the method may further include the following.

반복된 증강 캡셔닝 정보가 존재할 때에, 상기 반복된 증강 캡셔닝 정보에 대응하는 원본 캡셔닝 라벨이 상기 생성기에 재입력되며, 그리고 빔 탐색 방법을 기반으로, 상기 생성기는, 상기 빔 값의 크기를 조정함으로써 상기 증강 캡셔닝 정보를 재생성하기 위해 사용된다.When the repeated augmented captioning information is present, the original captioning label corresponding to the repeated augmented captioning information is re-entered into the generator, and based on the beam search method, the generator determines the size of the beam value It is used to regenerate the augmented captioning information by adjusting.

상기 이미지를 예로 들면, 상기 생성기 및 판별기의 트레이닝이 완료된 후에, 상기 트레이닝된 생성기가 사용되어, 상기 이미지에 대응하는 각각의 이미지 캡셔닝 라벨들에 따라 각각의 이미지 캡셔닝 라벨들에 대응하는 증강된 이미지 캡셔닝 정보를 생성할 수 있다. 이러한 증강 이미지 캡셔닝은 진정한(truth) 이미지 캡셔닝 라벨들과 상이하지만 이러한 증강된 이미지 캡셔닝은 서로 반복될 수 있다.Taking the image as an example, after the training of the generator and the discriminator is completed, the trained generator is used to augment the corresponding image captioning labels according to the respective image captioning labels corresponding to the image. image captioning information can be generated. Such augmented image captioning is different from true image captioning labels, but such augmented image captioning may be repeated with each other.

증강된 캡셔닝 정보에 반복이 있을 수 있는 문제를 해결하기 위해서, 대안적인 방법으로서, 빔 탐색 (beam search) 방법이 채택되어, 증강된 캡셔닝 정보를 재생성할 수 있다. 상기 생성기는 최대 확률에 기반하여 증강된 캡셔닝 정보를 생성하며, 즉, 최대 예측 확률을 가진 단어가 각 순간 (빔 값 1과 동등)에 출력되며, 상기 빔 검색 방법은 (2, 3 등과 같은) 상기 빔 값을 변경함으로써 상기 생성기의 생성 결과들을 조정할 수 있다. 예를 들어, 동일한 증강된 캡션들이 2개 존재할 때에, 상기 증강된 캡셔닝 정보 중 하나에 대응하는 진정한 캡션 (즉, 원본 캡셔닝 라벨)이 상기 생성기에 입력되며, 상기 빔 값은 2로 세팅되며, 그리고 상기 생성기는 상이한 증강된 캡셔닝 정보를 생성하기 위해 사용된다. 예를 들어, 상기 생성기는 제1 순간에 상위 2개 확률을 갖는 2개의 단어를 출력할 수 있으며, 이를 각각 {a} 및 {b}로 가정하며; 상기 제1 순간에 두 단어 {a}와 {b}를 기반으로 최대 확률을 갖는 두 단어를 다음 순간에 출력하고, 이를 {a, c}, {a, d}, {b, e}, {b, f}인 것으로 가정하며; 그런 다음 {a, c}, {b, e}인 것으로 가정되는 최대 확률을 가진 두 개가 이 네 개의 시퀀스들로부터 선택되며; 다음 순간에는 비슷하게 진행된다. 다른 예로서, 동일한 증강된 캡션들이 3개 존재할 때에, 상기 증강된 캡셔닝 정보 중 하나에 대응하는 원본 캡셔닝 라벨이 상기 생성기에 입력되며, 상기 빔 값은 2로 세팅되며, 그리고 상기 생성기는 상이한 증강된 캡셔닝 정보를 생성하기 위해 사용된다. 추가로, 상기 증강된 캡셔닝 정보 중 하나에 대응하는 진정한(truth) 캡션을 상기 생성기에 입력하며 그리고 빔 값을 3으로 세팅할 수도 있으며, 상기 생성기는 상이한 증강된 캡셔닝 정보를 더 생성하는 등을 위해 사용된다. 그처럼, 상이한 빔 크기가 사용되어, 증강된 캡셔닝 문장을 생성하고, 그 생성된 결과를 변경하여 상기 반복 문제를 해결할 수 있다.In order to solve the problem that there may be repetition in the augmented captioning information, as an alternative method, a beam search method may be adopted to regenerate the augmented captioning information. The generator generates augmented captioning information based on the maximum probability, that is, the word with the maximum predicted probability is output at each instant (equivalent to beam value 1), and the beam search method is (2, 3, etc.) ) by changing the beam value, it is possible to adjust the generation results of the generator. For example, when there are two identical augmented captions, a true caption (ie, an original captioning label) corresponding to one of the augmented captioning information is input to the generator, the beam value is set to 2, and , and the generator is used to generate different augmented captioning information. For example, the generator may output two words with the top two probabilities at the first instant, assuming that they are {a} and {b}, respectively; Based on the two words {a} and {b} at the first instant, the two words having the maximum probability are output at the next instant, and these are {a, c}, {a, d}, {b, e}, { b, assuming that the method is; Then the two with the greatest probability assumed to be {a, c}, {b, e} are selected from these four sequences; The next moment proceeds similarly. As another example, when there are three identical augmented captions, an original captioning label corresponding to one of the augmented captioning information is input to the generator, the beam value is set to 2, and the generator is different Used to generate augmented captioning information. Additionally, a true caption corresponding to one of the augmented captioning information may be input to the generator and set a beam value to 3, the generator may further generate different augmented captioning information, etc. is used for As such, different beam sizes may be used to generate augmented captioning sentences and alter the generated results to solve the iteration problem.

본 개시의 일 실시예에서 제공된 증강된 캡셔닝 정보를 획득하기 위한 방법을 이용하여 상기 증강된 캡셔닝 정보를 획득한 이후에, 상기 멀티미디어 데이터의 원본 캡셔닝 라벨들 및 대응하는 증강된 캡셔닝 정보는 상기 멀티미디어 데이터의 캡셔닝 라벨 정보로 사용될 수 있다. 더 많은 라벨 정보를 포함하는 멀티미디어 데이터 샘플들은 더 나은 캡셔닝 모델을 획득하기 위해 초기 멀티미디어 데이터 캡셔닝 모델을 트레이닝하기 위해 사용된다. 구체적으로, 이미지를 예로 들면, 다음 방식들에서의 트레이닝에 의해 이미지 캡셔닝 모델이 획득될 수 있다.After obtaining the augmented captioning information by using the method for obtaining the augmented captioning information provided in an embodiment of the present disclosure, original captioning labels of the multimedia data and corresponding enhanced captioning information may be used as caption label information of the multimedia data. The multimedia data samples containing more label information are used to train the initial multimedia data captioning model to obtain a better captioning model. Specifically, taking an image as an example, an image captioning model may be obtained by training in the following manners.

트레이닝 샘플들이 획득되고, 트레이닝 샘플들 내 각 샘플 이미지는 대응하는 라벨 정보를 가지며, 라벨 정보는 샘플 이미지의 적어도 하나의 이미지 캡셔닝 라벨 및 각 이미지 캡셔닝 라벨에 대응하는 증강된 이미지 캡셔닝 정보를 포함한다.Training samples are obtained, each sample image in the training samples has corresponding label information, wherein the label information includes at least one image captioning label of the sample image and enhanced image captioning information corresponding to each image captioning label. include

상기 원본 이미지 캡셔닝 모델은 트레이닝된 이미지 캡셔닝 모델을 획득하기 위해 미리 설정된 트레이닝 종료 조건이 만족될 때까지 각각의 샘플 이미지들을 기반으로 트레이닝되며, 이는 트레이닝된 이미지 캡셔닝 모델을 획득하기 위한 것이다.The original image captioning model is trained based on each sample image until a preset training termination condition is satisfied to obtain a trained image captioning model, which is to obtain a trained image captioning model.

각 샘플 이미지에 대해, 상기 샘플 이미지에 대응하는 증강된 이미지 캡셔닝 정보는, 본 개시의 임의의 대안적인 실시예에서 제공된 이미지 캡셔닝을 획득하기 위한 방법을 사용함으로써 획득된다. 상기 이미지 캡셔닝 모델의 구체적인 네트워크 구조는 본 개시의 실시예에서 제한되지 않는다. 예를 들어, 그것은 인코더 및 디코더 기반의 이미지 캡셔닝 모델일 수 있다. 예를 들어, 그것은 도 22 또는 도 23에 도시된 코덱을 기반으로 하는 네트워크 이미지 캡셔닝 모델일 수 있다.For each sample image, enhanced image captioning information corresponding to the sample image is obtained by using the method for obtaining image captioning provided in any alternative embodiment of the present disclosure. The specific network structure of the image captioning model is not limited in the embodiment of the present disclosure. For example, it may be an encoder- and decoder-based image captioning model. For example, it may be a network image captioning model based on the codec shown in FIG. 22 or FIG. 23 .

일례로, 도 29는 본 개시의 실시예에서 제공된 이미지 캡셔닝 모델을 트레이닝하기 위한 방법의 개략적인 흐름도를 도시한다. 이 예에서, 상기 이미지 캡셔닝 모델은 인코더와 디코더를 기반으로 하는 이미지 캡셔닝 모델이다. 상기 도면에서 보이듯이, 상기 방법은 다음의 단계들을 포함할 수 있다.As an example, FIG. 29 shows a schematic flowchart of a method for training an image captioning model provided in an embodiment of the present disclosure. In this example, the image captioning model is an image captioning model based on an encoder and a decoder. As shown in the figure, the method may include the following steps.

단계 S2701: 상기 트레이닝 이미지 데이터세트 내 각 트레이닝 이미지 (즉, 샘플 이미지)에 대해, 교차 엔트로피 손실 함수를 사용하여 상기 인코더 및 디코더에 관해 제1 트레이닝이 수행된다.Step S2701: For each training image (ie, a sample image) in the training image dataset, a first training is performed on the encoder and decoder using a cross entropy loss function.

대안으로, 이 단계에서의 트레이닝 이미지는 트레이닝 이미지 데이터세트 내 트레이닝 이미지일 수 있으며, 상기 트레이닝 이미지는 미리 결정된 N개의 이미지 캡셔닝 라벨들을 포함할 수 있으며, 여기에서 N은 1 이상의 양의 정수일 수 있다. 예를 들어, 상기 트레이닝 이미지는 이미지 캡셔닝 분야에서 일반적으로 사용되는 트레이닝 이미지 데이터세트(예를 들어, 데이터세트 MS-COCO) 내 5개의 이미지 캡셔닝 라벨들을 갖는 트레이닝 이미지일 수 있다.Alternatively, the training image in this step may be a training image in a training image dataset, wherein the training image may include N predetermined image captioning labels, where N may be a positive integer greater than or equal to 1 . For example, the training image may be a training image having five image captioning labels in a training image dataset (eg, dataset MS-COCO) commonly used in the image captioning field.

구체적으로, 상기 트레이닝 비디오에 대응하는 캡셔닝 정보는 상기 트레이닝 이미지에 기반하여 도 24 내 다양한 대안적인 방법들을 참조하거나 기반하여 획득될 수 있으며, 상기 트레이닝 이미지에 대응하는 상기 증강된 이미지 캡셔닝 정보는 도 27에서 보이는 방법을 참조함으로써 획득될 수 있다. 상기 획득된 캡셔닝 정보, 트레이닝 이미지들의 비디오 캡셔닝 라벨들, 및 증강된 이미지 캡셔닝 정보를 기반으로, 상기 인코더 및 디코더는 교차 엔트로피 손실 함수를 이용하여 트레이닝된다. 예를 들어, 그것들은 교차 엔트로피 손실 함수의 수학식 17에 의해 트레이닝될 수 있다.Specifically, the captioning information corresponding to the training video may be obtained by referring to or based on various alternative methods in FIG. 24 based on the training image, and the augmented image captioning information corresponding to the training image is It can be obtained by referring to the method shown in FIG. 27 . Based on the obtained captioning information, video captioning labels of training images, and augmented image captioning information, the encoder and decoder are trained using a cross entropy loss function. For example, they can be trained by Equation 17 of the cross entropy loss function.

여기에서 J_xe(θ)는 손실을 나타내고, θ는 변환기 인코더(302) 및 변환기 디코더(303)의 파라미터들을 나타내고, t는 현재 순간을 나타내고, T는 최대 순간을 나타내고, y_t는 현재 순간에서의 단어 출력을 나타내고, y_1:t-1은 이전 순간에서 실측(ground-truth) 단어를 나타내며, I는 현재 이미지를 나나타내며, 그리고 p_θ는 출력 단어가 진실(truth)일 확률을 나타낸다. 여기에서, 상기 제1 이미지 캡셔닝 문장은 각 순간의 최대 출력 확률을 갖는 단어들의 조합이며, 제1 이미지 캡셔닝 문장에서 각 순간에서의 y_t가 획득될 수 있다. 추가로, 각 순간의 실측 단어는 상기 이미지 캡셔닝 라벨 및 상기 트레이닝 이미지의 증강된 이미지 캡셔닝 각각으로부터 획득될 수 있다.where J _xe (θ) denotes the loss, θ denotes the parameters of the transformer encoder 302 and the transformer decoder 303 , t denotes the current moment, T denotes the maximum instant, and y _t denotes the current instant denotes the word output of , y _1:t-1 denotes the ground-truth word at the previous moment, I denotes the current image, and p _θ denotes the probability that the output word is true. Here, the first image captioning sentence is a combination of words having the maximum output probability of each moment, and y _t at each moment may be obtained from the first image captioning sentence. Additionally, the ground truth word of each moment may be obtained from each of the image captioning label and the augmented image captioning of the training image.

단계 S2702: 제1 트레이닝에 기반한 인코더 및 디코더의 트레이닝이 완료될 때에, 트레이닝 이미지 데이터세트 내 각 트레이닝 이미지에 대해, 정책 그래디언트 및/또는 셀프-크리티컬 메커니즘을 사용함으로써 상기 제1 트레이닝을 통해 획득된 인코더 및 디코더에 대해 제2 트레이닝이 수행된다.Step S2702: When the training of the encoder and decoder based on the first training is completed, for each training image in the training image dataset, the encoder obtained through the first training by using a policy gradient and/or a self-critical mechanism and a second training is performed for the decoder.

구체적으로, 교차 엔트로피 손실의 최적화 타겟이 상기 캡셔닝을 평가하기 위해 사용된 메트릭(예: CIDEr 값)과 상이하기 때문에 상기 정책 그래디언트가 사용된다. 이 문제를 해결하기 위해, 상기 정책 그래디언트가 사용되어, 상기 CIDEr 값을 직접 최적화한다. 그 공식은 다음 수학식 18과 같다.Specifically, the policy gradient is used because the optimization target of the cross-entropy loss is different from the metric used to evaluate the captioning (eg CIDEr value). To solve this problem, the policy gradient is used to directly optimize the CIDEr value. The formula is the following Equation 18.

여기에서 J(θ)는 손실이며, θ는 인코더 및 디코더의 파라미터들을 나타내며, E는 기대값을 나타내며, y^s는 샘플링된 이미지 캡셔닝 문장이며, r(y^s)는 보상인 CIDEr 값이며, y^s ~ p_θ는 기존 네트워크 파라미터들에 의해 샘플링된 이미지 캡셔닝 문장의 세트를 나타낸다.where J(θ) is the loss, θ is the parameters of the encoder and decoder, E is the expected value, y ^s is the sampled image captioning sentence, r(y ^s ) is the CIDEr value that is the compensation, y ^s ~ p _θ represents a set of image captioning sentences sampled by the existing network parameters.

상기 셀프-크리터컬 메커니즘은 상기 보상이 몬테카를로 샘플링에 의해 획득된 이미지 캡셔닝 문장의 CIDEr 값 및 그리디 디코딩에 의해 획득된 이미지 캡셔닝 문장의 CIDEr 값 사이의 차이로 설정되는 것을, 즉, 상기 보상을 제한하기 위해 그리디 디코딩의 효과가 사용되는 것이 더 좋을 것이라는 것을 가리킨다. 상기 공식은 다음 수학식 19와 같다.The self-critical mechanism is configured such that the reward is set to the difference between the CIDEr value of the image captioning sentence obtained by Monte Carlo sampling and the CIDEr value of the image captioning sentence obtained by greedy decoding, that is, the It indicates that the effect of greedy decoding would be better used to limit the compensation. The above formula is the following Equation 19.

여기에서

는 그리디 디코딩에 의해 획득된 이미지 캡셔닝 문장이며, y^s는 몬테카를로 샘플링에 의해 획득된 이미지 캡셔닝 문장이며, r은 계산된 CIDEr 값이며, ∇_θL(θ)은 손실의 그래디언트이며, 그리고 p_θ(y^s)는 y^s을 샘플링할 때의 대응 확률이다.From here

is the image captioning sentence obtained by greedy decoding, y ^s is the image captioning sentence obtained by Monte Carlo sampling, r is the calculated CIDEr value, ∇ _θ L(θ) is the gradient of loss, and p _θ (y ^s ) is the corresponding probability when sampling ^{y s .}

대안으로, 상기 제2 트레이닝을 수행할 때에, 도 24의 방법을 참조하여 또는 도 24에 도시된 방법에 기반하여 상기 그리디 디코딩을 이용함으로써 제1 이미지 캡셔닝 문장이 획득될 수 있다; 제2 이미지 캡셔닝 문장은 도 24의 방법을 참조하여 또는 도 24에 도시된 방법에 기반하여 몬테카를로 샘플링을 이용함으로써 획득될 수 있으며, 그리고 정책 그래디언트 및 셀프-크리티컬 메커니즘을 사용함으로써 제1 트레이닝된 인코더 및 디코더에 대해 제2 트레이닝이 수행된다. 구체적으로, 제1 이미지 캡셔닝 문장의 CIDEr 값은 제1 이미지 캡셔닝 문장 및 대응하는 트레이닝 이미지의 N개의 이미지 캡셔닝 라벨들 간의 유사도에 기반하여 계산될 수 있으며, 상기 제2 이미지 캡셔닝 문장의 CIDEr 값은 상기 제2 이미지 캡셔닝 문장 그리고 대응하는 트레이닝 이미지의 N개 이미지 캡셔닝 라벨들 사이의 유사도에 기반하여 계산될 수 있다. 그것은 상기 제1 이미지 캡셔닝 문장의 CIDEr 값 및 상기 제2 이미지 캡셔닝 문장의 CIDEr 값의 차이를 계산하여 보상을 획득하고, 상기 획득한 보상에 따라 상기 제1 트레이닝된 인코더의 파라미터들 및 디코더 파라미터들을 조정할 수 있다.Alternatively, when performing the second training, the first image captioning sentence may be obtained by using the greedy decoding with reference to the method of FIG. 24 or based on the method shown in FIG. 24 ; The second image captioning sentence may be obtained by using Monte Carlo sampling with reference to the method of Fig. 24 or based on the method shown in Fig. 24, and the first trained encoder by using a policy gradient and a self-critical mechanism. and a second training is performed for the decoder. Specifically, the CIDEr value of the first image captioning sentence may be calculated based on the similarity between the first image captioning sentence and N image captioning labels of the corresponding training image, and the CIDEr value of the second image captioning sentence The CIDEr value may be calculated based on the similarity between the second image captioning sentence and the N image captioning labels of the corresponding training image. It calculates the difference between the CIDEr value of the first image captioning sentence and the CIDEr value of the second image captioning sentence to obtain a reward, and according to the obtained reward, parameters of the first trained encoder and decoder parameters can adjust them.

본 개시의 실시예에서 제공된 이미지 캡셔닝 모델의 트레이닝 방법은 대안적인 트레이닝 방법일 뿐이라는 것을 알 수 있다. 본 개시의 실시예에서 제공된 증강된 이미지 캡셔닝을 획득하기 위한 방식에 기반하여 획득된 상기 증강된 비디오 자막이 상기 샘플 이미지의 라벨 정보로서 사용되는 한, 그것은 상기 샘플 이미지의 라벨 데이터의 양과 다양성을 증가시킬 수 있다. 이미지 캡셔닝 모델이 상기 샘플 이미지를 포함하는 트레이닝 데이터를 기반으로 트레이닝되며, 이는 상기 모델의 성능을 효과적으로 향상시킬 수 있다. 추가로, 이미지들에 적용 가능한 상기 솔루션은 상기 비디오에도 적용 가능하며, 원리는 동일함은 당업자에게 자명하다.It can be seen that the training method of the image captioning model provided in the embodiment of the present disclosure is only an alternative training method. As long as the augmented video caption obtained based on the method for obtaining augmented image captioning provided in the embodiment of the present disclosure is used as the label information of the sample image, it represents the amount and diversity of label data of the sample image. can increase An image captioning model is trained based on training data including the sample image, which can effectively improve the performance of the model. In addition, it is obvious to those skilled in the art that the solution applicable to images is also applicable to the video, and the principle is the same.

추가로, 상기 트레이닝 샘플의 캡셔닝 라벨들의 표현력을 향상시키기 위해. 본 개시의 대안적인 실시예에서, 생성적 대립 네트워크 (generative adversarial network)를 사용하여 캡셔닝 데이터 증강을 수행하고, 캡셔닝 데이터 증강을 위한 샘플들을 캡셔닝 모델의 트레이닝에 적용할 수 있으며, 그에 의해 샘플의 다양성을 증가시키고 본 개시의 비디오 캡셔닝 모델 또는 이미지 캡셔닝 모델의 효과를 더욱 향상시킨다. Additionally, to improve the expressive power of captioning labels of the training sample. In an alternative embodiment of the present disclosure, captioning data augmentation may be performed using a generative adversarial network, and samples for captioning data augmentation may be applied to training of a captioning model, thereby It increases the diversity of samples and further enhances the effectiveness of the video captioning model or image captioning model of the present disclosure.

다음은 본 개시가 제공하는 멀티미디어 데이터의 캡셔닝 정보를 2개의 개략도와 결합하여 생성하는 방법을 설명한다.The following describes a method for generating caption information of multimedia data provided by the present disclosure by combining with two schematic diagrams.

비디오를 예로 들면, 도 30은 본 개시의 비디오 캡셔닝 정보를 생성하기 위한 방법의 개략적인 흐름도를 도시한다. 도면에 도시된 바와 같이, 주어진 비디오에 대해, 프레임 선택 단계를 통해 상기 비디오의 프레임이 선택될 수 있다. 상기 도면에 도시된 바와 같이, 지역적 인코더는 상기 선택된 프레임들의 각 프레임에서 각각의 타겟 영역들의 로컬 시각적 특징들을 추출하기 위한 지역적 인코더이다. 대안으로, 상기 지역적 인코더는 지역적 특징 추출 네트워크, 관계 검출기 (즉, 관계 예측 네트워크), 속성 검출기 (즉, 속성 예측 네트워크) 및 행동 검출기 (즉, 행동 분류기)를 포함할 수 있다.Taking video as an example, FIG. 30 shows a schematic flowchart of a method for generating video captioning information of this disclosure. As shown in the figure, for a given video, a frame of the video may be selected through a frame selection step. As shown in the figure, the regional encoder is a regional encoder for extracting local visual features of respective target regions in each frame of the selected frames. Alternatively, the regional encoder may include a regional feature extraction network, a relationship detector (ie, a relationship prediction network), an attribute detector (ie, a property prediction network), and a behavior detector (ie, a behavior classifier).

트레이닝된 비디오 캡셔닝 모델을 얻기 위해서, 이 예에서 보이는 코덱 구조의 모델이 비디오 캡셔닝을 위해 사용되기 이전에, 도 30에서 보이는 것처럼, 그것은 준-지도 (semi-supervised) 학습 (또는 준-지도 트레이닝) 방식으로 트레이닝될 수 있다. 그것은 라벨이 지정된 비디오들 (즉, 캡셔닝 라벨들이 있는 비디오들)과 라벨이 지정되지 않은 비디오들 (즉, 캡셔닝 라벨들이 없는 비디오들)를 사용하여 트레이닝될 수 있다. 각 비디오를 트레이닝하는 동안에, 비디오의 여러 프레임이 사용될 수 있다. 라벨이 지정되지 않은 비디오는 증강된 비디오를 획득하기 위해 데이터 증강 처리에 의해 증강될 수 있다. 라벨이 지정되지 않은 비디오에 대해 적어도 하나의 비디오 캡션들이 획득될 수 있다. 증강된 비디오를 기반으로 각각의 비디오 캡션들의 점수가 획득되어 제2 손실 함수의 값을 얻을 수 있다. 라벨이 지정된 비디오에 대해, 상기 모델에 의해 출력된 비디오의 타겟 캡셔닝 정보 및 대응하는 라벨 정보를 기반으로 제1 손실 함수의 값이 획득될 있으며, 상기 모델의 총 손실 함수의 값이 상기 제1 손실 함수의 값과 상기 제2 손실 함수의 값에 기반하여 획득되며, 상기 모델의 전체 손실 함수가 최소값에 수렴할 때까지 상기 모델 트레이닝은 이 값을 기반으로 가이드된다.To obtain a trained video captioning model, before the model of the codec structure shown in this example is used for video captioning, as shown in FIG. 30 , it is subjected to semi-supervised learning (or semi-supervised) training) can be trained. It can be trained using labeled videos (ie, videos with captioning labels) and unlabeled videos (ie, videos without captioning labels). During training each video, multiple frames of video may be used. The unlabeled video may be augmented by data augmentation processing to obtain an augmented video. At least one video captions may be obtained for the unlabeled video. A score of each video caption may be obtained based on the augmented video to obtain a value of the second loss function. For a labeled video, a value of a first loss function may be obtained based on the target captioning information and corresponding label information of the video output by the model, wherein the value of the total loss function of the model is the first It is obtained based on the value of the loss function and the value of the second loss function, and the model training is guided based on this value until the overall loss function of the model converges to a minimum value.

추가로, 도면에 도시된 증강된 캡셔닝 정보를 획득하는 부분에 대해, 상기 모델을 트레이닝시킬 때에 더 많은 양의 다이버시티(diversity) 라벨 정보를 얻기 위해서, 이 부분이 사용된다. 그것은, 라벨링된 비디오의 원래 캡셔닝 라벨(즉, 진실한 캡셔닝 정보)을 기반으로 상기 생성기를 통해 상기 캡셔닝 라벨에 대응하는 증강된 캡셔닝 정보를 생성할 수 있다. 원본 캡셔닝 라벨과 증강 캡셔닝 라벨은 둘 모두 트레이닝 동안에 샘플 비디오 데이터의 캡셔닝 라벨 정보로서 사용되며, 그에 의해 캡셔닝 라벨 정보의 양과 다양성을 증가시킨다. 상기 모델의 디코더에 의해 예측된 더 많은 라벨 정보와 캡셔닝 정보는 모델 트레이닝을 가이드하는 데 사용되며, 이는 상기 모델의 안정성과 상기 생성된 캡셔닝 정보의 정확도를 더욱 향상시킬 수 있다.In addition, for the part for acquiring the augmented captioning information shown in the figure, this part is used to obtain a greater amount of diversity label information when training the model. It may generate the augmented captioning information corresponding to the captioning label through the generator based on the original captioning label (ie, the true captioning information) of the labeled video. Both the original captioning label and the augmented captioning label are used as captioning label information of the sample video data during training, thereby increasing the amount and variety of captioning label information. More label information and captioning information predicted by the decoder of the model are used to guide model training, which can further improve the stability of the model and the accuracy of the generated captioning information.

도 30에 도시된, 예에서, 트레이닝된 비디오 캡셔닝 모델을 기반으로 비디오 처리를 수행할 때에, 상기 지역적 인코더는 비디오의 프레임들 각각에서 각각의 타겟 영역들의 로컬 시각적 특징, 관계 특징, 속성 특징 등을 추출할 수 있으며, 상기 추출된 특징을 기반으로 각 프레임에 대한 장면 그래프를 구축할 수 있다. 이 예에서 상기 장면 그래프는 시간 정보를 통합한 공간-시간적 장면 그래프일 수 있으며, 대응하는 업데이트된 특징들 (즉, 그래프 컨볼루션 특징들)이 그래프 컨볼루션 네트워크를 통해 획득될 수 있다. 대응하여, 상기 모델의 디코더 부분에 대해, 이 예는 셀프-어텐션 기반 인트라-디코더 및 셀프-어텐션 기반 인터-디코더를 사용할 수 있다. 인코딩을 수행하여 캡셔닝 정보를 생성하는 경우, 사용자가 생성할 것으로 예상하는 캡셔닝 정보에 관한 정보를 획득함으로써 사용자의 요구 사항에 보다 잘 맞는 비디오의 텍스트 캡션을 생성할 수도 있다. 예를 들어, 사용자가 운전하는 동안, 사용자의 시선 앞에서 실시간 비디오를 수집하고 그 비디오를 분석하여, 잠재적인 위험이 있는 경우와 같이 사용자에게 프롬프트가 필요할 때에, 상기 비디오의 생성된 캡셔닝 정보를 분석하여 사용자에게 대응하는 리마인더를 제공하거나 상기 캡셔닝 정보를 사용자에게 재생할 수 있다.In the example shown in Figure 30, when performing video processing based on a trained video captioning model, the regional encoder performs local visual features, relational features, attribute features, etc. of respective target regions in each of the frames of the video. can be extracted, and a scene graph for each frame can be constructed based on the extracted features. In this example, the scene graph may be a spatio-temporal scene graph incorporating temporal information, and corresponding updated features (ie, graph convolution features) may be obtained through a graph convolution network. Correspondingly, for the decoder part of the model, this example may use a self-attention based intra-decoder and a self-attention based inter-decoder. In the case of generating captioning information by performing encoding, a text caption of the video may be generated that better meets the user's requirements by obtaining information about the captioning information expected to be generated by the user. For example, while the user is driving, collect real-time video in front of the user's eyes and analyze the video to analyze the generated captioning information of the video when the user needs a prompt, such as when there is a potential risk. Thus, a reminder corresponding to the user may be provided or the captioning information may be reproduced to the user.

다른 예로서, 도 31은 본 개시의 비디오 캡셔닝 정보를 생성하기 위한 방법의 개략적인 흐름도를 도시한다. 상기 도면에 도시된 3차원 시각적 특징 인코더(즉, 공간-시간적 특징 추출 네트워크), 지역적 인코더 및 의미론적 인코더(즉, 의미론적 예측 네트워크)는 각각 로컬 시각적 특징(상기 도면에서 보이는 로컬 특징), 공간-시간적 시각적 특징 및 비디오의 각 프레임에서 각각의 타겟 영역들의 의미론적 특징을 추출하기 위한 인코더들이다. 로컬 시각적 특징을 기반으로, 상기 이미지들의 각 프레임에 대한 공간-시간적 장면 그래프가 구축될 수 있다. 상기 그래프 컨볼루션 네트워크를 통해 상기 그래프 컨볼루션 특징(상기 도면에서 보이는 업데이트된 로컬 특징)이 획득될 수 있다. 이러한 예에서, 상기 3D 시각적 특징 인코더는 비디오의 공간-시간적 시각적 특징을 추출하는 데에도 사용될 수 있고, 상기 의미론적 인코더는 그 비디오의 의미론적 특징을 추출하는 데 사용될 수 있다. 획득된 공간-시간적 시각적 특징, 의미론적 특징 및 그래프 컨볼루션 특징에 대해, 상기 특징 선택 네트워크는 다양한 특징에 대해 특징 선택을 수행하며, 즉, 각 특징의 가중치를 결정한다. 또한, 각 특징의 가중치를 기반으로 특징들의 가중치 부여된 통합을 수행하여 통합된 특징을 얻을 수 있다. 상기 디코더 뱅크(상기 도면에서 보이는 여러 개의 디코더로 구성된 디코더 뱅크)는 원하는 캡셔닝 정보의 통합된 특징과 길이 정보에 따라 디코딩을 수행하며, 최종 캡셔닝 정보(상기 도면에서 보이는 출력 캡셔닝 정보)는 각자의 디코더들의 결과에 따라 최종적으로 획득된다. 예를 들어, 각자의 디코더들의 디코딩 결과들을 평균화하고, 그 평균화된 결과에 기반하여 상기 최종 캡셔닝 정보를 획득할 수 있다. 대안으로, 상기 디코더 뱅크는 셀프-어텐션 기반 인트라-디코더를 포함할 수 있으며, 원하는 캡셔닝 정보의 길이 정보는 상기 디코더에 입력될 수 있으며, 상기 디코더가 상기 최종 생성된 캡셔닝 정보의 길이를 제어하도록 한다. As another example, FIG. 31 shows a schematic flowchart of a method for generating video captioning information of this disclosure. The three-dimensional visual feature encoder (i.e., spatial-temporal feature extraction network), regional encoder, and semantic encoder (i.e., semantic prediction network) shown in the figure are the local visual features (local features shown in the figure), spatial -Encoders for extracting temporal visual features and semantic features of respective target regions in each frame of video. Based on the local visual characteristics, a spatio-temporal scene graph for each frame of the images can be built. The graph convolution feature (updated local feature shown in the figure) may be obtained through the graph convolution network. In this example, the 3D visual feature encoder may also be used to extract spatio-temporal visual features of a video, and the semantic encoder may be used to extract semantic features of the video. For the obtained spatial-temporal visual features, semantic features and graph convolution features, the feature selection network performs feature selection on various features, ie, determines the weight of each feature. In addition, an integrated feature can be obtained by performing weighted integration of features based on the weight of each feature. The decoder bank (a decoder bank composed of several decoders shown in the figure) performs decoding according to the integrated characteristics and length information of the desired captioning information, and the final captioning information (output captioning information shown in the figure) is It is finally obtained according to the results of the respective decoders. For example, decoding results of respective decoders may be averaged, and the final captioning information may be obtained based on the averaged result. Alternatively, the decoder bank may include a self-attention based intra-decoder, and length information of desired captioning information may be input to the decoder, and the decoder may control the length of the finally generated captioning information. let it do

유사하게, 트레이닝된 비디오 캡셔닝 모델을 얻기 위해서, 이 예에서 보이는 코덱 구조의 모델이 비디오 캡셔닝을 위해 사용되기 이전에, 대립적 트레이닝(adversarial training)이 트레이닝을 위해 사용되는 경우 준-지도 (semi-supervised) 학습 (즉, 준-지도 트레이닝) 방식으로 트레이닝될 수 있다. 트레이닝 과정에 대한 자세한 내용에 대해, 앞서 도 30의 모델 트레이닝에 대한 설명에 대응하는 부분에 대한 설명을 참조하며, 여기에서는 반복하여 설명하지 않는다.Similarly, in order to obtain a trained video captioning model, before the model of the codec structure shown in this example is used for video captioning, semi-supervised if adversarial training is used for training. -supervised) learning (ie, semi-supervised training) manner. For details of the training process, reference is made to the description of a part corresponding to the description of model training of FIG. 30 above, and a description thereof will not be repeated here.

추가로, 실제 애플리케이션에서, 도 30 및 도 31에 도시된 인코딩부를 통해 비디오의 각 특성 정보를 획득한 후에 상기 디코더에 의한 디코딩 이전에, 상기 셀프-어텐션 기반 인트라-프레임 인코더 및 셀프-어텐션 기반 인터-프레임 인코더를 사용하여 상기 추출된 특성 정보를 인코딩하고 상기 인코딩된 특징들을 상기 디코더에 입력하는 것이 또한 가능하다.In addition, in actual application, after obtaining each characteristic information of a video through the encoding unit shown in FIGS. 30 and 31 , before decoding by the decoder, the self-attention-based intra-frame encoder and the self-attention-based inter - It is also possible to encode the extracted characteristic information using a frame encoder and input the encoded characteristics to the decoder.

본 개시의 실시예에서 제공된 멀티미디어 데이터의 캡셔닝 정보 생성 방법과 동일한 원리에 기반하여, 본 개시는 멀티미디어 데이터의 캡셔닝 정보를 생성하는 장치도 제공한다. 도 32에 도시된 바와 같이. 캡셔닝 정보를 생성하는 장치(100)는 특성 정보 추출 모듈(110) 및 캡셔닝 정보 생성 모듈(120)을 포함할 수 있다.Based on the same principle as the method for generating captioning information of multimedia data provided in the embodiment of the present disclosure, the present disclosure also provides an apparatus for generating captioning information of multimedia data. As shown in FIG. 32 . The apparatus 100 for generating captioning information may include a characteristic information extraction module 110 and a captioning information generation module 120 .

상기 특성 정보 추출 모듈(110)은 처리될 멀티미디어 데이터의 특성 정보를 추출하도록 구성되며, 여기에서 상기 멀티미디어 데이터는 비디오 또는 이미지를 포함한다.The characteristic information extraction module 110 is configured to extract characteristic information of multimedia data to be processed, wherein the multimedia data includes a video or an image.

상기 캡셔닝 정보 생성 모듈(120)은 상기 추출된 특성 정보에 기반하여 상기 멀티미디어 데이터의 텍스트 캡션을 생성하도록 구성된다.The captioning information generating module 120 is configured to generate a text caption of the multimedia data based on the extracted characteristic information.

대안으로, 상기 캡셔닝 정보 생성 모듈(120)은 구체적으로 다음 중 적어도 하나를 실행하도록 구성된다: 멀티미디어 데이터 내 각 이미지의 각각의 타겟 영역들에 포함된 타겟들의 로컬 시각적 특징들을 추출하는 단계; 상기 멀티미디어 데이터의 의미론적 특징들을 추출하는 단계; 상기 멀티미디어 데이터가 비디오일 때에, 상기 멀티미디어 데이터의 공간-시간적 시각적 특징들을 추출하는 단계; 상기 멀티미디어 데이터의 전역 시각적 특징들을 추출하는 단계; 상기 멀티미디어 데이터 내 각 이미지의 각자의 타겟 영역들에 포함된 상기 타겟들의 속성 특징들을 추출하는 단계; 그리고 멀티미디어 데이터에서 각 이미지의 전역 속성 특징들을 추출하는 단계.Alternatively, the captioning information generating module 120 is specifically configured to execute at least one of the following: extracting local visual features of targets included in respective target regions of each image in the multimedia data; extracting semantic features of the multimedia data; when the multimedia data is a video, extracting spatial-temporal visual features of the multimedia data; extracting global visual features of the multimedia data; extracting attribute characteristics of the targets included in respective target regions of each image in the multimedia data; and extracting global attribute features of each image from the multimedia data.

추가로, 상기 특성 정보는 상기 멀티미디어 데이터의 각 이미지에서 각자의 타겟 영역들에 포함된 타겟들의 로컬 시각적 특징들을 포함하고, 상기 캡셔닝 정보 생성 모듈(120)은 구체적으로: 상기 이미지 내 각 타겟의 로컬 시각적 특징들에 기반하여 상기 타겟들 간의 관계 특징들 획득하며; 상기 로컬 시각적 특징들 및 상기 관계 특징들을 기반으로 상기 이미지의 장면 그래프를 구축하고; 상기 이미지의 장면 그래프에 기반하여 상기 이미지의 그래프 컨볼루션 특징들을 획득하고; 그리고 상기 멀티미디어 데이터의 각 이미지의 그래프 컨볼루션 특징들에 기반하여 상기 멀티미디어 데이터의 텍스트 캡션을 생성하도록 구성된다.Additionally, the characteristic information includes local visual characteristics of targets included in respective target regions in each image of the multimedia data, and the captioning information generating module 120 specifically: obtain relationship characteristics between the targets based on local visual characteristics; build a scene graph of the image based on the local visual features and the relational features; obtain graph convolution features of the image based on a scene graph of the image; and generate a text caption of the multimedia data based on graph convolution characteristics of each image of the multimedia data.

대안으로, 상기 장면 그래프는 복수의 노드 및 복수의 에지를 포함하며, 여기에서 하나의 노드는 하나의 타겟의 로컬 시각적 특징을 나타내고, 복수의 에지 각각은 두 개의 연결된 노드 사이의 관계 특징을 나타낸다.Alternatively, the scene graph comprises a plurality of nodes and a plurality of edges, wherein one node represents a local visual feature of one target and each of the plurality of edges represents a relationship feature between two connected nodes.

대안으로, 상기 특성 정보는 멀티미디어 데이터 내 각 이미지의 각자의 타겟 영역들에 포함된 타겟들의 속성 특징들을 포함하고; 상기 캡셔닝 정보 생성 모듈(120)이 상기 이미지의 장면 그래프를 구축할 때에, 그것은 구체적으로 각 타겟의 로컬 시각적 특징, 상기 타겟들 간의 관계 특징 및 각 타겟의 속성 특징을 기반으로 상기 이미지의 장면 그래프를 구축하도록 구성되며, 여기에서 상기 장면 그래프의 한 노드는 상기 타겟 지역에 대응하는 한 타겟의 로컬 시각적 특징 또는 속성 특징을 나타낸다.Alternatively, the characteristic information includes attribute characteristics of targets included in respective target regions of each image in the multimedia data; When the captioning information generating module 120 builds the scene graph of the image, it specifically builds the scene graph of the image based on the local visual characteristics of each target, the relationship characteristics between the targets and the attribute characteristics of each target. , wherein a node of the scene graph represents a local visual feature or attribute feature of a target corresponding to the target region.

대안으로, 상기 멀티미디어 데이터가 비디오이면, 상기 멀티미디어 데이터의 이미지들은 상기 비디오로부터 선택된 복수의 프레임들이며, 인접한 두 프레임들의 타겟 영역들이 동일한 타겟들을 포함하면, 그 인접한 두 프레임들의 장면 그래프들은 상기 동일한 타겟에 대응하는 노드들 사이의 시간적 에지들을 가진다.Alternatively, if the multimedia data is a video, the images of the multimedia data are a plurality of frames selected from the video, and if the target areas of two adjacent frames contain the same targets, the scene graphs of the adjacent two frames are on the same target It has temporal edges between corresponding nodes.

대안으로, 상기 캡셔닝 정보 생성 모듈(120)이 이미지의 장면 그래프에 따라 그 이미지의 그래프 컨볼루션 특징들을 획득할 때에, 그것은: 상기 장면 그래프 내 노드들 및 에지들을 인코딩하여 특징 벡터의 목표 차원을 획득하고; 상기 획득된 특징 벡터들을 기반으로 그래프 컨볼루션 네트워크를 이용하여 상기 그래프 컨볼루션 특징들을 획득하기 위해 사용된다.Alternatively, when the captioning information generating module 120 obtains the graph convolution features of the image according to the scene graph of the image, it: encodes the nodes and edges in the scene graph to obtain the target dimension of the feature vector obtain; It is used to obtain the graph convolution features using a graph convolution network based on the obtained feature vectors.

대안으로, 상기 멀티미디어 데이터의 특성 정보가 로컬 시각적 특징, 의미론적 특징, 공간-시간적 시각적 특징 및 전역 특징 중 적어도 2개를 포함하면, 상기 캡셔닝 정보 생성 모듈(120)은:각 특성 정보의 가중치들을 결정하고; 각 특성 정보의 가중치에 기반하여 각 특성 정보에 가중치를 부여하고; 그리고 상기 가중치 부여된 특성 정보에 기반하여 상기 멀티미디어 데이터의 텍스트 캡션을 생성하기 위해 사용될 수 있다.Alternatively, if the characteristic information of the multimedia data includes at least two of a local visual characteristic, a semantic characteristic, a spatial-temporal visual characteristic, and a global characteristic, the captioning information generating module 120 may include: a weight of each characteristic information to decide; assigning a weight to each characteristic information based on the weight of each characteristic information; And it may be used to generate a text caption of the multimedia data based on the weighted characteristic information.

대안으로, 상기 캡셔닝 정보 생성 모듈(120)은: 상기 획득한 특성 정보를 셀프-어텐션 기반 인코더를 사용하여 인코딩하고; 상기 인코딩된 특성 정보를 디코더에 입력하여 멀티미디어 데이터의 텍스트 캡션을 생성하도록 구성되며, 여기에서 상기 멀티미디어 데이터가 이미지이면, 상기 셀프-어텐션 기반 인코더는 셀프-어텐션 기반 인트라-프레임 인코더이며; 상기 멀티미디어 데이터가 비디오이면, 상기 셀프-어텐션 기반 인코더는 셀프-어텐션 기반 인트라-프레임 인코더 및/또는 셀프-어텐션 기반 인터(inter)-프레임 인코더를 포함한다.Alternatively, the captioning information generating module 120 may: encode the acquired characteristic information using a self-attention-based encoder; input the encoded characteristic information to a decoder to generate a text caption of multimedia data, wherein if the multimedia data is an image, the self-attention-based encoder is a self-attention-based intra-frame encoder; If the multimedia data is video, the self-attention-based encoder includes a self-attention-based intra-frame encoder and/or a self-attention-based inter-frame encoder.

대안으로, 상기 캡셔닝 정보 생성 모듈(120)은: 상기 추출된 특성 정보를 복수의 디코더들에 각각 입력하고; 상기 디코더들의 디코딩 결과에 기반하여 상기 멀티미디어 데이터의 텍스트 캡션을 생성하도록 구성될 수 있다.Alternatively, the captioning information generating module 120 may: input the extracted characteristic information to a plurality of decoders, respectively; and generate a text caption of the multimedia data based on a decoding result of the decoders.

대안으로, 상기 캡셔닝 정보 생성 모듈(120)은: 생성될 상기 텍스트 자막의 길이 정보를 획득하며; 상기 길이 정보 및 상기 추출된 특성 정보에 기반하여 상기 비디오의 텍스트 캡션을 생성하도록 구성될 수 있다.Alternatively, the captioning information generating module 120 is configured to: obtain length information of the text subtitle to be generated; and generate a text caption of the video based on the length information and the extracted characteristic information.

대안으로, 상기 캡셔닝 정보 생성 모듈(120)은 구체적으로 멀티미디어 데이터 캡셔닝 모델을 이용하여 멀티미디어 데이터의 텍스트 자막을 획득할 수 있고, 상기 멀티미디어 데이터 캡셔닝 모델은 트레이닝을 수행하는 모델 트레이닝 장치에 의해 획득되며, 여기에서 상기 모델 트레이닝 장치는: 트레이닝 샘플들을 획득하도록 구성된 샘플 획득 모듈 - 상기 트레이닝 샘플들은 캡셔닝 라벨들을 갖는 제1 샘플 멀티미디어 데이터를 포함함 -; 모델 손실 함수가 수렴될 때까지 상기 제1 샘플 멀티미디어 데이터에 기반하여 원본 캡셔닝 모델을 학습하도록 구성된 모델 학습 모듈을 포함하며; 상기 트레이닝된 캡셔닝 모델을 멀티미디어 데이터 캡셔닝 모델로 취한다.Alternatively, the captioning information generating module 120 may specifically acquire text captions of multimedia data using a multimedia data captioning model, and the multimedia data captioning model is performed by a model training device performing training. obtained, wherein the model training apparatus comprises: a sample acquiring module configured to acquire training samples, the training samples comprising first sample multimedia data with caption labels; a model learning module, configured to train an original captioning model based on the first sample multimedia data until a model loss function converges; The trained captioning model is taken as a multimedia data captioning model.

대안으로, 상기 트레이닝 샘플은 캡셔닝 라벨이 없는 제2 샘플 멀티미디어 데이터를 더 포함하고, 상기 모델 손실 함수는 제1 손실 함수 및 제2 손실 함수를 포함하고; 상기 모델 트레이닝 모듈은: 미리 설정된 캡셔닝 모델을 제1 샘플 멀티미디어 데이터에 기반하여 트레이닝하여 제1 손실 함수의 값을 획득하고, 상기 캡셔닝 모델을 상기 제2 샘플 멀티미디어 데이터에 기반하여 트레이닝하여 제2 손실 함수의 값을 획득하고; 상기 제1 손실 함수의 값 및 상기 제2 손실 함수의 값에 기반하여 최종 손실 함수의 값을 획득하고; 상기 최종 손실 함수가 수렴할 때까지 그 최종 손실 함수의 값을 기반으로 캡셔닝 모델을 트레이닝시키기 위해 사용될 수 있다.alternatively, the training sample further comprises second sample multimedia data without caption label, and the model loss function comprises a first loss function and a second loss function; The model training module is configured to: train a preset captioning model based on the first sample multimedia data to obtain a value of a first loss function, train the captioning model based on the second sample multimedia data to obtain a second obtain the value of the loss function; obtain a value of a final loss function based on the value of the first loss function and the value of the second loss function; It can be used to train a captioning model based on the value of the final loss function until it converges.

대안으로, 상기 모델 트레이닝 모듈이 상기 제2 샘플 멀티미디어 데이터에 기반하여 캡셔닝 모델을 트레이닝하고 상기 제2 손실 함수의 값을 획득할 때에, 그것은 특히: 제3 샘플 멀티미디어 데이터를 적어도 한 차례 획득하기 위해 상기 제2 샘플 멀티미디어 데이터에 대해 데이터 증강을 수행하고; 적어도 하나의 멀티미디어 캡셔팅을 획득하기 위해 상기 제2 샘플 멀티미디어 데이터를 상기 캡셔닝 모델에 입력하고; 상기 제2 샘플 멀티미디어 데이터 및 제3 샘플 멀티미디어 데이터에 기반하여 각 멀티미디어 캡셔닝의 점수를 결정하고; 그리고 상기 멀티미디어 캡셔닝의 점수들에 기반하여 상기 제2 손실 함수의 값을 획득하기 위해 사용될 수 있다.Alternatively, when the model training module trains a captioning model based on the second sample multimedia data and obtains the value of the second loss function, it is specifically: to obtain the third sample multimedia data at least once perform data augmentation on the second sample multimedia data; input the second sample multimedia data into the captioning model to obtain at least one multimedia captioning; determine a score of each multimedia caption based on the second sample multimedia data and the third sample multimedia data; and may be used to obtain a value of the second loss function based on the scores of the multimedia captioning.

대안으로, 상기 제1 샘플 멀티미디어 데이터의 캡셔닝 라벨은 상기 제1 샘플 멀티미디어 데이터의 적어도 하나의 원본 캡셔닝 라벨 및 각 원본 캡셔닝 라벨에 대응하는 증강 캡셔닝 라벨을 포함하고; 상기 증강 캡셔닝 라벨은, 상기 제1 샘플 멀티미디어 데이터의 각 원본 캡셔닝 라벨에 기반하여, 원본 캡셔닝 라벨들 각각에 대응하는 증강 이미지 캡셔닝 라벨을 각각 생성하는 방식을 통해 획득된다.Alternatively, the captioning label of the first sample multimedia data includes at least one original captioning label of the first sample multimedia data and an augmented captioning label corresponding to each original captioning label; The augmented captioning label is obtained by generating an augmented image captioning label corresponding to each of the original captioning labels, respectively, based on each original captioning label of the first sample multimedia data.

본 개시의 실시예에서 제공되는 방법 및 장치와 동일한 원리에 기반하여, 본 개시의 실시예는 전자 디바이스를 더 제공한다. 상기 전자 디바이스는 메모리와 프로세서를 포함한다. 상기 메모리는 컴퓨터 프로그램을 저장하고, 상기 프로세서가 상기 컴퓨터 프로그램을 실행할 때, 그 프로세서는 본 개시의 임의의 대안적인 실시예에 보이는 방법을 수행할 수 있다.Based on the same principle as the method and apparatus provided in the embodiments of the present disclosure, the embodiments of the present disclosure further provide an electronic device. The electronic device includes a memory and a processor. The memory stores a computer program, and when the processor executes the computer program, the processor may perform the methods shown in any alternative embodiment of the present disclosure.

본 개시의 실시예는 컴퓨터 판독가능 저장매체를 더 제공한다. 상기 저장매체는 컴퓨터 프로그램을 저장한다. 상기 컴퓨터 프로그램이 프로세서에 의해 실행될 때에, 그것은 본 개시의 임의의 대안적인 실시예에 도시된 방법을 수행할 수 있다.An embodiment of the present disclosure further provides a computer-readable storage medium. The storage medium stores a computer program. When the computer program is executed by a processor, it may perform the method shown in any alternative embodiment of the present disclosure.

일례로, 도 33은 본 개시의 실시예에 적용 가능한 전자 디바이스의 개략적인 구조도를 예시한다. 도 33에 도시된바와 같이, 도 33에 도시된 전자 디바이스(4000)는 프로세서(4001) 및 메모리(4003)를 포함한다. 상기 프로세서(4001)와 메모리(4003)는, 예를 들어, 버스(4002)를 통해 연결된다. 대안으로, 상기 전자 디바이스(4000)는 트랜시버(4004)를 더 포함할 수 있다. 실제 애플리케이션에서, 상기 트랜시버(4004)의 개수는 하나로 제한되지 않으며, 상기 전자 디바이스(4000)의 구조가 본 개시의 실시예를 제한하지 않는다는 점에 유의해야 한다.As an example, FIG. 33 illustrates a schematic structural diagram of an electronic device applicable to an embodiment of the present disclosure. As shown in FIG. 33 , the electronic device 4000 shown in FIG. 33 includes a processor 4001 and a memory 4003 . The processor 4001 and the memory 4003 are connected via, for example, a bus 4002 . Alternatively, the electronic device 4000 may further include a transceiver 4004 . It should be noted that in an actual application, the number of the transceivers 4004 is not limited to one, and the structure of the electronic device 4000 does not limit the embodiment of the present disclosure.

상기 프로세서(4001)는 중앙 처리 장치(CPU), 범용 프로세서, 디지털 신호 프로세서(DSP), 주문형 집적 회로(ASIC), 또는 필드 프로그래머블 게이트 어테이 (FPGA(Field Programmable Gate Array)) 또는 다른 프로그램 가능한 논리 디바이스, 트랜지스터 논리 디바이스, 하드웨어 컴포넌트들 또는 이들의 조합일 수 있다. 그것은 본 개시와 관련하여 설명된 다양한 예시적인 논리 블록, 모듈, 및 회로를 구현하거나 실행할 수 있다. 상기 프로세서(4001)는 또한 컴퓨팅 기능을 실현하는 조합, 예를 들어, 하나 이상의 마이크로 프로세서를 포함하는 조합, DSP와 마이크로 프로세서의 조합 등일 수 있다.The processor 4001 may be a central processing unit (CPU), general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), or field programmable gate array (FPGA) or other programmable logic device, transistor logic device, hardware components, or a combination thereof. It may implement or execute the various illustrative logical blocks, modules, and circuits described in connection with this disclosure. The processor 4001 may also be a combination realizing a computing function, for example, a combination including one or more microprocessors, a combination of a DSP and a microprocessor, and the like.

상기 버스(4002)는 전술한 컴포넌트들 사이에서 정보를 전송하기 위한 경로를 포함할 수 있다. 상기 버스(4002)는 PCI(Peripheral Component Interconnect) 버스 또는 EISA(Extended Industry Standard Architecture) 버스일 수 있다. 상기 버스(4002)는 어드레스 버스, 데이터 버스, 제어 버스 등으로 분할될 수 있다. 표현의 편의를 위해, 상기 버스는 도 33에서는 굵은 선 하나만으로 표시되었지만, 그것은 한 하나의 버스 또는 한 유형의 버스만이 존재한다는 것을 의미하지 않는다.The bus 4002 may include a path for transferring information between the aforementioned components. The bus 4002 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For convenience of expression, the bus is indicated by only a bold line in FIG. 33, but that does not mean that there is only one bus or one type of bus.

상기 메모리(4003)는 읽기 전용 메모리(ROM) 또는 정적 정보 및 명령을 저장할 수 있는 다른 유형의 정적 저장 디바이스, 랜덤 액세스 메모리(RAM), 또는 정보 및 명령을 저장할 수 있는 다른 유형의 동적 저장 장치일 수 있으며, 이는 또한 EEPROM (Electrically Erasable Programmable Read Only Memory), CD-ROM(Compact Disc Read Only Memory) 또는 다른 유형의 광 디스크 저장소; 광 디스크 저장소(압축 광 디스크, 레이저 디스크, 광 디스크, 디지털 다목적 디스크, 블루-레이 디스크 등을 포함함), 자기 디스크 저장 매체 또는 다른 자기 저장 디바이스, 또는 명령 또는 데이터 구조의 형태로 원하는 프로그램 코드를 운반하거나 저장하는 데 사용할 수 있고 컴퓨터에 의해 액세스 가능한 임의의 다른 매체일 수 있지만, 이것으로 제한되지 않는다.The memory 4003 may be read-only memory (ROM) or other type of static storage device capable of storing static information and instructions, random access memory (RAM), or other type of dynamic storage device capable of storing information and instructions. It may also include Electrically Erasable Programmable Read Only Memory (EEPROM), Compact Disc Read Only Memory (CD-ROM), or other types of optical disc storage; optical disk storage (including compressed optical disks, laser disks, optical disks, digital versatile disks, Blu-ray disks, etc.), magnetic disk storage media or other magnetic storage devices, or the desired program code in the form of instructions or data structures; It can be, but is not limited to, any other medium that can be used for carrying or storing and accessible by a computer.

상기 메모리(4003)는 본 개시의 솔루션을 실행하기 위한 애플리케이션 프로그램 코드를 저장하도록 구성되고, 상기 프로세서(4001)는 실행을 제어한다. 상기 프로세서(4001)는 메모리(4003)에 저장된 애플리케이션 프로그램 코드를 실행하여 전술한 방법 실시예들 중 어느 하나에서 보이는 콘텐츠를 구현하도록 구성된다.The memory 4003 is configured to store application program code for executing the solution of the present disclosure, and the processor 4001 controls execution. The processor 4001 is configured to execute the application program code stored in the memory 4003 to implement the content shown in any one of the above-described method embodiments.

이 애플리케이션의 대안적인 실시예에서 제공되는 방법 및 (비디오 캡셔닝 모델, 이미지 캡셔닝 모델 등과 같은) 모델은 비디오 캡셔닝 정보 및 이미지 캡셔닝 정보를 생성할 것을 필요로 하는 임의 단말 (이는 사용자 단말 또는 서버 등이 될 수 있음) 상에서 실행될 수 있다는 것이 이해될 수 있다. 대안으로, 상기 단말은 다음의 이점들을 가질 수 있다.The methods and models (such as video captioning models, image captioning models, etc.) provided in alternative embodiments of this application can be used for any terminal that needs to generate video captioning information and image captioning information (either a user terminal or may be executed on a server, etc.). Alternatively, the terminal may have the following advantages.

(1) 하드웨어 시스템에서, 상기 디바이스는 중앙 처리 장치, 메모리, 입력 컴포넌트들 및 출력 컴포넌트들을 구비하며, 즉, 상기 디바이스는 종종 통신 기능을 갖춘 마이크로 컴퓨터 디바이스이다. 추가로, 그것은 키보드, 마우스, 터치스크린, 마이크로폰, 카메라 등과 같은 다수의 입력 방식들을 가질 수 있으며, 이는 필요에 따라 조정될 수 있다. 동시에 장치에는, 상기 디바이스는 수신기 및 디스플레이 스크린과 같은 다수의 출력 방식을 종종 가지며, 이는 또한 필요에 따라 조정될 수 있다.(1) In a hardware system, the device has a central processing unit, a memory, input components and output components, that is, the device is often a microcomputer device with a communication function. In addition, it can have multiple input methods such as keyboard, mouse, touch screen, microphone, camera, etc., which can be adjusted as needed. At the same time in devices, the devices often have multiple output modes, such as receivers and display screens, which can also be adapted as required.

(2) 소프트웨어 시스템에서, 상기 디바이스는 윈도우 모바일. 심비안 (Symbian), Palm, Android, iOS 등과 같은 운영 체제를 구비해야 한다. 동시에, 이러한 운영 체제는 점점 더 개방되며, 이런 개방형 운영 체제 플랫폼에 기반한 개인화된 애플리케이션들은 주소록, 캘린더, 메모장, 계산기 및 다양한 게임 등과 같은 무한한 스트림에서 등장하고 있으며, 이는 개인화된 사용자의 니즈를 크게 충족시킨다.(2) in the software system, the device is Windows Mobile. You must have an operating system such as Symbian, Palm, Android, iOS, etc. At the same time, these operating systems are becoming more and more open, and personalized applications based on these open operating system platforms are appearing in an infinite stream such as address books, calendars, notepads, calculators and various games, which greatly meets the needs of personalized users. make it

(3) 통신 능력 면에서, 상기 디바이스는 유연한 액세스 방식과 고대역폭 통신 성능을 가지고 있으며, 그리고 그것은 선택된 서비스 및 환경에 따라 선택된 통신 방식을 자동으로 조정할 수 있으며, 그에 의해 사용자의 사용을 용이하게 한다. 상기 디바이스는 GSM(Global System for Mobile Communication), WCDMA(광대역 코드 분할 다중 액세스, Wideband Code Division Multiple Access), 코드 분할 다중 액세스 (Code Division Multiple Access (CDMA2000)) 및 시분할-동기 코드 분할 다중 액세스 (시분할-TDSCDMA), Wi-Fi 및 WiMAX (Worldwide Interoperability for Microwave Access) 등을 지원할 수 있으며, 이는 음성 서비스뿐만 아니라 다중 무선 데이터 서비스를 지원하는 다양한 표준 네트워크를 적응하기 위한 것이다.(3) In terms of communication capability, the device has a flexible access method and high-bandwidth communication capability, and it can automatically adjust the selected communication method according to the selected service and environment, thereby facilitating the use of the user. . The device includes Global System for Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), Code Division Multiple Access (CDMA2000), and Time Division-Synchronous Code Division Multiple Access (Time Division Multiple Access). -TDSCDMA), Wi-Fi and WiMAX (Worldwide Interoperability for Microwave Access), etc., to adapt to various standard networks supporting not only voice service but also multiple wireless data service.

(4) 기능 면에서, 상기 디바이스는 인간화, 개인화 및 다기능에 더 많은 관심을 기울인다. 컴퓨터 기술의 발달로, 상기 디바이스는 "디바이스-중심" 모드에서 "사람-중심" 모드로 이동하여, 임베디드 컴퓨팅, 제어 기술, 인공 지능 기술 및 생체 인증 기술을 통합하며, 이는 사람-중심 목적을 완전히 반영한다. 소프트웨어 기술의 발달로 인해, 상기 디바이스의 구성은 더욱 개인화되는 개인의 필요에 따라 조정될 수 있다. 동시에, 상기 디바이스 그 자체는 많은 소프트웨어와 하드웨어를 통합하며, 그것의 기능은 점점 더 강력해지고 있다.(4) In terms of function, the device pays more attention to humanization, personalization and multifunctionality. With the development of computer technology, the device moves from "device-centric" mode to "person-centric" mode, integrating embedded computing, control technology, artificial intelligence technology and biometric authentication technology, which completely achieves the person-centric purpose. reflect Due to the development of software technology, the configuration of the device can be adjusted according to the needs of the individual being more personalized. At the same time, the device itself integrates a lot of software and hardware, and its functions are becoming more and more powerful.

또한, 개시된 실시예에 따른 멀티미디어 데이터의 캡셔닝 정보 생성 장치 또는 방법은 컴퓨터 프로그램 제품으로 제공될 수 있다. 상기 컴퓨터 프로그램 제품은 판매자와 구매자 사이에 상품으로 거래될 수 있다.Also, the apparatus or method for generating caption information of multimedia data according to the disclosed embodiment may be provided as a computer program product. The computer program product may be traded as a commodity between a seller and a buyer.

상기 컴퓨터 프로그램 제품은 소프트웨어 프로그램 및 그 소프트웨어 프로그램이 저장된 컴퓨터 판독가능 저장 매체를 포함할 수 있다. 예를 들어, 상기 컴퓨터 프로그램 제품은 전자 디바이스의 제조업체 또는 전자 시장 (예: Google Play Store, AppStore)을 통해 전자적으로 배포되는 소프트웨어 프로그램(예: 다운로드 가능한 앱) 형태의 제품을 포함할 수 있다. 전자적 배포를 위해, 상기 소프트웨어 프로그램의 적어도 일부는 저장 매체에 저장되거나 임시로 생성될 수 있다. 이 경우에, 상기 저장매체는 제조사의 서버, 전자시장의 서버, 또는 SW 프로그램을 임시로 저장하는 중계 서버의 저장 매체일 수 있다.The computer program product may include a software program and a computer-readable storage medium in which the software program is stored. For example, the computer program product may include a product in the form of a software program (eg, a downloadable app) distributed electronically through a manufacturer of an electronic device or an electronic market (eg, Google Play Store, AppStore). For electronic distribution, at least a portion of the software program may be stored on a storage medium or created temporarily. In this case, the storage medium may be a server of a manufacturer, a server of an electronic market, or a storage medium of a relay server temporarily storing a SW program.

상기 컴퓨터 프로그램 제품은, 서버 및 클라이언트 디바이스를 포함하는 시스템에서 상기 서버의 저장 매체 또는 상기 클라이언트 디바이스의 저장 매체를 포함할 수 있다. 대안으로, 서버 또는 클라이언트 디바이스와 통신하는 제3 디바이스 (예를 들어, 스마트폰)가 있는 경우, 컴퓨터 프로그램 제품은 제3 디바이스의 저장 매체를 포함할 수 있다. 대안으로, 상기 컴퓨터 프로그램 제품은 서버에서 클라이언트 장치 또는 제3 디바이스로 전송되거나, 또는 제3 디바이스에서 클라이언트 디바이스로 전송되는 S/W 프로그램 자체를 포함할 수 있다.The computer program product may include a storage medium of the server or a storage medium of the client device in a system including a server and a client device. Alternatively, if there is a third device (eg, a smartphone) that communicates with the server or client device, the computer program product may include a storage medium of the third device. Alternatively, the computer program product may include the S/W program itself transmitted from the server to the client device or the third device, or from the third device to the client device.

이 경우, 상기 서버, 클라이언트 디바이스, 및 제3 디바이스 중 하나는 상기 개시된 실시예에 따른 방법을 수행하기 위해 컴퓨터 프로그램 제품을 실행할 수 있다. 대안으로, 상기 서버, 클라이언트 디바이스, 및 제3 디바이스 중 적어도 둘은 상기 개시된 실시예에 따른 방법을 배포 및 수행하기 위해 컴퓨터 프로그램 제품을 실행할 수 있다.In this case, one of the server, the client device, and the third device may execute a computer program product to perform the method according to the disclosed embodiment. Alternatively, at least two of the server, client device, and third device may execute a computer program product for distributing and performing the method according to the disclosed embodiment.

예를 들어, 서버 (예를 들어, 클라우드 서버 또는 인공 지능 서버)는 상기 개시된 실시예에 따른 방법을 수행하기 위해 서버와 통신하는 클라이언트 디바이스를 제어하기 위해 서버에 저장된 컴퓨터 프로그램 제품을 실행할 수 있다.For example, a server (eg, a cloud server or artificial intelligence server) may execute a computer program product stored on the server to control a client device that communicates with the server to perform a method according to the disclosed embodiments.

도면의 흐름도의 단계들이 화살표의 방향에 따라 순차적으로 디스플레이되지만, 이러한 단계들은 반드시 그 화살표가 지시하는 순서대로 수행되는 것은 아님을 이해해야 한다. 여기에 명시적으로 언급되지 않는 한, 이러한 단계들의 실행은 엄격하게 제한되지 않으며, 그것들은 다른 순서로 수행될 수 있다. 더욱이, 도면의 흐름도의 단계들 중 적어도 일부는 다중 하위 단계 또는 다중 스테이지들을 포함할 수 있다. 이러한 하위 단계 또는 스테이지는 반드시 동시에 수행되는 것은 아니며 다른 시각들에 수행될 수도 있다. 상기 실행 순서는 반드시 순서대로 수행되는 것은 아니며, 다른 단계들 또는 하위 단계의 적어도 일부 또는 다른 단계들 스테이지와 차례로 또는 교대로 수행될 수 있다.Although the steps in the flowcharts of the drawings are sequentially displayed according to the direction of the arrow, it should be understood that these steps are not necessarily performed in the order indicated by the arrow. Unless explicitly stated herein, the execution of these steps is not strictly limited, and they may be performed in a different order. Moreover, at least some of the steps in the flowcharts of the figures may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily performed simultaneously and may be performed at different times. The execution order is not necessarily performed in order, and at least some of other steps or sub-steps or other steps stages may be performed sequentially or alternately.

위의 설명은 본 개시의 구현의 일부일 뿐이다. 당업자에게는 본 개시의 원리를 벗어나지 않으면서 여러 개선 및 수정이 이루어질 수 있음을 주목해야 한다. 이러한 개선 및 수정은 또한 본원의 보호 범위에 속하는 것으로 간주되어야 한다.The above description is only part of the implementation of the present disclosure. It should be noted that various improvements and modifications may be made to those skilled in the art without departing from the principles of the present disclosure. Such improvements and modifications should also be considered to fall within the protection scope of the present application.

Claims

How to create caption information of multimedia data
extracting characteristic information of multimedia data to be processed, the multimedia data including video or images; and
and generating a text caption of the multimedia data based on the extracted characteristic information.

The method of claim 1, wherein extracting characteristic information of the multimedia data to be processed comprises:
extracting local visual features of targets included in respective target regions of each image in the multimedia data;
extracting semantic features of the multimedia data;
when the multimedia data is a video, extracting spatial-temporal visual features of the multimedia data;
extracting global visual features of the multimedia data;
extracting attribute characteristics of the targets included in respective target regions of each image in the multimedia data; and
at least one of extracting global attribute features of each image in the multimedia data.

The method of claim 2, wherein the characteristic information includes local visual characteristics of the targets included in respective target regions in each image of the multimedia data, and a text caption of the multimedia data is generated based on the extracted characteristic information. The steps to create are:
obtaining relationship characteristics between targets based on local visual characteristics of each target in the image;
building a scene graph of the image based on the local visual features and the relational features;
obtaining graph convolution features of the image based on a scene graph of the image; and
generating a text caption of the multimedia data based on graph convolution characteristics of each image of the multimedia data.

4. The scene graph of claim 3, wherein the scene graph comprises a plurality of nodes and a plurality of edges, wherein one node represents a local visual feature of one target, and each of the plurality of edges is between two connected nodes. A method for representing a relationship characteristic.

4. The method of claim 3, wherein: the characteristic information includes attribute characteristics of the targets included in respective target regions of each image in the multimedia data;
Building a scene graph of the image based on the local visual features and the relational features comprises:
building a scene graph of the image based on local visual characteristics of each target, relationship characteristics between the targets, and attribute characteristics of each target, wherein one node in the scene graph is one target Indicative of local visual characteristics or attribute characteristics of

4. The method of claim 3,
When the multimedia data is a video, the images of the multimedia data are a plurality of frames selected from the video, and when target areas of two adjacent frames include the same targets, scene graphs of the adjacent two frames correspond to the same target having temporal edges between nodes that

4. The method of claim 3, wherein obtaining graph convolution features of the image based on a scene graph of the image comprises:
encoding nodes and edges in the scene graph to obtain a target dimension of a feature vector; and
obtaining the graph convolution features by using a graph convolution network based on the obtained feature vector.

The method of claim 2, wherein when the characteristic information of the multimedia data includes at least two of the local visual characteristic, the semantic characteristic, the spatial-temporal visual characteristic, and the global characteristic, based on the extracted characteristic information The step of generating a text caption of the multimedia data by:
determining weights of each characteristic information;
assigning a weight to each characteristic information based on the weights of each characteristic information; and
and generating a text caption of the multimedia data based on the weighted characteristic information.

The method of claim 2, wherein the generating of the text caption of the multimedia data based on the extracted characteristic information comprises:
encoding the acquired characteristic information using a self-attention-based encoder;
inputting the encoded characteristic information to a decoder to generate a text caption of the multimedia data;
when the multimedia data is an image, the self-attention-based based encoder is a self-attention based intra-frame encoder; When the multimedia data is video, the self-attention-based encoder comprises a self-attention-based intra-frame encoder and/or a self-attention-based inter-frame encoder.

The method of claim 1, wherein the generating a text caption of the multimedia data based on the extracted characteristic information comprises:
inputting the extracted characteristic information to a plurality of decoders, respectively; and
and generating a text caption of the multimedia data based on the decoding result of the decoders.

The method of claim 1 , wherein generating a text caption of the multimedia data based on the extracted characteristic information comprises:
obtaining length information of the text caption to be generated; and
and generating a text caption of the video based on the length information and the extracted characteristic information.

The method of claim 1 , wherein the text caption of the multimedia data is generated through a multimedia data captioning model, and the multimedia data captioning model comprises:
obtaining training samples comprising first sample multimedia data having caption labels;
training an initial captioning model based on the first sample multimedia data until the model loss function converges; and taking the trained captioning model as the multimedia data captioning model.

13. The method of claim 12, wherein: the training samples further comprise second sample multimedia data without the captioning labels, the model loss function comprising a first loss function and a second loss function;
Training the initial captioning model based on the first sample multimedia data until the model loss function converges:
To obtain a value of the first loss function, train a preset captioning model based on the first sample multimedia data, and to obtain a value of the second loss function, use the captioning model as the second 2 training based on the sample multimedia data;
obtaining a value of a final loss function based on the value of the first loss function and the value of the second loss function; and
training the captioning model based on the value of the final loss function until the final loss function converges.

A device for generating caption information of multimedia data:
a memory storing one or more instructions; and
a processor configured to execute one or more instructions stored in the memory, the processor comprising:
extracting characteristic information of multimedia data to be processed, the multimedia data including video or images;
An apparatus for generating a text caption of the multimedia data based on the extracted characteristic information.

A computer program product comprising a non-transitory computer readable storage medium, the storage medium storing a computer program that, when executed by a processor, performs the method of claim 1 .