WO2020159036A1

WO2020159036A1 - Electronic device generating caption information for video sequence and operation method thereof

Info

Publication number: WO2020159036A1
Application number: PCT/KR2019/013609
Authority: WO
Inventors: 김경수; 김준모; 김병주; 박민석; 이시행; 이예강; 이재영
Original assignee: 삼성전자주식회사; 한국과학기술원
Priority date: 2019-01-30
Filing date: 2019-10-16
Publication date: 2020-08-06

Abstract

The present disclosure relates to an artificial intelligence (AI) system mimicking functions like cognition, determination, etc. of human brains by using a machine learning algorithm such as deep learning and the like, and an application of the AI system. Disclosed is a method for generating caption information for a video sequence in an electronic device, the method comprising the steps of: obtaining a plurality of videos included in a video sequence; extracting pieces of characteristic information of each of the plurality of videos; obtaining a first characteristic information on a characteristic of the video sequence by sequentially processing the extracted pieces of characteristic information according to the order of the plurality of videos; obtaining a second characteristic information on a characteristic of the video sequence that was determined on the basis of at least one similarity among the extracted pieces of characteristic information; and generating caption information of the video sequence on the basis of the first characteristic information and the second characteristic information.

Description

Electronic device generating caption information for image sequence and operation method thereof

The present disclosure relates to an electronic device generating caption information for a video sequence and a method of operating the same.

The Artificial Intelligence (AI) system is a computer system that realizes human-level intelligence, and unlike the existing Rule-based smart system, the machine learns, judges, and becomes intelligent. As the AI system is used, the recognition rate is improved and the user's taste can be understood more accurately, and the existing Rule-based smart system is gradually being replaced by the deep learning-based AI system.

Artificial intelligence technology is composed of machine learning (deep learning) and elemental technologies utilizing machine learning.

Machine learning is an algorithm technology that classifies/learns the characteristics of input data by itself, and element technology is a technology that simulates functions such as cognition and judgment of the human brain by using machine learning algorithms such as deep learning. It consists of technical fields such as understanding, reasoning/prediction, knowledge expression, and motion control.

The various fields in which artificial intelligence technology is applied are as follows. Linguistic understanding is a technology that recognizes and applies/processes human language/characters, and includes natural language processing, machine translation, conversation system, question and answer, speech recognition/synthesis, etc. Visual understanding is a technology that recognizes and processes objects as human vision, and includes object recognition, object tracking, image search, human recognition, scene understanding, spatial understanding, and image improvement. Inference prediction is a technique for logically inferring and predicting information by determining information, and includes knowledge/probability-based reasoning, optimization prediction, preference-based planning, and recommendation. Knowledge expression is a technology that automatically processes human experience information into knowledge data, and includes knowledge building (data generation/classification), knowledge management (data utilization), and so on. Motion control is a technique for controlling autonomous driving of a vehicle and movement of a robot, and includes motion control (navigation, collision, driving), operation control (behavior control), and the like.

The video captioning technique is a technique for generating sentences describing scenes of an image sequence. According to the video captioning technique, based on the above-described artificial intelligence system, an optimal sentence describing scenes of an image sequence can be generated.

The user can easily recognize the contents of the video sequence without directly viewing the video sequence of a considerable length through the sentence generated by the video captioning technology. Further, the text generated by the video captioning technology may be utilized in various fields, such as classifying or recognizing an image sequence, as it includes text compressively representing the contents of the image sequence.

Accordingly, there is a need for a video captioning technique for generating text that properly and clearly reflects the contents of an image sequence.

The problem to be solved by the present disclosure is to solve the above-described problem, and to provide an electronic device generating caption information for an image sequence and an operation method thereof.

In addition, it is to provide a computer program product comprising a recording medium readable by a computer recording a program for executing the method on a computer. The technical problem to be solved is not limited to the technical problems as described above, and other technical problems may exist.

1 is a diagram illustrating an example of generating caption information of an image sequence according to an embodiment.

2 is a block diagram illustrating an example of an electronic device 1000 that generates caption information for an image sequence according to an embodiment.

3 is a diagram illustrating an example of a method in which a non-regional feature extraction unit acquires second feature information according to an embodiment.

4 is a block diagram illustrating an example of an electronic device 1000 that generates caption information for an image sequence according to an embodiment.

5 is a block diagram illustrating an internal configuration of an electronic device according to an embodiment.

6 is a block diagram illustrating an internal configuration of an electronic device according to an embodiment.

7 is a flowchart illustrating a method of generating caption information for an image sequence according to an embodiment.

As a technical means for achieving the above technical problem, a first aspect of the present disclosure is a method of generating caption information for an image sequence in an electronic device, wherein a plurality of images included in the image sequence are Obtaining; Extracting feature information for each of the plurality of images; Obtaining first feature information regarding features of the image sequence by sequentially processing the extracted feature information according to the order of the plurality of images; Obtaining second feature information on features of the image sequence determined based on at least one similarity between the extracted feature information; And generating caption information for the video sequence based on the first feature information and the second feature information.

In addition, a second aspect of the present disclosure, an electronic device generating caption information for an image sequence includes: a memory for storing a plurality of images included in the image sequence; Feature information is extracted for each of the plurality of images, and the extracted feature information is sequentially processed according to the order of the plurality of images to obtain first feature information regarding features of the image sequence, and the extracted Acquire second feature information regarding the feature of the image sequence determined based on at least one similarity between feature information, and obtain caption information for the image sequence based on the first feature information and the second feature information. Generating, at least one processor; And an output unit that outputs information based on the generated caption information.

Further, a third aspect of the present disclosure may provide a computer program product including a recording medium in which a program for performing the method of the first aspect or the second aspect is stored.

Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present invention pertains may easily practice. However, the present invention can be implemented in many different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present invention in the drawings, parts irrelevant to the description are omitted, and like reference numerals are assigned to similar parts throughout the specification.

Throughout the specification, when a part is "connected" to another part, this includes not only "directly connected" but also "electrically connected" with other elements in between. . Also, when a part “includes” a certain component, this means that other components may be further included instead of excluding other components, unless otherwise specified.

AI-related functions according to the present disclosure are operated through a processor and a memory. The processor may be composed of one or more processors. At this time, the one or a plurality of processors may be a general-purpose processor such as a CPU, an AP, or a digital signal processor (DSP), a graphic processor such as a GPU or a vision processing unit (VPU), or an artificial intelligence processor such as an NPU. One or a plurality of processors are controlled to process input data according to predefined operation rules or artificial intelligence models stored in the memory. Alternatively, when one or more processors are AI-only processors, the AI-only processors may be designed with a hardware structure specialized for processing a specific AI model.

The predefined motion rules or artificial intelligence models are characterized by being created through learning. Here, by being created through learning, the basic artificial intelligence model is learned using a plurality of training data by a learning algorithm, thereby creating a predefined operation rule or artificial intelligence model set to perform a desired characteristic (or purpose). It means Jim. Such learning may be performed on a device on which artificial intelligence according to the present disclosure is performed, or may be performed through a separate server and/or system. Examples of learning algorithms include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but are not limited to the examples described above.

The artificial intelligence model may be composed of a plurality of neural network layers. Each of the plurality of neural network layers has a plurality of weight values and performs a neural network operation through calculation between a result of calculation of a previous layer and a plurality of weights. The plurality of weights of the plurality of neural network layers may be optimized by the learning result of the artificial intelligence model. For example, a plurality of weights may be updated such that a loss value or a cost value obtained from the artificial intelligence model is reduced or minimized during the learning process. The artificial neural network may include a deep neural network (DNN), for example, a convolutional neural network (CNN), a deep neural network (DNN), a recurrent neural network (RNN), a restricted boltzmann machine (RBM), There are Deep Belief Network (DBN), Bidirectional Recurrent Deep Neural Network (BRDNN) or Deep Q-Networks, but are not limited to the above-described examples.

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

According to an embodiment, caption information 120 may be generated for an image sequence 110 including a plurality of

images

111, 112, 113, 114, 115, and 116. The caption information 120 according to an embodiment may include information describing a scene of images included in the image sequence 110.

For example, the caption information 120 may include information indicating one topic, encompassing images included in the image sequence 110.

According to an embodiment, caption information 120 generated for the image sequence 110 may be provided to the user so that the user can determine the approximate contents of the images included in the image sequence 110. For example, the caption information 120 generated according to an embodiment may be displayed on the electronic device 1000.

Further, the caption information 120 according to an embodiment may be used to perform various processes related to the image sequence 110, such as classifying and recognizing the image sequence 110.

The caption information 120 of the image sequence 110 according to an embodiment may be generated based on information on features of the image sequence 110. Information about the features of the image sequence 110 may be generated based on the features of the images included in the image sequence 110.

According to an embodiment, information on the features of the image sequence 110 may be obtained by a learning model using feature information of images included in the image sequence 110 as an input. For example, the learning model described above may be trained so that feature information of the image sequence 110 suitable for generating caption information 120 from feature information of images included in the image sequence 110 can be obtained. have.

In addition, according to an embodiment, information on the features of the image sequence 110 may be obtained through various methods based on feature information of images included in the image sequence 110 as well as the method using the above-described learning model. Can.

The caption information 120 of the image sequence 110 according to an embodiment is based on feature information of each of a plurality of

images

111, 112, 113, 114, 115 and 116 included in the image sequence 110. Thus, it can be generated. The feature information of the image is information representing the visual characteristics of the image, and may include, for example, histogram information, edge information, brightness information, color distribution information, and shape information. Not limited to the above-described example, the feature information of the image may include various information representing the visual feature of the image.

In addition, the feature information of the image according to an embodiment includes a result of recognizing the image by inputting various information about the image to the data recognition model, for example, various information representing the visual characteristics of the image. can do. The result of recognizing the image may include, for example, information on an object recognized in the image, information on a position of the object recognized in the image, information on the motion of the object recognized in the image, and the like. For example, when the image input to the data recognition model is an image of a cat, the recognition result of the image of the data recognition model may include “cat”. Accordingly, the feature information of the image may include a “cat”, which is a result of recognition of the image.

The results of the image recognition by the data recognition model are not limited to the above-described examples, and may include various information representing characteristics of the image.

In one embodiment, a data recognition model that can be used to obtain feature information of an image may be a convolutional neural network (CNN) used to classify and detect objects in the image. Not limited to the above-described examples, in one embodiment, various types of data recognition models based on neural networks that can be used to acquire feature information of an image may be used.

Accordingly, feature information of the images included in the image sequence 110 according to an embodiment is recognized by inputting various information representing visual characteristics of each image and information representing the visual characteristics into a data recognition model. It may include at least one of the information about the results.

The image sequence 110 according to an embodiment may include a plurality of image frames, arranged in chronological order. For example, one video file may include a plurality of image sequences divided by scene or subject, and each image sequence may include a plurality of images. Also, a plurality of images included in the image sequence 110 may be still images.

According to one embodiment, a plurality of images (111, 112, 113, 114, 115, 116) for generating the caption information 110, corresponding to each time point (time point) set according to a predetermined time interval It may be a still image. Not limited to the above-described example, the plurality of

images

111, 112, 113, 114, 115, and 116 are among a plurality of still images included in the image sequence 110, a plurality of images selected by various criteria or methods It may include a still image.

2 is a block diagram illustrating an example of an electronic device 1000 that generates caption information for an image sequence 210 according to an embodiment.

The electronic device 1000 according to an embodiment may be implemented as various types of devices capable of generating caption information 120 for the image sequence 110. For example, the electronic device 1000 described in the present specification includes a digital camera, a smart phone, a laptop computer, a tablet PC, an electronic book terminal, a terminal for digital broadcasting, and PDAs (Personal Digital Assistants). , PMP (Portable Multimedia Player), navigation, MP3 player, and the like, but is not limited thereto. The electronic device 1000 described in this specification may be a wearable device. Wearable devices include accessory devices (e.g. watches, rings, cuff bands, ankle bands, necklaces, glasses, contact lenses), head-mounted devices (HMD), fabric or garment-integrated devices (e.g. Electronic clothing), a body-attached device (eg, a skin pad), or a bio-implantable device (eg, an implantable circuit).

According to an embodiment, the image sequence 210 for generating caption information may include images 1 to 4 (231, 232, 233, 234). The electronic device 1000 according to an embodiment may generate caption information for the image sequence 210 based on the images 1 to 4 (231, 232, 233, and 234).

Images 1 to 4 (231, 232, 233, 234) according to an embodiment of the plurality of images included in the image sequence 210, in order to determine an image for generating caption information, a predetermined reference or method Accordingly, it can be selected. Further, the present invention is not limited to the above-described example, and all still images included in the image sequence 210 or still images randomly selected from all the still images may be used as images for generating caption information.

According to an embodiment, as shown in FIG. 2, not only four images are used, but different numbers of images are used to generate caption information of the image sequence 210 according to the image sequence 210. Can be. For example, as the length of the image sequence 210 increases, a large number of images included in the image sequence 210 may be used to generate caption information of the image sequence 210.

Referring to FIG. 2, the electronic device 1000 is a configuration for generating caption information of the image sequence 210 according to an embodiment, and includes a local feature acquisition unit 220 and a non-local ) May include a feature acquiring unit 230, a coupling unit 240, and a caption generating unit 250.

According to an embodiment, each feature information extracted from a plurality of images included in the image sequence 210 is transmitted to the regional feature acquisition unit 220 and the non-regional feature acquisition unit 230, thereby resulting in the image sequence 210. The first characteristic information and the second characteristic information regarding the characteristic of the may be respectively obtained.

The feature information of the images delivered to the regional feature acquiring unit 220 and the non-regional feature acquiring unit 230 according to an embodiment includes various information representing visual characteristics of each image and information representing the visual characteristics. By inputting the recognition model, each image may include at least one of information related to a recognized result. Not limited to the above-described example, the feature information for the image may include various types of information obtained from each image.

According to an embodiment, the regional feature acquisition unit 220 and the non-regional feature acquisition unit 230 may acquire first feature information and second feature information, respectively. The first characteristic information and the second characteristic information according to an embodiment represent characteristics of the image sequence 210 acquired in different ways by the regional feature acquisition unit 220 and the non-regional feature acquisition unit 230, respectively. Information.

According to an embodiment, caption information may be generated based on the first characteristic information and the second characteristic information, which represent characteristics of the image sequence 210 obtained by different methods. Accordingly, according to an embodiment, more suitable caption information may be generated than when the caption information is generated based on the feature information of the image sequence 210 obtained in only one way.

The regional feature acquiring unit 220 according to an embodiment may acquire first feature information regarding features of the image sequence 210 by sequentially processing feature information of each image in the order of the images. According to an embodiment, the feature information of each image may be sequentially processed in the image order by the regional feature acquiring unit 220, so that processing for acquiring features of the image sequence 210 may be performed.

The regional feature acquiring unit 220 according to an embodiment may include feature acquiring units 1 to 4 (221, 222, 223, 224), as illustrated in FIG. 2. In addition, the regional feature acquisition unit 220, through the feature acquisition unit 1 to 4 (221, 222, 223, 224), through the image 1 to 4 (231, 232, 233, 234) extracted feature information of the image Based on the, first characteristic information about the image sequence 210 may be obtained.

According to an embodiment, the feature information extracted from the images 1 to 4 (231, 232, 233, 234) are sequentially acquired in the feature obtaining units 1 to 4 (221, 222, 223, 224) according to the order of each image, respectively. By inputting and processing, first characteristic information about the image sequence 210 may be obtained. Accordingly, from image 1 to image 4, feature information of each image may be sequentially processed, and a result output by the feature acquiring unit 4 224 may be used as first feature information for the image sequence 210. , May be input to the coupling unit 240.

The feature acquiring units 1 to 4 (221, 222, 223, 224) according to an embodiment may process image information including a plurality of images by processing feature information for a plurality of images in consideration of order or temporal aspects ( 210) may use a data recognition model for acquiring feature information. For example, the data recognition models that can be used in the feature acquisition units 1 to 4 (221, 222, 223, 224) may be recurrent neural networks (RNN), long short term memory (LSTM), or the like. The data recognition model may be, for example, a learning model for obtaining feature information of the image sequence 210 in consideration of the order of the images, as feature information of each image is sequentially input. The data recognition model that can be used in the feature acquisition units 1 to 4 (221, 222, 223, 224) is not limited to the above-described examples, and may be various types of learning models.

According to an embodiment, the regional feature acquiring unit 220, including the feature acquiring units 1 to 4 (221, 222, 223, 224), is obtained in consideration of the order from the image 1 to the image 4, the image sequence ( 210), the first characteristic information indicating the characteristic for the first may be output. For example, the first feature information is determined as feature information for the image sequence 210 including the images 1 to 4 when the respective feature information of the images 1 to 4 is sequentially input to the data recognition model. Results may be included.

According to an embodiment, in the feature acquiring unit 1 (221), the result of the feature information for the image 1 recognized by the data recognition model (ex. LSTM) may enter the input of the feature acquiring unit 2 (222). The feature acquiring unit 2 222 may input the result of the feature acquiring unit 1 221 and the feature information on the image 2 as inputs, and output a result value through a data recognition model. Also, the feature acquiring unit 3 223 may input the result of the feature acquiring unit 2 222 and feature information on the image 3 as inputs, and output the result value through the data recognition model. The feature acquiring unit 4 224 may input the result of the feature acquiring unit 3 223 and feature information on the image 4 as inputs, and output the result value through the data recognition model. In addition, the result value output by the feature acquiring unit 4 224 may be transmitted to the combining unit 240 as first feature information.

Accordingly, according to an embodiment, the feature information of the images 1 to 4 may be sequentially processed by the regional feature acquiring unit 220 according to the order of each image, and as a result of processing, indicating characteristics of the image sequence 210 The first characteristic information can be output.

However, as the processing is sequentially performed in the feature acquisition units 1 to 4 (221, 222, 223, 224), since the first feature information can be obtained, the image 1 (231) located at the front of the image sequence 210 The characteristic information of may be relatively less reflected in the first characteristic information. On the other hand, the feature information of the last processed image 4 224 may be relatively reflected in the first feature information.

For example, if the length of the image sequence 210 is quite long, and there are a lot of images used to generate the caption information of the image sequence 210, the sequential processing described above in the regional feature acquisition unit 220 may be performed. It can be performed as many times as many times as the number of images used to generate information. The feature information of the images located in the front portion of the image sequence 210 may be hardly reflected in the first feature information by repeatedly processing.

Accordingly, according to an embodiment, as well as the first feature information, the second feature information obtained without considering the order of the images is further used, so that feature information of images located in the front portion of the image sequence 210 is properly reflected. , Feature information of the image sequence 210 may be obtained.

For obtaining the second feature information according to an embodiment, the non-regional feature acquisition unit 230 may include a non-regional feature extraction unit 231 and a conversion unit 232. The non-regional feature acquisition unit 230 is not limited to the above-described example, and may include only the non-regional feature extraction unit 231 without the conversion unit 232. Unlike the regional feature acquiring unit 220, the non-regional feature acquiring unit 230 according to an embodiment does not consider the order of the images 1 to 4 and based on the feature information of each image, the image sequence 210 It is possible to obtain second feature information regarding the feature of. According to an embodiment, the non-regional feature extraction unit 231 does not consider the order of the images 1 to 4, and based on the similarity between the feature information of each image, the feature information for the image sequence 210 Can be extracted.

According to an embodiment, the non-regional feature acquiring unit 230 acquires similarity values for feature information of the image 1 231 and other images except the image 1 231, and the obtained similarity The sum of the weights of the values can be obtained. Also, in the case of the images 2 (232) to 4 (134), similarly to the image 1 (231), similarity values between feature information with other images are obtained, and a weighted sum of the obtained similarity values is obtained. Can. According to an embodiment, weight values applied to each similarity value may be determined as an optimal value by learning.

Accordingly, according to an embodiment, for images 1 231 to 4 134, a weighted sum of similarity values may be respectively obtained.

According to an embodiment, based on weighted sums obtained for each image, feature information for each corresponding image may be corrected. Accordingly, feature information for each image may be corrected according to a similarity value with other images.

Further, according to an embodiment, the higher the frequency of the feature values included in the feature information of each image appears in each image, the higher the importance of the feature values. Accordingly, according to an embodiment, the feature information for each image may be corrected based on the frequency of feature values between a plurality of images and the similarity of the feature information.

According to an embodiment, the second feature information may be obtained based on the modified feature information for each of the plurality of images. For example, the second feature information may be obtained by combining feature information corrected for each of a plurality of images based on similarity, through concatenation operation.

For example, a representative value (eg, average value, median value, etc.) for feature values included in the modified feature information may be determined, and feature information including the determined representative value may be obtained as second feature information. Can. For example, a representative value may be determined for feature values corresponding to each other among feature values included in the first feature information and the second feature information. The second feature information according to an exemplary embodiment is not limited to the above-described example, and may be obtained through various methods, based on feature information modified according to similarity, for each of a plurality of images.

Also, according to an embodiment, the conversion unit 232 may be configured to combine the second feature information obtained by the non-regional feature extraction unit 231 with the first feature information by the combining unit 240. Can be converted to. For example, the converter 232 adjusts the order of feature values included in the second feature information so that feature values corresponding to each other in the first feature information and the second feature information can be combined in parallel, A new feature value can be added to the second feature information. Not limited to the above-described example, the conversion unit 232 may convert the second characteristic information through various methods so that the first characteristic information and the second characteristic information can be combined.

According to an embodiment of the present disclosure, the combining unit 240 combines the first characteristic information and the second characteristic information obtained from the regional feature acquisition unit 220 and the non-regional feature acquisition unit 230, respectively, to obtain an image sequence. The characteristic information for 210 may be finally obtained.

For example, the combining unit 240 may combine the first characteristic information and the second characteristic information with each other according to Equation 1 below.

In Equation 1, h _k and F (n _out) is each of the first characteristic information and second means the characteristic information, and n 'is the feature information for the video sequence 210 is obtained finally in the engaging portion 240 Indicates. In addition, n _out is a value obtained by the non-regional feature extraction unit 231, and F( n _out ), which is the result of processing n _out by the conversion unit 232, may be transferred to the coupling unit 240. . In Equation 1, a bold character means a vector, and may indicate that it has multiple values, such as a matrix.

For example, a representative value (ex. average value, median value, etc.) for feature values corresponding to each other among feature values included in the first feature information and the second feature information may be determined by the combining unit 240, Feature information including the determined representative value may be obtained as feature information for the image sequence 210. Without being limited to the above-described example, according to various methods, based on the first characteristic information and the second characteristic information, characteristic information on the image sequence 210 may be finally determined.

Accordingly, according to an embodiment, even if the length of the image sequence 210 is significantly increased, the feature information of the image located in the front portion of the image sequence 210 is also caption information of the image sequence 210 according to the similarity with other images It can be appropriately reflected in generating.

The caption generation unit 250 according to an embodiment may generate caption information for the image sequence 210 according to feature information on the image sequence 210 determined by the combining unit 240. The caption generator 250 according to an embodiment may include a gated recurrent unit (GRU) capable of generating texts describing the image as input of feature information for a predetermined image. The caption generation unit 250 may generate caption information for the image sequence 210 through various methods, not limited to the above-described example.

3 is a diagram illustrating an example of a method for the non-regional feature extraction unit 231 to obtain second feature information according to an embodiment.

In Equation 1 above, n _out may mean second feature information obtained by the non-regional feature extraction unit 231.

Referring to FIG. 3, the non-regional feature extraction unit 231 acquires x ₀ , x ₁ , x ₂ and x ₃ as feature information of each image from images 1 to 4 (231, 232, 233, 234) can do. In addition, i or j means identification information representing each image.

According to an embodiment, similarity between feature information of each image may be obtained as f(x _i , x _j ). f is a pairwise function for obtaining similarity, and may be defined in various forms.

For example, as shown in FIG. 3, in 231-1, f(x ₂ , x ₀ ), f(x ₂ , x ₁ ), f(x ₂ , x ₂ ), f(x ₂ , x ₃ ) may be obtained as similarities to feature information between the image 3 233 and the

images

1, 2, and 4 (231, 232, 234), respectively. Similarity of feature information with other images may be obtained for the remaining

images

1, 2, and 4 (231, 232, 234) as well as the image 3 (233).

The similarity value obtained for each image may be applied with a weight g( x _i ), which may be determined differently for each image, as shown in the example shown. Accordingly, as a result of performing the operation according to 231-1, y _{2, which} is a value calculated based on the similarity between images, may be obtained for the image 3 233.

According to an embodiment, the y _i values for the images 1 to 4 (231, 232, 233, and 234), which can be calculated as 231-1, according to Equation 2 below, feature information of each image Based on the similarity value of the liver, it can be performed.

In Equation 2,

Means the similarity between the feature information of the images,

Indicates a weight value that can be applied differently for each image. y _i is a value obtained based on the similarity to the image i, and n _i indicating characteristic information about the image i obtained based on the similarity according to Equation 5 below may be obtained from y _i .

Also, Equation 2

And

Can be expressed as Equation 3 below.

In addition, C(x) is a normalization factor, and C(x) is

When set to, Equation 2 may be modified as Equation 4 below.

Reference numeral 231-2 of FIG. 3 is a configuration for applying residual connection to y _i , which is a value obtained according to Equation 2 or 4, and may be expressed as Equation 5 below.

W _g , W _θ , W _φ , and W _z included in Equations 3, 4, and 5 respectively mean a weighting matrix that can be trained. Through the operation according to equation (5), each weight value can be learned in a better way.

Accordingly, according to an embodiment, according to Equation 5, the x _i value, which is characteristic information for the images 1 to 4 (231, 232, 233, 234) is based on the y _i value, which is a value obtained based on similarity. Thus, it can be modified to the value of n _i .

N _i obtained by Equation 5 may be converted to n _out , which is the second feature information described above, according to Equation 6 below. n _out is not limited to the method according to Equation 6, and can be obtained by combining the modified feature information for the images 1 to 4 (231, 232, 233, 234) through various methods.

[Equation 6]

n _out is in the form of F( n _out ) converted by the conversion unit 232, and represents the second characteristic information for the image sequence, and is transmitted to the combining unit 240, and may be combined with the first characteristic information. .

4 is a block diagram illustrating an example of an electronic device 1000 that generates caption information for an image sequence 210 according to an embodiment.

The non-regional feature acquisition unit 430 of FIG. 4 corresponds to the non-regional feature acquisition unit 230 of FIG. 2, but a value input to the non-regional feature acquisition unit 430 is different from that of FIG. 2. There is a difference in that the feature information is a value processed by the feature extraction units 1 to 4 (221, 222, 223, 224).

According to one embodiment, the non-regional feature acquisition unit 430, by the regional feature acquisition unit 220, to obtain the first feature information, features of the images 1 to 4 (231, 232, 233, 234) When the information is sequentially processed, the second feature information may be obtained based on the similarity between information obtained from each feature extracting unit 1 to 4 (221, 222, 223, 224).

In one embodiment, in the regional feature acquisition unit 220, feature information of each image is obtained by the feature extraction units 1 to 4 (221, 222, 223, 224) in order to acquire feature information of the image sequence 210. It can be processed sequentially.

For example, the feature extraction unit 1 221 may output feature information of the image 1 to the non-regional feature acquisition unit 430. Also, the feature extraction unit 2 222 may output feature information of the image sequence 210, which is determined from the feature information of the image 1 and the feature information of the image 2, to the non-regional feature acquisition unit 430. Also, the feature extraction unit 3 223 may output feature information of the image sequence 210, which is determined from the result of the feature extraction unit 2 222 and the feature information of the image 3, to the non-regional feature acquisition unit 430. Can. Also, the feature extraction unit 4 224 outputs feature information of the image sequence 210, which is determined from the result of the feature extraction unit 3 223 and the feature information of the image 4, to the non-regional feature acquisition unit 430. Can.

In the same manner as the operation of the non-regional feature acquisition unit 230 of FIG. 2 of the non-regional feature acquisition unit 430 according to an embodiment, the features input from the feature extraction units 1 to 4 (221, 222, 223, 224) Based on the similarity between the information, the second characteristic information can be obtained.

5 is a block diagram illustrating an internal configuration of the electronic device 1000 according to an embodiment.

6 is a block diagram showing an internal configuration of the electronic device 1000 according to an embodiment.

Referring to FIG. 5, the electronic device 1000 may include a memory 1700, a processor 1300, and an output unit 1200. However, not all of the components illustrated in FIG. 5 are essential components of the electronic device 1000. The electronic device 1000 may be implemented by more components than those illustrated in FIG. 5, or the electronic device 1000 may be implemented by fewer components than those illustrated in FIG. 5.

For example, as illustrated in FIG. 6, the electronic device 1000 may include a user input unit 1100 in addition to the memory 1700, the processor 1300, and the output unit 1200, according to some embodiments. ), a sensing unit 1400, a communication unit 1500, and an A/V input unit 1600 may be further included.

The user input unit 1100 refers to a means for a user to input data for controlling the electronic device 1000. For example, the user input unit 1100 includes a key pad, a dome switch, and a touch pad (contact capacitive type, pressure resistive film type, infrared sensing type, surface ultrasonic conduction type, integral type) Tension measurement method, piezo effect method, etc.), a jog wheel, a jog switch, and the like, but are not limited thereto.

According to an embodiment, the user input unit 1100 may receive a user input for generating caption information for an image sequence.

The output unit 1200 may output an audio signal, a video signal, or a vibration signal, and the output unit 1200 may include a display unit 1210, an audio output unit 1220, and a vibration motor 1230. have.

The output unit 1200 according to an embodiment may output information based on caption information generated for an image sequence. For example, the output unit 1200 may output text representing caption information of an image sequence, generated according to an embodiment. Further, the output unit 1200 may output information indicating a result of performing various operations based on text representing caption information of an image sequence, generated according to an embodiment.

The display unit 1210 displays and outputs information processed by the electronic device 1000. According to an embodiment, the display 1210 may display a result of generating caption information for an image sequence. Also, the display unit 1210 may display information indicating a result of performing various operations based on text representing caption information of an image sequence, generated according to an embodiment.

Meanwhile, when the display unit 1210 and the touch pad are configured as a touch screen by forming a layer structure, the display unit 1210 may be used as an input device in addition to an output device. The display unit 1210 includes a liquid crystal display, a thin film transistor-liquid crystal display, an organic light-emitting diode, a flexible display, and a three-dimensional display ( 3D display), and an electrophoretic display (electrophoretic display). Also, according to the implementation form of the electronic device 1000, the electronic device 1000 may include two or more display units 1210.

The audio output unit 1220 outputs audio data received from the communication unit 1500 or stored in the memory 1700.

The vibration motor 1230 may output a vibration signal. Also, the vibration motor 1230 may output a vibration signal when a touch is input to the touch screen.

The processor 1300 typically controls the overall operation of the electronic device 1000. For example, the processor 1300, by executing programs stored in the memory 1700, the user input unit 1100, the output unit 1200, the sensing unit 1400, the communication unit 1500, the A/V input unit 1600 ) Etc. can be controlled overall. The electronic device 1000 may include at least one processor 1300.

The processor 1300 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. The command may be provided to the processor 1300 from the memory 1700 or may be received through the communication unit 1500 and provided to the processor 1300. For example, the processor 1300 may be configured to execute instructions according to program code stored in a recording device such as memory.

The at least one processor 1300 according to an embodiment may perform an operation for generating caption information for an image sequence. At least one processor 1300 according to an embodiment acquires first characteristic information and second characteristic information regarding characteristics of an image sequence by using a plurality of images included in an image sequence, and the first characteristic information and the first Based on the 2 feature information, caption information for the image sequence can be generated.

The first characteristic information according to an embodiment may include information on characteristics of an image sequence determined based on the feature information of the plurality of images sequentially processed according to an image order.

In addition, the second characteristic information according to an embodiment may include information on characteristics of the determined image sequence based on at least one similarity between the characteristic information of the plurality of images. For example, the second feature information may be obtained by combining feature information for each of a plurality of images corrected based on the at least one similarity value.

The sensing unit 1400 may detect a state of the electronic device 1000 or a state around the electronic device 1000 and transmit the sensed information to the processor 1300.

The sensing unit 1400 includes a magnetic sensor 1410, an acceleration sensor 1420, a temperature/humidity sensor 1430, an infrared sensor 1440, a gyroscope sensor 1450, and a position sensor (Eg, GPS) 1460, an air pressure sensor 1470, a proximity sensor 1480, and an RGB sensor (illuminance sensor) 1490, but may include at least one.

The communication unit 1500 may include one or more components that allow the electronic device 1000 to communicate with a server (not shown) or an external device (not shown). For example, the communication unit 1500 may include a short-range communication unit 1510, a mobile communication unit 1520, and a broadcast reception unit 1530.

The communication unit 1500 according to an embodiment may receive information required to generate caption information for an image sequence from the outside. For example, the communication unit 1500 may receive an image sequence for generating caption information from the outside.

Also, the communication unit 1500 according to an embodiment may transmit caption information generated by at least one processor 1300 to the outside.

The short-range wireless communication unit 1510 includes a Bluetooth communication unit, a Bluetooth Low Energy (BLE) communication unit, a Near Field Communication unit, a WLAN (Wi-Fi) communication unit, a Zigbee communication unit, and an infrared ray ( IrDA, an infrared data association (WDA) communication unit, a WFD (Wi-Fi Direct) communication unit, a UWB (ultra wideband) communication unit, an Ant+ communication unit, and the like, but are not limited thereto.

The mobile communication unit 1520 transmits and receives a wireless signal to and from at least one of a base station, an external terminal, and a server on a mobile communication network. Here, the wireless signal may include various types of data according to transmission and reception of a voice call signal, a video call signal, or a text/multimedia message.

The broadcast receiving unit 1530 receives a broadcast signal and/or broadcast related information from the outside through a broadcast channel. The broadcast channel may include a satellite channel and a terrestrial channel. Depending on the implementation example, the electronic device 1000 may not include the broadcast receiving unit 1530.

The A/V (Audio/Video) input unit 1600 is for inputting an audio signal or a video signal, which may include a camera 1610 and a microphone 1620. The camera 1610 may obtain a video frame such as a still image or a video through an image sensor in a video call mode or a shooting mode. The image captured through the image sensor may be processed through the processor 1300 or a separate image processing unit (not shown). The microphone 1620 receives external sound signals and processes them as electrical voice data.

According to an embodiment, an image sequence in which caption information may be generated may be obtained by capturing an image by the A/V input unit 1600.

The memory 1700 may store a program for processing and controlling the processor 1300, and may store data input to or output from the electronic device 1000.

The memory 1700 according to an embodiment may store one or more instructions, and the at least one processor 1300 of the above-described electronic device 1000 executes the one or more instructions stored in the memory 1700 to perform one or more instructions. The operation according to the embodiment may be performed.

Also, the memory 1700 according to an embodiment may store information necessary to generate caption information of an image sequence according to an embodiment. For example, the memory 1700 may store at least one image sequence in which caption information can be generated. The image sequence stored in the memory 1700 may be at least one of an image sequence obtained by the A/V input unit 1600 and an image sequence received from the outside.

The memory 1700 is a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (for example, SD or XD memory, etc.), RAM (RAM, Random Access Memory) SRAM (Static Random Access Memory), ROM (ROM, Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, magnetic disk , It may include at least one type of storage medium of the optical disk.

Programs stored in the memory 1700 may be classified into a plurality of modules according to their functions, for example, a UI module 1710, a touch screen module 1720, and a notification module 1730. .

The UI module 1710 may provide specialized UIs, GUIs, and the like interlocked with the electronic device 1000 for each application. The touch screen module 1720 may detect a touch gesture on the user's touch screen and transfer information regarding the touch gesture to the processor 1300. The touch screen module 1720 according to some embodiments may recognize and analyze a touch code. The touch screen module 1720 may be configured with separate hardware including a controller.

Various sensors may be provided inside or near the touch screen to sense a touch or proximity touch of the touch screen. A tactile sensor is an example of a sensor for sensing a touch of a touch screen. A tactile sensor is a sensor that senses the contact of a specific object with or above a human level. The tactile sensor can detect various information such as the roughness of the contact surface, the hardness of the contact object, and the temperature of the contact point.

The user's touch gesture may include tap, touch & hold, double tap, drag, pan, flick, drag and drop, swipe, and the like.

The notification module 1730 may generate a signal for notifying the occurrence of an event in the electronic device 1000.

Referring to FIG. 7, in operation 710, the electronic device 1000 may extract feature information for each of a plurality of images included in the image sequence. The characteristic information of a plurality of images according to an embodiment includes at least one of various information representing visual characteristics of each image and information regarding a result of each image being recognized by inputting the information representing the visual characteristics into a data recognition model. It may contain information.

In operation 720, the electronic device 1000 may obtain first characteristic information regarding the characteristics of the image sequence by sequentially processing the characteristic information extracted in operation 710 according to the order of the images. According to an embodiment of the present disclosure, the electronic device 1000 obtains first feature information as feature information for the image sequence by sequentially processing feature information of each image according to the order of each image in the image sequence. Can.

The electronic device 1000 according to an embodiment may acquire first feature information by using a data learning model for obtaining feature information of an image sequence including a plurality of images from feature information for a plurality of images. .

The first feature information according to an embodiment may include feature information about an image sequence obtained by considering the order of each image.

In operation 730, the electronic device 1000 may obtain the second characteristic information based on at least one similarity between the characteristic information extracted in operation 710.

The electronic device 1000 according to an embodiment obtains second feature information by using at least one similarity value between feature information of a plurality of images in order to obtain feature information of an image sequence including a plurality of images Can. For example, the second feature information may be obtained by combining feature information for each of a plurality of images corrected based on the at least one similarity value.

Unlike the first feature information, the second feature information according to an embodiment may include feature information about an image sequence, obtained without considering the order of each image.

In operation 740, the electronic device 1000 may generate caption information based on the first characteristic information and the second characteristic information obtained in

steps

720 and 730.

According to an embodiment, the second characteristic information may be converted into a form that can be combined with the first characteristic information before being combined with the first characteristic information.

In addition, according to an embodiment, the first characteristic information and the second characteristic information may be combined with each other based on a representative value of characteristic values corresponding to each other among feature values included in the first characteristic information and the second characteristic information. Can. For example, the combined information may include representative values of corresponding characteristic values, respectively, in the first characteristic information and the second characteristic information.

According to an embodiment of the present disclosure, the electronic device 1000 may obtain information combining the first characteristic information and the second characteristic information as characteristic information for the final image sequence. The electronic device 1000 may generate caption information based on the feature information for the final image sequence.

According to an embodiment, even in the case of a long video sequence, caption information may be generated in which characteristics related to images in front of the video sequence are properly reflected.

One embodiment may also be embodied in the form of a recording medium comprising instructions executable by a computer, such as program modules, being executed by a computer. Computer readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. In addition, computer-readable media may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically includes computer readable instructions, data structures, or program modules, and includes any information delivery media.

Also, in this specification, the “unit” may be a hardware component such as a processor or circuit, and/or a software component executed by a hardware component such as a processor.

The above description of the present invention is for illustration only, and those skilled in the art to which the present invention pertains can understand that the present invention can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

The scope of the present invention is indicated by the following claims rather than the above detailed description, and it should be interpreted that all changes or modified forms derived from the meaning and scope of the claims and equivalent concepts thereof are included in the scope of the present invention. do.

Claims

In the electronic device, in the method for generating caption (caption) information for the video sequence,

Obtaining a plurality of images included in the image sequence;

Extracting feature information for each of the plurality of images;

Obtaining first feature information regarding features of the image sequence by sequentially processing the extracted feature information according to the order of the plurality of images;

Obtaining second feature information on features of the image sequence determined based on at least one similarity between the extracted feature information; And

And generating caption information for the video sequence based on the first feature information and the second feature information.
The method of claim 1, wherein the obtaining of the second characteristic information

Modifying feature information for each of the plurality of images based on the at least one similarity value; And

And acquiring the second characteristic information based on the modified characteristic information for each of the plurality of images.
The method of claim 2, wherein the second feature information

A method for obtaining, by combining feature information corrected for each of the plurality of images by combining each other through a concatenation operation.
The method of claim 1, wherein generating the caption information is

Combining the first characteristic information and the second characteristic information; And

And generating caption information for the video sequence based on the combined information.
The method of claim 4, wherein the second characteristic information,

After being converted into a form that can be combined with the first characteristic information, the method is combined with the first characteristic information.
According to claim 4,

Among the feature values included in the first feature information and the second feature information, the first feature information and the second feature information are combined based on representative values for feature values corresponding to each other.
The method of claim 1, wherein the at least one similarity

In order to obtain the first feature information instead of the feature information of the plurality of images, the method is obtained, based on information obtained whenever the feature information of the plurality of images is sequentially processed.
An electronic device that generates caption information for an image sequence,

A memory for storing a plurality of images included in the image sequence;

Feature information is extracted for each of the plurality of images, and the extracted feature information is sequentially processed according to the order of the plurality of images to obtain first feature information regarding features of the image sequence, and the extracted Acquire second feature information regarding the feature of the image sequence determined based on at least one similarity between feature information, and obtain caption information for the image sequence based on the first feature information and the second feature information. Generating, at least one processor; And

And an output unit configured to output information based on the generated caption information.
The method of claim 8, wherein the at least one processor

An electronic device that modifies characteristic information for each of the plurality of images based on the at least one similarity value, and acquires the second characteristic information based on the modified characteristic information for each of the plurality of images.
The method of claim 9, wherein the second characteristic information

An electronic device obtained by combining feature information corrected for each of the plurality of images through a combination operation.
The method of claim 8, wherein the at least one processor

An electronic device that combines the first characteristic information and the second characteristic information and generates caption information for the image sequence based on the combined information.
The method of claim 11, wherein the second characteristic information,

An electronic device that is converted into a form that can be combined with the first characteristic information and then combined with the first characteristic information.
The method according to claim 11, wherein the first characteristic information and the second characteristic information are combined based on representative values of characteristic values corresponding to each other, respectively included in the first characteristic information and the second characteristic information, Electronic devices.
The method of claim 8, wherein the at least one similarity

In order to obtain the first feature information instead of the feature information of the plurality of images, the electronic device is obtained based on information obtained whenever the feature information of the plurality of images is sequentially processed.
A computer-readable recording medium in which a program for implementing the method of any one of claims 1 to 7 is recorded.