KR20220087359A

KR20220087359A - Method and electronic device for producing summary of video

Info

Publication number: KR20220087359A
Application number: KR1020210116222A
Authority: KR
Inventors: 파텔 밀란쿠마; 쿠마 자스왈 너닐; 티와리 사우랍; 칸트 데오 투사르; 쿠마 비제이안앤드
Original assignee: 삼성전자주식회사
Priority date: 2020-12-17
Filing date: 2021-09-01
Publication date: 2022-06-24

Abstract

전자 장치(100)가 비디오 요약을 생성하는 방법을 개시한다. 방법은 복수의 프레임을 포함하는 비디오를 수신하는 것을 포함한다. 또한, 방법은 비디오를 시청하는 사용자의 시점을 결정하는 것을 포함한다. 또한, 방법은 사용자의 시점에 기초하여 사용자의 적어도 하나의 관심 영역(ROI)이 비디오에서 이용 가능한지 여부를 결정하는 것을 포함한다. 또한, 방법은 적어도 하나의 ROI가 비디오에서 이용 가능하다는 결정에 응답하여 적어도 하나의 ROI를 포함하는 복수의 프레임으로부터 프레임 세트를 식별하는 것을 포함한다. 또한, 방법은 식별된 프레임 세트에 기초하여 비디오 요약을 생성하는 것을 포함한다.Disclosed is a method for an electronic device 100 to generate a video summary. The method includes receiving a video comprising a plurality of frames. The method also includes determining a viewpoint of a user viewing the video. The method also includes determining whether at least one region of interest (ROI) of the user is available in the video based on the user's viewpoint. The method also includes identifying a set of frames from the plurality of frames including the at least one ROI in response to determining that the at least one ROI is available in the video. The method also includes generating a video summary based on the identified set of frames.

Description

METHOD AND ELECTRONIC DEVICE FOR PRODUCING SUMMARY OF VIDEO

개시된 발명은 비디오 요약 생성 시스템에 관한 것으로, 더욱 상세하게는 비디오를 시청하는 사용자의 시점을 기초로 비디오 요약을 생성하는 방법 및 전자 장치에 관한 것이다.The disclosed invention relates to a system for generating a video summary, and more particularly, to a method and an electronic device for generating a video summary based on a viewpoint of a user viewing the video.

많은 기존 방법과 시스템이 비디오 요약을 생성하기 위해 제안되었다. 기존의 방법 및 시스템에서 비디오 요약은 키 프레임 식별을 위한 관심 영역을 추출하는 기하학적 해석, 관심 영역에 대한 카메라 각도, 얼굴 색상 히스토그램, 객체 정보, 및 지능형 비디오 섬네일 선택 및 생성 중 적어도 하나에 기초하여 생성되지만, 이러한 기존의 방법 및 시스템은 전력 소비, 메모리 이용, 견고성, 신뢰성, 무결성, 운영 의존성, 시간, 비용, 복잡성, 디자인, 이용된 하드웨어 구성 요소, 크기 등의 측면에서 장점과 단점을 가질 수 있다. 또한, 사용자의 관점에서 시점을 캡처하는 것은 어려우며, 현재의 신경망은 현재의 신경망이 과거 정보를 유지하는 것과 같은 것을 시뮬레이션 할 수 없다. 또한, 장단기 메모리(Long Short-Term Memory, LSTM) / 게이트 순환 유닛(Gated Recurrent Unit, GRU)와 같은 현재의 딥 러닝 시스템은 하나의 비디오에서 요약을 캡처하고 생성하는데 제한적이다.Many existing methods and systems have been proposed for generating video summaries. In existing methods and systems, video summaries are generated based on at least one of geometric interpretation to extract regions of interest for key frame identification, camera angles for regions of interest, face color histograms, object information, and intelligent video thumbnail selection and generation. However, these existing methods and systems may have advantages and disadvantages in terms of power consumption, memory usage, robustness, reliability, integrity, operational dependence, time, cost, complexity, design, hardware components used, size, etc. . Also, capturing a viewpoint from the user's point of view is difficult, and current neural networks cannot simulate such things as current neural networks retain historical information. In addition, current deep learning systems such as Long Short-Term Memory (LSTM)/Gated Recurrent Unit (GRU) are limited in capturing and generating summaries in one video.

따라서, 전술한 단점 또는 다른 결점들을 해결하거나 적어도 유용한 대안을 제공하는 것이 바람직하다.Accordingly, it would be desirable to address the aforementioned or other drawbacks, or at least provide a useful alternative.

개시된 발명의 일 측면은, 비디오를 시청하는 사용자의 시점을 기초로 비디오의 요약을 생성하는 방법 및 전자 장치를 제공하고자 한다. One aspect of the disclosed invention is to provide a method and an electronic device for generating a summary of a video based on a viewpoint of a user watching the video.

개시된 발명의 일 측면은, 프레임 또는 프레임 시퀀스 당 입력 여기(input excitation)에 따른 프레임 선택 기술을 제공하고자 한다. "입력 여기"는 시점(view point)에서 키 프레임을 선택하는 동안 비교되는 가중 파라미터다.One aspect of the disclosed invention is to provide a frame selection technique according to input excitation per frame or frame sequence. "Input excitation" is a weighted parameter that is compared while selecting a keyframe from a view point.

개시된 발명의 일 측면은, 사용자의 사용자의 시점(view point)의 변화를 결정하는 것을 제공하고자 한다. 따라서, 비용 효율적인 방식으로 비디오 요약을 동적으로 생성할 수 있다. 시점은 하나의 비디오 자체에 대한 다중 비디오 요약을 생성하는 확률적 결정에 도움이 될 수 있다.One aspect of the disclosed invention is intended to provide for determining a change in the user's viewpoint (view point) of the user. Thus, it is possible to dynamically generate video summaries in a cost-effective manner. Viewpoints can aid in probabilistic decisions to generate multiple video summaries of one video itself.

개시된 발명의 일 측면은, 환경 입력, 사용자의 생각, 긍정적 또는 부정적 평가 또는 리뷰를 캡처하여 비디오 요약을 생성하는 것을 제공하고자 한다. One aspect of the disclosed subject matter seeks to provide for generating a video summary by capturing environmental input, a user's thoughts, positive or negative ratings or reviews.

따라서, 개시된 발명의 실시예는 전자 장치가 비디오 요약을 생성하는 방법을 개시한다. 상기 방법은, 상기 전자 장치가 복수의 프레임을 포함하는 비디오를 수신하는 것을 포함할 수 있다. 또한, 상기 방법은, 상기 전자 장치가 비디오를 시청하는 사용자의 적어도 하나의 시점을 결정하는 것을 포함할 수 있다. 상기 시점은 사용자의 주관적 시점, 사용자의 객관적 시점 및 사용자의 물리적 시점 중 적어도 하나를 포함할 수 있다. 또한, 상기 방법은, 상기 전자 장치가 사용자의 적어도 하나의 시점에 기초하여 사용자의 적어도 하나의 관심 영역(region of interest, ROI)이 비디오에서 이용 가능한 지 여부를 결정하는 것을 포함할 수 있다. 또한, 상기 방법은, 상기 전자 장치가 적어도 하나의 ROI가 비디오에서 이용 가능하다는 결정에 응답하여 적어도 하나의 ROI를 포함하는 복수의 프레임으로부터 프레임 세트를 식별하는 것을 포함할 수 있다. 또한, 상기 방법은, 상기 전자 장치가 식별된 프레임 세트에 기초하여 비디오 요약을 생성하는 것을 포함할 수 있다. 또한, 상기 방법은 상기 전자 장치가 비디오의 요약을 저장하는 것을 포함할 수 있다.Accordingly, an embodiment of the disclosed invention discloses a method for an electronic device to generate a video summary. The method may include receiving, by the electronic device, a video including a plurality of frames. Also, the method may include determining, by the electronic device, at least one viewpoint of a user viewing the video. The viewpoint may include at least one of the user's subjective viewpoint, the user's objective viewpoint, and the user's physical viewpoint. Also, the method may include, by the electronic device, determining whether at least one region of interest (ROI) of the user is available in the video based on the at least one viewpoint of the user. Further, the method may include, in response to the electronic device determining that the at least one ROI is available in the video, identifying a set of frames from the plurality of frames including the at least one ROI. Further, the method may include generating, by the electronic device, a video summary based on the identified set of frames. Further, the method may include the electronic device storing a summary of the video.

일 실시예에 따르면, 상기 전자 장치가 비디오를 시청하는 사용자의 주관적 시점을 결정하는 것은 상기 전자 장치가 사용자와 관련된 복수의 주관적 파라미터를 획득하는 것을 포함할 수 있다. 복수의 주관적 파라미터는 사용자의 직업, 사용자의 나이, 사용자의 선호도, 사용자와 관련된 이벤트, 및 적어도 하나의 소셜 네트워크 사이트에서의 사용자의 활동 중 적어도 하나를 포함할 수 있다. 또한, 상기 방법은, 상기 전자 장치가 사용자와 관련된 복수의 주관적 파라미터에 기초하여 사용자의 주관적 상황 정보를 결정하는 것을 포함할 수 있다. 또한, 상기 방법은, 상기 전자 장치가 사용자의 주관적 상황 정보에 기초하여 사용자의 주관적 시점을 결정하는 것을 포함할 수 있다.According to an embodiment, when the electronic device determines the subjective viewpoint of the user watching the video, the electronic device acquires a plurality of subjective parameters related to the user. The plurality of subjective parameters may include at least one of the user's occupation, the user's age, the user's preferences, an event related to the user, and the user's activity on the at least one social networking site. Also, the method may include determining, by the electronic device, subjective context information of the user based on a plurality of subjective parameters related to the user. Also, the method may include determining, by the electronic device, a subjective viewpoint of the user based on the user's subjective context information.

일 실시예에 따르면, 상기 전자 장치가 비디오를 시청하는 사용자의 객관적 시점을 결정하는 것은, 상기 전자 장치가 사용자와 관련된 복수의 객관적 파라미터를 획득하고, 상기 전자 장치가 사용자와 관련된 복수의 객관적 파라미터에 기초하여 사용자의 객관적 상황 정보를 결정하고, 상기 전자 장치가 사용자의 객관적 상황 정보에 기초하여 사용자의 객관적 시점을 결정하는 것을 포함하며, 복수의 객관적 파라미터는 사용자의 과거 이력, 사용자의 현재 목표 및 사용자의 추가 목표 중 적어도 하나를 포함할 수 있다.According to an embodiment, when the electronic device determines the objective viewpoint of the user watching the video, the electronic device acquires a plurality of objective parameters related to the user, and the electronic device responds to the plurality of objective parameters related to the user. determining the objective context information of the user based on the user's objective context information, and determining, by the electronic device, the objective point of view of the user based on the objective context information of the user, wherein the plurality of objective parameters include the user's past history, the user's current goal, and the user. may include at least one of the additional goals of

일 실시예에 따르면, 상기 전자 장치가 비디오를 시청하는 사용자의 물리적 시점을 결정하는 것은, 상기 전자 장치가 복수의 물리적 파라미터를 획득하고, 상기 전자 장치가 사용자와 관련된 복수의 물리적 파라미터에 기초하여 사용자의 물리적 상황 정보를 결정하고, 상기 전자 장치가 사용자의 물리적 상황 정보에 기초하여 사용자의 물리적 시점을 결정하는 것을 포함하며, 복수의 물리적 파라미터는 사용자와 관련된 카메라의 각도, 사용자의 위치, 사용자의 주위의 주변 광 상태, 사용자의 주위의 날씨 조건, 사용자의 프라이버시 선호도 중 적어도 하나를 포함할 수 있다.According to an embodiment, when the electronic device determines the physical viewpoint of the user watching the video, the electronic device acquires a plurality of physical parameters, and the electronic device determines the user based on the plurality of physical parameters related to the user. and determining, by the electronic device, physical context information of the user, and determining, by the electronic device, a physical viewpoint of the user based on the user's physical context information, wherein the plurality of physical parameters include an angle of a camera associated with the user, a user's location, and a user's surroundings. may include at least one of an ambient light condition of , a weather condition around the user, and a privacy preference of the user.

일 실시예에 따르면, 상기 전자 장치가 적어도 하나의 ROI를 포함하는 복수의 프레임으로부터 프레임 세트를 식별하는 것은, 상기 전자 장치가 각각의 프레임과 관련된 복수의 여기 파라미터에 기초하여 비디오의 복수의 프레임으로부터 각 프레임의 여기 레벨을 결정하고, 상기 전자 장치가 각 프레임의 오디오 파라미터 및 텍스트 파라미터 중 적어도 하나를 추출하고, 상기 전자 장치가 각 프레임의 여기 레벨 및 오디오 파라미터 및 텍스트 파라미터 중 적어도 하나에 기초하여 비디오의 복수의 프레임으로부터 각 프레임의 상대적 상황 정보를 결정하고, 상기 전자 장치가 각 프레임의 상대적 상황 정보에 기초하여 적어도 하나의 ROI를 포함하는 복수의 프레임으로부터 프레임 세트를 식별하는 것을 포함하며, 복수의 여기 파라미터는 각 프레임에서의 ROI의 속도, 각 프레임에서의 ROI의 강도, 각 프레임에서의 ROI의 출현 빈도, 및 각 프레임의 재생 기간 중 적어도 하나를 포함할 수 있다.According to an embodiment, wherein the electronic device identifies the set of frames from the plurality of frames including the at least one ROI, the electronic device selects the plurality of frames from the video based on a plurality of excitation parameters associated with each frame. determine an excitation level of each frame, the electronic device extracts at least one of an audio parameter and a text parameter of each frame, and the electronic device determines an excitation level of each frame and a video based on at least one of an audio parameter and a text parameter determining relative context information of each frame from a plurality of frames of The excitation parameter may include at least one of a speed of the ROI in each frame, an intensity of the ROI in each frame, a frequency of appearance of the ROI in each frame, and a reproduction period of each frame.

일 실시예에 따르면, 상기 전자 장치가 식별된 프레임 세트에 기초하여 비디오의 요약을 생성하는 것은, 상기 전자 장치가 사용자의 ROI 및 시점에 기초하여 식별된 프레임 세트로부터 각 프레임에 대한 가중치를 결정하고, 상기 전자 장치가 각 프레임에 대해 결정된 가중치에 기초하여 식별된 프레임 세트로부터 각 프레임을 시퀀싱하고, 및 상기 전자 장치가 시퀀싱된 프레임 세트를 병합하여 비디오 요약을 생성하는 것을 포함할 수 있다.According to one embodiment, the electronic device generating a summary of the video based on the identified set of frames comprises: determining, by the electronic device, a weight for each frame from the identified set of frames based on the user's ROI and viewpoint; , the electronic device sequencing each frame from the identified frame set based on the weight determined for each frame, and the electronic device merging the sequenced frame set to generate a video summary.

일 실시예에 따르면, 상기 전자 장치가 사용자의 ROI 및 시점에 기초하여 식별된 프레임 세트로부터 각 프레임에 대한 가중치를 결정하는 것은, 상기 전자 장치가 사용자의 적어도 하나의 시점과 식별된 복수의 프레임으로부터의 각 프레임 사이의 관계 파라미터 및 식별된 복수의 프레임으로부터 각 프레임의 시점 각도(perspective angle)을 획득하고, 및 상기 전자 장치가 획득된 관계 파라미터에 기초하여 식별된 프레임에 대한 가중치를 결정하는 것을 포함하며, 관계 파라미터는 상기 사용자의 상기 적어도 하나의 시점에 기초한 상기 비디오의 각도 및 상기 식별된 프레임에서의 장면의 투시도(perspective view) 중 적어도 하나를 식별할 수 있다.According to an embodiment, when the electronic device determines the weight for each frame from the identified frame set based on the user's ROI and the viewpoint, the electronic device determines the weight for each frame from the user's at least one viewpoint and the identified plurality of frames. obtaining a perspective angle of each frame from a relationship parameter between each frame of , and a plurality of identified frames, and determining, by the electronic device, a weight for the identified frame based on the obtained relationship parameter and the relation parameter may identify at least one of an angle of the video based on the at least one viewpoint of the user and a perspective view of a scene in the identified frame.

일 실시예에 따르면, 상기 전자 장치가 적어도 하나의 ROI를 포함하는 복수의 프레임으로부터 프레임 세트를 식별하는 것은, 상기 전자 장치가 비디오의 절대 완성도 스코어(absolute completeness score)를 결정하고, 상기 전자 장치가 절대 완성도 스코어에 기초하여 비디오의 절대 프레임 여기 정보(absolute frame excitation information)를 결정하고, 상기 전자 장치가 절대 프레임 여기 정보에 기초하여 비디오의 공동 참조 정보(co-reference information)를 검출하는 단계, 및 상기 전자 장치가 공동 참조 정보에 기초하여 비디오의 시퀀스 여기 레벨을 결정하는 것을 포함할 수 있다.According to an embodiment, when the electronic device identifies the frame set from the plurality of frames including the at least one ROI, the electronic device determines an absolute completeness score of the video, and the electronic device determines the absolute completeness score of the video. determining absolute frame excitation information of the video based on the absolute completeness score, the electronic device detecting co-reference information of the video based on the absolute frame excitation information, and and determining, by the electronic device, a sequence excitation level of the video based on the joint reference information.

일 실시예에 따르면, 상기 전자 장치가 절대 완성도 스코어에 기초하여 비디오의 절대 프레임 여기 정보를 결정하는 것은, 상기 전자 장치가 각 프레임에서의 ROI의 속도, 각 프레임에서의 ROI의 강도, 각 프레임에서의 ROI의 출현 빈도, 및 각 프레임의 재생 기간을 획득하고, 및 상기 전자 장치가 획득된 각 프레임에서의 ROI의 속도, 획득된 각 프레임에서의 ROI의 강도, 획득된 각 프레임에서의 ROI의 출현 빈도, 및 획득된 각 프레임의 재생 기간에 기초하여 비디오의 절대 프레임 여기 정보를 결정하는 것을 포함할 수 있다.According to an embodiment, when the electronic device determines the absolute frame excitation information of the video based on the absolute completeness score, the electronic device determines the speed of the ROI in each frame, the intensity of the ROI in each frame, and the intensity of the ROI in each frame. obtain the frequency of appearance of the ROI, and the reproduction period of each frame, and the electronic device acquires the speed of the ROI in each frame acquired, the intensity of the ROI in each frame acquired, the appearance of the ROI in each acquired frame determining absolute frame excitation information of the video based on the frequency, and the obtained playing period of each frame.

일 실시예에 따르면, 절대 완성도 스코어는, 비디오와 관련된 절대 프레임 정보를 획득하고, 비디오와 관련된 완성도 임계치를 획득하고, 비디오와 관련된 획득된 절대 프레임 정보를 비디오와 관련된 획득된 완성도 임계치와 비교함으로써, 결정될 수 있다. According to one embodiment, the absolute completeness score is obtained by obtaining absolute frame information associated with the video, obtaining a completeness threshold associated with the video, and comparing the obtained absolute frame information associated with the video to an obtained completeness threshold associated with the video, can be decided.

일 실시예에 따르면, 절대 프레임 여기 정보는 프레임 세트를 시퀀싱 하기 위해 프레임 세트와 관련된 상대 프레임 여기(relative frame excitation)를 구동하도록 구성될 수 있다. 절대 프레임 여기 정보는 독립적으로 캡처할 수 있으며, 상황 정보에서 기준 여기(reference excitation)의 세그먼트와 매칭시킬 수 있다. 상대 프레임 여기는 순서대로 프레임 당 여기 커버리지에 의하여 정의될 수 있다. 참조 프레임 여기가 입력될 수 있으며, 이에 따라 기준 여기 레벨을 획득하도록 프레임 시퀀스가 조정될 수 있다 .According to one embodiment, the absolute frame excitation information may be configured to drive a relative frame excitation associated with the frame set to sequence the frame set. Absolute frame excitation information can be captured independently and matched with segments of reference excitation in context information. Relative frame excitation can be defined by excitation coverage per frame in order. A reference frame excitation may be input, and the frame sequence may be adjusted accordingly to obtain a reference excitation level.

일 실시예에 따르면, 공동 참조 정보는 프레임 세트와 관련된 시퀀스 여기 레벨을 유지시키도록 구성될 수 있으며, 공동 참조 정보는, 프레임 세트와 관련된 오디오 이용량 및 프레임 세트와 관련된 의미적 유사성(semantic similarities) 중 적어도 하나를 획득하고 프레임 세트와 관련된 획득된 오디오 이용량 및 프레임 세트와 관련된 획득된 의미적 유사성 중 적어도 하나에 기초하여 공동 참조 정보를 결정함으로써, 결정될 수 있다.According to an embodiment, the joint reference information may be configured to maintain a level of sequence excitation associated with the set of frames, wherein the joint reference information may include audio usage associated with the set of frames and semantic similarities associated with the set of frames. and determining the joint reference information based on at least one of an obtained audio usage associated with the frame set and an obtained semantic similarity associated with the frame set.

일 실시예에 따르면, 시퀀스 여기 레벨은 프레임 세트와 관련된 유사성을 맵핑하도록 구성될 수 있다.According to one embodiment, the sequence excitation level may be configured to map a similarity associated with a set of frames.

따라서, 본 발명의 실시예는 비디오 요약을 생성하는 전자 장치를 개시한다. 전자 장치는 메모리 및 프로세서에 연결된 비디오 요약 생성 컨트롤러를 포함할 수 있다. 시점 기반 비디오 요약 생성 컨트롤러는 복수의 프레임을 포함하는 비디오를 수신하고, 비디오를 시청하는 사용자의 적어도 하나의 시점을 결정하도록 구성될 수 있다. 시점은 사용자의 주관적 시점, 사용자의 객관적 시점, 및 사용자의 물리적 시점 중 적어도 하나를 포함할 수 있다. 시점 기반 비디오 요약 생성 컨트롤러는 용자의 적어도 하나의 시점에 기초하여 사용자의 적어도 하나의 ROI가 비디오에서 이용 가능한지 여부를 결정하도록 구성될 수 있다. 시점 기반 비디오 요약 생성 컨트롤러는 적어도 하나의 ROI가 비디오에서 이용 가능하다는 것에 응답하여 적어도 하나의 ROI를 포함하는 복수의 프레임으로부터 프레임 세트를 식별하도록 구성될 수 있다. 시점 기반 비디오 요약 생성 컨트롤러는 식별된 프레임 세트에 기초하여 비디오 요약을 생성하도록 구성될 수 있다. 시점 기반 비디오 요약 생성 컨트롤러는 비디오 요약을 저장하도록 구성될 수 있다.Accordingly, an embodiment of the present invention discloses an electronic device for generating a video summary. The electronic device may include a memory and a video summary generation controller coupled to the processor. The viewpoint-based video summary generation controller may be configured to receive a video including a plurality of frames and determine at least one viewpoint of a user viewing the video. The viewpoint may include at least one of the user's subjective viewpoint, the user's objective viewpoint, and the user's physical viewpoint. The viewpoint-based video summary generation controller may be configured to determine whether at least one ROI of the user is available in the video based on the at least one viewpoint of the user. The view-based video summary generation controller may be configured to identify a set of frames from the plurality of frames comprising the at least one ROI in response to the at least one ROI being available in the video. The viewpoint-based video summary generation controller may be configured to generate a video summary based on the identified set of frames. The viewpoint-based video summary generation controller may be configured to store the video summary.

본 발명의 실시예들의 이들 및 다른 측면들은 하기의 설명 및 첨부된 도면들과 함께 고려될 때 완전히 인식되고 이해될 것이다. 그러나, 하기의 설명은 바람직한 실시예 및 이의 다수의 특정 세부 사항을 나타내면서 제한이 아닌 예시로서 제공된다는 것을 이해해야 한다. 본 발명의 실시예의 범위 내에서 많은 변경 및 수정이 이루어질 수 있으며, 본 명세서의 실시예는 이러한 모든 수정을 포함한다.These and other aspects of embodiments of the invention will be fully appreciated and understood when considered in conjunction with the following description and accompanying drawings. It is to be understood, however, that the following description is presented by way of illustration and not limitation, while indicating the preferred embodiments and numerous specific details thereof. Many changes and modifications can be made within the scope of the embodiments of the present invention, and the embodiments herein include all such modifications.

개시된 발명의 일 측면에 따르면, 비디오를 시청하는 사용자의 시점을 기초로 비디오의 요약을 생성하는 방법 및 전자 장치를 제공할 수 있다. 그 결과 사용자의 경험이 향상된다.According to an aspect of the disclosed invention, it is possible to provide a method and an electronic device for generating a summary of a video based on a viewpoint of a user watching the video. As a result, the user experience is improved.

개시된 발명의 일 측면에 따르면, 프레임 또는 프레임 시퀀스 당 입력 여기(input excitation)에 따른 프레임 선택 기술을 제공할 수 있다. "입력 여기"는 시점(view point)에서 키 프레임을 선택하는 동안 비교되는 가중 파라미터다.According to one aspect of the disclosed invention, it is possible to provide a frame selection technique according to an input excitation per frame or frame sequence. "Input excitation" is a weighted parameter that is compared while selecting a keyframe from a view point.

개시된 발명의 일 측면에 따르면, 사용자의 사용자의 시점(view point)의 변화를 결정하는 것을 제공할 수 있다. 따라서, 비용 효율적인 방식으로 비디오 요약을 동적으로 생성할 수 있다. 시점은 하나의 비디오 자체에 대한 다중 비디오 요약을 생성하는 확률적 결정에 도움이 될 수 있다.According to one aspect of the disclosed invention, it may be provided to determine a change in the user's view point of the user. Thus, it is possible to dynamically generate video summaries in a cost-effective manner. Viewpoints can aid in probabilistic decisions to generate multiple video summaries of one video itself.

개시된 발명의 일 측면에 따르면, 환경 입력, 사용자의 생각, 긍정적 또는 부정적 평가 또는 리뷰를 캡처하여 비디오 요약을 생성하는 것을 제공할 수 있다. According to one aspect of the disclosed subject matter, it may be possible to provide for generating a video summary by capturing environmental input, a user's thoughts, positive or negative ratings or reviews.

도 1은 일 실시예에 따른 비디오 요약을 생성하는 전자 장치의 다양한 하드웨어 구성요소를 도시한다.
도 2는 일 실시예에 따른 비디오 요약을 생성하는 방법을 도시하는 흐름도이다.
도 3은 일 실시예에 따라 전자 장치가 사용자의 객관적 시점에 기초하여 비디오 요약을 생성하는 예시 도면이다.
도 4는 일 실시예에 따라 전자 장치가 사용자의 물리적 시점에 기초하여 비디오 요약을 생성하는 예시 도면이다.
도 5는 일 실시예에 따라 전자 장치가 사용자의 주관적 시점에 기초하여 비디오 요약을 생성하는 예시 도면이다.
도 6a 내지 6c는 일 실시예에 따라 상이한 시점에서 전체 축구 경기 요약을 요약한 예시 도면이다.
도 7a 내지 7c는 일 실시예에 따라 비디오의 다중 프레임으로부터 케이크 컷팅 세레모니를 요약한 예시 도면이다.
도 8a는 일 실시예에 따라 키 프레임 선택이 도시된 예시 도면이다.
도 8b는 일 실시예에 따라 도 8a와 함께 선택된 키 프레임에 기초하여 여기 스코어 및 완성도 스코어가 계산되는 예시 그래프이다.
도 9는 일 실시예에 따라 프레임 시퀀싱이 도시된 예시 도면이다.
도 10은 일 실시예에 따라 기준 여기 프레임에 기초하여 조정 프레임 시퀀싱이 도시된 예시 도면이다.
도 11은 일 실시예에 따라 다중 시점에 기초하여 비디오 요약이 생성되는 예시 도면이다.
도 12는 일 실시예에 따라 다중 시점에 기초하여 축구 경기의 요약이 생성되는 예시 도면이다.
도 13은 일 실시예에 따라 텍스트 및 오디오에 기초하여 비디오 프레임 여기가 도시된 예시 도면이다.
도 14 및 15는 일 실시예에 따라 비디오 프레임 내의 정보의 절대 완성도가 도시된 예시 도면이다.
도 16a는 일 실시예에 따라 비디오 프레임 여기가 도시된 예시 도면이다.
도 16b는 일 실시예에 따라 오디오 프레임 여기가 도시된 예시 도면이다.
도 16c는 일 실시예에 따라 텍스트 프레임 여기가 도시된 예시 도면이다.
도 17은 일 실시예에 따라 프레임에서의 여기 매칭에 기초하여 입력 상황 정보 분포를 이용하여 비디오 프레임을 시퀀싱하는 예시 도면이다.
도 18은 일 실시예에 따라 도 17과 관련된 프레임에서 여기 매칭에 기초하여 비디오 프레임을 시퀀싱하는 예시 도면이다.
도 19는 일 실시예에 따라 상대 프레임 여기가 도시된 예시 도면이다.
도 20은 일 실시예에 따라 상대 프레임 여기에 기초하여 단일 비디오 프레임을 요약하는 예시 도면이다.
도 21은 일 실시예에 따라 전자 장치가 다수의 피사체의 시점을 캡처하는 예시 도면이다.
도 22는 일 실시예에 따라 전자 장치가 서술 시점을 캡처하는 예시 도면이다.
도 23은 일 실시예에 따라 전자 장치가 시점으로부터 하이라이트를 캡처하는 예시 도면이다.
도 24는 일 실시예에 따라 전자 장치가 점진적 여기를 캡처하는 예시 도면이다.
도 25 및 26은 일 실시예에 따라 전자 장치가 다중 시점의 상황 정보를 획득함으로써 복수의 비디오 요약을 생성하는 예시 도면이다.
도 27은 일 실시예에 따라 전자 장치가 보케 효과를 이용하여 다중 시점의 상황 정보를 획득함으로써 복수의 비디오 요약을 생성하는 예시 도면이다.
도 28은 일 실시예에 따라 전자 장치가 여기 분포를 순서대로 이용하여 다중 시점의 상황 정보를 획득함으로써 복수의 비디오 요약을 생성하는 예시 도면이다. 1 illustrates various hardware components of an electronic device for generating a video summary according to an embodiment.
2 is a flow diagram illustrating a method of generating a video summary according to one embodiment.
3 is an exemplary diagram in which an electronic device generates a video summary based on an objective viewpoint of a user, according to an embodiment.
4 is an exemplary diagram in which an electronic device generates a video summary based on a physical viewpoint of a user, according to an embodiment.
5 is an exemplary diagram in which an electronic device generates a video summary based on a subjective viewpoint of a user, according to an embodiment.
6A to 6C are exemplary views summarizing an entire soccer match summary at different points in time, according to an embodiment.
7A-7C are exemplary diagrams summarizing a cake cutting ceremony from multiple frames of a video, according to an embodiment.
8A is an exemplary diagram illustrating key frame selection according to an embodiment.
FIG. 8B is an example graph in which an excitation score and a completeness score are calculated based on key frames selected in conjunction with FIG. 8A according to one embodiment.
9 is an exemplary diagram illustrating frame sequencing according to an embodiment.
10 is an exemplary diagram illustrating coordination frame sequencing based on a reference excitation frame according to an embodiment.
11 is an exemplary diagram in which a video summary is generated based on multiple views, according to an embodiment.
12 is an exemplary diagram in which a summary of a soccer match is generated based on multiple viewpoints, according to an embodiment.
13 is an exemplary diagram illustrating video frame excitation based on text and audio according to an embodiment.
14 and 15 are exemplary diagrams illustrating absolute completeness of information in a video frame according to an embodiment.
16A is an exemplary diagram illustrating video frame excitation according to one embodiment.
16B is an exemplary diagram illustrating audio frame excitation according to one embodiment.
16C is an exemplary diagram illustrating text frame excitation according to one embodiment.
17 is an exemplary diagram of sequencing a video frame using an input contextual information distribution based on excitation matching in the frame, according to an embodiment.
18 is an exemplary diagram of sequencing a video frame based on excitation matching in the frame associated with FIG. 17 according to an embodiment.
19 is an exemplary diagram illustrating relative frame excitation according to an embodiment.
20 is an exemplary diagram summarizing a single video frame based on relative frame excitation according to one embodiment.
21 is an exemplary diagram in which an electronic device captures viewpoints of a plurality of subjects, according to an embodiment.
22 is an exemplary diagram in which an electronic device captures a narration viewpoint, according to an embodiment.
23 is an exemplary diagram in which an electronic device captures a highlight from a viewpoint, according to an embodiment.
24 is an exemplary diagram in which an electronic device captures progressive excitation, according to an embodiment.
25 and 26 are exemplary diagrams in which an electronic device generates a plurality of video summaries by obtaining multi-view context information, according to an embodiment.
27 is an exemplary diagram in which an electronic device generates a plurality of video summaries by obtaining multi-view context information using a bokeh effect, according to an embodiment.
28 is an exemplary diagram in which an electronic device generates a plurality of video summaries by sequentially using an excitation distribution to obtain multi-view context information, according to an embodiment.

본 명세서의 실시예 및 이의 다양한 특징 및 유리한 세부 사항은 첨부된 도면에 예시되고 하기의 설명에서 상세하게 설명되는 비 제한적인 실시예를 참조하여 보다 완전하게 설명된다. 잘 알려진 구성 요소 및 처리 기술에 대한 설명은 본 명세서의 실시예를 불필요하게 모호하게 하지 않도록 생략된다. 또한, 일부 실시예는 새로운 실시예를 형성하기 위해 하나 이상의 다른 실시예와 조합 될 수 있기 때문에, 여기에 설명된 다양한 실시예는 반드시 상호 배타적이지 않다. 본 명세서에 이용된 용어 "또는"은 달리 명시되지 않는 한 비배타적인 것을 말한다. 본 명세서에서 이용된 예시는 단지 본 명세서의 실시예가 실시될 수 있는 방식의 이해를 용이하게 하고 기술의 숙련자가 본 명세서의 실시예를 실시할 수 있도록 하기 위한 것이다.The embodiments herein and their various features and advantageous details are more fully described with reference to the non-limiting embodiments illustrated in the accompanying drawings and detailed in the description below. Descriptions of well-known components and processing techniques are omitted so as not to unnecessarily obscure the embodiments herein. Also, the various embodiments described herein are not necessarily mutually exclusive, as some embodiments may be combined with one or more other embodiments to form new embodiments. As used herein, the term "or" refers to non-exclusive unless otherwise specified. The examples used herein are merely intended to facilitate an understanding of how the embodiments herein may be practiced and to enable those skilled in the art to practice the embodiments herein.

이 분야에서 종래와 같이, 실시예는 설명된 기능 또는 기능들을 수행하는 블록의 관점에서 설명되고 예시될 수 있다. 본 명세서에서 유닛 또는 모듈 등으로 지칭될 수 있는 이러한 블록은 논리 게이트, 집적 회로, 마이크로 프로세서, 마이크로 컨트롤러, 메모리 회로, 수동 전자 부품, 능동 전자 부품, 광학 부품, 하드와이어 회로와 같은 아날로그 또는 디지털 회로에 의해 물리적으로 구현되며, 선택적으로 펌웨어에 의해 구동될 수 있다. 예를 들어, 회로는 하나 이상의 반도체 칩 내에 또는 인쇄 회로 기판 등과 같은 기판 지지체 상에 구현될 수 있다. 블록을 구성하는 회로는 전용 하드웨어 또는 프로세서 (예를 들어, 하나 이상의 프로그래밍 된 마이크로 프로세서 및 관련 회로) 또는 블록의 일부 기능을 수행하는 전용 하드웨어와 블록의 다른 기능을 수행하는 프로세서의 조합에 의해 구현될 수 있다. 실시예의 각 블록은 본 발명의 범위를 벗어나지 않고 물리적으로 둘 이상의 상호 작용 및 개별 블록으로 분리될 수 있다. 마찬가지로, 실시예의 블록은 본 발명의 범위를 벗어나지 않고 물리적으로 더 복잡한 블록으로 결합될 수 있다.As is conventional in the art, an embodiment may be described and illustrated in terms of a described function or block that performs the functions. Such blocks, which may be referred to herein as units or modules, etc., are analog or digital circuits such as logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive electronic components, active electronic components, optical components, hardwired circuits, and the like. It is physically implemented by , and may optionally be driven by firmware. For example, the circuit may be implemented in one or more semiconductor chips or on a substrate support such as a printed circuit board or the like. The circuits constituting the block may be implemented by dedicated hardware or a processor (eg, one or more programmed microprocessors and related circuits) or a combination of dedicated hardware for performing some functions of the block and a processor performing other functions of the block. can Each block of an embodiment may be physically separated into two or more interacting and separate blocks without departing from the scope of the present invention. Likewise, blocks of embodiments may be physically combined into more complex blocks without departing from the scope of the present invention.

첨부된 도면은 다양한 기술적 특징을 쉽게 이해할 수 있도록 돕기 위해 이용되었으며 여기에 제시된 실시예는 첨부된 도면에 의해 제한되지 않는다는 것을 이해해야 한다. 따라서, 본 발명은 특히 첨부된 도면에 기재된 것들에 추가하여 임의의 변경, 등가물 및 대체물로 확장되는 것으로 해석되어야 한다. "제 1", "제 2" 등의 용어는 본 명세서에서 다양한 구성 요소를 설명하기 위해 이용될 수 있지만, 이들 구성 요소는 이들 용어에 의해 제한되지 않아야 한다. 이러한 용어는 일반적으로 한 요소를 다른 요소와 구별하는 데만 이용된다. The accompanying drawings are used to help easily understand various technical features, and it should be understood that the embodiments presented herein are not limited by the accompanying drawings. Accordingly, the invention should be construed to extend to any modifications, equivalents and substitutions, particularly in addition to those set forth in the accompanying drawings. Terms such as “first” and “second” may be used to describe various elements in the present specification, but these elements should not be limited by these terms. These terms are generally only used to distinguish one element from another.

따라서, 일 실시예에 따르면, 전자 장치에 의해 비디오 요약을 생성하는 방법이 제공된다. 상기 방법은, 상기 전자 장치가 복수의 프레임을 포함하는 비디오를 수신하는 것을 포함할 수 있다. 또한, 상기 방법은, 상기 전자 장치가 비디오를 시청하는 사용자의 적어도 하나의 시점을 결정하는 것을 포함할 수 있다. 시점은 사용자의 주관적 시점, 사용자의 객관적 시점 및 사용자의 물리적 시점 중 적어도 하나를 포함할 수 있다. 또한, 상기 방법은, 상기 전자 장치가 사용자의 적어도 하나의 시점에 기초하여 사용자의 적어도 하나의 관심 영역(region of interest, ROI)이 비디오에서 이용 가능한 지 여부를 결정하는 것을 포함할 수 있다. 또한, 상기 방법은, 상기 전자 장치가 적어도 하나의 ROI가 비디오에서 이용 가능하다는 결정에 응답하여 적어도 하나의 ROI를 포함하는 복수의 프레임으로부터 프레임 세트를 식별하는 것을 포함할 수 있다. 또한, 상기 방법은, 상기 전자 장치가 식별된 프레임 세트에 기초하여 비디오 요약을 생성하는 것을 포함할 수 있다. 또한, 상기 방법은 상기 전자 장치가 비디오의 요약을 저장하는 것을 포함할 수 있다.Accordingly, according to one embodiment, a method for generating a video summary by an electronic device is provided. The method may include receiving, by the electronic device, a video including a plurality of frames. Also, the method may include determining, by the electronic device, at least one viewpoint of a user viewing the video. The viewpoint may include at least one of the user's subjective viewpoint, the user's objective viewpoint, and the user's physical viewpoint. Also, the method may include, by the electronic device, determining whether at least one region of interest (ROI) of the user is available in the video based on the at least one viewpoint of the user. Further, the method may include, in response to the electronic device determining that the at least one ROI is available in the video, identifying a set of frames from the plurality of frames including the at least one ROI. Further, the method may include generating, by the electronic device, a video summary based on the identified set of frames. Further, the method may include the electronic device storing a summary of the video.

상기 방법은 사용자의 시점에서 비디오 요약을 위해 이용될 수 있다. 상기 방법은 사용자의 시점의 변화를 판단하는데 이용될 수 있다. 따라서, 비용 효율적인 방식으로 비디오 요약을 동적으로 생성할 수 있다. 상기 방법은 키 프레임 섹션에 기초하여 사용자의 시점에서 비디오 요약을 위해 이용될 수 있다. 키 프레임 섹션은 카메라 설정, 객체의 깊이 상황 정보, 피사체 상황 정보, 유사 프레임과 유사하지 않은 프레임, 및 여기 파라미터에 의해 결정될 수 있다. 상기 방법에서는 비디오 요약을 생성하기 위하여 사용자의 시점을 캡처하는데 강화 학습을 이용할 수 있다. 상기 방법은 초기 편향을 이해할 수 있도록 환경 입력, 사용자의 생각, 긍정적 또는 부정적 평가 또는 리뷰를 캡처하여 비디오 요약을 생성하는데 이용할 수 있다. The method may be used for video summarization from a user's point of view. The method may be used to determine a change in the user's viewpoint. Thus, it is possible to dynamically generate video summaries in a cost-effective manner. The method can be used for video summarization from the user's point of view based on key frame sections. The key frame section may be determined by camera settings, object depth context information, subject context information, similar frames and dissimilar frames, and excitation parameters. The method may use reinforcement learning to capture the user's viewpoint to generate a video summary. The method can be used to generate video summaries by capturing environmental inputs, user thoughts, positive or negative ratings or reviews to understand initial bias.

LSTM / GRU와 같은 현재 딥 러닝 시스템은 하나의 비디오에서 요약을 캡처하고 생성하는데 제한이 있지만, 시점은 하나의 비디오 자체에 대해 여러 비디오 요약을 생성하는 확률적 결정에 도움이 될 수 있다. 강화 학습은 현재 방법을 해결하는 한 가지 방법이며, 심층 생성 네트워크를 기반으로하며, 시점을 동적으로 캡처하기 위한 강화 학습으로 추가로 확장할 수 있다.Current deep learning systems such as LSTM/GRU are limited in capturing and generating summaries from one video, but viewpoints can help in probabilistic decisions to generate multiple video summaries for one video itself. Reinforcement learning is one way to solve the current method, it is based on deep generative networks, and can be further extended to reinforcement learning to dynamically capture viewpoints.

상기 방법에서 상황 정보 입력은 비디오 요약을 생성하기 위하여 강화 학습 모델을 강화할 수 있다. 강화 학습 모델은, 여기 레벨에 기초하여 프레임에 새로운 가중치를 부가함으로써, 환경을 관찰할 수 있다. 그 결과, 비디오 요약을 효율적인 방식으로 생성할 수 있다.The contextual information input in the method may enhance the reinforcement learning model to generate a video summary. A reinforcement learning model can observe the environment by adding new weights to frames based on excitation levels. As a result, video summaries can be generated in an efficient manner.

도면, 특히 도 1 내지 28을 참조하면, 바람직한 실시예가 도시되어 있다.Referring to the drawings, in particular FIGS. 1-28, a preferred embodiment is shown.

도 1은 일 실시예에 따라 비디오 요약을 생성하는 전자 장치(100)의 다양한 하드웨어 구성요소를 도시한다. 전자 장치(100)는 예를 들어 휴대폰, 스마트폰, 개인 휴대 정보 단말기(Personal Digital Assistant, PDA), 태블릿 컴퓨터, 랩톱 컴퓨터, 사물 간 인터넷(Internet of Things, IoT), 몰입형 시스템, 가상 현실 장치, 폴더블 장치 및 플렉서블 장치일 수 있으나 이에 한정되지 않는다. 전자 장치(100)는 프로세서(110), 커뮤니케이터(120), 메모리(130), 인코더(140), 디코더(150), 디스플레이(160), 카메라(170), 및 시점 기반 비디오 요약 생성 컨트롤러(view point based video summary generation controller, 180)를 포함할 수 있다. 프로세서(110)는 커뮤니케이터(120), 메모리(130), 인코더(140), 디코더(150), 디스플레이(160), 카메라(170), 및 시점 기반 비디오 요약 생성 컨트롤러(180)와 연결될 수 있다.1 illustrates various hardware components of an electronic device 100 for generating a video summary according to an embodiment. The electronic device 100 may include, for example, a mobile phone, a smart phone, a personal digital assistant (PDA), a tablet computer, a laptop computer, the Internet of Things (IoT), an immersive system, and a virtual reality device. , a foldable device and a flexible device, but is not limited thereto. The electronic device 100 includes a processor 110 , a communicator 120 , a memory 130 , an encoder 140 , a decoder 150 , a display 160 , a camera 170 , and a viewpoint-based video summary generation controller (view). point based video summary generation controller, 180). The processor 110 may be connected to the communicator 120 , the memory 130 , the encoder 140 , the decoder 150 , the display 160 , the camera 170 , and the viewpoint-based video summary generation controller 180 .

카메라(170)는 복수의 프레임을 포함하는 비디오를 캡처할 수 있다. 캡처된 비디오는 인코더(140) 및 디코더(150)를 통해 시점 기반 비디오 요약 생성 컨트롤러(180)에 전송될 수 있다. 인코더(140) 및 디코더(150)는 기존 기술을 이용하여 잠재 벡터 공간(latent vector space)에서 비디오를 정규화할 수 있다. 다차원 공간에서 속도, 강도, 빈도 및 지속 시간과 같은 파라미터를 정규화하여 시점과 비디오를 동일 척도로 평가할 수 있다. 강화 학습 모델이 0과 1 사이의 데이터를 정규화하는 것이 파라미터 또는 변수의 예라고 이해하려면 최소 최대 스칼라(minmax scalar)와 같은 파라미터는 축소될 수 있다.The camera 170 may capture video including a plurality of frames. The captured video may be transmitted to the view-based video summary generation controller 180 via the encoder 140 and the decoder 150 . Encoder 140 and decoder 150 may normalize video in a latent vector space using existing techniques. By normalizing parameters such as speed, intensity, frequency, and duration in a multidimensional space, viewpoints and videos can be evaluated on the same scale. In order for the reinforcement learning model to understand that normalizing data between 0 and 1 is an example of a parameter or variable, a parameter such as a minmax scalar can be reduced.

또한, 시점 기반 비디오 요약 생성 컨트롤러(180)는 비디오를 수신할 수 있다. 시점 기반 비디오 요약 생성 컨트롤러(180)는 비디오를 수신한 후 비디오를 시청하는 사용자의 시점(view point)을 결정하도록 구성될 수 있다. 시점은 예를 들어 사용자의 주관적 시점, 사용자의 객관적 시점 및 사용자의 물리적 시점일 수 있지만, 이에 한정되지 않는다.Also, the view-based video summary generation controller 180 may receive a video. The view-based video summary generation controller 180 may be configured to determine a view point of a user who views the video after receiving the video. The viewpoint may be, for example, a subjective viewpoint of the user, an objective viewpoint of the user, and a physical viewpoint of the user, but is not limited thereto.

사용자의 주관적 관점은, 사용자와 관련된 복수의 주관적 파라미터를 획득하고, 사용자와 관련된 복수의 주관적 파라미터에 기초하여 사용자의 주관적 상황 정보를 결정하고, 사용자의 주관적 상황 정보에 기초하여 사용자의 주관적 시점을 결정함으로써, 결정될 수 있다. 복수의 주관적 파라미터는 예를 들어 사용자의 직업, 사용자의 나이, 사용자의 선호도, 사용자와 관련된 이벤트, 및 소셜 네트워크 사이트에서의 사용자의 활동을 포함할 수 있으나, 이에 한정되지 않는다. 소셜 네트워크 사이트에서 사용자의 활동은, 예를 들어 소셜 네트워크 사이트에서 좋아요, 소셜 네트워크 사이트에서 싫어요, 및 소셜 네트워크 사이트에서의 사진 공유를 포함할 수 있다. 일례로서, 전자 장치(100)는 도 5에 도시된 바와 같이 사용자의 주관적 시점에 기초하여 비디오 요약을 생성할 수 있다. 예를 들어, 전자 장치(100)의 사용자는 팀 회의의 비디오를 캡처하고, 사용자는 팀 회의에서 CEO 연설에 기초하여 비디오를 요약하기를 원할 수 있다. 입력에 기초하여, 전자 장치(100)는 팀 회의에서 CEO 연설에 기초하여 비디오를 생성할 수 있다. 예를 들어, 열차의 사고 시, 철도청은 서로 다른 시점을 가지고 있고, 경찰청은 서로 다른 시점을 가지고 있으며, 의료인은 서로 다른 시점을 가지고 있을 수 있으며, 서로 다른 시점에 따라 비디오 요약이 생성될 수 있다.The subjective viewpoint of the user is obtained by obtaining a plurality of subjective parameters related to the user, determining the user's subjective context information based on the plurality of subjective parameters related to the user, and determining the subjective viewpoint of the user based on the user's subjective context information By doing so, it can be determined The plurality of subjective parameters may include, but are not limited to, for example, the user's occupation, the user's age, the user's preferences, an event related to the user, and the user's activity on a social networking site. A user's activity on social networking sites may include, for example, likes on social networking sites, dislikes on social networking sites, and sharing photos on social networking sites. As an example, the electronic device 100 may generate a video summary based on a user's subjective viewpoint as shown in FIG. 5 . For example, the user of the electronic device 100 may capture a video of a team meeting, and the user may want to summarize the video based on a CEO speech at the team meeting. Based on the input, the electronic device 100 may generate a video based on the CEO speech at the team meeting. For example, in the case of a train accident, the railroad agency may have different viewpoints, the police department may have different viewpoints, the medical personnel may have different viewpoints, and video summaries may be generated according to the different viewpoints. .

사용자의 객관적 시점은, 사용자와 관련된 복수의 객관적 파라미터를 획득하고, 사용자와 관련된 복수의 객관적 파라미터에 기초하여 사용자의 객관적 상황 정보를 결정하고, 사용자의 객관적 상황 정보에 기초하여 객관적 시점을 결정함으로써, 결정될 수 있다. 복수의 객관적 파라미터는 예를 들어 사용자의 과거 이력, 사용자의 현재 목표 및 사용자의 추가 목표를 포함할 수 있으나, 이에 한정되지 않는다. 사용자의 목표는 사용자의 목적/동기를 포함할 수 있다. 사용자의 과거 이력은 사용자의 과거 이벤트를 포함할 수 있다. 일례로서, 전자 장치(100)는 도 3에 도시된 바와 같이 사용자의 객관적 시점에 기초하여 비디오 요약을 생성할 수 있다. 도 3에서, 전자 장치(100)의 사용자는 신생아의 비디오를 캡처하고, 전자 장치(100)의 사용자는 신생아의 비디오를 요약할 수 있다. 사용자의 입력에 기초하여, 전자 장치(100)의 사용자는 1개월부터 12개월까지 신생아의 비디오를 생성할 수 있다.The objective viewpoint of the user is obtained by obtaining a plurality of objective parameters related to the user, determining the objective situation information of the user based on the plurality of objective parameters related to the user, and determining the objective viewpoint based on the objective situation information of the user, can be decided. The plurality of objective parameters may include, but are not limited to, for example, the user's past history, the user's current goal, and the user's additional goal. The user's goal may include the user's purpose/motive. The user's past history may include the user's past events. As an example, the electronic device 100 may generate a video summary based on the user's objective viewpoint as shown in FIG. 3 . In FIG. 3 , the user of the electronic device 100 may capture a video of a newborn baby, and the user of the electronic device 100 may summarize the video of the newborn baby. Based on the user's input, the user of the electronic device 100 may generate a video of a newborn baby from 1 month to 12 months.

사용자의 물리적 시점은, 복수의 물리적 파라미터를 획득하고, 사용자와 관련된 복수의 물리적 파라미터에 기초하여 사용자의 물리적 상황 정보를 결정하고, 사용자의 물리적 상황 정보에 기초하여 사용자의 물리적 시점을 결정함으로써, 결정될 수 있다. 복수의 물리적 파라미터는 예를 들어 카메라(170)의 각도, 사용자의 위치, 사용자의 주위의 주변 광 조건, 사용자의 주위의 날씨 조건 및 사용자의 프라이버시 선호도를 포함할 수 있으나, 이에 한정되지 않는다. 전자 장치(100)는 도 4에 도시된 바와 같이 축구 경기에서 사용자의 물리적 시점에 기초하여 비디오 요약을 생성할 수 있다.The physical point of view of the user is to be determined by obtaining a plurality of physical parameters, determining the physical context information of the user based on the plurality of physical parameters related to the user, and determining the physical point of view of the user based on the physical context information of the user. can The plurality of physical parameters may include, but are not limited to, for example, an angle of the camera 170 , a user's location, ambient light conditions around the user, weather conditions around the user, and privacy preferences of the user. As shown in FIG. 4 , the electronic device 100 may generate a video summary based on the user's physical viewpoint in the soccer match.

사용자의 시점에 기초하여, 시점 기반 비디오 요약 생성 컨트롤러(180)는 사용자의 ROI가 비디오에서 이용 가능한지 여부를 결정하도록 구성될 수 있다. ROI가 비디오에서 이용 가능한 것으로 결정한 후, 시점 기반 비디오 요약 생성 컨트롤러(180)는 ROI를 포함하는 복수의 프레임으로부터 프레임 세트를 식별하도록 구성될 수 있다.Based on the user's viewpoint, the viewpoint-based video summary generation controller 180 may be configured to determine whether the user's ROI is available in the video. After determining that an ROI is available in the video, the view-based video summary generation controller 180 may be configured to identify a set of frames from the plurality of frames comprising the ROI.

일 실시예에 따르면, 시점 기반 비디오 요약 생성 컨트롤러(180)는 프레임 각각과 관련된 복수의 여기 파라미터에 기초하여 비디오의 복수의 프레임으로부터 각 프레임의 여기 레벨을 결정하도록 구성될 수 있다. 복수의 여기 파라미터는 각 프레임에서의 ROI의 속도, 각 프레임에서의 ROI의 강도, 각 프레임에서의 출현 빈도, 및 각 프레임의 재생 시간을 포함할 수 있다. 또한, 시점 기반 비디오 요약 생성 컨트롤러(180)는 각 프레임의 오디오 파라미터 및 텍스트 파라미터를 추출하고, 각 프레임의 여기 레벨 및 오디오 파라미터 및 텍스트 파라미터에 기초하여 비디오의 복수의 프레임으로부터 각 프레임의 상대 상황 정보를 결정하도록 구성될 수 있다. 또한, 시점 기반 비디오 요약 생성 컨트롤러(180)는 각 프레임의 상대 상황 정보에 기초하여 ROI를 포함하는 복수의 프레임으로부터 프레임 세트를 식별하도록 구성될 수 있다. 복수의 여기 파라미터는 각 프레임에서의 ROI의 속도, 각 프레임에서의 ROI의 강도, 각 프레임에서의 출현 빈도, 및 각 프레임의 재생 시간을 포함할 수 있으나, 이에 한정되지 않는다.According to an embodiment, the view-based video summary generation controller 180 may be configured to determine an excitation level of each frame from a plurality of frames of a video based on a plurality of excitation parameters associated with each frame. The plurality of excitation parameters may include a speed of the ROI in each frame, an intensity of the ROI in each frame, a frequency of appearance in each frame, and a playback time of each frame. In addition, the viewpoint-based video summary generation controller 180 extracts audio parameters and text parameters of each frame, and relative context information of each frame from a plurality of frames of the video based on the excitation level and audio parameters and text parameters of each frame. can be configured to determine Also, the view-based video summary generation controller 180 may be configured to identify a frame set from a plurality of frames including the ROI based on the relative context information of each frame. The plurality of excitation parameters may include, but are not limited to, a speed of the ROI in each frame, an intensity of the ROI in each frame, a frequency of appearance in each frame, and a playback time of each frame.

일 실시예에 따르면, 시점 기반 비디오 요약 생성 컨트롤러(180)는, 복수의 프레임으로부터 프레임 세트를 식별하도록, 비디오의 절대 완성도 스코어를 결정하고, 절대 완성도 스코어에 기초하여 비디오의 절대 프레임 여기 정보를 결정하고, 절대 프레임 여기 정보에 기초하여 비디오의 공동 참조 정보를 검출하고, 공동 참조 정보에 기초하여 비디오의 시퀀스 여기 레벨을 결정하고, 비디오의 시퀀스 여기 레벨에 기초하여 복수의 프레임으로부터 프레임 세트를 식별하도록 구성될 수 있다. 절대 완성도 스코어, 절대 프레임 여기 정보, 공동 참조 정보 및 시퀀스 여기 레벨과 관련된 예시는 도 9 및 도 10에 설명되어 있다.According to an embodiment, the viewpoint-based video summary generation controller 180 determines an absolute completeness score of the video to identify a set of frames from the plurality of frames, and determines absolute frame excitation information of the video based on the absolute completeness score. and to detect joint reference information of the video based on the absolute frame excitation information, determine a sequence excitation level of the video based on the joint reference information, and identify a set of frames from the plurality of frames based on the sequence excitation level of the video. can be configured. Examples relating to absolute completeness scores, absolute frame excitation information, joint reference information, and sequence excitation levels are illustrated in FIGS. 9 and 10 .

비디오의 절대 프레임 여기 정보는 각 프레임에서의 ROI의 속도, 각 프레임에서의 ROI의 강도, 각 프레임에서의 ROI의 출현 빈도, 및 각 프레임의 재생 시간을 획득하고, 획득된 각 프레임에서의 ROI의 속도, 획득된 각 프레임에서의 ROI의 강도, 획득된 각 프레임에서의 ROI의 출현 빈도, 및 획득된 각 프레임의 재생 시간에 기초하여 비디오의 절대 프레임 여기 정보를 결정함으로써, 결정될 수 있다. 일례로서, 장면의 움직임을 캡처하는 동안, 속도는 피사체간의 절대 또는 상대 속도를 포함할 수 있다. 일례로서, 피사체 또는 객체의 관여를 캡처하는 동안, 강도는 감정, 이미지 열지도(image heat map), 소리의 세기, 색상 열지도, 색상 변경, 배경, 애니메이션를 포함할 수 있다. 일례로서, 빈도는 피사체의 출현, 반복, 피사체의 유사하거나 유사하지 않은 이벤트를 포함할 수 있다. 일례로서, 지속 시간은 시퀀스를 캡처하기 위한 프레임 지속 시간을 포함할 수 있다.The absolute frame excitation information of the video is obtained by obtaining the velocity of the ROI in each frame, the intensity of the ROI in each frame, the frequency of appearance of the ROI in each frame, and the playing time of each frame, It can be determined by determining the absolute frame excitation information of the video based on the speed, the intensity of the ROI in each acquired frame, the frequency of appearance of the ROI in each acquired frame, and the playback time of each acquired frame. As an example, while capturing motion in a scene, velocity may include absolute or relative velocity between subjects. As an example, while capturing subject or object involvement, intensity may include emotion, image heat map, loudness, color heat map, color change, background, animation. As an example, the frequency may include the appearance, repetition of the subject, similar or dissimilar events of the subject. As an example, the duration may include a frame duration for capturing the sequence.

절대 완성도 스코어는, 비디오와 관련된 절대 프레임 정보를 획득하고, 비디오와 관련된 완성도 임계치를 획득하고, 비디오와 관련된 획득된 절대 프레임 정보를 비디오와 관련된 획득된 완성도 임계치와 비교함으로써, 결정될 수 있다.The absolute completeness score may be determined by obtaining absolute frame information associated with the video, obtaining a completeness threshold associated with the video, and comparing the obtained absolute frame information associated with the video to an obtained completeness threshold associated with the video.

절대 프레임 여기 정보는 프레임 세트를 시퀀싱하기 위해 프레임 세트와 관련된 상대 프레임 여기를 구동하도록 구성될 수 있다. 절대 프레임 여기 정보는 독립적으로 캡처할 수 있으며, 상황 정보에서 기준 여기의 세그먼트와 매칭시킬 수 있다. 상대 프레임 여기는 순서대로 프레임 당 여기 커버리지로 정의될 수 있다. 참조 프레임 여기를 입력할 수 있으며, 이에 따라 프레임 시퀀스를 조정하여 기준 여기 레벨을 획득할 수 있다. The absolute frame excitation information may be configured to drive relative frame excitation associated with the set of frames to sequence the set of frames. Absolute frame excitation information can be captured independently and matched with segments of reference excitation in context information. Relative frame excitation can be defined as excitation coverage per frame in order. A reference frame excitation may be input, and the frame sequence may be adjusted accordingly to obtain a reference excitation level.

공동 참조 정보는 프레임 세트와 관련된 시퀀스 여기 레벨을 유지하도록 구성될 수 있다. 참조 정보는, 프레임 세트와 관련된 오디오 이용량을 포함하는 장면 및 프레임 세트와 관련된 의미적 유사성을 획득하고, 프레임 세트와 관련된 오디오 이용량을 포함하는 획득된 장면 및 프레임 세트와 관련된 획득된 의미적 유사성에 기초하여 공동 참조 정보를 결정함으로써, 결정될 수 있다. 시퀀스 여기 레벨은 프레임 세트와 관련된 유사성을 맵핑하도록 구성될 수 있다.The joint reference information may be configured to maintain a sequence excitation level associated with the set of frames. The reference information is configured to obtain semantic similarity associated with a scene and frame set including an audio usage associated with the frame set, and obtain semantic similarity associated with an obtained scene and frame set including an audio usage related to the frame set. It may be determined by determining the joint reference information based on . Sequence excitation levels may be configured to map similarities associated with a set of frames.

시점 기반 비디오 요약 생성 컨트롤러(180)는 식별된 프레임 세트에 기초하여 비디오 요약을 생성하도록 구성될 수 있다. 일 실시예에 따르면, 시점 기반 비디오 요약 생성 컨트롤러(180)는 사용자의 시점 및 ROI에 기초하여 식별된 프레임 세트로부터 각 프레임에 대한 가중치를 결정하도록 구성될 수 있다. 식별된 프레임 세트로부터 각 프레임에 대한 가중치는, 사용자의 시점과 식별된 복수의 프레임으로부터의 각 프레임 사이의 관계 파라미터 및 식별된 복수의 프레임으로부터 각 프레임의 투시 각도(perspective angle)를 획득하고, 획득된 관계 파라미터에 기초하여 식별된 프레임에 대한 가중치를 결정함으로써, 결정될 수 있다. 관계 파라미터는 사용자의 시점 및 식별된 프레임에서의 장면의 투시도에 기초하여 비디오의 각도를 식별할 수 있다. 또한, 시점 기반 비디오 요약 생성 컨트롤러(180)는 각 프레임에 대해 결정된 가중치에 기초하여 식별된 프레임 세트로부터 각 프레임을 시퀀싱하고, 시퀀싱된 프레임 세트를 병합하여 비디오 요약을 생성하도록 구성될 수 있다.View-based video summary generation controller 180 may be configured to generate a video summary based on the identified set of frames. According to an embodiment, the viewpoint-based video summary generation controller 180 may be configured to determine a weight for each frame from the identified frame set based on the user's viewpoint and the ROI. A weight for each frame from the identified frame set is obtained by obtaining a relation parameter between the user's viewpoint and each frame from the identified plurality of frames and a perspective angle of each frame from the identified plurality of frames, may be determined by determining a weight for the identified frame based on the determined relation parameter. The relationship parameter may identify the angle of the video based on the user's viewpoint and a perspective view of the scene in the identified frame. Also, the view-based video summary generation controller 180 may be configured to sequence each frame from the identified frame set based on a weight determined for each frame, and merge the sequenced frame set to generate a video summary.

시점 기반 비디오 요약 생성 컨트롤러(180)는 비디오 요약을 메모리(130)에 저장하도록 구성될 수 있다. 시점 기반 비디오 요약 생성 컨트롤러(180)는 비디오 요약을 디스플레이(160)에 표시하도록 구성될 수 있다. 디스플레이(160)는 예를 들어 액정 표시 (liquid crystal display, LCD) 디스플레이 및 발광 다이오드(light emitting diode, LED) 디스플레이를 포함할 수 있으나, 이에 한정되지 않는다. 디스플레이(160)는 임의의 다른 터치 검출 기술을 이용하여 구현될 수 있다. 디스플레이(160)에는 비디오 요약을 생성하기 위한 모드가 마련될 수 있다. 모드는 예를 들어 수동 모드, 반 자동 모드 및 완전 자동 모드를 포함할 수 있으나, 이에 한정되지 않는다. 선택 모드에 따라, 전자 장치(100)는 비디오 요약을 생성할 수 있다. 수동 모드는 사용자의 입력에 따라 동작할 수 있다. 반 자동 모드는 사용자의 하나 이상의 관심사, 사용자의 탐색 기록, 과거 카메라 이용량, 및 현재 진행중인 상황 정보에 기초하여 동작할 수 있다. 완전 자동 모드는 전자 장치(100)에 의한 장면 해석 및 환경 해석에 기초하여 동작할 수 있다. The view-based video summary generation controller 180 may be configured to store the video summary in the memory 130 . The view-based video summary generation controller 180 may be configured to display the video summary on the display 160 . The display 160 may include, for example, a liquid crystal display (LCD) display and a light emitting diode (LED) display, but is not limited thereto. Display 160 may be implemented using any other touch detection technology. The display 160 may be provided with a mode for generating a video summary. Modes may include, but are not limited to, manual mode, semi-auto mode, and fully automatic mode, for example. According to the selection mode, the electronic device 100 may generate a video summary. The manual mode may operate according to a user's input. The semi-automatic mode may operate based on one or more interests of the user, the user's browsing history, past camera usage, and current status information. The fully automatic mode may operate based on scene analysis and environment analysis by the electronic device 100 .

시점 기반 비디오 요약 생성 컨트롤러(180)는 논리 게이트, 직접 회로, 마이크로 프로세서, 마이크로 컨트롤러, 메모리 회로, 수동 전자 부품, 능동 전자 부품, 광학 부품, 하드와이어 회로 등과 같은 아날로그 또는 디지털 회로에 의해 물리적으로 구현될 수 있으며, 선택적으로 펌웨어에 의해 구동될 수 있다. 시점 기반 비디오 요약 생성 컨트롤러(180)는 하나 이상의 반도체 칩 내에 또는 인쇄 회로 기판 등과 같은 기판 지지체 상에 구현될 수 있다. 블록을 구성하는 회로는 전용 하드웨어 또는 프로세서 (예를 들어, 하나 이상의 프로그래밍 된 마이크로 프로세서 및 관련 회로) 또는 블록의 일부 기능을 수행하는 전용 하드웨어와 블록의 다른 기능을 수행하는 프로세서의 조합에 의해 구현될 수 있다. The viewpoint-based video summary generation controller 180 is physically implemented by analog or digital circuitry such as logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive electronic components, active electronic components, optical components, hardwired circuits, etc. and may optionally be driven by firmware. The viewpoint-based video summary generation controller 180 may be implemented in one or more semiconductor chips or on a substrate support such as a printed circuit board. The circuits constituting the block may be implemented by dedicated hardware or a processor (eg, one or more programmed microprocessors and related circuits) or a combination of dedicated hardware for performing some functions of the block and a processor performing other functions of the block. can

또한, 메모리(130)는 프로세서(110)에 의해 실행될 명령을 저장할 수 있다. 메모리(130)는 비-휘발성 저장 요소를 포함할 수 있다. 이러한 비-휘발성 저장 요소의 예는 자기 하드 디스크, 광 디스크, 플로피 디스크, 플래시 메모리, 또는 EPROM(electrically programmable memories) 또는 EEPROM(electrically erasable and programmable memories)의 형태를 포함할 수 있다. 또한, 메모리(130)는 일부 예에서 비-일시적 저장 매체로 간주될 수 있다. 용어 "비-일시적(non-transitory)"은 저장 매체가 반송파 또는 전파 신호로 구현되지 않음을 나타낼 수 있다. 그러나, "비-일시적"이라는 용어는 메모리(130)가 움직일 수 없는 것으로 해석되어서는 안 된다. 일부 예에서, 메모리(130)는 더 많은 양의 정보를 저장하도록 구성될 수 있다. 특정 예에서, 비-일시적 저장 매체는 시간이 지남에 따라 (RAM(Random Access Memory) 또는 캐시에서) 변경될 수 있는 데이터를 저장할 수 있다.Also, the memory 130 may store instructions to be executed by the processor 110 . Memory 130 may include a non-volatile storage element. Examples of such non-volatile storage elements may include magnetic hard disks, optical disks, floppy disks, flash memory, or in the form of electrically programmable memories (EPROM) or electrically erasable and programmable memories (EEPROM). Also, memory 130 may be considered a non-transitory storage medium in some examples. The term “non-transitory” may indicate that the storage medium is not embodied in a carrier wave or propagated signal. However, the term “non-transitory” should not be interpreted as meaning that the memory 130 is immovable. In some examples, memory 130 may be configured to store larger amounts of information. In certain instances, the non-transitory storage medium may store data that may change (in random access memory (RAM) or cache) over time.

프로세서(110)는 메모리(130)에 저장된 명령을 실행하고 다양한 프로세스를 수행하도록 구성될 수 있다. 프로세서(110)는 하나 또는 복수의 프로세서를 포함할 수 있다. 하나 또는 복수의 프로세서는 CPU(central processing unit), AP(application processor) 등과 같은 범용 프로세서, GPU(graphics processing unit), VPU(visual processing unit)와 같은 그래픽 전용 처리 장치(graphics-only processing unit), 및/또는 NPU(neural processing unit)와 같은 AI 전용 프로세서(AI-dedicated processor)일 수 있다. 프로세서(140)는 다중 코어(multiple cores)를 포함할 수 있으며, 메모리(120)에 저장된 명령을 수행하도록 구성될 수 있다. The processor 110 may be configured to execute instructions stored in the memory 130 and perform various processes. The processor 110 may include one or a plurality of processors. One or more processors include a general-purpose processor such as a central processing unit (CPU) and an application processor (AP), a graphics-only processing unit such as a graphics processing unit (GPU), and a visual processing unit (VPU); and/or an AI-dedicated processor such as a neural processing unit (NPU). The processor 140 may include multiple cores and may be configured to execute instructions stored in the memory 120 .

커뮤니케이터(120)는 하나 이상의 네트워크를 통해 내부 하드웨어 구성요소와 외부 장치 사이에서 내부적으로 통신하도록 구성될 수 있다. 예를 들어, 커뮤니케이터(120)는 블루투스 커뮤니케이터, Wi-Fi(Wireless fidelity) 모듈 및 Li-Fi 모듈을 포함할 수 있으나, 이에 한정되지 않는다. Communicator 120 may be configured to communicate internally between internal hardware components and external devices via one or more networks. For example, the communicator 120 may include a Bluetooth communicator, a wireless fidelity (Wi-Fi) module, and a Li-Fi module, but is not limited thereto.

일례로서, 전자 장치(100)는 도 6a 내지 6c에 도시된 바와 같이 상이한 시점에서 전체 축구 경기 요약을 요약할 수 있다. 전자 장치(100)는 도 7a 내지 7c에 도시된 바와 같이 다수의 비디오 프레임으로부터 케이크 컷팅 세레모니를 요약할 수 있다. As an example, the electronic device 100 may summarize the entire soccer match summary at different time points as shown in FIGS. 6A to 6C . The electronic device 100 may summarize the cake cutting ceremony from a plurality of video frames as shown in FIGS. 7A to 7C .

도 1이 전자 장치(100)의 다양한 하드웨어 구성 요소를 도시하고 있지만, 다른 실시예가 이에 제한되지 않음을 이해해야 한다. 다른 실시예에 따르면, 전자 장치(100)는 더 적은 개수의 구성 요소를 포함할 수 있다. 또한, 구성 요소의 라벨 또는 이름은 예시의 목적으로만 이용되며 본 발명의 범위를 한정하지 않는다. 하나 이상의 구성 요소가 함께 결합되어 동일하거나 실질적으로 유사한 기능을 수행하여 비디오 요약을 생성할 수 있다.Although FIG. 1 illustrates various hardware components of the electronic device 100 , it should be understood that other embodiments are not limited thereto. According to another embodiment, the electronic device 100 may include a smaller number of components. In addition, labels or names of components are used for illustrative purposes only and do not limit the scope of the present invention. One or more components may be joined together to perform the same or substantially similar function to create a video summary.

도 2는 일 실시예에 따른 비디오 요약을 생성하는 방법을 도시하는 흐름도(200)이다. 동작(202~226)은 시점 기반 비디오 요약 생성 컨트롤러(180)에 의해 수행될 수 있다.2 is a flow diagram 200 illustrating a method of generating a video summary according to one embodiment. Operations 202-226 may be performed by the viewpoint-based video summary generation controller 180 .

방법은 복수의 프레임을 포함하는 비디오를 수신하는 것을 포함할 수 있다(202). 방법은 사용자와 관련된 복수의 주관적 파라미터를 획득하는 것을 포함할 수 있다(206a). 방법은 사용자와 관련된 복수의 주관적 파라미터에 기초하여 사용자의 주관적 상황 정보를 결정하는 것을 포함할 수 있다. 방법은 사용자의 주관적 상황 정보에 기초하여 사용자의 주관적 시점을 결정하는 것을 포함할 수 있다(208a).The method may include receiving a video comprising a plurality of frames ( 202 ). The method may include obtaining a plurality of subjective parameters associated with the user ( 206a ). The method may include determining subjective contextual information of the user based on a plurality of subjective parameters associated with the user. The method may include determining the user's subjective viewpoint based on the user's subjective contextual information (208a).

방법은 사용자와 관련된 복수의 객관적 파라미터를 획득하는 것을 포함할 수 있다(204b). 방법은 사용자와 관련된 복수의 객관적 파라미터에 기초하여 사용자의 객관적 상황 정보를 결정하는 것을 포함할 수 있다(206b). 방법은 사용자의 객관적 상황 정보에 기초하여 사용자의 객관적 시점을 결정하는 것을 포함할 수 있다(208b).The method may include obtaining a plurality of objective parameters associated with the user ( 204b ). The method may include determining objective contextual information of the user based on a plurality of objective parameters associated with the user ( 206b ). The method may include determining an objective viewpoint of the user based on the objective context information of the user (208b).

방법은 복수의 물리적 파라미터를 획득하는 것을 포함할 수 있다(204c). 방법은 사용자와 관련된 복수의 물리적 파라미터에 기초하여 사용자의 물리적 상황 정보를 결정하는 것을 포함할 수 있다(206c). 방법은 사용자의 물리적 상황 정보에 기초하여 사용자의 물리적 시점을 결정하는 것을 포함할 수 있다(208c).The method may include obtaining a plurality of physical parameters (204c). The method may include determining physical context information of the user based on a plurality of physical parameters associated with the user ( 206c ). The method may include determining a physical viewpoint of the user based on the physical context information of the user (208c).

방법은 사용자의 시점에 기초하여 사용자의 ROI가 비디오에서 이용 가능한지 여부를 결정하는 것을 포함할 수 있다(210). 방법은 각각의 프레임과 관련된 복수의 여기 파라미터에 기초하여 복수의 비디오 프레임으로부터 각 프레임의 여기 레벨을 결정하는 것을 포함할 수 있다(212). 방법은 각 프레임의 오디오 파라미터 및 텍스트 파라미터를 추출하는 것을 포함할 수 있다(214). 방법은 각 프레임의 여기 레벨 및 오디오 파라미터 및 텍스트 파라미터에 기초하여 복수의 비디오 프레임으로부터 각 프레임의 상대 상황 정보를 결정하는 것을 포함할 수 있다(216).The method may include determining whether an ROI of the user is available in the video based on the user's viewpoint ( 210 ). The method may include determining ( 212 ) an excitation level of each frame from the plurality of video frames based on a plurality of excitation parameters associated with each frame. The method may include extracting 214 audio parameters and text parameters of each frame. The method may include determining relative contextual information of each frame from the plurality of video frames based on an excitation level of each frame and an audio parameter and a text parameter ( 216 ).

방법은 각 프레임의 상대 상황 정보에 기초하여 ROI를 포함하는 복수의 프레임으로부터 프레임 세트를 식별하는 것을 포함할 수 있다(218). 방법은 사용자의 시점과 식별된 복수의 프레임으로부터의 각 프레임 사이의 관계 파라미터 및 식별된 복수의 프레임으로부터 각 프레임의 투시 각도(perspective angle)를 획득하는 것을 포함할 수 있다(220). 방법은 획득된 관계 파라미터에 기초하여 식별된 프레임에 대한 가중치를 결정하는 것을 포함할 수 있다(222). 방법은 각 프레임에 대해 결정된 가중치에 기초하여 식별된 프레임 세트로부터 각 프레임을 시퀀싱하는 것을 포함할 수 있다(224). 방법은 시퀀싱된 프레임 세트를 병합하여 비디오 요약을 생성하는 것을 포함할 수 있다(226).The method may include identifying ( 218 ) a set of frames from the plurality of frames comprising the ROI based on the relative context information of each frame. The method may include obtaining ( 220 ) a relationship parameter between the user's viewpoint and each frame from the identified plurality of frames and a perspective angle of each frame from the identified plurality of frames. The method may include determining a weight for the identified frame based on the obtained relationship parameter ( 222 ). The method may include sequencing each frame from the identified set of frames based on the weight determined for each frame ( 224 ). The method may include merging the sequenced set of frames to generate a video summary ( 226 ).

상기 방법에 있어서, 강화 학습은 사용자의 시점을 캡처하여 비디오 요약을 생성하는데 이용될 수 있다. 상기 방법은 환경 입력, 사용자의 생각, 긍정적 또는 부정적 평가 또는 리뷰를 캡처하여 비디오 요약을 생성하는데 이용될 수 있다.In the method, reinforcement learning can be used to capture the user's point of view and generate a video summary. The method may be used to generate a video summary by capturing environmental input, a user's thoughts, positive or negative ratings or reviews.

LSTM/GRU와 같은 현재의 딥 러닝 시스템은 하나의 비디오에서 요약을 캡처하고 생성하는데 제한이 있지만, 시점은 하나의 비디오 자체에 대해 다중 비디오 요약을 생성하는 확률적 결정에 도움이 될 수 있다. 강화 학습은 현재 방법을 해결하는 한 가지 방법일 뿐이며, 심층 네트워크를 기반으로 구축되며, 시점을 동적으로 캡처하기 위해 강화 학습으로 추가적으로 확장될 수 있다.Current deep learning systems, such as LSTM/GRU, have limitations in capturing and generating summaries from one video, but viewpoints can help in probabilistic decisions to generate multiple video summaries for one video itself. Reinforcement learning is just one way to solve the current method, it is built on deep networks and can be further extended with reinforcement learning to dynamically capture viewpoints.

상기 방법에 있어서, 상황 별 입력은 비디오 요약을 생성하기 위하여 강화 학습 모델을 강화할 수 있다. 강화 학습 모델은 환경을 관찰하고, 여기 레벨에 따라 프레임에 새로운 가중치를 제공할 수 있다. 그 결과, 효율적인 방식으로 비디오 요약을 생성할 수 있다.In the above method, contextual input may enhance the reinforcement learning model to generate video summaries. Reinforcement learning models can observe the environment and give new weights to frames depending on the excitation level. As a result, it is possible to generate video summaries in an efficient manner.

흐름도(200)의 다양한 동작, 작용, 블록, 단계 등은 제시된 순서로, 다른 순서로 또는 동시에 수행될 수 있다. 또한, 일부 실시예에서, 동작, 작용, 블록, 단계 등의 일부는 본 발명의 범위를 벗어나지 않고 생략, 추가, 수정, 스킵될 수 있다.The various acts, acts, blocks, steps, etc. of flowchart 200 may be performed in the order presented, in a different order, or concurrently. Further, in some embodiments, some of the actions, acts, blocks, steps, etc. may be omitted, added, modified, skipped, etc. without departing from the scope of the present invention.

도 8a는 일 실시예에 따라 키 프레임 선택이 도시된 예시 도면(800a)이다. 일례로서, 4개의 프레임 세트가 비디오에서 이용 가능하고, 전자 장치(100)는 비디오의 시점을 식별할 수 있다. 시점에 기초하여 비디오 생성을 위한 키프레임이 선택될 수 있다. 여기 스코어 및 완성도 스코어가 선택된 키 프레임에 기초하여 계산되는 예시 그래프(800b)가 도 8b에 도시되어 있다. 도 8b를 참조하면, 절대 프레임 스코어는 속도, 빈도, 강도 및 지속 시간을 고려하여 가중 평균으로 계산될 수 있다. 완성도 스코어는 평균 또는 분산과 같은 분포 파라미터일 수 있다. 완성도 스코어는 다음 시퀀스에 적합할 수 있도록 다음 프레임 완성도 점수와 비교될 수 있다. 완성도 스코어는 또한 비디오 프레임과 같은 프레임 완성도의 시점 또는 품질에서 얼마나 많은 편차를 포함할 수 있는지, 그러나 시점 여기와 매칭시키기 위해 일부 부분이 잘렸는지 여부를 규정할 수 있다. 프레임 완성도는 프레임을 정렬하기 위해 한 프레임이 다음 프레임에 정보를 위임해야 할 때, 즉 결과 프레임에서 얼마나 많은 의미가 유지되는지를 문자 그대로의 의미로 생각할 수 있다.8A is an exemplary diagram 800a illustrating key frame selection according to an embodiment. As an example, four frame sets are available in a video, and the electronic device 100 may identify a viewpoint of the video. A keyframe for video generation may be selected based on the viewpoint. An example graph 800b here where a score and a completeness score are calculated based on the selected key frame is shown in FIG. 8B . Referring to FIG. 8B , the absolute frame score may be calculated as a weighted average in consideration of speed, frequency, intensity, and duration. The completeness score may be a distribution parameter such as mean or variance. The completeness score can be compared to the next frame's completeness score so that it can fit into the next sequence. The completeness score may also specify how many deviations in the viewpoint or quality of frame perfection, such as a video frame, may be included, but whether some portions have been truncated to match the viewpoint excitation. Frame completeness can be thought of literally as how much meaning is retained in the resulting frame when one frame has to delegate information to the next in order to align it.

키 프레임은 절대 프레임에 대한 여기 평가 및 제공된 상황 정보에 대한 상대 평가에 기초하여 선택될 수 있다.A key frame may be selected based on an excitation evaluation of an absolute frame and a relative evaluation of the context information provided.

절대 프레임에 대한 여기 평가: 비디오를 기초로, 4개의 파라미터를 이용하여 여기를 계산할 수 있다. 파라미터는 속도, 강도, 빈도 및 지속 시간을 포함할 수 있다. 가중치는 여기 파라미터를 변경하여 상황 정보에 따라 요구 임계치를 조정하거나 프레임의 질적 정보에서 완전도를 획득하도록 조정될 수 있다. 이는 선택 기준, 상황 정보 매칭 또는 임계치에 따라 프레임에 대한 사전 조정 또는 사후 조정을 지원할 수 있다.Evaluating excitation for absolute frames: Based on the video, we can calculate excitation using 4 parameters. Parameters may include speed, intensity, frequency and duration. The weight can be adjusted to adjust the demand threshold according to the context information by changing the excitation parameter, or to obtain completeness in the qualitative information of the frame. It can support pre- or post-adjustment of frames according to selection criteria, contextual information matching or thresholds.

제공된 상황 정보(텍스트 또는 오디오)에 대한 상대 평가: 여기 파라미터가 비디오 프레임에 대해 평가될 때, 유사하게 오디오 및 텍스트 파라미터가 비디오 프레임이 있는 잠재 공간에서 맵핑될 수 있다. 이는 일반화의 상대적인 상황 정보를 이해하는데 도움이 될 것이다. 상황 정보에 기초한 프레임 선택이 추가로 도출될 수 있다. 또한, 전자 장치(100)는 예상되는 여기 레벨에 도달하거나 충족하기 위해 프레임의 색상, 프레임의 배경, 프레임의 전경을 변경할 수 있으며, 프레임에서 객체를 제거할 수 있으며 프레임에서 객체를 교체할 수 있다.Relative evaluation of the given context information (text or audio): When excitation parameters are evaluated for video frames, similarly audio and text parameters can be mapped in the latent space in which the video frame resides. This will help to understand the relative contextual information of generalization. Frame selection based on context information may be further derived. Also, the electronic device 100 may change the color of the frame, the background of the frame, and the foreground of the frame to reach or meet the expected excitation level, remove objects from the frame, and replace objects from the frame. .

도 9는 일 실시예에 따라 프레임 시퀀싱이 도시된 예시 도면(900)이다. 도 9에서, 약한 공동 참조 프레임 및 강한 공동 참조 프레임이 식별되고, 텍스트, 오디오, 상황 정보에 대한 여기 공동 참조로부터 프레임의 시퀀스가 변경될 수 있다.9 is an exemplary diagram 900 illustrating frame sequencing according to an embodiment. In Fig. 9, weak joint reference frames and strong joint reference frames are identified, and the sequence of frames can be changed from this joint reference to text, audio, contextual information.

도 10은 일 실시예에 따라 기준 여기 프레임에 기초하여 조정 프레임 시퀀싱이 도시된 예시 도면(1000)이다. 조정된 프레임 시퀀스는, 기준 여기 및 현재 선택된 프레임 여기를 수신함으로써, 결정될 수 있다. 여기는 가중치를 사용하여 프레임에 대해 변경될 수 있다. 여기의 변경에 따라, 조정된 프레임 시퀀스가 생성될 수 있다. 특정 또는 가변 길이에 대한 프레임은 (a) 기준 여기와 매칭되는 전체 여기 레벨 또는 (b) 윈도우 와이즈 여기 레벨(window wise excitation level)을 생성할 수 있다. 여기 매칭의 경우, (a) 가중치 제어를 사용하여 프레임 특성(예를 들어, 속도, 강도, 빈도 또는 지속 시간)을 조정할 수 있고, (b) 선택한 시퀀스에 매칭되는 프레임이 있는 경우 시퀀스를 변경할 수 있고, (c) 선택 과정에서 새 프레임으로 교체할 수 있다.10 is an exemplary diagram 1000 illustrating coordination frame sequencing based on a reference excitation frame in accordance with one embodiment. The adjusted frame sequence may be determined by receiving a reference excitation and a currently selected frame excitation. This can be changed for a frame using weights. Depending on the changes here, an adjusted frame sequence may be created. A frame of a specific or variable length may generate (a) an overall excitation level matching the reference excitation or (b) a window wise excitation level. For matching here, (a) weight control can be used to adjust frame characteristics (e.g., speed, intensity, frequency, or duration), and (b) the sequence can be changed if there is a matching frame for the selected sequence. and (c) can be replaced with a new frame in the selection process.

도 11 및 12는 일 실시예에 따라 다중 시점에 기초하여 비디오 요약이 생성되는 예시 도면(1100 및 1200)이다.11 and 12 are example diagrams 1100 and 1200 in which video summaries are generated based on multiple views, according to an embodiment.

일례로서, 센서 데이터를 사용하여 무비 쇼(movie show) 중에 사용자 여기를 캡처하는 전자 장치(100)의 사용자는, 비슷(액션 1 또는 액션 2)하거나 비슷하지 않은(액션 및 로맨틱) 장르의 영화에 대한 비디오 요약을 생성하기 위한 입력으로 받아들일 수 있다. 비슷한 장르의 영화를 기초로 비디오 요약이 생성될 수 있으며, 비슷하지 않은 장르의 영화를 기초로 다른 비디오 요약이 생성될 수 있다.As an example, a user of electronic device 100 using sensor data to capture user excitation during a movie show may be interested in movies of similar (action 1 or action 2) or dissimilar (action and romantic) genres. can be accepted as input for generating a video summary for Video summaries may be generated based on movies of similar genres, and other video summaries may be created based on movies of dissimilar genres.

일례로서, 전자 장치(100)의 두 사용자는 영화 등급에 대해 서로 다른 관점을 가질 수 있다(즉, 한 사용자는 영화에서 싸움 장면을 기준으로 등급을 매길 수 있고, 다른 사용자는 영화에서 코미디 장면을 기준으로 등급을 매길 수 있다). 다른 관점을 기초로, 두 가지 다른 비디오 요약이 생성될 수 있다.As an example, two users of the electronic device 100 may have different views of a movie rating (ie, one user may rate a movie based on fight scenes, and another user may rate a comedy scene in the movie). can be graded based on Based on different perspectives, two different video summaries may be generated.

시점 상황 정보는 잠재 공간에서 벡터 형태로, 예를 들어 평균, 분산 등의 형태와 같이 시퀀스 분포 또는 전체 분포로 캡처될 수 있다. 잠재 공간의 벡터는 상황 정보에 대한 기준 여기 파라미터로 간주될 수 있다. 상황 정보가 변경되면 다중 비디오가 생성될 수 있다. 상황 정보의 약간의 변화는 다중 비디오를 생성할 수 있다. 다른 예에서, 축구 경기의 요약은 도 12에 도시된 바와 같이 다중 시점에 기초하여 생성될 수 있다. 축구 경기에서 시점은 예를 들어 골대 기반 시점, 선수 기반 시점, 감독 기반 시점, 미드필더 기반 시점, 수비수 기반 시점, 공격수 기반 시점, 패널티 시점 및 등 번호 시점을 포함할 수 있으나, 이에 한정되지 않는다.View point context information can be captured in the latent space in the form of a vector, for example, as a sequence distribution or an overall distribution, such as in the form of mean, variance, etc. A vector of latent space can be regarded as a reference excitation parameter for contextual information. Multiple videos can be created when context information is changed. A slight change in context information can create multiple videos. In another example, a summary of a soccer match may be generated based on multiple viewpoints as shown in FIG. 12 . In a soccer game, a viewpoint may include, but is not limited to, for example, a goal-based viewpoint, a player-based viewpoint, a manager-based viewpoint, a midfield-based viewpoint, a defender-based viewpoint, an attacker-based viewpoint, a penalty viewpoint, and a back number viewpoint.

도 13은 일 실시예에 따라 텍스트 및 오디오에 기초하여 비디오 프레임 여기가 도시된 예시 도면(1300)이다. 비디오 프레임 여기는 음성 여기에 의해 결정될 수 있다. 음성 여기는, 단어 벡터에서 단어가 서로 가깝다는 것을 식별함으로써, 결정될 수 있다. 음성 여기는, 문장 구조, 문장 구조의 음성 사용 일부, 문장 구조의 의미적 유사성, 및 문장 구조의 상호 참조 종속성을 식별함으로써, 결정될 수 있다. 유사하게, 오디오 여기는, 오디오 및 음악 입력의 음향 특성(예를 들어, 속도, 주파수 등)을 식별함으로써, 결정될 수 있다.13 is an exemplary diagram 1300 illustrating video frame excitation based on text and audio according to one embodiment. Video frame excitation may be determined by audio excitation. Speech excitation can be determined by identifying words in a word vector that are close to each other. Speech excitation may be determined by identifying the sentence structure, the phonetic usage portions of the sentence structure, the semantic similarity of the sentence structure, and the cross-reference dependencies of the sentence structure. Similarly, audio excitation can be determined by identifying acoustic characteristics (eg, speed, frequency, etc.) of audio and music inputs.

도 14 및 15는 일 실시예에 따라 비디오 프레임 내의 정보의 절대 완성도가 도시된 예시 도면(1400, 1500)이다.14 and 15 are exemplary diagrams 1400 and 1500 illustrating absolute completeness of information in a video frame according to an embodiment.

전자 장치(100)는 절대 여기 임계치 및 가중치를 사용하여 비디오 프레임의 완성도를 도출하는데 사용될 수 있다. 절대 여기 임계치는 여기 임계치에 따라 완성도를 위해 사용되며, 매칭 여기로 프레임 시퀀스를 채울 수 있다. 가중치는 절대 비지오 프레임 여기를 충족하기 위한 동적 가중치 조정이다. [수학식 1]은 절대 여기에 사용될 수 있다. The electronic device 100 may be used to derive the completeness of the video frame using the absolute excitation threshold and weight. The absolute excitation threshold is used for completeness according to the excitation threshold, and can fill the frame sequence with matching excitation. Weights are dynamic weight adjustments to satisfy absolute vizio frame excitation. [Equation 1] can absolutely be used here.

[수학식 1][Equation 1]

절대 여기 = w1 * 속도 + w2 * 강도 + w3 * 빈도 + w4 * 지속시간Absolute excitation = w1 * velocity + w2 * intensity + w3 * frequency + w4 * duration

(여기에서, w1, w2, w3 및 w4는 절대 여기 임계치에 도달하기 위한 조정). (where w1, w2, w3 and w4 are adjustments to reach the absolute excitation threshold).

도 16a는 일 실시예에 따라 비디오 프레임 여기가 도시된 예시 도면(1600a)이다. 비디오 프레임 여기는 비디오 프레임에 대한 시각적 여기를 매칭시키는데 사용될 수 있다.16A is an exemplary diagram 1600a illustrating video frame excitation according to one embodiment. Video frame excitation can be used to match visual excitation to video frames.

도 16b는 일 실시예에 따라 오디오 프레임 여기가 도시된 예시 도면(1600b)이다. 오디오 여기에 있어서, 프레임 시퀀스를 지원하기 위해 비디오 프레임 상황 정보의 오디오에 대해 유사한 여기 매칭이 획득될 수 있다. 속도, 리듬, 피치, 비트, 시간, 음향 파라미터에서 전체 여기에 대한 가중치와 같은 음향 특성을 조합하여 사용할 수 있다.16B is an exemplary diagram 1600b illustrating audio frame excitation in accordance with one embodiment. For audio excitation, similar excitation matching can be obtained for audio of video frame context information to support frame sequences. Any combination of acoustic properties such as speed, rhythm, pitch, beat, time, and weighting for the overall excitation in acoustic parameters can be used.

도 16c는 일 실시예에 따라 텍스트 프레임 여기가 도시된 예시 도면(1600c)이다. 텍스트 여기에서, 텍스트를 발음할 수 있는 속도, 단어 강도, 품사 빈도, 텍스트 문장 발음 시간, 동적 조정을 위한 유사성 가중치와 같이 매칭 미디오 여기 특성을 계산할 수 있다.16C is an exemplary diagram 1600c illustrating text frame excitation in accordance with one embodiment. Text Here, matching media excitation characteristics such as speed at which text can be pronounced, word strength, part-of-speech frequency, text sentence pronunciation time, and similarity weight for dynamic adjustment may be calculated.

도 17은 일 실시예에 따라 프레임에서의 여기 매칭에 기초하여 입력 상황 정보 분포를 이용하여 비디오 프레임을 시퀀싱하는 예시 도면(1700)이다. 전자 장치(100)는 여기 매칭을 위한 프레임을 순서대로 재선택할 수 있으며, 프레임을 순서대로 배열하기 위한 가중치 파라미터를 조정할 수 있다. 구체적으로, 전자 장치(100)는 비디오의 완성도 및 시퀀싱을 위해 서로 다른 프레임 위치에서 가중치를 변경할 수 있다.17 is an exemplary diagram 1700 of sequencing a video frame using an input contextual information distribution based on excitation matching in the frame, according to one embodiment. The electronic device 100 may reselect frames for excitation matching in order, and may adjust a weight parameter for arranging frames in order. Specifically, the electronic device 100 may change weights at different frame positions for video completeness and sequencing.

도 18은 일 실시예에 따라 도 17과 관련된 프레임에서 여기 매칭에 기초하여 비디오 프레임을 시퀀싱하는 예시 도면(1800)이다. 도 18에서, 강한 유사성 및 약한 유사성이 식별될 수 있으며, 전자 장치(100)는 여기 시퀀스에 따라 프레임이 공동 참조되는 것으로 결정할 수 있다.18 is an exemplary diagram 1800 of sequencing a video frame based on excitation matching in the frame associated with FIG. 17 according to one embodiment. 18 , strong similarity and weak similarity may be identified, and the electronic device 100 may determine that a frame is jointly referenced according to an excitation sequence.

도 19는 일 실시예에 따라 상대 프레임 여기가 도시된 예시 도면(1900)이다. 도 19에서, 온기, 마스크, 오염 또는 질병으로부터의 보호 및 산업 오염 등 4개의 프레임이 도시되어 있다. 여기서 강한 유사성은 산업 폐기물이 많은 질병의 근원일 수 있으며, 약한 유사성은 지구 온난화와 오염이 질병을 확산시키는 것일 수 있다.19 is an exemplary diagram 1900 illustrating relative frame excitation in accordance with one embodiment. 19 , four frames are shown: warmth, a mask, protection from pollution or disease and industrial pollution. A strong similarity here may be that industrial waste is the source of many diseases, and a weak similarity may be that global warming and pollution spread the disease.

도 20은 일 실시예에 따라 상대 프레임 여기에 기초하여 단일 비디오 프레임을 요약하는 예시 도면(2000)이다. 전자 장치(100)는 여기 분포 파라미터를 수신할 수 있다. 여기 분포 파라미터는 예를 들어 평균, 중간 값, 분산, 첨도, 왜곡도를 포함할 수 있으나, 이에 한정되지 않는다. 전자 장치(100)는 여기 분포 파라미터에 기초하여 프레임을 요약할 수 있다. ③에서, 전자 장치(100)는 절대 프레임 여기를 결정할 수 있다. ④에서, 전자 장치(100)는 절대 프레임 여기에 기초하여 상대 프레임 여기 및 강/약 여기 기준을 결정할 수 있다. ⑤에서, 전자 장치(100)는 결정된 상대 프레임 여기 및 강/약 여기 기준에 기초하여 공동 참조 프레임 여기를 결정할 수 있다. 공동 참조 프레임 여기에 기초하여, 전자 장치(100)는 요약된 다중 비디오 프레임을 생성할 수 있으며, 요약된 다중 비디오 프레임에 기초하여 요약된 단일 비디오 프레임을 생성할 수 있다.20 is an exemplary diagram 2000 summarizing a single video frame based on relative frame excitation in accordance with one embodiment. The electronic device 100 may receive an excitation distribution parameter. Excitation distribution parameters may include, but are not limited to, mean, median, variance, kurtosis, and skewness, for example. The electronic device 100 may summarize the frame based on the excitation distribution parameter. In ③, the electronic device 100 may determine absolute frame excitation. In ④, the electronic device 100 may determine relative frame excitation and strong/weak excitation criteria based on the absolute frame excitation. In ⑤, the electronic device 100 may determine the joint reference frame excitation based on the determined relative frame excitation and strong/weak excitation criteria. Based on the joint reference frame, the electronic device 100 may generate multiple summarized video frames, and may generate a single summarized video frame based on the multiple summarized video frames.

도 21은 일 실시예에 따라 전자 장치(100)가 다수의 피사체의 시점을 캡처하는 예시 도면(2100)이다. 도 21에서, 커스텀 특성은 시점으로 정의될 수 있다. 비디오 요약은 운전자 시점을 기초로 생성될 수 있다. 비디오 요약은 목격자 시점을 기초로 생성될 수 있다. 비디오 요약은 관중의 시점을 기초로 생성될 수 있다. , 비디오 요약은 교통 경찰 시점을 기초로 생성될 수 있다. 21 is an exemplary diagram 2100 in which the electronic device 100 captures viewpoints of a plurality of subjects, according to an embodiment. In FIG. 21 , a custom characteristic may be defined as a viewpoint. The video summary may be generated based on the driver's point of view. The video summary may be generated based on the eyewitness point of view. The video summary may be generated based on the viewer's point of view. , a video summary may be generated based on a traffic police viewpoint.

도 22는 일 실시예에 따라 전자 장치(100)가 서술 시점을 캡처하는 예시 도면(2200)이다. 전자 장치(100)의 사용자는 "지구 온난화의 근원은 산업의 확장이다"라는 서술 입력을 제공할 수 있다. 상기 방법을 기초로, 전자 장치(100)는 소스 영상에서 "지구 온난화의 근원은 산업 확장이다"와 관련된 비디오 요약을 생성할 수 있다.22 is an exemplary diagram 2200 in which the electronic device 100 captures a narration viewpoint, according to an embodiment. The user of the electronic device 100 may provide a descriptive input saying "The source of global warming is the expansion of industry". Based on the method, the electronic device 100 may generate a video summary related to "the source of global warming is industrial expansion" in the source image.

도 23은 일 실시예에 따라 전자 장치(100)가 시점으로부터 하이라이트를 캡처하는 예시 도면(2300)이다. 상기 방법을 기초로, 전자 장치(100)는 폐쇄 후 학교의 재개에 대응하는 비디오 요약을 생성할 수 있다. 비디오 요약에는 학생 반응, 안전 조치 및 정치적 관여도 및 학부모 반응을 나타내는 많은 프레임이 포함될 수 있다.23 is an exemplary diagram 2300 in which the electronic device 100 captures a highlight from a viewpoint, according to an embodiment. Based on the method, the electronic device 100 may generate a video summary corresponding to the resumption of the school after closure. Video summaries can include many frames representing student responses, safety measures and political involvement, and parent responses.

도 24는 일 실시예에 따라 전자 장치(100)가 점진적 여기를 캡처하는 예시 도면(2400)이다. 비디오 요약은 대화, 워킹(walking), 논쟁, 파이팅(fighting) 및 운전과 관련된 다중 프레임을 포함할 수 있다. 상기 방법을 기초로 다중 프레임이 배열될 수 있다.24 is an exemplary diagram 2400 in which the electronic device 100 captures progressive excitation, according to an embodiment. The video summary may include multiple frames related to conversation, walking, arguing, fighting, and driving. Multiple frames can be arranged based on the above method.

도 25 및 26은 일 실시예에 따라 전자 장치(100)가 다중 시점의 상황 정보를 획득함으로써 복수의 비디오 요약을 생성하는 예시 도면이다.25 and 26 are exemplary diagrams in which the electronic device 100 generates a plurality of video summaries by obtaining multi-view context information, according to an embodiment.

비디오 프레임에 기초하여, 비디오 프레임의 여기는 4개의 최상의 파라미터를 사용하여 계산될 수 있다. 파라미터는 속도, 강도, 빈도 및 지속 시간이 포함할 수 있다. 가중치는, 여기 파라미터를 변경하여 (a) 상황 정보에 따라 요구 임계치를 조정하거나 (b) 프레임의 질적 정보에서 완전도를 획득하도록, 조정될 수 있다. 이는 선택 기준, 컨텍스트 일치 또는 임계 값에 따라 프레임에 대한 사전 또는 사후 조정에 도움이 될 수 있다.Based on the video frame, the excitation of the video frame can be calculated using the four best parameters. Parameters may include speed, intensity, frequency and duration. The weight may be adjusted by changing the excitation parameter to (a) adjust the demand threshold according to the context information or (b) obtain completeness in the qualitative information of the frame. This can help with pre- or post-adjustment to a frame depending on selection criteria, context match, or threshold.

제공된 상황 정보(텍스트 또는 오디오)에 대한 상대 평가: 여기 파라미터가 비디오 프레임에 대해 평가됨에 따라, 유사하게 오디오 및 텍스트 파라미터가 비디오 프레임이 있는 잠재 공간에서 맵핑될 수 있다. 이는 일반화의 상대적인 상황 정보를 이해하는데 도움이 될 것이다. 상황 정보에 기초한 프레임 선택이 추가로 도출될 수 있다.Relative evaluation of the given context information (text or audio): As excitation parameters are evaluated for video frames, similarly audio and text parameters can be mapped in the latent space with the video frame. This will help to understand the relative contextual information of generalization. Frame selection based on context information may be further derived.

또한, 전자 장치(100)는 예상되는 여기 레벨에 도달하거나 충족하기 위해 프레임의 색상, 프레임의 배경, 프레임의 전경을 변경할 수 있으며, 프레임에서 객체를 제거할 수 있으며 프레임에서 객체를 교체할 수 있다.Also, the electronic device 100 may change the color of the frame, the background of the frame, and the foreground of the frame to reach or meet the expected excitation level, remove objects from the frame, and replace objects from the frame. .

도 27은 일 실시예에 따라 전자 장치(100)가 보케 효과를 이용하여 다중 시점의 상황 정보를 획득함으로써 복수의 비디오 요약을 생성하는 예시 도면(2700)이다. 도 28은 일 실시예에 따라 전자 장치(100)가 여기 분포를 순서대로 이용하여 다중 시점의 상황 정보를 획득함으로써 복수의 비디오 요약을 생성하는 예시 도면(2800)이다.27 is an exemplary diagram 2700 in which the electronic device 100 generates a plurality of video summaries by obtaining multi-view context information using a bokeh effect, according to an embodiment. FIG. 28 is an exemplary diagram 2800 in which the electronic device 100 generates a plurality of video summaries by sequentially using an excitation distribution to obtain multi-view context information, according to an embodiment.

전술한 다양한 예로 논의한 프레임과 별도로, 가중치 기준의 여기 파라미터는 시점 임계치를 보상하는 효과를 생성함으로써 선택한 프레임에 대해 조정될 수 있다. 프레임의 추가 완성도는 동적 가중치 조정을 통해 균형을 이룬다. 또한, 블러, 보케, 부메랑, 슬로우 모션, 배경 변경, 의상 색상 변경, 의상 교체, 분할 또는 다른 이미지 처리 기술, AR 필터 적용 등과 같은 시각적 조정이 있다. 음량, 프레임에서의 대화 반복, 음악 효과 추가 등과 같이 음향을 조정할 수 있다. 확대/축소, 카메라 각도, 깊이, 셔터, ISO, 조리개 제어 등과 같은 비디오 프레임을 설정할 수 있다.Apart from the frames discussed in the various examples above, the weighted excitation parameters can be adjusted for the selected frame by creating an effect that compensates for the viewpoint threshold. The additional completeness of the frame is balanced through dynamic weight adjustment. In addition, there are visual adjustments such as blur, bokeh, boomerang, slow motion, changing the background, changing the color of the clothes, changing the clothes, segmentation or other image processing techniques, applying AR filters, etc. You can adjust the sound, such as volume, repeating dialogue in frames, adding music effects, and more. You can set video frames such as zoom, camera angle, depth, shutter, ISO, iris control, etc.

본 발명에 개시된 실시예는 적어도 하나의 하드웨어 장치에서 실행되는 네트워크 관리 기능을 사용하여 구현될 수 있다.Embodiments disclosed in the present invention may be implemented using a network management function executed in at least one hardware device.

특정 실시예에 대해 전술한 설명은 다른 사람들이 현재 지식을 적용함으로써 일반적인 개념에서 벗어나지 않고 특정 실시예와 같은 다양한 애플리케이션에 대해 쉽게 수정 및/또는 적용할 수 있도록 본 명세서의 실시 예의 일반적인 특성을 완전히 드러낼 것이며, 따라서, 그러한 적응 및 수정은 개시된 실시 예의 균등물의 의미 및 범위 내에서 이해되어야 하고 이해되도록 의도되어야 한다. 본 명세서에서 사용된 어법 또는 용어는 제한이 아니라 설명을 위한 것임을 이해해야 한다. 따라서, 본 명세서의 실시예가 바람직한 실시예의 관점에서 설명되었지만, 통상의 기술자는 본 명세서의 실시예가 본 명세서에 설명된 실시예의 범위 내에서 수정되어 실시될 수 있음을 인식할 것이다.The foregoing description of specific embodiments fully reveals the general nature of the embodiments herein so that others may readily modify and/or adapt to various applications, such as the specific embodiments, without departing from the general concept by applying current knowledge. Accordingly, such adaptations and modifications should be understood and intended to be understood within the meaning and scope of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology used herein is for the purpose of description and not limitation. Accordingly, although the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein may be practiced with modifications within the scope of the embodiments described herein.

Claims

A method for an electronic device (100) to generate a video summary, comprising:
receiving, by the electronic device 100, a video including a plurality of frames;
determining, by the electronic device 100, at least one viewpoint of a user viewing the video;
determining, by the electronic device 100, whether at least one region of interest (ROI) of the user is available in the video based on the at least one viewpoint of the user;
in response to determining that the at least one ROI is available in the video, the electronic device (100) identifying a set of frames from the plurality of frames including the at least one ROI;
generating, by the electronic device (100), the video summary based on the identified set of frames; and
and displaying the video summary by the electronic device (100).

The method of claim 1, wherein the at least one viewpoint comprises a subjective viewpoint of the user;
The method is
The electronic device 100 acquires a plurality of subjective parameters related to the user, and the plurality of subjective parameters include the user's occupation, the user's age, the user's preference, and an event related to the user, and at least one including at least one of the user's activities on a social networking site;
determining, by the electronic device 100, subjective context information of the user based on the plurality of subjective parameters related to the user; and
and determining, by the electronic device 100, the subjective viewpoint of the user based on the subjective context information of the user.

The method of claim 1, wherein the at least one viewpoint comprises an objective viewpoint of the user,
The method is
a process in which the electronic device 100 acquires a plurality of objective parameters related to the user, wherein the plurality of objective parameters include at least one of a past history of the user, a current goal of the user, and an additional goal of the user;
determining, by the electronic device 100, the objective context information of the user based on the plurality of objective parameters related to the user; and
and determining, by the electronic device 100, the objective viewpoint of the user based on the objective situation information of the user.

The method of claim 1, wherein the at least one viewpoint comprises a physical viewpoint of the user,
The method is
The electronic device 100 acquires a plurality of physical parameters related to the user, and the plurality of physical parameters include an angle of a camera related to the user, a location of the user, an ambient light condition around the user, and a surrounding light condition around the user. including at least one of a weather condition and a privacy preference of the user;
determining, by the electronic device 100, physical context information of the user based on the plurality of physical parameters related to the user; and
The method further comprising; determining, by the electronic device 100, the physical point of view of the user based on the physical context information of the user.

The method of claim 1, wherein the step of identifying the frame set from the plurality of frames including the at least one ROI by the electronic device comprises:
The electronic device 100 determines an excitation level of each frame from the plurality of frames of the video based on a plurality of excitation parameters associated with each of the frames, and the plurality of excitation parameters in each frame a process of including at least one of a speed of the ROI, an intensity of the ROI in each frame, an appearance frequency of the ROI in each frame, and a reproduction period of each frame;
extracting, by the electronic device 100, at least one of an audio parameter and a text parameter of each frame;
determining, by the electronic device 100, relative context of each frame from the plurality of video frames based on the excitation level of each frame and at least one of the audio parameter and the text parameter; and
and identifying, by the electronic device 100, the frame set from the plurality of frames including the at least one ROI based on the relative context information of each frame.

The method according to claim 1, wherein the generating of the video summary based on the identified frame set by the electronic device (100) comprises:
determining, by the electronic device 100, a weight for each frame from the identified frame set based on the ROI and the viewpoint of the user;
sequencing, by the electronic device 100, each frame from the identified frame set based on the weight determined for each frame; and
and generating, by the electronic device (100), the video summary by merging the sequenced set of frames.

The method of claim 6, wherein the determining of the weight for each frame from the identified frame set based on the ROI and the viewpoint of the user by the electronic device 100 comprises:
The electronic device 100 determines a relation parameter between the at least one viewpoint of the user and each frame from the identified plurality of frames, and a perspective angle of each frame from the identified plurality of frames. obtaining, wherein the relation parameter comprises at least one of an angle of the video based on the at least one viewpoint of the user and a perspective view of a scene in the identified frame; and
and determining, by the electronic device 100, the weight for the identified frame based on the obtained relationship parameter.

The method of claim 1, wherein the step of identifying the frame set from the plurality of frames including the at least one ROI by the electronic device comprises:
determining, by the electronic device 100, an absolute completeness score of the video;
determining, by the electronic device 100, absolute frame excitation information of the video based on the absolute completeness score;
detecting, by the electronic device 100, co-reference information of the video based on the absolute frame excitation information; and
and determining, by the electronic device 100, a sequence excitation level of the video based on the joint reference information.

The method of claim 8, wherein the determining of the absolute frame excitation information of the video based on the absolute completeness score by the electronic device comprises:
obtaining, by the electronic device 100, the speed of the ROI in each frame, the intensity of the ROI in each frame, the frequency of appearance of the ROI in each frame, and a reproduction period of each frame; and
The electronic device 100 determines the speed of the ROI in each of the acquired frames, the intensity of the ROI in each of the acquired frames, the frequency of appearance of the ROI in each of the acquired frames, and a reproduction period of each of the acquired frames. determining the absolute frame excitation information of the video;

According to claim 8, wherein the absolute completeness score,
obtaining absolute frame information related to the video;
obtaining a completeness threshold associated with the video; and
comparing the obtained absolute frame information associated with the video with the obtained completeness threshold associated with the video;

9. The method of claim 8, wherein the absolute frame excitation information is configured to drive relative frame excitation associated with the set of frames for sequencing the set of frames.

9. The method of claim 8, wherein the joint reference information is configured to maintain the sequence excitation level associated with the set of frames, the joint reference information comprising:
obtaining at least one of a scene comprising audio usage associated with the set of frames and semantic similarities associated with the set of frames; and
determining the joint reference information based on at least one of the obtained scene comprising the audio usage associated with the frame set and the obtained semantic similarity associated with the frame set.

9. The method of claim 8, wherein the sequence excitation level is configured to map a similarity associated with the set of frames.

An electronic device for generating a video summary, comprising:
display; and
a controller connected to the display; and
The controller is
receive a video comprising a plurality of frames;
determine at least one viewpoint of the user viewing the video;
determine whether at least one ROI of the user is available in the video based on the at least one viewpoint of the user;
identify a set of frames from the plurality of frames including the at least one ROI in response to determining that the at least one ROI is available in the video;
generate the video summary based on the identified set of frames;
and present the video summary on the display.

15. The method of claim 14, wherein the at least one viewpoint comprises a subjective viewpoint of the user,
The controller is
obtain a plurality of subjective parameters associated with the user, the plurality of subjective parameters comprising: an occupation of the user, an age of the user, a preference of the user, and an event associated with the user; at least one of the following activities;
determine subjective context of the user based on the plurality of subjective parameters associated with the user;
and determine the subjective viewpoint of the user based on the subjective context information of the user.

The method of claim 14, wherein the at least one viewpoint comprises an objective viewpoint of the user,
The controller is
acquire a plurality of objective parameters associated with the user, wherein the plurality of objective parameters include at least one of a past history of the user, a current goal of the user, and an additional goal of the user;
determine the objective context information of the user based on the plurality of objective parameters associated with the user; and
and determine the objective viewpoint of the user based on the objective context information of the user.

The method of claim 14, wherein the at least one viewpoint comprises an objective viewpoint of the user,
The controller is
obtain a plurality of physical parameters associated with the user, the plurality of physical parameters including an angle of a camera 170 associated with the user, a location of the user, an ambient light condition around the user, a weather condition around the user, and at least one of the user's privacy preferences;
determine physical context information of the user based on the plurality of physical parameters associated with the user; and
and determine the physical viewpoint of the user based on the physical context information of the user.

15. The method of claim 14, wherein the controller,
determine an excitation level of each frame from the plurality of frames of the video based on a plurality of excitation parameters associated with each of the frames, the plurality of excitation parameters comprising: a velocity of the ROI in each frame; at least one of an intensity of the ROI, a frequency of appearance of the ROI in each frame, and a reproduction period of each frame;
extracting at least one of an audio parameter and a text parameter of each frame;
determine a relative context of each frame from the plurality of video frames based on the excitation level of each frame and at least one of the audio parameter and the text parameter;
and identify the set of frames from the plurality of frames including the at least one ROI based on the relative context information of each frame.

15. The method of claim 14, wherein the controller,
determine a weight for each frame from the identified set of frames based on the ROI and the viewpoint of the user;
sequence each frame from the identified set of frames based on the weight determined for each frame;
and merge the sequenced set of frames to generate the video summary.

The method of claim 19, wherein the controller,
obtain a relationship parameter between the at least one viewpoint of the user and each frame from the identified plurality of frames, and a perspective angle of each frame from the identified plurality of frames, the relationship parameter comprising: at least one of an angle of the video based on the at least one viewpoint of the user and a perspective view of a scene in the identified frame;
and determine the weight for the identified frame based on the obtained relationship parameter.

15. The method of claim 14, wherein the controller,
determining an absolute completeness score of the video;
determine absolute frame excitation information of the video based on the absolute completeness score;
detecting co-reference information of the video based on the absolute frame excitation information;
and determine a sequence excitation level of the video based on the joint reference information.

The method of claim 21, wherein the controller,
obtain the speed of the ROI in each frame, the intensity of the ROI in each frame, the frequency of appearance of the ROI in each frame, and a reproduction period of each frame;
Excitation of the absolute frame of the video based on the speed of the ROI in each acquired frame, the intensity of the ROI in each acquired frame, the frequency of appearance of the ROI in each acquired frame, and the playback period of each acquired frame An electronic device configured to determine information.

The method of claim 21, wherein the controller,
obtaining absolute frame information related to the video;
obtaining a maturity threshold associated with the video;
and determine the absolute completeness score by comparing the obtained absolute frame information associated with the video to the obtained completeness threshold associated with the video.

22. The electronic device of claim 21, wherein the controller is configured to drive relative frame excitation associated with the set of frames to sequence the set of frames based on the absolute frame excitation information.

22. The method of claim 21, wherein the controller is configured to maintain the sequence excitation level associated with the set of frames based on the joint reference information;
The controller is:
obtain at least one of a scene comprising audio usage associated with the set of frames and a semantic similarity associated with the set of frames;
configured to determine the joint reference information by determining the joint reference information based on at least one of the obtained scene comprising the audio usage associated with the frame set and the obtained semantic similarity associated with the frame set; electronic device.

22. The electronic device of claim 21, wherein the controller is configured to map a similarity associated with the set of frames based on the level of sequence excitation.