KR102080315B1

KR102080315B1 - Method for providing vedio service and service server using the same

Info

Publication number: KR102080315B1
Application number: KR1020180063439A
Authority: KR
Inventors: 김진중; 우성섭
Original assignee: 네이버 주식회사; 라인 가부시키가이샤
Priority date: 2018-06-01
Filing date: 2018-06-01
Publication date: 2020-02-24
Also published as: JP2019212308A; KR20190137359A; JP6824332B2

Abstract

본 출원은 동영상 서비스 제공 방법 및 이를 이용하는 서비스 서버에 관한 것으로서, 본 발명의 동영상 서비스 제공 방법은, 동영상 내에 포함된 음성의 특성 변화를 기준으로, 상기 동영상을 복수의 단위구간으로 분리하는 단위구간 분리단계; 상기 단위구간에 포함된 음성을 인식하여, 상기 음성에 대응하는 스크립트 문자열을 생성하는 스크립트 문자열 생성단계; 상기 단위구간에 포함된 자막 이미지를 인식하여, 상기 자막 이미지에 대응하는 자막 문자열을 생성하는 자막 문자열 생성단계; 및 상기 스크립트 문자열 및 자막 문자열에 자연어 처리(Natural Language Processing)를 적용하여, 상기 단위구간에 대응하는 키워드를 생성하는 키워드 생성 단계를 포함할 수 있다. The present application relates to a video service providing method and a service server using the same. The video service providing method according to the present invention is based on a characteristic change of a sound included in a video, and the unit section is divided into a plurality of unit sections. step; A script string generation step of generating a script string corresponding to the speech by recognizing the speech included in the unit section; A caption string generation step of recognizing a caption image included in the unit section and generating a caption string corresponding to the caption image; And generating a keyword corresponding to the unit section by applying natural language processing to the script string and the subtitle string.

Description

Method for providing video service and service server using same {Method for providing vedio service and service server using the same}

본 출원은 동영상 서비스 제공 방법 및 이를 이용하는 서비스 서버에 관한 것으로서, 동영상을 의미기반으로 분리하여 각 단위구간에 대한 키워드를 자동으로 생성할 수 있는 동영상 서비스 제공 방법 및 이를 이용하는 서비스 서버에 관한 것이다. The present application relates to a video service providing method and a service server using the same, and to a video service providing method capable of automatically generating a keyword for each unit section by separating a video based on a semantic basis and a service server using the same.

최근 인터넷 기술의 발달에 따라, 인터넷을 통하여 동영상을 제공하는 동영상 서비스 등이 널리 활용되고 있다. 사용자가 인터넷을 통하여 동영상을 시청하고자 하는 경우, 인터넷 상에서 제공되는 수많은 동영상들 중에서 원하는 동영상을 검색할 필요가 있으며, 효과적인 동영상 검색을 위한 다양한 동영상 검색 방법 등이 제시되어 왔다. Recently, with the development of Internet technology, a video service for providing a video through the Internet is widely used. When a user wants to watch a video through the Internet, it is necessary to search for a desired video among numerous videos provided on the Internet, and various video search methods for effective video search have been proposed.

다만, 최근에는 사용자가 동영상 전체가 아니라 동영상 내의 일부분에 대해 관심을 가지고, 그 부분만을 시청하고자 하는 경우가 증가하고 있다. 예를들어, 축구 중계를 시청하고자 하는 사용자는, 축구 중계 프로그램 전체를 시청하기 보다는 특정 선수가 골을 넣는 장면만을 시청하고자 할 수 있다. 그러나, 일반적인 동영상 검색 방법은 축구 중계 전체를 그 검색의 대상으로 하므로, 사용자가 원하는 동영상의 일부 장면 등을 검색하는 것이 어려웠다. Recently, however, a user is interested in a part of a video rather than the entire video, and an increasing number of users want to watch only that part. For example, a user who wants to watch a football relay may want to watch only a scene where a specific player scores a goal, rather than watching the entire football relay program. However, since a general video search method uses the entire soccer relay as the target of the search, it is difficult to search for some scenes and the like of the video desired by the user.

한국 등록특허 제10-0721409호Korean Patent Registration No. 10-0721409

본 출원은, 동영상을 의미기반으로 분리하여, 각 단위구간에 대한 키워드를 자동으로 생성할 수 있는 동영상 서비스 제공 방법 및 이를 이용하는 서비스 서버를 제공하고자 한다. The present application is to provide a video service providing method and a service server using the same that can automatically generate a keyword for each unit section by separating the video on a semantic basis.

본 출원은, 동영상 내의 음성의 특성 변화를 기반으로 동영상을 복수의 단위구간으로 분리할 수 있는 동영상 서비스 제공 방법 및 이를 이용하는 서비스 서버를 제공하고자 한다. The present application is to provide a video service providing method and a service server using the same that can divide the video into a plurality of unit sections based on the change in the characteristics of the voice in the video.

본 출원은, 동영상을 분리한 각각의 단위구간에 음성인식 및 자막인식을 적용하여, 단위구간의 내용에 따른 키워드를 자동으로 생성할 수 있는 동영상 서비스 제공 방법 및 이를 이용하는 서비스 서버를 제공하고자 한다.The present application is to provide a video service providing method and a service server using the same, which can automatically generate a keyword according to the contents of the unit section by applying the voice recognition and subtitle recognition to each unit section separated the video.

본 출원은, 기계학습을 이용한 자연어 처리를 적용하여, 동영상의 각 단위구간의 내용에 따른 키워드를 자동으로 생성할 수 있는 동영상 서비스 제공 방법 및 이를 이용하는 서비스 서버를 제공하고자 한다. The present application, by applying natural language processing using machine learning, to provide a video service providing method that can automatically generate a keyword according to the content of each unit section of the video and a service server using the same.

본 발명의 일 실시예에 의한 동영상 서비스 제공 방법은, 서비스 서버가 단말장치로 동영상을 제공하는 동영상 서비스 제공 방법에 관한 것으로서, 동영상 내에 포함된 음성의 특성 변화를 기준으로, 상기 동영상을 복수의 단위구간으로 분리하는 단위구간 분리단계; 상기 단위구간에 포함된 음성을 인식하여, 상기 음성에 대응하는 스크립트 문자열을 생성하는 스크립트 문자열 생성단계; 상기 단위구간에 포함된 자막 이미지를 인식하여, 상기 자막 이미지에 대응하는 자막 문자열을 생성하는 자막 문자열 생성단계; 및 상기 스크립트 문자열 및 자막 문자열에 자연어 처리(Natural Language Processing)를 적용하여, 상기 단위구간에 대응하는 키워드를 생성하는 키워드 생성 단계를 포함할 수 있다. The video service providing method according to an embodiment of the present invention relates to a video service providing method in which a service server provides a video to a terminal device, wherein the video is provided in a plurality of units based on a change in the characteristics of the voice included in the video. A unit section separation step of separating the section; A script string generation step of generating a script string corresponding to the speech by recognizing the speech included in the unit section; A caption string generation step of recognizing a caption image included in the unit section and generating a caption string corresponding to the caption image; And generating a keyword corresponding to the unit section by applying natural language processing to the script string and the subtitle string.

본 발명의 일 실시예에 의한 서비스 서버는, 동영상 내에 포함된 음성의 특성 변화를 기준으로, 상기 동영상을 복수의 단위구간으로 분리하는 단위구간 분리부; 상기 단위구간에 포함된 음성을 인식하여, 상기 음성에 대응하는 스크립트 문자열을 생성하는 스크립트 문자열 생성부; 상기 단위구간에 포함된 자막 이미지를 인식하여, 상기 자막 이미지에 대응하는 자막 문자열을 생성하는 자막 문자열 생성부; 및 상기 스크립트 문자열 및 자막 문자열에 자연어 처리(Natural Language Processing)를 적용하여, 상기 단위구간에 대응하는 키워드를 생성하는 키워드 생성부를 포함할 수 있다. According to an embodiment of the present invention, a service server includes: a unit section separator configured to separate the video into a plurality of unit sections based on a change in a characteristic of a voice included in the video; A script string generator for recognizing a voice included in the unit section and generating a script string corresponding to the voice; A caption string generator for recognizing a caption image included in the unit section and generating a caption string corresponding to the caption image; And a keyword generator for generating a keyword corresponding to the unit section by applying natural language processing to the script string and the subtitle string.

본 발명의 다른 실시예에 의한 서비스 서버는, 프로세서; 및 상기 프로세서에 커플링된 메모리를 포함하는 것으로서, 상기 메모리는 상기 프로세서에 의하여 실행되도록 구성되는 하나 이상의 모듈을 포함하고, 상기 하나 이상의 모듈은, 동영상 내에 포함된 음성의 특성 변화를 기준으로, 상기 동영상을 복수의 단위구간으로 분리하고, 상기 단위구간에 포함된 음성을 인식하여, 상기 음성에 대응하는 스크립트 문자열을 생성하며, 상기 단위구간에 포함된 자막 이미지를 인식하여, 상기 자막 이미지에 대응하는 자막 문자열을 생성하고, 상기 스크립트 문자열 및 자막 문자열에 자연어 처리(Natural Language Processing)를 적용하여, 상기 단위구간에 대응하는 키워드를 생성하는, 명령어를 포함할 수 있다. Service server according to another embodiment of the present invention, a processor; And a memory coupled to the processor, wherein the memory includes one or more modules configured to be executed by the processor, wherein the one or more modules are based on a change in a characteristic of a voice included in a video. A video is divided into a plurality of unit sections, a voice included in the unit section is recognized, a script string corresponding to the voice is generated, a caption image included in the unit section is recognized, and a corresponding subtitle image is generated. And generating a subtitle string and applying natural language processing to the script string and the subtitle string to generate a keyword corresponding to the unit section.

덧붙여 상기한 과제의 해결수단은, 본 발명의 특징을 모두 열거한 것이 아니다. 본 발명의 다양한 특징과 그에 따른 장점과 효과는 아래의 구체적인 실시형태를 참조하여 보다 상세하게 이해될 수 있을 것이다.In addition, the solution of the said subject does not enumerate all the characteristics of this invention. Various features of the present invention and the advantages and effects thereof may be understood in more detail with reference to the following specific embodiments.

본 발명의 일 실시예에 의한 동영상 서비스 제공 방법 및 이를 이용하는 서비스 서버에 의하면, 동영상 내의 음성의 특성 변화를 기반으로 동영상을 분리하므로, 문맥이나 의미의 손상없이 동영상을 분리하는 것이 가능하다. According to the video service providing method and the service server using the same according to an embodiment of the present invention, since the video is separated based on the change in the characteristics of the voice in the video, it is possible to separate the video without losing the context or meaning.

본 발명의 일 실시예에 의한 동영상 서비스 제공 방법 및 이를 이용하는 서비스 서버에 의하면, 음성인식 및 자막인식을 적용하여 단위구간 내에 포함된 내용을 추출하고, 이후 이를 이용하여 각각의 단위구간에 대한 키워드를 설정하므로, 단위구간의 내용에 따른 키워드를 설정하는 것이 가능하다. According to a video service providing method and a service server using the same according to an embodiment of the present invention, the contents included in a unit section are extracted by applying voice recognition and caption recognition, and then, the keyword for each unit section is used. Since it is set, it is possible to set keywords according to the contents of the unit section.

본 발명의 일 실시예에 의한 동영상 서비스 제공 방법 및 이를 이용하는 서비스 서버에 의하면, 사용자는 내용을 기반으로 동영상에 포함된 특정 장면을 검색할 수 있으며, 특정 주제나 내용을 기반으로 요약동영상을 생성하는 것이 가능하다. According to a video service providing method and a service server using the same according to an embodiment of the present invention, a user may search for a specific scene included in a video based on content, and generate a summary video based on a specific theme or content. It is possible.

다만, 본 발명의 실시예들에 따른 동영상 서비스 제공 방법 및 이를 이용하는 서비스 서버가 달성할 수 있는 효과는 이상에서 언급한 것들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the video service providing method and the effect that can be achieved by the service server using the same according to the embodiments of the present invention are not limited to those mentioned above, and other effects not mentioned are not described in the following description. It will be clearly understood by those skilled in the art.

도1은 본 발명의 일 실시예에 의한 동영상 서비스 제공 시스템을 나타내는 개략도이다.
도2 및 도3은 본 발명의 일 실시예에 의한 서비스 서버를 나타내는 블록도이다.
도4는 본 발명의 일 실시예에 의한 동영상의 단위구간 분리를 나타내는 개략도이다.
도5는 본 발명의 일 실시예에 의한 스크립트 문자열 및 자막 문자열 생성을 나타내는 개략도이다.
도6은 본 발명의 일 실시예에 의한 자막 이미지의 검출을 나타내는 개략도이다.
도7은 본 발명의 다른 실시예에 의한 동영상 서비스 제공 방법을 나타내는 순서도이다. 1 is a schematic diagram showing a video service providing system according to an exemplary embodiment of the present invention.
2 and 3 are block diagrams illustrating a service server according to an exemplary embodiment of the present invention.
4 is a schematic diagram illustrating separation of unit sections of a video according to an embodiment of the present invention.
5 is a schematic diagram illustrating generation of a script string and a caption string according to an embodiment of the present invention.
6 is a schematic diagram illustrating detection of a caption image according to an embodiment of the present invention.
7 is a flowchart illustrating a video service providing method according to another exemplary embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 즉, 본 발명에서 사용되는 '부'라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, '부'는 어떤 역할들을 수행한다. 그렇지만 '부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 '부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '부'들로 결합되거나 추가적인 구성요소들과 '부'들로 더 분리될 수 있다.Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings, and the same or similar components will be given the same reference numerals regardless of the reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "unit" for components used in the following description are given or mixed in consideration of ease of specification, and do not have distinct meanings or roles. In other words, the term 'part' used in the present invention refers to a hardware component such as software, FPGA or ASIC, and 'part' plays a role. But wealth is not limited to software or hardware. The 'unit' may be configured to be in an addressable storage medium and may be configured to play one or more processors. Thus, as an example, a 'part' may include components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, procedures, Subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, and variables. The functionality provided within the components and 'parts' may be combined into a smaller number of components and 'parts' or further separated into additional components and 'parts'.

또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.In addition, in the following description of the embodiments disclosed herein, when it is determined that the detailed description of the related known technology may obscure the gist of the embodiments disclosed herein, the detailed description thereof will be omitted. In addition, the accompanying drawings are only for easily understanding the embodiments disclosed in the present specification, the technical idea disclosed in the specification by the accompanying drawings are not limited, and all changes included in the spirit and scope of the present invention. It should be understood to include equivalents and substitutes.

도1은 본 발명의 일 실시예에 의한 동영상 서비스 제공 시스템을 나타내는 개략도이다. 1 is a schematic diagram showing a video service providing system according to an exemplary embodiment of the present invention.

도1을 참조하면, 본 발명의 일 실시예에 의한 동영상 서비스 제공 시스템은, 단말장치(1) 및 서비스 서버(100)를 포함할 수 있다. Referring to FIG. 1, the video service providing system according to an exemplary embodiment of the present invention may include a terminal device 1 and a service server 100.

이하, 도1을 참조하여 본 발명의 일 실시예에 의한 동영상 서비스 제공 시스템을 설명한다. Hereinafter, a video service providing system according to an exemplary embodiment of the present invention will be described with reference to FIG. 1.

단말장치(1)는 네트워크를 통하여 서비스 서버(100)와 통신을 수행할 수 있으며, 서비스 서버(100)가 제공하는 동영상 서비스를 제공받을 수 있다. 단말장치(1)는 동영상 등의 컨텐츠를 사용자에게 시각적 또는 청각적으로 제공하기 위한 디스플레이부, 스피커 등을 포함할 수 있으며, 사용자의 입력을 인가받는 입력부, 적어도 하나의 프로그램이 저장된 메모리 및 프로세서를 포함할 수 있다.The terminal device 1 may communicate with the service server 100 through a network, and may receive a video service provided by the service server 100. The terminal device 1 may include a display unit, a speaker, or the like for providing visual or audio content to a user, such as a video, and may include an input unit for receiving a user's input, a memory and at least one program stored therein. It may include.

단말장치(1)는 스마트폰, 태블릿 PC 등의 이동 단말기 또는 데스크탑 등의 고정형 장치일 수 있으며, 실시예에 따라서는 휴대폰, 스마트 폰(Smart phone), 노트북 컴퓨터(laptop computer), 디지털방송용 단말기, PDA(personal digital assistants), PMP(portable multimedia player), 슬레이트 PC(slate PC), 태블릿 PC(tablet PC), 울트라북(ultrabook), 웨어러블 디바이스(wearable device, 예를 들어, 워치형 단말기 (smartwatch), 글래스형 단말기 (smart glass), HMD(head mounted display)) 등이 단말장치(1)에 해당할 수 있다. The terminal device 1 may be a fixed device such as a mobile terminal or a desktop such as a smartphone or a tablet PC, and according to an embodiment, a mobile phone, a smart phone, a laptop computer, a digital broadcasting terminal, Personal digital assistants (PDAs), portable multimedia players (PMPs), slate PCs, tablet PCs, ultrabooks, wearable devices (e.g., smartwatches) , A glass type terminal (smart glass), a head mounted display (HMD), etc. may correspond to the terminal device 1.

단말장치(1)와 서비스 서버(100)를 연결하는 네트워크는, 유선 네트워크와 무선 네트워크를 포함할 수 있으며, 구체적으로, 근거리 네트워크(LAN: Local Area Network), 도시권 네트워크(MAN: Metropolitan Area Network), 광역 네트워크(WAN: Wide Area Network) 등 다양한 네트워크를 포함할 수 있다. 또한, 네트워크는 공지의 월드와이드웹(WWW: World Wide Web)을 포함할 수도 있다. 다만, 본 발명에 따른 네트워크는 상기 열거된 네트워크에 국한되지 않으며, 공지의 무선 데이터 네트워크, 공지의 전화 네트워크, 공지의 유선 또는 무선 텔레비전 네트워크 등을 포함할 수 있다.The network connecting the terminal device 1 and the service server 100 may include a wired network and a wireless network, and specifically, a local area network (LAN) and a metropolitan area network (MAN). And various networks such as a wide area network (WAN). The network may also include a known World Wide Web (WWW). However, the network according to the present invention is not limited to the networks listed above, and may include a known wireless data network, a known telephone network, a known wired or wireless television network, and the like.

서비스 서버(100)는 네트워크를 통하여 단말장치(1)에게 동영상 서비스를 제공할 수 있다. 서비스 서버(100)에는 단말장치(1)에 제공가능한 복수의 동영상 컨텐츠들이 저장되어 있을 수 있으며, 단말장치(1)의 요청에 따라 단말장치(1)로 동영상을 제공할 수 있다. 예를들어, 서비스 서버(100)는 동영상 등의 컨텐츠를 실시간으로 스트리밍(streaming)하거나, 해당 컨텐츠를 다운로드(download) 받도록 제공할 수 있다. The service server 100 may provide a video service to the terminal device 1 through a network. The service server 100 may store a plurality of video contents that can be provided to the terminal device 1, and provide the video to the terminal device 1 at the request of the terminal device 1. For example, the service server 100 may provide to stream content such as a video in real time or to download the corresponding content.

서비스 서버(100)는 동영상 서비스를 제공할 때, 동영상에 대한 메타 정보를 더 포함하여 제공할 수 있다. 즉, 동영상 자체에 대한 메타정보를 설정하여 사용자에게 동영상의 등장 인물, 줄거리, 장르 등 추가적인 정보를 제공할 수 있으며, 이를 활용하여 사용자에게 동영상 검색이나 추천 서비스 등을 제공하는 것도 가능하다. When providing the video service, the service server 100 may further include meta information about the video. That is, by setting meta information about the video itself, it is possible to provide the user with additional information such as a character, a plot, and a genre of the video. By using this, it is also possible to provide a user with a video search or recommendation service.

여기서, 본 발명의 일 실시예에 의한 서비스 서버(100)는, 동영상 자체에 대한 메타 정보를 설정하는 것 이외에, 동영상 내에 포함된 내용에 대한 메타 정보를 설정하는 것도 가능하다. 즉, 서비스 서버(100)는, 동영상을 의미기반의 단위구간으로 분리한 후, 각각의 단위구간에 대한 키워드를 설정함으로써, 전체 동영상 중에서 사용자가 원하는 구간만을 탐색하도록 제공할 수 있다. 또한, 동일한 키워드를 가지는 단위구간들을 취합하여 전체 동영상을 축약한 요약 동영상을 사용자에게 제공하는 것도 가능하다. Here, the service server 100 according to an exemplary embodiment of the present invention may set meta information about contents included in the video, in addition to setting meta information about the video itself. That is, the service server 100 may divide the video into semantic based unit sections and set keywords for each unit section, so that the user may search only the desired section among the entire videos. In addition, it is also possible to provide a user with a summary video in which the entire video is shortened by collecting unit sections having the same keyword.

도2는 본 발명의 일 실시예에 의한 서비스 서버를 나타내는 블록도이다. 2 is a block diagram illustrating a service server according to an exemplary embodiment of the present invention.

도2를 참조하면 본 발명의 일 실시예에 의한 서비스 서버(100)는, 단위구간 분리부(110), 스크립트 문자열 생성부(120), 자막 문자열 생성부(130), 키워드 생성부(140), 검색부(150) 및 요약동영상 생성부(160)를 포함할 수 있다. Referring to FIG. 2, the service server 100 according to an exemplary embodiment of the present invention may include a unit section separator 110, a script string generator 120, a subtitle string generator 130, and a keyword generator 140. The searcher 150 may include a summary video generator 160.

이하, 도2를 참조하여 본 발명의 일 실시예에 의한 서비스 서버(100)를 설명한다. Hereinafter, the service server 100 according to an embodiment of the present invention will be described with reference to FIG. 2.

단위구간 분리부(110)는 동영상을 복수의 단위구간으로 분리할 수 있다. 즉, 단위구간 분리부(110)는 대상이 되는 동영상을 로드할 수 있으며, 로드된 동영상 내에 포함된 음성의 특성 변화를 기준으로 복수의 단위구간으로 분리할 수 있다. 여기서, 음성의 특성 변화는 음량 또는 음질의 변화일 수 있으며, 실시예에 따라서는 음의 높낮이, 음색 등의 변화도 포함할 수 있다. The unit section separator 110 may divide the video into a plurality of unit sections. That is, the unit section separator 110 may load a target video, and may divide the video into target units based on a change in the characteristics of the voice included in the loaded video. Here, the change in the characteristic of the voice may be a change in volume or sound quality, and in some embodiments, may include a change in height and tone of the sound.

구체적으로, 단위구간 분리부(110)는 음성의 특성 변화를 확인하기 위하여, 동영상 내의 음량을 추적할 수 있다. 예를들어, 음량은 동영상 내의 일정구간 동안에는 특정 범위 내에서 유지되다가, 갑자기 특정 범위를 벗어나서 급격히 높아지거나 낮아질 수 있다. 이때, 단위구간 분리부(110)는 동영상 내의 음량을 추적하여, 음량의 변화가 발생한 동영상 내의 지점을 감지할 수 있다. 즉, 단위구간 분리부(110)는 음량의 변화량을 이용하여 상기 음량의 급격한 상승지점이나 하강지점을 감지할 수 있다.In detail, the unit section separator 110 may track the volume in the video in order to confirm the change in the characteristic of the voice. For example, the volume may be maintained within a certain range for a certain period in a video, and then suddenly out of a certain range, the volume may be rapidly increased or decreased. In this case, the unit section separating unit 110 may track a volume in the video and detect a point in the video where a change in volume occurs. That is, the unit section separator 110 may detect a sudden rising point or a falling point of the volume by using a change amount of the volume.

여기서, 음량의 변화량은 동영상 내 일정구간 동안의 음량의 평균값이나, 해당구간 내에 나타난 음량의 최대값 또는 최소값을 기준으로 계산할 수 있다. 즉, 단위구간 분리부(110)는 측정한 음량을 평균값 등의 기준과 비교하여 얼마나 변화하였는지 계산할 수 있으며, 음량의 변화량이 일정한 임계치(threshold) 이상으로 증가한 지점을 상승지점, 감소한 지점을 하강지점으로 설정할 수 있다. 이때, 상승지점, 하강지점을 설정하기 위한 임계치는 각각 상이하게 설정될 수 있으며, 상기 임계치는 각각의 동영상마다 상이하게 설정될 수 있다.Here, the change amount of the volume may be calculated based on the average value of the volume during a certain period in the video, or the maximum or minimum value of the volume appearing in the corresponding section. That is, the unit section separation unit 110 may calculate how much the measured volume is compared with a reference such as an average value, and the rising point and the decreasing point are the points where the change in volume is increased above a certain threshold. Can be set with In this case, a threshold for setting the rising point and the falling point may be set differently, and the threshold may be set differently for each video.

단위구간 분리부(110)는 해당 상승지점 또는 하강지점을 기준으로 동영상을 복수의 단위구간으로 분리할 수 있으며, 이를 통하여, 야구에서 타자의 홈런시 함성소리를 이용하여 홈런장면을 감지하거나, 뉴스에서 앵커가 말을 하다가 다음 뉴스로 넘어가기 위해 잠시 쉬는 부분 등을 감지할 수 있다. The unit section separating unit 110 may divide the video into a plurality of unit sections based on the corresponding rising point or the falling point, and through this, detects a home run scene by using the batter's home run cry in baseball, or news Can detect where the anchor is talking and taking a break to move on to the next news.

또한, 실시예에 따라서는, 단위구간 분리부(110)가 동영상 내에 포함된 음성의 특성 변화를 확인하기 위하여, 동영상 내의 음질을 파악할 수 있다. 예를들어, 단위구간 분리부(110)는 동영상 내의 음질이 깨끗하다가 갑자기 노이즈가 많이 포함되는 부분을 감지할 수 있으며, 감지된 부분을 기준으로 단위구간을 분리할 수 있다. 즉, 단위구간 분리부(110)는 뉴스에서 앵커들이 말을 하다가 현장의 아나운서에게 마이크를 넘길 때 발생하는 음질의 변화 등을 감지한 후, 이를 기준으로 동영상을 분리하는 것도 가능하다. In addition, according to an exemplary embodiment, the unit section separator 110 may grasp the sound quality in the video in order to confirm the characteristic change of the voice included in the video. For example, the unit section separator 110 may detect a portion in which the sound quality of the video is clean and suddenly contains a lot of noise, and may separate the unit section based on the detected portion. That is, the unit section separation unit 110 may detect a change in sound quality that occurs when an anchor speaks to the announcer in the field while talking in the news, and then separate the video based on this.

추가적으로, 동영상 내에 복수의 화자가 존재하는 경우, 단위구간 분리부(110)는 음색을 이용하여 각각의 화자들을 구분한 후 화자별로 단위구간을 분리하는 것도 가능하다. 이외에도, 단위구간 분리부(110)는 다양한 방법으로 음성의 특성변화를 감지하고 그에 따라 단위구간을 분리할 수 있다.In addition, in the case where a plurality of speakers exist in the video, the unit section separator 110 may classify each speaker by using a tone and then separate the unit sections for each speaker. In addition, the unit section separator 110 may detect a characteristic change of the voice in various ways and separate the unit section accordingly.

한편, 뉴스 동영상의 경우, 앵커는 원고(script)를 일정한 속도로 읽어 내려갈 수 있으며, 하나의 문단이 끝나면 잠시 끊었다가 다음 문단을 계속하여 읽을 수 있다. 즉, 동영상 내의 화자가 읽는 각각의 문단은 화자의 음량의 변화량을 기준으로 구별하는 것이 가능하다. 동일한 문단 내에는 동일한 주제의 내용이 포함되는 것이 일반적이므로, 이를 기준으로 동영상을 구분하면 동영상을 의미기반으로 분리할 수 있다. 또한, 정해진 원고(script)가 없는 동영상 등의 경우에도, 동영상 내의 화자가 말하는 문맥을 유지하기 위해서는, 화자의 음량의 변화량을 기준으로 동영상을 구분하는 것이 유리하다. 따라서, 단위구간 분리부(110)에서는 동영상 내 포함된 음성의 음량 변화를 기준으로, 동영상을 복수의 단위구간으로 분리할 수 있다. On the other hand, in the case of a news video, the anchor can read the script (script) at a constant speed, and when one paragraph ends, you can hang up for a while and continue reading the next paragraph. That is, each paragraph read by the speaker in the video can be distinguished based on the amount of change in the speaker's volume. Since the content of the same topic is generally included in the same paragraph, dividing the video based on this can separate the video on the basis of meaning. Also, in the case of a moving picture or the like without a script, it is advantageous to classify moving pictures based on the change amount of the speaker's volume in order to maintain the context in which the speaker speaks. Therefore, the unit section separator 110 may divide the video into a plurality of unit sections based on the volume change of the voice included in the video.

예를들어, 도4에 도시한 바와 같이, 뉴스 동영상(V) 내의 앵커의 음량 변화량을 이용하면, 전체 동영상을 앵커가 발화하는 구간(A)과 발화를 중단한 정지구간(B)으로 구분할 수 있다. 여기서, 앵커의 음량의 변화량을 기준으로 동영상을 분리하므로, 하나의 단위구간 내에 복수의 화면 전환이 일어날 수 있음을 확인할 수 있다. For example, as shown in FIG. 4, if the volume change amount of the anchor in the news video V is used, the entire video can be divided into a section A where the anchor utters and a stop section B where the utterance stopped. have. Here, since the video is separated based on the change amount of the volume of the anchor, it can be seen that a plurality of screen changes may occur within one unit section.

한편, 앵커가 발화하는 구간(A)이 각각의 단위구간에 해당하므로, 발화를 중단한 정지구간(B)을 편집점(Cutting point)으로 설정하여 각각의 단위구간을 분리할 수 있다. 여기서, 정지구간(B)은 음량이 설정값 미만으로 감소하고, 설정값 미만으로 감소한 음량이 기준시간 이상 유지되는 구간으로 설정할 수 있다. 정지구간(B)의 길이는 각각의 동영상마다 상이하게 설정될 수 있다. On the other hand, since the section (A) ignited by the anchor corresponds to each unit section, each unit section can be separated by setting the stop section (B) where the utterance is stopped as a cutting point. Here, the stop section B may be set to a section in which the volume is reduced below the set value, and the volume reduced below the set value is maintained for more than the reference time. The length of the stop section B may be set differently for each video.

따라서, 단위구간 분리부(110)는 음성의 특성변화를 이용하여, 동영상 내에 포함된 정지구간(B)을 판별할 수 있으며, 이를 이용하여 복수의 단위구간으로 분리할 수 있다. Therefore, the unit section separator 110 may determine the stop section B included in the video by using the characteristic change of the voice, and may divide the unit section into a plurality of unit sections.

스크립트 문자열 생성부(120)는 단위구간에 포함된 음성을 인식하여, 음성에 대응하는 스크립트 문자열을 생성할 수 있다. 동영상을 복수의 단위구간으로 분리한 이후에는, 각각의 단위구간 내에 포함된 내용을 인식할 필요가 있다. 이를 위하여, 스크립트 문자열 생성부(120)는 화자가 발화한 음성을 인식하여 이를 문자로 변환하고, 변환된 문자들을 결합하여 스크립트 문자열로 생성할 수 있다. The script string generator 120 may generate a script string corresponding to the voice by recognizing the voice included in the unit section. After dividing the video into a plurality of unit sections, it is necessary to recognize the content included in each unit section. To this end, the script string generation unit 120 may recognize the voice spoken by the speaker, convert it into a character, and combine the converted characters to generate a script string.

실시예에 따라서는, 서비스 서버(100) 내에 별도의 음성인식장치가 구비되어 있을 수 있으며, 스크립트 문자열 생성부(120)는 음성인식장치를 이용하여 음성을 문자로 변환할 수 있다. 예를들어, 단위구간에 포함된 음성을 전기적 신호인 음성패턴으로 표현할 수 있으며, 음성모델 데이터베이스 등에는 각각의 문자에 대응하는 표준음성패턴이 저장되어 있을 수 있다. 이 경우, 음성인식장치는 입력되는 음성패턴들을 음성모델 데이터베이스에 저장된 표준음성패턴과 비교할 수 있으며, 각각의 음성패턴에 대응하는 표준음성패턴을 추출할 수 있다. 이후, 추출한 표준음성패턴을 대응하는 문자로 변환할 수 있으며, 변환된 문자들을 결합하여 스크립트 문자열을 생성할 수 있다. 즉, 도5에 도시한 바와 같이, 스크립트 문자열 생성부(120)는 동영상 내에서 화자가 발화한 음성을 인식하여 스크립트 문자열(S1)로 생성할 수 있다. According to an embodiment, a separate voice recognition device may be provided in the service server 100, and the script string generation unit 120 may convert the voice into text using the voice recognition device. For example, the voice included in the unit section may be expressed as a voice pattern which is an electrical signal, and a standard voice pattern corresponding to each character may be stored in the voice model database. In this case, the voice recognition apparatus may compare the input voice patterns with standard voice patterns stored in the voice model database, and extract a standard voice pattern corresponding to each voice pattern. Thereafter, the extracted standard speech pattern may be converted into a corresponding character, and a script string may be generated by combining the converted characters. That is, as shown in FIG. 5, the script string generation unit 120 may recognize the voice spoken by the speaker in the video and generate the script string S1.

다만, 스크립트 문자열 생성부(120)가 음성을 문자로 변환하는 방식은 이에 한정되는 것은 아니며, 스크립트 문자열 생성부(120)는 이외에도 다양한 방식으로 동영상에 포함된 음성을 문자로 변환할 수 있다. However, the method of converting the voice to the text by the script string generating unit 120 is not limited thereto, and the script string generating unit 120 may convert the voice included in the video into the text in various ways.

자막 문자열 생성부(130)는 단위구간에 포함된 자막 이미지를 인식하여, 자막 이미지에 대응하는 자막 문자열을 생성할 수 있다. 동영상 내에는 화자가 말하는 내용이나, 동영상에서 전달하고자 하는 내용을 강조하기 위하여, 자막 이미지가 포함될 수 있다. 예를들어, 도5에 도시한 바와 같이, 뉴스 동영상의 경우에도, 뉴스의 주된 내용을 요약하여 전달하기 위해 자막 이미지(C)가 포함될 수 있다. The caption string generation unit 130 may recognize a caption image included in a unit section and generate a caption string corresponding to the caption image. The video may include a caption image to emphasize contents spoken by the speaker or contents to be delivered in the video. For example, as shown in Fig. 5, even in the case of a news video, a caption image C may be included to summarize and convey main contents of the news.

이와 같이 자막 이미지에는 동영상의 내용이 요약되어 표시되므로, 각각의 단위구간의 내용을 확인하기 위하여, 자막 이미지에 포함된 문자를 인식할 필요가 있다. 다만, 자막 이미지는 문자가 아니라 형상으로 인식되므로, 자막 이미지에 포함된 문자를 인식하기 위해서는 문자인식 알고리즘 등을 적용할 필요가 있다. As described above, since the contents of the video are summarized and displayed on the caption image, it is necessary to recognize the characters included in the caption image in order to confirm the contents of each unit section. However, since the caption image is recognized as a shape instead of a character, it is necessary to apply a character recognition algorithm or the like to recognize the character included in the caption image.

실시예에 따라서는 서비스 서버(100) 내에 별도의 문자인식장치가 구비되어 있을 수 있으며, 자막 문자열 생성부(130)는 문자인식장치를 이용하여 자막 이미지를 문자로 변환할 수 있다. 예를들어, 단위구간에 포함된 자막 이미지를 스캔하여 자막 이미지에 대한 픽셀값의 분포를 전기적 신호인 형상패턴으로 표현할 수 있으며, 문자모델 데이터베이스 등에는 각각의 문자에 대응하는 표준형상패턴이 저장되어 있을 수 있다. 이 경우, 문자인식장치는 입력되는 형상패턴을 문자모델 데이터베이스에 저장된 표준형상패턴과 비교할 수 있으며, 각각의 형상패턴에 대응하는 표준형상패턴을 추출할 수 있다. 이후, 추출한 표준형상패턴에 대응하는 문자로 각각 변환하여 자막 문자열을 생성할 수 있다. 즉, 도5에 도시한 바와 같이, 동영상 프레임(f) 내의 자막 이미지(C)에 포함된 형상을 문자로 변환하여 자막 문자열(S2)로 추출할 수 있다. According to an exemplary embodiment, a separate character recognition device may be provided in the service server 100, and the caption string generation unit 130 may convert the caption image into text using the character recognition device. For example, by scanning a caption image included in a unit section, the distribution of pixel values for the caption image can be expressed as a shape pattern, which is an electrical signal, and a standard shape pattern corresponding to each character is stored in the character model database. There may be. In this case, the character recognition apparatus may compare the input shape pattern with a standard shape pattern stored in the character model database, and extract a standard shape pattern corresponding to each shape pattern. Subsequently, subtitle strings may be generated by converting each character into a character corresponding to the extracted standard shape pattern. That is, as shown in FIG. 5, the shape included in the caption image C in the video frame f may be converted into a character and extracted as a caption string S2.

한편, 자막 문자열 생성부(130)가 자막 이미지로부터 자막 문자열을 추출하기 위해서는, 단위구간 내에 자막 이미지의 존재여부와, 자막 이미지의 동영상 프레임 내의 위치를 판별할 필요가 있다. 즉, 자막 이미지가 포함된 동영상 프레임에 한하여 문자인식을 수행하고, 동영상 프레임 내에 자막 이미지가 위치하는 영역에 한정하여 문자인식을 수행하도록 하여, 보다 효율적인 문자인식이 수행되도록 할 수 있다. 또한, 이를 통하여 동영상 프레임 내에 포함된 자막 이미지가 아닌 다른 문자를 변환하는 등의 문제를 방지할 수 있다. 따라서, 자막 문자열 생성부(130)에서는 자막 문자열을 생성하기 전에, 먼저 단위구간 내 자막 이미지를 포함하는 동영상 프레임을 검출하고, 동영상 프레임 내에 포함된 자막 이미지의 위치를 특정할 수 있다.On the other hand, in order for the caption string generation unit 130 to extract the caption string from the caption image, it is necessary to determine whether the caption image exists within the unit section and the position of the caption image in the video frame. That is, the character recognition may be performed only on the video frame including the caption image, and the character recognition may be performed only on the region in which the caption image is located in the video frame, thereby enabling more efficient character recognition. In addition, this may prevent a problem such as converting a character other than the caption image included in the video frame. Therefore, before generating the caption string, the caption string generation unit 130 may first detect a video frame including the caption image in the unit section and specify the position of the caption image included in the video frame.

구체적으로, 자막 문자열 생성부(130)는 단위구간에 포함된 각각의 동영상 프레임에 복수의 랜드마크를 설정할 수 있다. 즉, 도6에 도시한 바와 같이, 동영상 프레임 내에 랜드마크(L)들이 균일하게 위치하도록 설정할 수 있으며, 각각의 랜드마크(L)에서 색상 또는 휘도 등을 측정할 수 있다. 구체적으로, 랜드마크(L)의 위치에 대응하는 픽셀로부터 각각의 픽셀의 색상, 휘도 등을 입력받을 수 있다. In detail, the subtitle string generation unit 130 may set a plurality of landmarks in each video frame included in the unit section. That is, as shown in FIG. 6, the landmarks L may be uniformly positioned in the video frame, and color or luminance may be measured at each landmark L. FIG. In detail, the color, luminance, and the like of each pixel may be input from the pixel corresponding to the position of the landmark L. FIG.

이후, 랜드마크에서 측정한 색상 또는 휘도 등이 자막 이미지에 대응하는 기준 색상 또는 기준 휘도에 해당하면, 해당 동영상 프레임 내에 자막 이미지가 위치하는 것으로 판별할 수 있다. 도6에 도시한 바와 같이, 자막 이미지(C)는 원본영상(D)을 덮는 형태로 표시될 수 있으며, 자막 이미지(C)는 기준 색상과 기준 휘도를 가지도록 설정될 수 있다. 여기서, 자막 이미지(C)의 기준 색상, 기준 휘도는, 원본영상(D)과는 구별되는 특징적인 색상이나 휘도를 가지도록 설정되므로, 자막 문자열 생성부(130)는 색상이나 휘도를 이용하여 자막 이미지를 구별하는 것이 가능하다. Subsequently, when the color or the luminance measured in the landmark corresponds to the reference color or the reference luminance corresponding to the caption image, it may be determined that the caption image is located in the corresponding video frame. As shown in FIG. 6, the caption image C may be displayed to cover the original image D, and the caption image C may be set to have a reference color and a reference luminance. Here, the reference color and the reference luminance of the caption image (C) are set to have a characteristic color or luminance different from that of the original image (D), so that the caption character string generation unit 130 uses the caption or color to display the caption. It is possible to distinguish images.

또한, 자막 문자열 생성부(130)는 동영상 프레임 상에 균일하게 분포하는 복수의 랜드마크 중에서, 자막 이미지에 대응하는 기준 색상 또는 기준 휘도가 측정된 랜드마크들을 추출할 수 있으며, 추출된 랜드마크를 이용하여 자막 이미지의 위치 또는 크기를 특정할 수 있다. 즉, 각각의 랜드마크들의 동영상 프레임내 설정좌표 등이 미리 설정되어 있을 수 있으며, 자막 문자열 생성부(130)는 자막 이미지를 검출한 랜드마크들의 설정좌표를 이용하여, 해당 자막 이미지의 위치와 크기를 특정할 수 있다. 이 경우, 자막 문자열 생성부(130)는 문자인식을 상기 특정된 자막 이미지 영역 내에서만 수행하도록 제어할 수 있다. 즉, 전체 동영상 프레임 중에서 문자인식을 수행하는 영역을 특정할 수 있으므로, 보다 효율적인 문자인식이 가능하다. In addition, the caption string generation unit 130 may extract landmarks in which a reference color or reference luminance corresponding to the caption image is measured, from among a plurality of landmarks uniformly distributed on the video frame. The position or size of the caption image can be specified. That is, the set coordinates in the video frame of each landmark may be set in advance, and the caption character string generation unit 130 uses the set coordinates of the landmarks for which the caption image is detected, and the position and size of the caption image. Can be specified. In this case, the subtitle string generation unit 130 may control to perform character recognition only within the specified subtitle image area. That is, since the area for performing the character recognition can be specified among the entire video frames, more efficient character recognition is possible.

한편, 자막 문자열 생성부(130)는 동영상 제작자로부터 각각의 동영상에서 사용한 자막 이미지의 기준색상이나 기준휘도, 동영상 프레임 내 위치나 크기 등의 특징정보를 제공받을 수 있으며, 자막 이미지 추출시 이를 활용할 수 있다. 예를들어, 자막 이미지의 위치나 크기 등에 대한 특징정보를 수신하는 경우에는, 랜드마크를 동영상 프레임 전체에 균일하게 설정하지 않고, 자막 이미지가 위치할 것으로 설정된 영역 내로 한정하여 설정할 수 있다. Meanwhile, the subtitle string generation unit 130 may receive feature information such as a reference color, a reference luminance, a position or a size in a video frame, etc. of a subtitle image used in each video from a video producer, and may use it when extracting a subtitle image. have. For example, when receiving feature information on the position and size of the caption image, the landmark can be set within the area where the caption image is to be located, without setting the landmark uniformly throughout the video frame.

키워드 생성부(140)는 스크립트 문자열 및 자막 문자열에 자연어 처리(Natural Language Processing)을 적용하여, 단위구간에 대응하는 키워드를 생성할 수 있다. 즉, 사용자가 단위구간의 내용을 확인한 후 그에 대응하여 키워드나 주석 등을 설정하는 것이 아니라, 각각의 단위구간에 대한 의미기반의 키워드를 자동으로 설정하는 것이 가능하다. 여기서, 스크립트 문자열 및 자막 문자열에 적용하는 자연어 처리에는 다양한 방법 등이 적용될 수 있으며, 실시예에 따라서는 word2vec, LDA(Latent Dirichlet Allocation) 등의 기계학습(machine learning)을 적용할 수 있다. The keyword generator 140 may generate a keyword corresponding to a unit section by applying natural language processing to the script string and the caption string. That is, after checking the contents of the unit section, the user may automatically set a semantic based keyword for each unit section instead of setting keywords or annotations. Here, various methods may be applied to natural language processing applied to script strings and subtitle strings, and machine learning such as word2vec and late diallet allocation (LDA) may be applied according to embodiments.

일 실시예에 의하면, 키워드 생성부(140)는 word2vec를 이용하여 워드 임베딩(word embedding)한 word2vec 모델을 구현할 수 있으며, 자막 문자열 또는 스크립트 문자열에서 추출한 단어들을 word2vec 모델에 대한 입력단어로 설정하여, 입력단어에 대응하는 연관단어들을 추출할 수 있다. 이후, 추출한 연관단어들을 해당 단위구간에 대한 키워드로 설정할 수 있다. According to an embodiment, the keyword generator 140 may implement a word embedding word2vec model using word2vec, and set words extracted from a caption string or a script string as an input word for the word2vec model. Associated words corresponding to the input word may be extracted. Thereafter, the extracted related words may be set as keywords for a corresponding unit section.

예를들어, 서비스 서버(100)에서 제공하는 동영상이 뉴스 동영상인 경우에는, 최근 5년동안의 뉴스 기사 등을 word2vec를 이용하여 워드 임베딩하는 방식으로 word2vec 모델을 구현할 수 있다. Word2vec의 경우 각각의 단어들을 벡터 공간에 임베딩하여 단어를 벡터로 표현하는 것으로서, 서로 연관되는 단어들은 공간 상에 인접하게 배치되는 특징이 있다. 즉, word2vec 모델이 학습하는 복수의 샘플들에서 각각의 단어들이 서로 인접하게 나타나는 빈도가 높을수록, 벡터 공간 상에 인접하게 표시될 수 있다. 예를들어, 샘플에 사용된 기존의 뉴스 기사에서 "북한"과 관련하여 "핵"이 자주 언급되면, "북한"과 "핵"에 대응하는 벡터들은 서로 인접하게 임베딩될 수 있으며, 이들은 서로 연관이 있는 것으로 판별할 수 있다. For example, when the video provided by the service server 100 is a news video, the word2vec model may be implemented by word embedding news articles for the last five years using word2vec. In the case of Word2vec, each word is embedded in a vector space to express a word as a vector. The words related to each other are arranged adjacent to each other in the space. That is, the higher the frequency of each word appearing adjacent to each other in the plurality of samples trained by the word2vec model, the more closely it may be displayed in the vector space. For example, if "nuclear" is frequently mentioned in relation to "North Korea" in an existing news article used in the sample, the vectors corresponding to "North Korea" and "Nuclear" may be embedded adjacent to each other, and they are associated with each other. Can be determined to be present.

다만, 스크립트 문자열에는 다수의 단어들이 포함되므로, 스크립트 문자열에 포함된 각각의 단어들에 대응하여 추출되는 연관단어들을 모두 키워드로 설정하기에는 키워드가 지나치게 많아질 수 있다. 이를 방지하기 위하여, 키워드 생성부(140)는 연관단어와 입력단어를 비교하여 유사도가 높은 연관단어만을 키워드로 설정할 수 있다. However, since the script string includes a plurality of words, the keywords may be too large to set all of the related words extracted corresponding to the respective words included in the script string as keywords. In order to prevent this, the keyword generation unit 140 may set only a related word having a high similarity as a keyword by comparing the related word and the input word.

구체적으로, 키워드 생성부(140)는 word2vec 모델에 입력한 입력단어에 대응하는 입력단어 벡터와, 연관단어에 대응하는 연관단어 벡터 사이의 유사도를 계산하여, 유사도가 높은 연관단어만을 추출하여 키워드로 설정할 수 있다.Specifically, the keyword generation unit 140 calculates the similarity between the input word vector corresponding to the input word input to the word2vec model and the related word vector corresponding to the related word, and extracts only the related word having high similarity as a keyword. Can be set.

워드 임베딩을 통하여 각각의 단어들은 공간상에 벡터화하여 분포될 수 있으며, 학습한 샘플에서 서로 유사하거나 관련있는 것으로 설정된 단어들은 벡터 공간 상에 인접한 위치에 위치하게 된다. 따라서, 입력단어 벡터와 연관단어 벡터들 사이의 유사도를 계산하여, 입력단어와 연관단어들 사이의 관계를 파악하는 것이 가능하다. 여기서, 벡터들 사이의 유사도는 코사인 유사도(cosine similarity)를 이용하여 계산할 수 있으나, 이에 한정되는 것은 아니며 벡터들 사이의 유사도를 계산할 수 있는 것이면 어떠한 것도 적용가능하다. Through word embedding, each word can be distributed in a vectorized space, and words set as similar or related to each other in the learned sample are located at adjacent positions in the vector space. Therefore, by calculating the similarity between the input word vector and the related word vectors, it is possible to grasp the relationship between the input word and the related words. Here, the similarity between the vectors can be calculated using cosine similarity, but is not limited thereto, and any similarity can be applied as long as the similarity between the vectors can be calculated.

키워드 생성부(140)는 입력 벡터와의 유사도가 제한값 이상인 연관단어 벡터를 추출할 수 있으며, 추출된 연관단어 벡터에 대응하는 연관단어를 키워드로 설정할 수 있다. 즉, 유사도가 제한값 이상인 연관단어벡터에 해당하는 연관단어들만을 키워드로 설정할 수 있다. 또한, 실시예에 따라서는, 입력 벡터와의 유사도가 높은 순서에 따라 기 설정된 개수의 연관단어 벡터를 추출할 수 있으며, 추출된 기 설정된 개수의 연관단어 벡터에 대응하는 연관단어들을 키워드로 설정하는 것도 가능하다. 예를들어, 가장 유사도가 큰 연관단어 벡터 10개를 추출하고, 추출된 10개의 연관단어를 키워드로 설정할 수 있다. The keyword generation unit 140 may extract an associated word vector having a similarity degree or more with a limit value greater than or equal to the input vector, and may set an associated word corresponding to the extracted associated word vector as a keyword. That is, only related words corresponding to the related word vector having a similarity or more than a limit value may be set as keywords. In addition, according to an exemplary embodiment, a predetermined number of related word vectors may be extracted according to a high similarity with the input vector, and the related words corresponding to the extracted predetermined number of related word vectors may be set as keywords. It is also possible. For example, 10 related word vectors with the highest similarity may be extracted, and the extracted 10 related words may be set as keywords.

추가적으로, 키워드 생성부(140)가 실시간 검색어 정보를 이용하여, 키워드를 설정하는 실시예도 가능하다. 실시간 검색어 정보는, 포털 사이트 등에서 제공하는 검색 서비스에서 사용되는 검색어 중에서, 실시간으로 검색량이 급증한 검색어들에 대한 정보일 수 있다. 실시간 검색어 정보에 포함된 각각의 검색어들은 현재 이슈가 되고 있는 주제에 관한 것이므로, 키워드 생성부(140)는 실시간 검색어와 관련되는 단어를 우선적으로 키워드로 설정할 수 있다. 실시간 검색어 정보는 서비스 서버(100)가 외부로부터 수신하여 키워드 생성부(140)로 제공될 수 있다. In addition, an embodiment in which the keyword generator 140 sets keywords by using real-time search word information is also possible. The real-time search term information may be information about search terms that have increased in search volume in real time among search terms used in a search service provided by a portal site. Since each of the search terms included in the real-time search term information is related to a topic that is currently an issue, the keyword generator 140 may preferentially set a word related to the real-time search term as a keyword. The real-time search term information may be received by the service server 100 from the outside and provided to the keyword generator 140.

구체적으로, 키워드 생성부(140)는 word2vec 모델에서 추출한 연관단어들 중에서, 실시간 검색어 정보에 포함된 검색어에 대응하는 연관단어를 추출하고, 추출된 연관단어에 대하여는 유사도 계산시 가중치를 부가할 수 있다. 즉, 상대적으로 유사도가 낮은 경우에도, 실시간 검색어 정보에 대응하는 연관단어에 대하여는 가중치에 의하여 키워드로 설정될 수 있다. 이때, 검색어들의 실시간 검색순위에 따라, 검색어에 대응하는 연관단어에 제공하는 가중치를 차등하여 부여하는 것도 가능하다. 예를들어, 실시간 검색어 1위에 해당하는 검색어와 5위에 해당하는 검색어에 대하여 가중치를 상이하게 설정할 수 있다. In detail, the keyword generator 140 may extract an associated word corresponding to a search word included in the real-time search word information from the related words extracted from the word2vec model, and may add a weight to the extracted related word when calculating the similarity. . That is, even when the similarity is relatively low, the related word corresponding to the real-time search word information may be set as a keyword by weight. In this case, the weights provided to the related words corresponding to the search terms may be differentially assigned according to the real-time search ranking of the search terms. For example, weights may be set differently for search words corresponding to the first place and the fifth place.

키워드 설정시 실시간 검색어 정보를 활용하는 경우에는, 키워드 생성부(140)가 각각의 단위구간에 대하여 설정하는 키워드를 매번 상이하게 설정할 수 있다. 즉, 사용자의 흥미나 수요를 반영하여 키워드를 설정할 수 있으며, 이를 통하여 이슈가 되고 있는 내용과 관련된 단위구간을 사용자가 용이하게 검색할 수 있도록 제공할 수 있다. When real-time search term information is used in the keyword setting, the keyword generation unit 140 may set different keywords each time for each unit section. That is, the keyword may be set to reflect the interest or demand of the user, and thus, the user may easily search for the unit section related to the content that is the issue.

한편, 실시예에 따라서는, 키워드 생성부(140)가 LDA(Latent Dirichlet Allocation)를 이용하여 키워드를 설정하는 것도 가능하다. 즉, LDA로 학습한 기계학습 모델에 스크립트 문자열 및 자막 문자열을 적용하여 단위 구간에 대응하는 주제어를 추출할 수 있으며, 이후, 추출된 주제어를 해당 단위구간의 키워드로 설정할 수 있다. In some embodiments, the keyword generation unit 140 may set a keyword by using Latent Dirichlet Allocation (LDA). That is, a main word corresponding to a unit section may be extracted by applying a script string and a subtitle string to the machine learning model learned by LDA, and then the extracted main word may be set as a keyword of the unit section.

LDA는 토픽 모델(topic model)의 하나로, 다수의 문서 집합을 이용하여 각 문서에 어떤 주제들이 존재하는지 분류할 수 있는 비지도학습 알고리즘에 해당한다. LDA를 이용하여 모델링을 하면, 특정 주제에 해당하는 단어들과, 특정 문서에 포함된 주제들을 결과물로 얻을 수 있다.LDA is a topic model and corresponds to an unsupervised learning algorithm that can classify which topics exist in each document by using a plurality of document sets. When modeling using LDA, you can get words that correspond to specific topics and the topics that are contained in a particular document.

예를들어, 서비스 서버(100)에서 제공하는 동영상이 뉴스 동영상인 경우에는, 최근 5년동안의 뉴스 기사 등을 LDA를 이용하여 학습시켜 기계학습 모델을 구현할 수 있다. 이 경우, 각각의 기사들에 포함된 주제들을 나타내는 주제어와, 각각의 주제어에 대응하는 단어들의 집합을 추출할 수 있다. 예를들어, 한반도 비핵화에 대한 기사에 대하여, "북한", "정치", "비핵화"의 주제를 포함하는 것으로 분류할 수 있으며, "비핵화" 주제와 관련하여 "남북", "북핵", "정상회담" 등의 단어가 해당 주제에 포함되는 것으로 설정할 수 있다. 따라서, 뉴스 동영상 중 어느 하나의 단위구간에서 추출한 스크립트 문자열과 자막 문자열을 기계학습 모델에 입력하면, 입력한 스크립트 문자열과 자막 문자열에 포함된 단어들이 어떠한 주제어에 해당하는 단어들인지 확인할 수 있으며, 이를 통하여 해당 뉴스 동영상 내에 어떤 주제어에 대응하는 내용들이 포함되어 있지는 파악할 수 있다. 이후, 키워드 생성부(140)는 기계학습 모델을 통하여 추출된 주제어를, 해당 단위구간에 대한 키워드로 설정할 수 있다. For example, when the video provided by the service server 100 is a news video, a machine learning model may be implemented by learning news articles for the last five years using LDA. In this case, a main word representing the topics included in each article and a set of words corresponding to each topic word may be extracted. For example, an article on the denuclearization of the Korean Peninsula may be classified as containing the subjects of "North Korea", "Politics", and "Denuclearization", and "North and South", "North Korean Nuclear", " Words such as "summit meeting" can be set to be included in the subject. Therefore, if a script string and a subtitle string extracted from a unit section of a news video are inputted into the machine learning model, the words included in the input script string and the subtitle string correspond to the subject words. It can be seen that the corresponding contents of the news video are included in the news video. Thereafter, the keyword generator 140 may set the main word extracted through the machine learning model as a keyword for the corresponding unit section.

또한, 실시예에 따라서는, 키워드 생성부(140)가 전체 동영상에 대한 키워드를 생성하는 것도 가능하다. 구체적으로, 동영상 내에 포함되는 각각의 단위구간에 설정된 키워드에 자연어 처리(Natural Language Processing)를 적용하여, 해당 동영상에 대응하는 키워드를 생성하도록 할 수 있다. 여기서, 자연어 처리 기법에는 word2vec, LDA(Latent Dirichlet Allocation) 등의 기계학습(machine learning) 등이 적용될 수 있다. 즉, 해당 동영상 전체의 내용에 대한 키워드를 설정하는 것이 사용자의 편의상 유리하므로, 키워드 생성부(140)는 해당 동영상에 대한 키워드도 생성할 수 있다. 이때, 동영상의 내용을 반영하기 위하여, 각각의 단위구간에 대한 키워드들을 이용하여, 해당 동영상의 키워드를 생성할 수 있다. In addition, according to an embodiment, the keyword generator 140 may generate keywords for the entire video. In detail, a natural language processing may be applied to a keyword set in each unit section included in the video to generate a keyword corresponding to the video. The natural language processing technique may include word learning, machine learning such as Latent Dirichlet Allocation (LDA), and the like. That is, it is advantageous for the user to set a keyword for the content of the entire video, and the keyword generator 140 may also generate a keyword for the video. In this case, in order to reflect the contents of the video, keywords of the corresponding video may be generated by using keywords for each unit section.

검색부(150)는 사용자로부터 입력받은 키워드에 대응하는 단위구간을 검색하고, 검색된 단위구간을 사용자에게 제공할 수 있다. 각각의 단위구간에는 키워드가 설정되어 있으므로, 검색부(150)는 특정 내용을 포함하는 단위구간을 검색하여 사용자에게 제공할 수 있다. 또한, 검색부(150)는 동영상에서 분리된 단위구간 별로 검색이 가능하므로, 사용자가 원하는 단위구간만을 제공하는 것이 가능하다. 즉, 검색부(150)에 의하면, 동영상 서비스 제공시 사용자 편의성을 크게 향상시키는 것이 가능하다. The searcher 150 may search for a unit section corresponding to the keyword input from the user, and provide the searched unit section to the user. Since a keyword is set in each unit section, the search unit 150 may search for a unit section including specific content and provide the same to the user. In addition, since the search unit 150 may search for each unit section separated from the video, the search unit 150 may provide only a unit section desired by the user. That is, according to the search unit 150, it is possible to greatly improve the user convenience when providing a video service.

요약동영상생성부(160)는 동일한 동영상에 대하여, 기준 키워드에 대응하는 단위구간을 추출하고, 추출된 단위구간들을 결합하여 해당 동영상에 대한 요약동영상을 생성할 수 있다. 여기서, 기준 키워드는 관리자 등에 의하여 미리 설정되거나, 사용자로부터 입력받을 수 있다. The summary video generator 160 may extract a unit section corresponding to the reference keyword for the same video, and combine the extracted unit sections to generate a summary video of the corresponding video. Here, the reference keyword may be preset by an administrator or the like, or may be input by a user.

예를들어, 축구 중계동영상의 경우, 기준 키워드를 "골", "득점" 등으로 설정하면 단위구간 중에서 득점 장면만을 추출하여 골장면모음 요약동영상을 생성할 수 있으며, 기준 키워드를 특정선수의 이름으로 설정하면 해당 선수가 공을 터치하는 단위구간만을 추출하여 해당 선수에 대한 하이라이트 요약동영상을 생성할 수 있다. 또한, 뉴스 동영상의 경우, 기준 키워드를 "경제"로 설정하여 경제분야에 대한 요약동영상을 생성하거나, "가상화폐" 등 특정한 이슈에 대한 뉴스를 취합하여 하나의 요약동영상으로 생성하는 것도 가능하다. 즉, 동영상에 대한 별도의 편집작업 등을 수행할 필요없이, 용이하게 요약동영상을 생성하여 사용자에게 제공할 수 있다. For example, in the case of a soccer relay video, if the reference keyword is set to "goal", "scoring", etc., only a shot scene from the unit section can be extracted to generate a summary of the scene shot, and the reference keyword is the name of a specific player. If set to, the player can extract only the unit section where the player touches the ball and generate a highlight summary video for the player. In addition, in the case of a news video, it is also possible to generate a summary video of the economic field by setting the reference keyword as "economy", or to generate a single summary video by collecting news on a specific issue such as "virtual currency". In other words, a summary video can be easily generated and provided to the user without the need for a separate editing operation for the video.

한편, 본 발명의 일 실시예에 의한 서비스 서버(100)는, 도3에 도시한 바와 같이, 프로세서(10), 메모리(40) 등의 물리적인 구성을 포함하는 것일 수 있으며, 메모리(40) 내에는 프로세서(10)에 의하여 실행되도록 구성되는 하나 이상의 모듈이 포함될 수 있다. 구체적으로, 하나 이상의 모듈에는, 단위구간 분리모듈, 스크립트 문자열 생성모듈, 자막 문자열 생성모듈, 키워드 생성 모듈, 검색 모듈 및 요약동영상생성 모듈 등이 포함될 수 있다. On the other hand, the service server 100 according to an embodiment of the present invention, as shown in Figure 3, may include a physical configuration such as the processor 10, the memory 40, the memory 40 Within may be included one or more modules configured to be executed by the processor 10. Specifically, the one or more modules may include a unit section separation module, a script string generation module, a subtitle string generation module, a keyword generation module, a search module, and a summary video generation module.

프로세서(10)는, 다양한 소프트웨어 프로그램과, 메모리(40)에 저장되어 있는 명령어 집합을 실행하여 여러 기능을 수행하고 데이터를 처리하는 기능을 수행할 수 있다. 주변인터페이스부(30)는, 컴퓨터 장치의 입출력 주변 장치를 프로세서(10), 메모리(40)에 연결할 수 있으며, 메모리 제어기(20)는 프로세서(10)나 컴퓨터 장치의 구성요소가 메모리(40)에 접근하는 경우에, 메모리 액세스를 제어하는 기능을 수행할 수 있다. 실시예에 따라서는, 프로세서(10), 메모리 제어기(20) 및 주변인터페이스부(30)를 단일 칩 상에 구현하거나, 별개의 칩으로 구현할 수 있다. The processor 10 may execute various software programs and an instruction set stored in the memory 40 to perform various functions and to process data. The peripheral interface unit 30 may connect an input / output peripheral device of the computer device to the processor 10 and the memory 40, and the memory controller 20 may include the memory 10 of the processor 10 or a component of the computer device. In the case of accessing to, the function of controlling the memory access may be performed. According to an embodiment, the processor 10, the memory controller 20, and the peripheral interface unit 30 may be implemented on a single chip or may be implemented as separate chips.

메모리(40)는 고속 랜덤 액세스 메모리, 하나 이상의 자기 디스크 저장 장치, 플래시 메모리 장치와 같은 불휘발성 메모리 등을 포함할 수 있다. 또한, 메모리(40)는 프로세서(10)로부터 떨어져 위치하는 저장장치나, 인터넷 등의 통신 네트워크를 통하여 엑세스되는 네트워크 부착형 저장장치 등을 더 포함할 수 있다. The memory 40 may include fast random access memory, one or more magnetic disk storage devices, nonvolatile memory such as a flash memory device, and the like. In addition, the memory 40 may further include a storage device located away from the processor 10 or a network attached storage device accessed through a communication network such as the Internet.

한편, 도3에 도시한 바와 같이, 본 발명의 일 실시예에 의한 서비스 서버(100)는, 메모리(40)에 운영체제를 비롯하여, 응용프로그램에 해당하는 단위구간 분리모듈, 스크립트 문자열 생성모듈, 자막 문자열 생성모듈, 키워드 생성 모듈, 검색 모듈 및 요약동영상생성 모듈 등을 포함할 수 있다. 여기서, 각각의 모듈들은 상술한 기능을 수행하기 위한 명령어의 집합으로, 메모리(40)에 저장될 수 있다. On the other hand, as shown in Figure 3, the service server 100 according to an embodiment of the present invention, including the operating system in the memory 40, a unit section separation module, a script string generation module, a subtitle corresponding to an application program It may include a string generation module, a keyword generation module, a search module and a summary video generation module. Here, each module may be stored in the memory 40 as a set of instructions for performing the above-described function.

따라서, 본 발명의 일 실시예에 의한 서비스 서버(100)는, 프로세서(10)가 메모리(40)에 액세스하여 각각의 모듈에 대응하는 명령어를 실행할 수 있다. 다만, 단위구간 분리모듈, 스크립트 문자열 생성모듈, 자막 문자열 생성모듈, 키워드 생성 모듈, 검색 모듈 및 요약동영상생성 모듈은, 상술한 단위구간 분리부, 스크립트 문자열 생성부, 자막 문자열 생성부, 키워드 생성부, 검색부 및 요약동영상 생성부에 각각 대응하므로 여기서는 자세한 설명을 생략한다. Therefore, the service server 100 according to an embodiment of the present invention, the processor 10 may access the memory 40 to execute instructions corresponding to each module. However, the unit section separation module, script string generation module, subtitle string generation module, keyword generation module, search module, and summary video generation module may include the above-described unit section separation unit, script string generation unit, subtitle string generation unit, and keyword generation unit. , The searcher and the summary video generator, respectively, will not be described here.

도7은 본 발명의 일 실시예에 의한 동영상 서비스 제공 방법을 나타내는 순서도이다. 7 is a flowchart illustrating a video service providing method according to an embodiment of the present invention.

도7을 참조하면, 본 발명의 일 실시예에 의한 동영상 서비스 제공방법은, 단위구간 분리단계(S10), 스크립트 문자열 생성단계(S20), 자막 문자열 생성단계(S30), 키워드 생성 단계(S40), 검색 단계(S50) 및 요약동영상 생성 단계(S60)를 포함할 수 있다. 여기서, 본 발명의 일 실시예에 의한 동영상 서비스 제공방법은 서비스 서버에 의하여 실행될 수 있다. 7, the video service providing method according to an embodiment of the present invention, the unit section separation step (S10), script string generation step (S20), subtitle string generation step (S30), keyword generation step (S40) It may include a search step (S50) and a summary video generation step (S60). Here, the video service providing method according to an embodiment of the present invention may be executed by a service server.

이하, 도7을 참조하여 본 발명의 일 실시예에 의한 동영상 서비스 제공 방법을 설명한다.Hereinafter, a video service providing method according to an exemplary embodiment of the present invention will be described with reference to FIG. 7.

단위구간 분리단계(S10)에서는, 동영상 내에 포함된 음성의 특성 변화를 기준으로 복수의 단위구간으로 분리할 수 있다. 여기서, 음성의 특성변화에는 음량 또는 음질의 변화를 포함할 수 있다. 구체적으로, 음성의 특성 변화를 이용하여 동영상 내 화자(話者)의 발화가 중단되는 정지구간을 추출할 수 있으며, 정지구간을 편집점(cutting point)으로 설정하여 동영상을 분리할 수 있다. 예를들어, 정지구간을 음량이 설정값 미만으로 감소하고, 설정값 미만으로 감소한 음량이 기준시간 이상 유지되는 구간으로 설정할 수 있다. 즉, 문맥 등을 고려할 때, 동영상 내의 화자가 말을 멈출때까지를 하나의 구간으로 설정할 수 있으며, 이를 위하여 단위구간 분리시 음량의 변화량을 이용할 수 있다. In the unit section separating step (S10), the unit section may be divided into a plurality of unit sections based on the change in the characteristic of the voice included in the video. Here, the change in the characteristic of the voice may include a change in volume or sound quality. In detail, a stop section in which the utterance of the speaker in the video is stopped may be extracted by using a change in the characteristic of the voice, and the video may be separated by setting the stop section as a cutting point. For example, the stop section may be set to a section in which the volume is reduced below the set value, and the volume reduced below the set value is maintained for more than the reference time. That is, in consideration of the context, the video may be set as a section until the speaker in the video stops talking. For this purpose, the volume change may be used when separating the unit section.

스크립트 문자열 생성단계(S20)에서는, 단위구간에 포함된 음성을 인식하여, 음성에 대응하는 스크립트 문자열을 생성할 수 있다. 동영상을 복수의 단위구간으로 분리한 이후에는, 각각의 단위구간 내에 포함된 내용을 인식할 필요가 있다. 이를 위하여, 화자가 발화한 음성을 인식하여 이를 문자로 변환하고, 변환된 문자들을 결합하여 스크립트 문자열로 생성할 수 있다. In the script string generation step (S20), the voice included in the unit section may be recognized to generate a script string corresponding to the voice. After dividing the video into a plurality of unit sections, it is necessary to recognize the content included in each unit section. To this end, the speaker may recognize the spoken voice, convert it into a character, and combine the converted characters into a script string.

실시예에 따라서는, 음성인식장치가 구비되어 있을 수 있으며, 음성인식장치를 이용하여 음성을 문자로 변환할 수 있다. 예를들어, 단위구간에 포함된 음성을 전기적 신호인 음성패턴으로 표현할 수 있으며, 음성모델 데이터베이스 등에는 각각의 문자에 대응하는 표준음성패턴이 저장되어 있을 수 있다. 이 경우, 음성인식장치는 입력되는 음성패턴들을 음성모델 데이터베이스에 저장된 표준음성패턴과 비교할 수 있으며, 각각의 음성패턴에 대응하는 표준음성패턴을 추출할 수 있다. 이후, 추출한 표준음성패턴을 대응하는 문자로 변환할 수 있으며, 변환된 문자들을 결합하여 스크립트 문자열을 생성할 수 있다. In some embodiments, a voice recognition device may be provided, and the voice may be converted into a text using the voice recognition device. For example, the voice included in the unit section may be expressed as a voice pattern which is an electrical signal, and a standard voice pattern corresponding to each character may be stored in the voice model database. In this case, the voice recognition apparatus may compare the input voice patterns with standard voice patterns stored in the voice model database, and extract a standard voice pattern corresponding to each voice pattern. Thereafter, the extracted standard speech pattern may be converted into a corresponding character, and a script string may be generated by combining the converted characters.

자막 문자열 생성단계(S30)에서는, 단위구간에 포함된 자막 이미지를 인식하여, 자막 이미지에 대응하는 자막 문자열을 생성할 수 있다. 자막 이미지에는 동영상의 내용이 요약되어 표시될 수 있으므로, 자막 이미지에 포함된 문자를 인식할 필요가 있다. 다만, 자막 이미지는 문자가 아니라 형상으로 인식되므로, 자막 이미지에 포함된 문자를 인식하기 위해서는 문자인식 알고리즘 등을 적용할 필요가 있다. 여기서, 자막 문자열 생성단계(S30)는 스크립트 문자열 생성단계(S20)와 동시에 실행될 수 있으나, 이에 한정되는 것은 아니다. In the subtitle string generation step (S30), a subtitle string corresponding to the subtitle image may be generated by recognizing a subtitle image included in a unit section. Since the caption image may be displayed by summarizing the contents of the video, it is necessary to recognize a character included in the caption image. However, since the caption image is recognized as a shape instead of a character, it is necessary to apply a character recognition algorithm or the like to recognize the character included in the caption image. Here, the subtitle string generation step S30 may be executed simultaneously with the script string generation step S20, but is not limited thereto.

실시예에 따라서는 별도의 문자인식장치가 구비되어 있을 수 있으며, 문자인식장치를 이용하여 자막 이미지를 문자로 변환할 수 있다. 예를들어, 단위구간에 포함된 자막 이미지를 스캔하여 자막 이미지에 대한 픽셀값의 분포를 전기적 신호인 형상패턴으로 표현할 수 있으며, 문자모델 데이터베이스 등에는 각각의 문자에 대응하는 표준형상패턴이 저장되어 있을 수 있다. 이 경우, 문자인식장치는 입력되는 형상패턴을 문자모델 데이터베이스에 저장된 표준형상패턴과 비교할 수 있으며, 각각의 형상패턴에 대응하는 표준형상패턴을 추출할 수 있다. 이후, 추출한 표준형상패턴에 대응하는 문자로 각각 변환하여 자막 문자열을 생성할 수 있다. According to an embodiment, a separate text recognition device may be provided, and the caption image may be converted into text using the text recognition device. For example, by scanning a caption image included in a unit section, the distribution of pixel values for the caption image can be expressed as a shape pattern, which is an electrical signal, and a standard shape pattern corresponding to each character is stored in the character model database. There may be. In this case, the character recognition apparatus may compare the input shape pattern with a standard shape pattern stored in the character model database, and extract a standard shape pattern corresponding to each shape pattern. Subsequently, subtitle strings may be generated by converting each character into a character corresponding to the extracted standard shape pattern.

한편, 자막 이미지로부터 자막 문자열을 추출하기 위해서는, 단위구간 내에 자막 이미지의 존재여부와, 자막 이미지의 동영상 프레임 내의 위치를 판별할 필요가 있다. 즉, 자막 문자열을 생성하기 전에, 먼저 단위구간 내 자막 이미지를 포함하는 동영상 프레임을 검출하고, 동영상 프레임 내에 포함된 자막 이미지의 위치를 특정할 수 있다. 구체적으로, 자막 문자열 생성단계(S30)에서는, 단위구간에 포함되는 동영상 프레임 내에 복수의 랜드마크를 설정하고, 랜드마크에서 색상 또는 휘도를 측정하는 방식으로 자막 이미지를 검출할 수 있다. 또한, 자막 이미지의 위치는, 랜드마크들을 동영상 프레임 상에 균일하게 분포시킨 후, 자막 이미지에 대응하는 기준 색상 또는 기준 휘도가 측정된 랜드마크들을 추출하여 특정할 수 있다. On the other hand, in order to extract the caption character string from the caption image, it is necessary to determine whether the caption image exists within the unit section and the position of the caption image in the video frame. That is, before generating the caption string, the video frame including the caption image in the unit section may be first detected, and the position of the caption image included in the video frame may be specified. In detail, in the subtitle character string generation step S30, a plurality of landmarks may be set in a video frame included in a unit section, and the caption image may be detected by measuring color or luminance in the landmark. In addition, the position of the caption image may be specified by uniformly distributing the landmarks on the video frame and extracting landmarks in which the reference color or the reference luminance corresponding to the caption image is measured.

키워드 생성단계(S40)에서는 스크립트 문자열 및 자막 문자열에 자연어 처리(Natural Language Processing)을 적용하여, 단위구간에 대응하는 키워드를 생성할 수 있다. 즉, 사용자가 단위구간의 내용을 확인한 후 그에 대응하여 키워드나 주석 등을 설정하는 것이 아니라, 각각의 단위구간에 대한 의미기반의 키워드를 자동으로 설정하는 것이 가능하다. 여기서, 스크립트 문자열 및 자막 문자열에 적용하는 자연어 처리에는 다양한 방법 등이 적용될 수 있으며, 실시예에 따라서는 word2vec, LDA(Latent Dirichlet Allocation) 등의 기계학습(machine learning)을 적용할 수 있다. In the keyword generation step S40, natural language processing may be applied to the script string and the subtitle string to generate a keyword corresponding to the unit section. That is, after checking the contents of the unit section, the user may automatically set a semantic based keyword for each unit section instead of setting keywords or annotations. Here, various methods may be applied to natural language processing applied to script strings and subtitle strings, and machine learning such as word2vec and late diallet allocation (LDA) may be applied according to embodiments.

일 실시예에 의하면, 키워드 생성단계(S40)에서는 word2vec를 이용하여 워드 임베딩(word embedding)한 word2vec 모델을 구현할 수 있으며, 자막 문자열 또는 스크립트 문자열에서 추출한 단어들을 word2vec 모델에 대한 입력단어로 설정하여, 입력단어에 대응하는 연관단어들을 추출할 수 있다. 이후, 추출한 연관단어들을 해당 단위구간에 대한 키워드로 설정할 수 있다. According to an embodiment, in the keyword generation step (S40), a word embedding word2vec model may be implemented using word2vec, and words extracted from a caption string or a script string are set as input words for the word2vec model. Associated words corresponding to the input word may be extracted. Thereafter, the extracted related words may be set as keywords for a corresponding unit section.

여기서, 키워드 생성단계(S40)는 연관단어와 입력단어를 비교하여 유사도가 높은 연관단어만을 키워드로 설정하도록 제한할 수 있다. 구체적으로, word2vec 모델에 입력한 입력단어에 대응하는 입력단어 벡터와, 연관단어에 대응하는 연관단어 벡터 사이의 유사도를 계산하여, 유사도가 높은 연관단어만을 추출하여 키워드로 설정할 수 있다.Here, the keyword generation step (S40) may be limited to set only the related words with a high similarity by comparing the related words and the input words. In detail, the similarity between the input word vector corresponding to the input word input to the word2vec model and the related word vector corresponding to the related word may be calculated to extract only the related word having a high similarity and set it as a keyword.

각각의 단어들은 워드 임베딩을 통하여 공간상에 벡터화하여 분포될 수 있으며, 학습한 샘플에서 서로 유사하거나 관련있는 것으로 설정된 단어들은 벡터 공간 상에 인접한 위치에 위치하게 된다. 따라서, 입력단어 벡터와 연관단어 벡터들 사이의 유사도를 계산하여, 입력단어와 연관단어들 사이의 관계를 파악하는 것이 가능하다. 여기서, 벡터들 사이의 유사도는 코사인 유사도(cosine similarity)를 이용하여 계산할 수 있다. Each word may be distributed in a vectorized space through word embedding, and words set as similar or related to each other in the learned sample may be located at adjacent positions on the vector space. Therefore, by calculating the similarity between the input word vector and the related word vectors, it is possible to grasp the relationship between the input word and the related words. Here, the similarity between the vectors may be calculated using cosine similarity.

구체적으로, 입력 벡터와의 유사도가 제한값 이상인 연관단어 벡터를 추출할 수 있으며, 추출된 연관단어 벡터에 대응하는 연관단어를 키워드로 설정할 수 있다. 즉, 유사도가 제한값 이상인 연관단어벡터에 해당하는 연관단어들만을 키워드로 설정할 수 있다. 또한, 실시예에 따라서는, 입력 벡터와의 유사도가 높은 순서에 따라 기 설정된 개수의 연관단어 벡터를 추출할 수 있으며, 추출된 기 설정된 개수의 연관단어 벡터에 대응하는 연관단어들을 키워드로 설정하는 것도 가능하다. 예를들어, 가장 유사도가 큰 연관단어 벡터 10개를 추출하고, 추출된 10개의 연관단어를 키워드로 설정할 수 있다. In detail, an associated word vector having a similarity degree to or more than a limit value with an input vector may be extracted, and an associated word corresponding to the extracted associated word vector may be set as a keyword. That is, only related words corresponding to the related word vector having a similarity or more than a limit value may be set as keywords. In addition, according to an exemplary embodiment, a predetermined number of related word vectors may be extracted according to a high similarity with the input vector, and the related words corresponding to the extracted predetermined number of related word vectors may be set as keywords. It is also possible. For example, 10 related word vectors with the highest similarity may be extracted, and the extracted 10 related words may be set as keywords.

추가적으로, 키워드 생성단계(S40)에서는 실시간 검색어 정보를 이용하여, 키워드를 설정하는 실시예도 가능하다. 예를들어, word2vec 모델에서 추출한 연관단어들 중에서, 실시간 검색어 정보에 포함된 검색어에 대응하는 연관단어를 추출할 수 있으며, 추출된 연관단어에 대하여는 유사도 계산시 가중치를 부가할 수 있다. 즉, 상대적으로 유사도가 낮은 경우에도, 실시간 검색어 정보에 대응하는 연관단어에 대하여는 가중치에 의하여 키워드로 설정될 수 있다. 이때, 검색어들의 실시간 검색순위에 따라, 검색어에 대응하는 연관단어에 제공하는 가중치를 차등하여 부여하는 것도 가능하다. In addition, in the keyword generation step S40, an embodiment of setting a keyword using real-time search word information is also possible. For example, among the related words extracted from the word2vec model, the related words corresponding to the search words included in the real-time search word information may be extracted, and weights may be added to the extracted related words when calculating the similarity. That is, even when the similarity is relatively low, the related word corresponding to the real-time search word information may be set as a keyword by weight. In this case, the weights provided to the related words corresponding to the search terms may be differentially assigned according to the real-time search ranking of the search terms.

한편, 실시예에 따라서는, 키워드 생성단계(S40)에서 LDA(Latent Dirichlet Allocation)를 이용하여 키워드를 설정하는 것도 가능하다. 즉, LDA로 학습한 기계학습 모델에 스크립트 문자열 및 자막 문자열을 적용하여 단위 구간에 대응하는 주제어를 추출할 수 있으며, 이후, 추출된 주제어를 해당 단위구간의 키워드로 설정할 수 있다. 다만, 앞서 LDA를 이용하여 학습한 기계학습 모델을 이용하여 키워드를 설정하는 내용은 설명하였으므로, 여기서는 구체적인 내용을 생략한다. In some exemplary embodiments, a keyword may be set using late diallet allocation (LDA) in the keyword generation step (S40). That is, a main word corresponding to a unit section may be extracted by applying a script string and a subtitle string to the machine learning model learned by LDA, and then the extracted main word may be set as a keyword of the unit section. However, since the contents of setting the keywords using the machine learning model learned using the LDA have been described above, detailed descriptions thereof will be omitted.

또한, 실시예에 따라서는, 키워드 생성단계(S40)에서 전체 동영상에 대한 키워드를 생성하는 것도 가능하다. 즉, 동영상 내에 포함되는 각각의 단위구간에 설정된 키워드에 자연어 처리(Natural Language Processing)를 적용하여, 해당 동영상에 대응하는 키워드를 생성하도록 할 수 있다. 여기서, 자연어 처리 기법에는 word2vec, LDA(Latent Dirichlet Allocation) 등의 기계학습(machine learning) 등이 적용될 수 있다. In addition, according to an embodiment, it is also possible to generate keywords for the entire video in the keyword generation step (S40). That is, natural language processing may be applied to a keyword set in each unit section included in the video to generate a keyword corresponding to the video. The natural language processing technique may include word learning, machine learning such as Latent Dirichlet Allocation (LDA), and the like.

검색단계(S50)에서는, 사용자로부터 입력받은 키워드에 대응하는 단위구간을 검색하고, 검색된 단위구간을 사용자에게 제공할 수 있다. 각각의 단위구간에는 키워드가 설정되어 있으므로, 특정 내용을 포함하는 단위구간을 검색하여 사용자에게 제공할 수 있다. 또한, 동영상에서 분리된 단위구간 별로 검색이 가능하므로, 사용자가 원하는 단위구간만을 제공하는 것이 가능하다. 즉, 동영상 서비스 제공시 사용자 편의성을 크게 향상시키는 것이 가능하다. In the search step (S50), it is possible to search for a unit section corresponding to the keyword input from the user, and provide the searched unit section to the user. Since a keyword is set in each unit section, a unit section including a specific content can be searched and provided to the user. In addition, since it is possible to search for each unit section separated from the video, it is possible to provide only the unit section desired by the user. That is, it is possible to greatly improve user convenience when providing a video service.

요약동영상 생성단계(S60)에서는, 동일한 동영상에 대하여, 기준 키워드에 대응하는 단위구간을 추출하고, 추출된 단위구간들을 결합하여 해당 동영상에 대한 요약동영상을 생성할 수 있다. 여기서, 기준 키워드는 관리자 등에 의하여 미리 설정되거나, 사용자로부터 입력받을 수 있다. 즉, 동영상에 대한 별도의 편집작업 등을 수행할 필요없이, 용이하게 요약동영상을 생성하여 사용자에게 제공할 수 있다. In the summary video generation step (S60), a unit section corresponding to the reference keyword is extracted for the same video, and the extracted unit sections may be combined to generate a summary video of the corresponding video. Here, the reference keyword may be preset by an administrator or the like, or may be input by a user. In other words, a summary video can be easily generated and provided to the user without the need for a separate editing operation for the video.

전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수개 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 애플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The present invention described above can be embodied as computer readable codes on a medium on which a program is recorded. The computer readable medium may be to continuously store a computer executable program or to temporarily store a program for execution or download. In addition, the medium may be a variety of recording means or storage means in the form of a single or several hardware combined, not limited to a medium directly connected to any computer system, it may be distributed on the network. Examples of media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, And ROM, RAM, flash memory, and the like, configured to store program instructions. In addition, examples of another medium may include a recording medium or a storage medium managed by an app store that distributes an application, a site that supplies or distributes various software, a server, or the like. Accordingly, the above detailed description should not be construed as limiting in all aspects and should be considered as illustrative. The scope of the present invention should be determined by reasonable interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다. 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 본 발명에 따른 구성요소를 치환, 변형 및 변경할 수 있다는 것이 명백할 것이다.The present invention is not limited by the above-described embodiment and the accompanying drawings. It will be apparent to those skilled in the art that the present invention may be substituted, modified, and changed in accordance with the present invention without departing from the spirit of the present invention.

1: 단말장치 100: 서비스 서버
10: 프로세서 20: 메모리 제어부
30: 주변인터페이스부 40: 메모리
110: 단위구간 분리부 120: 스크립트 문자열 생성부
130: 자막 문자열 생성부 140: 키워드 생성부
150: 검색부 160: 요약동영상 생성부1: terminal device 100: service server
10: processor 20: memory control unit
30: peripheral interface 40: memory
110: unit section separator 120: script string generation unit
130: subtitle string generator 140: keyword generator
150: search unit 160: summary video generation unit

Claims

In the video service providing method for providing a video to the terminal device, the service server,
A unit section separation step of separating the video into a plurality of unit sections based on a change in a characteristic of a voice included in the video;
A script string generation step of generating a script string corresponding to the speech by recognizing the speech included in the unit section;
A caption string generation step of recognizing a caption image included in the unit section and generating a caption string corresponding to the caption image; And
And a keyword generation step of generating a semantic-based keyword corresponding to each unit section by applying natural language processing to the script string and the subtitle string.
The unit section separation step is
Video service, characterized in that for extracting a stop section in which the utterance of the speaker in the video is stopped using the change in the characteristics of the voice, and separating the video by setting the stop section as a cutting point (cutting point) How to Provide.

delete

The method of claim 1, wherein the separating unit section
If the volume of the voice is less than the set value, and the volume of the voice less than the set value is maintained for more than a reference time, the video service providing method characterized in that it is determined as the stop section.

The method of claim 1, wherein the script string generation step
Using a voice recognition device, converting the voice pattern extracted from the voice to a corresponding character, and combining the converted characters to generate the script string, characterized in that for generating a script string.

The method of claim 1, wherein the subtitle string generation step
Using a character recognition device, converting the shape pattern extracted from the caption image to a corresponding character, and combining the converted characters to generate the subtitle character string.

The method of claim 1, wherein the subtitle string generation step
And a plurality of landmarks are set in the moving picture frame included in the unit section, and the caption image is detected using the color or the luminance measured by the landmark.

The method of claim 6, wherein the subtitle string generation step
Among the plurality of landmarks uniformly distributed on the video frame, the landmarks in which the reference color or the reference luminance corresponding to the caption image are measured are extracted, and the position of the caption image is specified using the extracted landmarks. Video service providing method, characterized in that.

The method of claim 1, wherein the keyword generation step
A video service, comprising: extracting related words by inputting input words extracted from the subtitle string or script string into a word embedding word2vec model using word2vec, and setting the related word as the keyword How to Provide.

In the video service providing method for providing a video to the terminal device, the service server,
A unit section separation step of separating the video into a plurality of unit sections based on a change in a characteristic of a voice included in the video;
A script string generation step of generating a script string corresponding to the speech by recognizing the speech included in the unit section;
A caption string generation step of recognizing a caption image included in the unit section and generating a caption string corresponding to the caption image; And
And a keyword generation step of generating a keyword corresponding to the unit section by applying natural language processing to the script string and the subtitle string.
The keyword generation step
input corresponding words extracted from the subtitle string or script string into a word embedding word2vec model using word2vec to extract corresponding association words, and set the association word as the keyword,
The keyword generation step
Calculating a similarity between an input word vector corresponding to the input word input to the word2vec model and an associated word vector corresponding to the related word;
Extracting an associative word vector having a similarity or more than a limit value or a predetermined number of associative word vectors selected according to the order of high similarity; And
And setting a related word corresponding to the extracted related word vector as the keyword.

The method of claim 9, wherein calculating the similarity level
And extracting a related word corresponding to a search word included in real-time search word information, and adding the weighted value to the extracted related word when calculating the similarity.

The method of claim 10, wherein generating the keyword
According to the real-time search ranking of the search terms, the video service providing method characterized in that the weight is provided to the associated word corresponding to the search by differential.

The method of claim 1, wherein generating the keyword
Providing a video service comprising extracting a main word corresponding to the unit section by applying the script string and a subtitle string to a machine learning model trained by using LDA (Latent Dirichlet Allocation), and setting the main word as the keyword. Way.

The method of claim 1, wherein generating the keyword
And a keyword corresponding to the video by applying natural language processing to the keyword corresponding to the unit section.

The method of claim 1,
And searching for a unit section corresponding to a keyword input from a user, and providing the searched unit section to a user.

The method of claim 1,
And a summary video generating step of extracting a unit section corresponding to a reference keyword and generating a summary video by combining the extracted unit sections with respect to the same video.

16. A computer program stored in a medium in combination with hardware to execute the video service providing method of any one of claims 1 and 3.

A unit section separator configured to separate the video into a plurality of unit sections based on a change in a characteristic of a voice included in the video;
A script string generation unit for recognizing a voice included in the unit section and generating a script string corresponding to the voice;
A caption string generator for recognizing a caption image included in the unit section and generating a caption string corresponding to the caption image; And
By applying a natural language processing (Natural Language Processing) to the script string and subtitle string, comprising a keyword generation unit for generating a semantic-based keyword corresponding to the unit section for each unit section,
The unit section separation unit
A service server, characterized in that for extracting a stop section in which the utterance of the speaker in the video is stopped using the change in the characteristics of the voice, and separating the video by setting the stop section as a cutting point (cutting point) .

A processor; And
Including a memory coupled to the processor,
The memory includes one or more modules configured to be executed by the processor,
The one or more modules,
The video is divided into a plurality of unit sections based on the change in the characteristics of the voice included in the video.
Recognizing a voice included in the unit section, generates a script string corresponding to the voice,
Recognizing a caption image included in the unit section, generates a caption string corresponding to the caption image,
By applying Natural Language Processing to the script string and the subtitle string, a semantic based keyword corresponding to the unit section is generated for each unit section,
A service server including a command for extracting a stop section in which a utterance of a speaker is stopped in the video using the change in the characteristic of the voice, and setting the stop section as a cutting point to separate the video. .