KR102044540B1

KR102044540B1 - Method and apparatus for creating animation in video

Info

Publication number: KR102044540B1
Application number: KR1020180030927A
Authority: KR
Inventors: 박귀현
Original assignee: 박귀현
Priority date: 2018-03-16
Filing date: 2018-03-16
Publication date: 2019-11-13
Also published as: KR20190109054A

Abstract

본 발명은 영상 내 그래픽 생성 방법 및 장치에 관한 것이다. 이를 위하여, 입력되는 음성 데이터에서 문맥(Context)의 주체를 대표하는 주체 대표 어구를 선정하는 주체 대표 어구 선정 모듈; 음성 데이터에서 문맥(Context)의 모션을 대표하는 모션 대표 어구를 선정하는 모션 대표 어구 선정 모듈; 효과 어구 및 효과 어구에 대응되는 그래픽 효과를 저장하는 효과 데이터베이스에 저장된 효과 어구 중 주체 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 주체 효과 정보를 생성하고, 모션 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 모션 효과 정보를 생성하는 매칭 모듈; 주체 효과 정보와 모션 효과 정보를 기초로 결합 효과 정보를 생성하는 효과 생성 모듈; 및 결합 효과 정보를 영상 데이터에 결합하여 출력 영상 정보를 생성하는 영상 출력 모듈;을 제공할 수 있다. The present invention relates to a method and apparatus for generating graphics in an image. To this end, the subject representative phrase selection module for selecting a subject representative phrase representing the subject of the context (Context) from the input voice data; A motion representative phrase selection module for selecting a motion representative phrase representing a motion of a context in speech data; Effect phrases and graphic effects corresponding to the effect phrases are stored in the effect database. Create the subject effect information, which is a graphic effect of the effect phrase that matches the main phrase among the effect phrases stored in the effect database, and the graphic effect of the effect phrase that matches the motion phrase. A matching module for generating in motion effect information; An effect generation module for generating combined effect information based on the subject effect information and the motion effect information; And an image output module configured to generate output image information by combining the combined effect information with the image data.

Description

Method and apparatus for creating graphics in images {Method and apparatus for creating animation in video}

본 발명은 영상 내 그래픽 생성 방법 및 장치에 관한 것이다. The present invention relates to a method and apparatus for generating graphics in an image.

사용자들은 더이상 단순한 이미지나 글귀에 반응하지 않는다. 기업들이 사용자를 어퀴지션(Acquisition)하기 위해 자극적인 이미지나 텍스트를 오랜기간 사용한 결과이고, 사용자들과의 접점이 PC의 Web에서 모바일로 전이되면서 발생된 일이다. 사용자들은 이미 기존의 커뮤니케이션 및 미디어 방식에 적응하였으며, 사용자를 끌기 위해서는 패러다임이 다른 자극이 필요하게 되었다. 이는 최근의 배너 광고 및 검색 광고의 몰락과 관계가 있다. Users no longer respond to simple images or text. It is the result of a long period of stimulus images and texts used by companies to acquire users, and it is the result of the user's contact point transitioning from the PC's Web to mobile. Users have already adapted to the existing methods of communication and media, and paradigms require different stimuli to attract users. This is related to the recent downfall of banner and search ads.

사용자들에게 새로운 자극은 곧 동영상이다. 페이스북(Facebook) 및 유투브(Youtube)를 통한 동영상의 시청 시간이 어느때보다 큰 폭으로 증가하고 있고, 특히 나이 어린 사용자들에게서는 동영상의 시청이 지배적이다. 특히 기존에는 '동영상 플랫폼'이라고 하면 유투브만이 연상될 정도로 유투브가 독보적이었지만, 최근 페이스북의 성장이 매우 가파르다. The new stimulus for users is video. Watching time on videos via Facebook and YouTube is increasing more than ever, especially in younger users. In particular, YouTube was so unique that it was reminiscent of YouTube as a 'video platform', but Facebook's growth is very steep.

페이스북은 2015년 4분기에 15억 6천만 달러의 순익을 돌파하였고, 전체 매출 중 모바일 광고 매출 비중이 80%를 차지하였다. 특히, 2015년 4분기의 페이스북의 MAU(Monthly Active User)인 15억 9천만명 중 90.6%인 14억 4천만명이 모바일을 통해 페이스북을 이용하였다. Facebook topped the $ 1.56 billion net profit in the fourth quarter of 2015, with mobile advertising revenue accounting for 80 percent of total revenue. In particular, out of 1.515 million people, Facebook's monthly active users (MAU) in the fourth quarter of 2015, 91.4% (1.4 billion) used Facebook through mobile.

페이스북의 이러한 성장에는 페이스북의 동영상 전략이 뒤에 있다. 2013년 12월 페이스북은 사용자 또는 페이지 운영자가 직접 올린 동영상에 '자동재생(Auto Play)' 기능을 도입했다. 동영상 생산자 입장에서 보면 유튜브에 올린 동영상 주소(URL)를 페이스북 포스트에 입력하는 것보다, 페이스북에 동영상을 자체적으로 업로드하여 자동재생되도록 구성하는 것이 훨씬 바람직하다. 왜냐하면, 사용자가 자기도 모르는 사이에 영상의 초반을 감상하게 되면서 자연스럽게 영상에 트랙션(traction) 되기 때문이다. re/code에 따르면, 북미 프로아이스하키(NHL) 결승전 편집 동영상의 유튜브 조회수는 1,200에 그친 반면 동일 동영상이 페이스북에서 조회수는 24만을 기록했다. 자동재생 덕분이다. Facebook's video strategy is behind this growth. In December 2013, Facebook introduced the 'Auto Play' feature for videos uploaded by users or page owners. From a video producer's point of view, it's much more desirable to upload a video to Facebook and configure it to autoplay, rather than entering the video address (URL) uploaded to YouTube into a Facebook post. This is because the user naturally sees the beginning of the image without being aware of the traction. According to the re / code, the YouTube video of the North American Pro Ice Hockey (NHL) final edit was only 1,200, while the same video recorded 240,000 views on Facebook. Thanks to autoplay.

페이스북은 링크에 기초한 유튜브 동영상보다 페이스북에 직접 올린 동영상의 뉴스피드 노출도를 높이기 위해 뉴스피드 알고리즘을 조정했다. Facebook has tweaked its news feed algorithms to increase the exposure of news feeds on videos directly uploaded to Facebook rather than link-based YouTube videos.

이러한 과정을 통해 페이스북은 '웹 동영상 = 유튜브'라는 사용자 인식을 바꾸는데 성공했다. 수치에서도 이는 쉽게 확인할 수 있다. 2015년 중순, 페이스북 동영상은 하루 총 40억 조회수를 기록하고 있다. 2014년 9월 총 조회수는 10억 수준이었다. 약 7개월만에 페이스북 동영상 조회수는 10억에서 40억으로 급상승했다.Through this process, Facebook succeeded in changing the user's perception that 'web video = YouTube'. This can easily be seen in the figures. In mid-2015, Facebook video had a total of 4 billion views per day. The total number of views in September 2014 was 1 billion. In about seven months, Facebook video views jumped from one billion to four billion.

반면 유튜브가 하루 40억 조회수를 기록한 시점은 2012년 초기다. 2009년 10억 조회수를 기록했으니 유튜브는 40억 조회수를 도달하는데 약 4년을 필요로 했다. 페이스북이 상대적으로 단기간에 동영상 조회수를 끌어 올리는 데에는 자동재생뿐 아니라 뉴스피드 알고리즘 조정이 한 몫한 것으로 추정할 수 있다. 2015년 3월 페이스북은 동영상의 임베드 기능을 제공하기 시작했다. 이로써 페이스북 동영상을 매개로하는 네트워크 확산이 가능하게 되었다. 모바일이 미디어 소비의 주요 공간으로 성장하는 상황에서 임베드 기능에 동영상 확산에 기여하는 바는 크지 않다. 그러나 페이스북이 올린 또는 올라온 동영상을 다양한 곳에서 만날 수 있다는 점이 생산자 입장에서 매력이다.On the other hand, YouTube reached 4 billion views per day in early 2012. With 1 billion views in 2009, YouTube needed about four years to reach 4 billion views. It is estimated that Facebook's adjustment to the news feed algorithm, as well as autoplay, contributed to a relatively short time to drive video views. In March 2015, Facebook began offering embedded features for video. This enabled the spread of networks via Facebook videos. As mobile grows as a major space for media consumption, the contribution of video to embedding is small. However, it is attractive for producers to be able to meet various videos uploaded or uploaded by Facebook.

이와 같은 맥락으로, 커뮤니케이션 및 미디어 환경은 기존의 카카오톡(Kakaotalk), 라인(Line), 위챗(Wechat), 왓츠앱(WhatsApp), 페이스북 메신저(Facebook messanger) 등의 채팅 기반에서 인스타그램(Instagram), 스냅(Snap), 콰이(Kwai), 스노우(Snow) 등의 영상 기반으로 전이되고 있다. In the same vein, the communication and media environment is based on existing chats such as Kakaotalk, Line, Wechat, WhatsApp, and Facebook messanger. It is being transferred based on images such as Instagram, Snap, Kwai, and Snow.

공개특허 10-2004-0100658, 그래픽을 이용한 휴대 단말기 및 그의 가상 영상 통화 방법, 엘지전자 주식회사Korean Patent Laid-Open Publication No. 10-2004-0100658, Mobile terminal using a graphic and a virtual video call method thereof, LG Electronics Co., Ltd. 등록특허 10-0759364, 사용자 반응형 실시간 그래픽스와 고품질 그래픽 영상 합성 방법, 한국과학기술원Registered patent 10-0759364, User responsive real time graphics and high quality graphic image synthesis method, Korea Advanced Institute of Science and Technology 등록특허 10-1029612, 그래픽 그래픽과 영상의 동시 재생 방법 및 시스템, 한국전자통신연구원Patent 10-1029612, Method and system for simultaneous reproduction of graphic graphics and images, Korea Electronics and Telecommunications Research Institute 등록특허 10-1373020, 정적 영상에서 그래픽 아트 효과를 생성하기 위한 방법 및 시스템, 삼성전자 주식회사Patent No. 10-1373020, Method and system for generating graphic art effect from static image, Samsung Electronics Co., Ltd.

인스타그램(Instagram), 스냅(Snap), 콰이(Kwai), 스노우(Snow) 등과 같은 최근의 커뮤니케이션 및 미디어 플랫폼들은 영상에서 Face recognition을 이용하여 사용자들의 얼굴에 자연스러운 그래픽 효과를 발생시키는 방법을 취하고 있다. 하지만, 이러한 그래픽 효과는 사용자들의 자발적인 선택에 의해 발생되는 것이라는 문제가 있다.Recent communication and media platforms, such as Instagram, Snap, Kwai, Snow, etc., have taken the approach of using face recognition in video to create natural graphic effects on users' faces. have. However, there is a problem that such graphic effects are generated by voluntary selection of users.

따라서, 본 발명의 목적은 사용자들이 입력하는 영상에서 음성 데이터를 추출하여 해당 음성 데이터에 대응되는 그래픽 효과를 발생시키는 영상 내 그래픽 생성 방법 및 장치를 제공하는 데에 있다. Accordingly, it is an object of the present invention to provide a method and apparatus for generating graphics in an image for extracting voice data from an image input by a user and generating a graphic effect corresponding to the voice data.

이하 본 발명의 목적을 달성하기 위한 구체적 수단에 대하여 설명한다.Hereinafter, specific means for achieving the object of the present invention will be described.

본 발명의 목적은, 입력되는 음성 데이터에서 문맥(Context)의 주체를 대표하는 주체 대표 어구를 선정하는 주체 대표 어구 선정 모듈; 상기 음성 데이터에서 문맥(Context)의 모션을 대표하는 모션 대표 어구를 선정하는 모션 대표 어구 선정 모듈; 효과 어구 및 상기 효과 어구에 대응되는 그래픽 효과를 저장하는 효과 데이터베이스에 저장된 상기 효과 어구 중 상기 주체 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 주체 효과 정보를 생성하고, 상기 모션 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 모션 효과 정보를 생성하는 매칭 모듈; 상기 주체 효과 정보와 상기 모션 효과 정보를 기초로 결합 효과 정보를 생성하는 효과 생성 모듈; 및 상기 결합 효과 정보를 상기 영상 데이터에 결합하여 출력 영상 정보를 생성하는 영상 출력 모듈;을 포함하고, 상기 효과 생성 모듈은, 상기 결합 효과 정보의 정적 그래픽이 상기 주체 효과 정보에 가깝게 출력되도록 최적화되고, 상기 결합 효과 정보의 동적 그래픽이 상기 모션 효과 정보에 가깝게 출력되도록 최적화되며, 특정 영상 데이터가 입력되면, 상기 특정 영상 데이터의 문맥에 대응되는 특정 결합 효과 정보를 생성하고 상기 특정 결합 효과 정보를 상기 특정 영상 데이터와 결합하여 특정 출력 영상 정보를 출력하는, 영상 내 그래픽 생성 장치를 제공하여 달성될 수 있다. An object of the present invention, the subject representative phrase selection module for selecting a subject representative phrase representing the subject of the context (Context) in the input voice data; A motion representative phrase selection module for selecting a motion representative phrase representing a motion of a context in the voice data; An effect phrase and an effect phrase which is a graphic effect of an effect phrase that matches the subject representative phrase among the effect phrases stored in an effect database storing a graphic effect corresponding to the effect phrase, and an effect that matches the motion representative phrase A matching module for generating motion effect information which is a graphic effect of a phrase; An effect generation module for generating combined effect information based on the subject effect information and the motion effect information; And an image output module configured to combine the combined effect information with the image data to generate output image information, wherein the effect generation module is optimized to output a static graphic of the combined effect information closer to the subject effect information. The dynamic graphic of the combined effect information is optimized to be output close to the motion effect information, and when specific image data is input, the specific combined effect information corresponding to the context of the specific image data is generated and the specific combined effect information is recalled. It may be achieved by providing an apparatus for generating a graphic in an image, which outputs specific output image information in combination with specific image data.

또한, 특정 영상 데이터 대신 특정 음성 데이터가 입력되면, 상기 음성 데이터의 문맥에 대응되는 상기 결합 효과 정보를 생성하고, 상기 결합 효과 정보와 상기 음성 데이터를 이용하여 영상 정보인 상기 출력 영상 정보를 출력할 수 있다. In addition, when specific audio data is input instead of specific image data, the combined effect information corresponding to the context of the audio data is generated, and the output image information that is image information is output using the combined effect information and the audio data. Can be.

본 발명의 다른 목적은, 효과 어구 및 상기 효과 어구에 대응되는 그래픽 효과를 저장하고, 영상 내 그래픽 생성을 위한 프로그램 코드가 저장된 메모리 모듈; 및 상기 메모리 모듈과 동작 가능하도록 결합되고, 상기 프로그램 코드를 실행하는 처리 모듈;을 포함하고, 상기 프로그램 코드는, 입력되는 음성 데이터에서 문맥(Context)의 주체를 대표하는 주체 대표 어구를 선정하는 주체 대표 어구 선정 단계; 상기 음성 데이터에서 문맥(Context)의 모션을 대표하는 모션 대표 어구를 선정하는 모션 대표 어구 선정 단계; 상기 효과 어구 중 상기 주체 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 주체 효과 정보를 생성하고, 상기 모션 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 모션 효과 정보를 생성하는 매칭 단계; 상기 주체 효과 정보와 상기 모션 효과 정보를 기초로 결합 효과 정보를 생성하는 효과 생성 단계; 및 상기 결합 효과 정보를 상기 영상 데이터에 결합하여 출력 영상 정보를 생성하는 영상 출력 단계;를 포함하며, 상기 효과 생성 단계는, 상기 결합 효과 정보의 정적 그래픽이 상기 주체 효과 정보에 가깝게 출력되도록 하는 최적*최소화 및 상기 결합 효과 정보의 동적 그래픽이 상기 모션 효과 정보에 가깝게 출력되도록 하는 최적화 이후 수행되는, 영상 내 그래픽 생성 장치를 제공하여 달성될 수 있다. According to another aspect of the present invention, there is provided a memory module including: a memory module for storing effect phrases and graphic effects corresponding to the effect phrases, and storing program codes for generating graphics in an image; And a processing module operatively coupled to the memory module and executing the program code, wherein the program code includes: a subject that selects a subject representative phrase representing a subject of a context in the input voice data; Selection of representative phrases; A motion representative phrase selecting step of selecting a motion representative phrase representing a motion of a context from the voice data; A matching step of generating subject effect information which is a graphic effect of an effect phrase matching the subject representative phrase among the effect phrases, and generating motion effect information which is a graphic effect of an effect phrase matching the motion representative phrase; An effect generating step of generating combined effect information based on the subject effect information and the motion effect information; And an image output step of generating output image information by combining the combined effect information with the image data, wherein the effect generation step includes: optimally outputting a static graphic of the combined effect information close to the subject effect information; It can be achieved by providing an apparatus for generating an in-image graphic, which is performed after minimization and optimization to cause the dynamic graphic of the combined effect information to be output close to the motion effect information.

본 발명의 다른 목적은, 주체 대표 어구 선정 모듈이, 입력되는 음성 데이터에서 문맥(Context)의 주체를 대표하는 주체 대표 어구를 선정하는 주체 대표 어구 선정 단계; 대표 어구 선정 모듈이, 상기 음성 데이터에서 문맥(Context)의 모션을 대표하는 모션 대표 어구를 선정하는 모션 대표 어구 선정 단계; 매칭 모듈이, 효과 어구 및 상기 효과 어구에 대응되는 그래픽 효과를 저장하는 효과 데이터베이스에 저장된 상기 효과 어구 중 상기 주체 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 주체 효과 정보를 생성하고, 상기 모션 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 모션 효과 정보를 생성하는 매칭 단계; 효과 생성 모듈이, 상기 주체 효과 정보와 상기 모션 효과 정보를 기초로 결합 효과 정보를 생성하는 효과 생성 단계; 및 영상 출력 모듈이, 상기 결합 효과 정보를 상기 영상 데이터에 결합하여 출력 영상 정보를 생성하는 영상 출력 단계;를 포함하고, 상기 효과 생성 모듈은, 상기 결합 효과 정보의 정적 그래픽이 상기 주체 효과 정보에 가깝게 출력되도록 최적화되고, 상기 결합 효과 정보의 동적 그래픽이 상기 모션 효과 정보에 가깝게 출력되도록 최적화되며, 특정 영상 데이터가 입력되면, 상기 특정 영상 데이터의 문맥에 대응되는 특정 결합 효과 정보를 생성하고 상기 특정 결합 효과 정보를 상기 특정 영상 데이터와 결합하여 특정 출력 영상 정보를 출력하는, 영상 내 그래픽 생성 방법을 제공하여 달성될 수 있다.Another object of the present invention, the subject representative phrase selection module, the subject representative phrase selection step of selecting a subject representative phrase representing the subject of the context (Context) from the input voice data; A motion representative phrase selection step of, by the representative phrase selection module, selecting a motion representative phrase representing a motion of a context from the voice data; The matching module generates subject effect information which is a graphic effect of the effect phrase matching the subject representative phrase among the effect phrases stored in the effect database storing the effect phrase and the graphic effect corresponding to the effect phrase, and the motion representative phrase A matching step of generating motion effect information which is a graphic effect of an effect phrase that matches with; An effect generation step of generating, by the effect generation module, combined effect information based on the subject effect information and the motion effect information; And an image output module, generating an output image information by combining the combined effect information with the image data, wherein the effect generation module comprises: a static graphic of the combined effect information on the subject effect information. Optimized to be output in close proximity, the dynamic graphic of the combined effect information is optimized to be output in close proximity to the motion effect information, and when specific image data is input, generate specific combined effect information corresponding to the context of the specific image data and specify It may be achieved by providing a method for generating a graphic in an image by combining combining effect information with the specific image data and outputting specific output image information.

본 발명의 다른 목적은, 입력되는 음성 데이터에서 문맥(Context)의 주체를 대표하는 주체 대표 어구를 선정하는 주체 대표 어구 선정 모듈; 상기 음성 데이터에서 문맥(Context)의 모션을 대표하는 모션 대표 어구를 선정하는 모션 대표 어구 선정 모듈; 입력되는 구연동화 이미지 정보에서 특정 객체를 검출하여 특정 객체 이미지 정보를 생성하는 디텍션 모듈; 상기 특정 객체 이미지 정보를 분류하여 이미지 분류 정보를 생성하는 이미지 분류 모듈; 상기 주체 대표 어구와 매칭되는 상기 이미지 분류 정보의 그래픽 효과인 주체 효과 정보를 생성하고, 효과 어구 및 상기 효과 어구에 대응되는 그래픽 효과를 저장하는 효과 데이터베이스에 저장된 상기 효과 어구 중 상기 모션 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 모션 효과 정보를 생성하는 매칭 모듈; 상기 주체 효과 정보와 상기 모션 효과 정보를 기초로 결합 효과 정보를 생성하는 효과 생성 모듈; 및 상기 결합 효과 정보를 상기 음성 데이터 및 상기 구연동화 이미지 정보에 결합하여 출력 영상 정보를 생성하는 영상 출력 모듈;을 포함하고, 상기 효과 생성 모듈은, 상기 결합 효과 정보의 정적 그래픽이 상기 주체 효과 정보에 가깝게 출력되도록 최적화되고, 상기 결합 효과 정보의 동적 그래픽이 상기 모션 효과 정보에 가깝게 출력되도록 최적화되며, 특정 구연동화 이미지 정보 및 특정 음성 데이터가 입력되면, 상기 특정 음성 데이터의 문맥에 대응되는 특정 결합 효과 정보를 생성하고 상기 특정 결합 효과 정보를 상기 특정 구연동화 이미지 정보 및 상기 특정 음성 데이터와 결합하여 구연동화 영상인 특정 출력 영상 정보를 출력하는, 구연동화 영상 생성 장치를 제공하여 달성될 수 있다.Another object of the present invention, the subject representative phrase selection module for selecting a subject representative phrase representing the subject of the context (Context) in the input voice data; A motion representative phrase selection module for selecting a motion representative phrase representing a motion of a context in the voice data; A detection module which detects a specific object from the input storytelling image information and generates specific object image information; An image classification module configured to classify the specific object image information to generate image classification information; Matching with the motion representative phrase among the effect phrases stored in an effect database for generating subject effect information, which is a graphic effect of the image classification information matched with the subject representative phrase, and storing an effect phrase and a graphic effect corresponding to the effect phrase. A matching module for generating motion effect information that is a graphic effect of the effect phrase; An effect generation module for generating combined effect information based on the subject effect information and the motion effect information; And an image output module configured to generate output image information by combining the combined effect information with the voice data and the image of the storytelling image, wherein the effect generation module comprises a static graphic of the combined effect information. Is optimized to be outputted close to, and the dynamic graphic of the combined effect information is optimized to be outputted close to the motion effect information, and when a specific verbalization image information and specific speech data are inputted, a specific combination corresponding to the context of the specific speech data is input. It may be achieved by providing a device for generating a storytelling image, generating effect information and outputting specific output image information, which is a storytelling image, by combining the specific combined effect information with the specific storytelling image information and the specific speech data.

본 발명의 다른 목적은, 효과 어구 및 상기 효과 어구에 대응되는 그래픽 효과를 저장하고, 구연동화 영상 생성을 위한 프로그램 코드가 저장된 메모리 모듈; 및 상기 메모리 모듈과 동작 가능하도록 결합되고, 상기 프로그램 코드를 실행하는 처리 모듈;을 포함하고, 상기 프로그램 코드는, 입력되는 음성 데이터에서 문맥(Context)의 주체를 대표하는 주체 대표 어구를 선정하는 주체 대표 어구 선정 단계; 상기 음성 데이터에서 문맥(Context)의 모션을 대표하는 모션 대표 어구를 선정하는 모션 대표 어구 선정 단계; 입력되는 구연동화 이미지 정보에서 특정 객체를 검출하여 특정 객체 이미지 정보를 생성하는 디텍션 단계; 상기 특정 객체 이미지 정보를 분류하여 이미지 분류 정보를 생성하는 이미지 분류 단계; 상기 주체 대표 어구와 매칭되는 상기 이미지 분류 정보의 그래픽 효과인 주체 효과 정보를 생성하고, 효과 어구 및 상기 효과 어구에 대응되는 그래픽 효과를 저장하는 효과 데이터베이스에 저장된 상기 효과 어구 중 상기 모션 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 모션 효과 정보를 생성하는 매칭 단계; 상기 주체 효과 정보와 상기 모션 효과 정보를 기초로 결합 효과 정보를 생성하는 효과 생성 단계; 및 상기 결합 효과 정보를 상기 음성 데이터 및 상기 구연동화 이미지 정보에 결합하여 출력 영상 정보를 생성하는 영상 출력 단계;를 포함하고, 상기 효과 생성 단계는, 상기 결합 효과 정보의 정적 그래픽이 상기 주체 효과 정보에 가깝게 출력되도록 하는 최적화 및 상기 결합 효과 정보의 동적 그래픽이 상기 모션 효과 정보에 가깝게 출력되도록 하는 최적화 이후 수행되는, 구연동화 영상 생성 장치를 제공하여 달성될 수 있다.Another object of the present invention, a memory module for storing the effect phrase and the graphic effect corresponding to the effect phrase, the program code for generating a storytelling image; And a processing module operatively coupled to the memory module and executing the program code, wherein the program code includes: a subject that selects a subject representative phrase representing a subject of a context in the input voice data; Selection of representative phrases; A motion representative phrase selecting step of selecting a motion representative phrase representing a motion of a context from the voice data; A detection step of detecting a specific object from the input fairy tale image information and generating specific object image information; An image classification step of classifying the specific object image information to generate image classification information; Matching with the motion representative phrase among the effect phrases stored in an effect database for generating subject effect information, which is a graphic effect of the image classification information matched with the subject representative phrase, and storing an effect phrase and a graphic effect corresponding to the effect phrase. A matching step of generating motion effect information which is a graphic effect of the effect phrase; An effect generating step of generating combined effect information based on the subject effect information and the motion effect information; And an image output step of generating output image information by combining the combined effect information with the voice data and the image of the storytelling image. The effect generating step includes the static graphic of the combined effect information being the subject effect information. It can be achieved by providing an apparatus for generating a storytelling image, which is performed after optimization to optimize the output to be close to and the dynamic graphic of the combined effect information to be output to the motion effect information.

본 발명의 다른 목적은, 주체 대표 어구 선정 모듈이, 입력되는 음성 데이터에서 문맥(Context)의 주체를 대표하는 주체 대표 어구를 선정하는 주체 대표 어구 선정 단계; 모션 대표 어구 선정 모듈이, 상기 음성 데이터에서 문맥(Context)의 모션을 대표하는 모션 대표 어구를 선정하는 모션 대표 어구 선정 단계; 디텍션 모듈이, 입력되는 구연동화 이미지 정보에서 특정 객체를 검출하여 특정 객체 이미지 정보를 생성하는 디텍션 단계; 이미지 분류 모듈이, 상기 특정 객체 이미지 정보를 분류하여 이미지 분류 정보를 생성하는 이미지 분류 단계; 매칭 모듈이, 상기 주체 대표 어구와 매칭되는 상기 이미지 분류 정보의 그래픽 효과인 주체 효과 정보를 생성하고, 효과 어구 및 상기 효과 어구에 대응되는 그래픽 효과를 저장하는 효과 데이터베이스에 저장된 상기 효과 어구 중 상기 모션 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 모션 효과 정보를 생성하는 매칭 단계; 효과 생성 모듈이, 상기 주체 효과 정보와 상기 모션 효과 정보를 기초로 결합 효과 정보를 생성하는 효과 생성 단계; 및 영상 출력 모듈이, 상기 결합 효과 정보를 상기 음성 데이터 및 상기 구연동화 이미지 정보에 결합하여 출력 영상 정보를 생성하는 영상 출력 단계;를 포함하고, 상기 효과 생성 모듈은, 상기 결합 효과 정보의 정적 그래픽이 상기 주체 효과 정보에 가깝게 출력되도록 최적화되고, 상기 결합 효과 정보의 동적 그래픽이 상기 모션 효과 정보에 가깝게 출력되도록 최적화되며, 특정 구연동화 이미지 정보 및 특정 음성 데이터가 입력되면, 상기 특정 음성 데이터의 문맥에 대응되는 특정 결합 효과 정보를 생성하고 상기 특정 결합 효과 정보를 상기 특정 구연동화 이미지 정보 및 상기 특정 음성 데이터와 결합하여 구연동화 영상인 특정 출력 영상 정보를 출력하는, 구연동화 영상 생성 방법을 제공하여 달성될 수 있다.Another object of the present invention, the subject representative phrase selection module, the subject representative phrase selection step of selecting a subject representative phrase representing the subject of the context (Context) from the input voice data; A motion representative phrase selecting step of selecting, by the motion representative phrase selecting module, a motion representative phrase representing a motion of a context in the voice data; A detection step of detecting, by the detection module, a specific object from the inputted storytelling image information to generate specific object image information; An image classification step of classifying, by the image classification module, the specific object image information to generate image classification information; The matching module may generate subject effect information, which is a graphic effect of the image classification information matched with the subject representative phrase, and store the effect phrase and the graphic phrase corresponding to the effect phrase in the motion phrase stored in the effect database. A matching step of generating motion effect information which is a graphic effect of an effect phrase that matches the representative phrase; An effect generation step of generating, by the effect generation module, combined effect information based on the subject effect information and the motion effect information; And an image output module, wherein the image output module generates the output image information by combining the combined effect information with the audio data and the image of the storytelling image. The effect generation module includes a static graphic of the combined effect information. Is optimized to be outputted close to the subject effect information, and the dynamic graphic of the combined effect information is optimized to be outputted close to the motion effect information, and when specific speech-activated image information and specific speech data are input, the context of the specific speech data is input. Providing the specific combined effect information corresponding to the specific combined effect information and the specific cipher story image information and the specific audio data to combine to output specific output image information that is a storytelling image, providing a storytelling image generating method Can be achieved.

상기한 바와 같이, 본 발명에 의하면 이하와 같은 효과가 있다.As described above, the present invention has the following effects.

첫째, 본 발명의 일실시예에 따르면, 사용자들이 업로드하는 영상에 자동으로 해당 영상의 Context와 관련이 깊은 그래픽 효과가 발생되는 효과가 발생된다. First, according to an embodiment of the present invention, an effect in which a graphic effect deeply related to the context of a corresponding image is automatically generated in an image uploaded by users.

둘째, 본 발명의 일실시예에 따른 구연동화 영상 생성 방법 및 장치에 따르면, 사용자가 구연하는 부분의 이미지에 그래픽 효과가 발생되는 효과가 발생된다.Second, according to the method and apparatus for generating a storytelling image according to an embodiment of the present invention, a graphic effect is generated on an image of a part of a user speaking.

본 명세서에 첨부되는 다음의 도면들은 본 발명의 바람직한 실시예를 예시하는 것이며, 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 발명은 그러한 도면에 기재된 사항에만 한정되어 해석되어서는 아니 된다.
도 1,2는 본 발명의 일실시예에 따른 영상 내 그래픽 생성 장치를 도시한 것,
도 3은 본 발명의 일실시예에 따른 영상 내 그래픽 생성 방법을 도시한 흐름도,
도 4,5는 본 발명의 일실시예에 따른 구연동화 영상 생성 장치를 도시한 것,
도 6는 R-CNN의 특정 객체 검출 및 분류 방법을 도시한 흐름도,
도 7은 YOLO의 네트워크 형태를 도시한 모식도,
도 8은 본 발명의 일실시예에 따른 구연동화 영상 생성 방법을 도시한 흐름도이다.The following drawings, which are attached to this specification, illustrate exemplary embodiments of the present invention, and together with the detailed description thereof, serve to further understand the technical spirit of the present invention. It should not be interpreted.
1 and 2 illustrate an apparatus for generating a graphic in an image according to an embodiment of the present invention;
3 is a flowchart illustrating a method for generating a graphic in an image according to an embodiment of the present invention;
4 and 5 illustrate an apparatus for generating a storytelling image according to an embodiment of the present invention;
6 is a flowchart illustrating a method of detecting and classifying a specific object of an R-CNN;
7 is a schematic diagram illustrating a network form of YOLO;
8 is a flowchart illustrating a method for generating a storytelling image according to an embodiment of the present invention.

이하 첨부된 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명을 쉽게 실시할 수 있는 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예에 대한 동작원리를 상세하게 설명함에 있어서 관련된 공지기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, in describing in detail the principle of operation of the preferred embodiment of the present invention, if it is determined that the detailed description of the related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted.

또한, 도면 전체에 걸쳐 유사한 기능 및 작용을 하는 부분에 대해서는 동일한 도면 부호를 사용한다. 명세서 전체에서, 특정 부분이 다른 부분과 연결되어 있다고 할 때, 이는 직접적으로 연결되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고, 간접적으로 연결되어 있는 경우도 포함한다. 또한, 특정 구성요소를 포함한다는 것은 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다.In addition, the same reference numerals are used for parts having similar functions and functions throughout the drawings. Throughout the specification, when a particular part is connected to another part, this includes not only the case where it is directly connected, but also the case where it is indirectly connected with another element in between. In addition, the inclusion of a specific component does not exclude other components unless specifically stated otherwise, it means that may further include other components.

영상 내 그래픽 생성 방법 및 장치Method and apparatus for generating graphics in images

도 1,2는 본 발명의 일실시예에 따른 영상 내 그래픽 생성 장치를 도시한 것이다. 도 1,2에 도시된 바와 같이, 본 발명의 일실시예에 따른 영상 내 그래픽 생성 장치(1)는, 대표 어구 선정 모듈(10), 매칭 모듈(11), 효과 데이터베이스(12), 효과 생성 모듈(13), 영상 출력 모듈(14)을 포함할 수 있다. 도 1에 도시된 바와 같이, 본 발명의 일실시예에 따른 영상 내 그래픽 생성 장치는 기본적으로 사용자 또는 특정 알고리즘에 의해 제공되는 영상 데이터에서 음성 데이터 중 대표 어구를 선정하고(대표 어구 선정 모듈,10), 선정된 대표 어구에 가장 유사한 그래픽 효과 정보를 매칭하며(매칭 모듈,11), 매칭된 그래픽 효과 정보를 상기 영상 데이터에 맞게 최적화하여 생성하고(효과 생성 모듈,12), 최적화 생성된 그래픽 효과 정보를 토대로 출력 영상 정보를 사용자 클라이언트 또는 서버에 송신하게 된다(영상 출력 모듈,14). 1 and 2 illustrate an apparatus for generating a graphic in an image according to an exemplary embodiment. As illustrated in FIGS. 1 and 2, the apparatus for generating graphics in an image 1 according to an embodiment of the present invention may include a representative phrase selection module 10, a matching module 11, an effect database 12, and effect generation. The module 13 and the image output module 14 may be included. As shown in FIG. 1, an apparatus for generating a graphic in an image according to an embodiment of the present invention basically selects a representative phrase among audio data from image data provided by a user or a specific algorithm (representative phrase selection module 10). ) Matches the graphic effect information most similar to the selected representative phrase (matching module 11), and generates the optimized graphic effect information according to the image data (effect generation module 12), and optimizes the generated graphic effect. The output image information is transmitted to the user client or the server based on the information (image output module 14).

대표 어구 선정 모듈(10)은 사용자가 입력하는 영상 데이터에서 음성 데이터를 추출하고, 상기 음성 데이터에서 문맥(Context)을 대표하는 대표 어구를 선정하는 구성이다. 본 발명의 일실시예에 따른 대표 어구 선정 모듈(10)은, 음성 추출 모듈(100), 텍스트 변환 모듈(101), 구분 모듈(102), 주체 및 모션 선정 모듈(103)을 포함할 수 있다. 본 발명의 일실시예에 따른 대표 어구 선정 모듈(10)에 따르면, 문맥의 주체에 관련된 주체 대표 어구와 문맥의 모션에 관련된 모션 대표 어구를 선정할 수 있다.The representative phrase selection module 10 is configured to extract voice data from the image data input by the user, and to select a representative phrase representing a context from the voice data. The representative phrase selection module 10 according to an embodiment of the present invention may include a voice extraction module 100, a text conversion module 101, a division module 102, a subject and a motion selection module 103. . According to the representative phrase selection module 10 according to an embodiment of the present invention, a subject representative phrase related to the subject of the context and a motion representative phrase related to the motion of the context may be selected.

음성 추출 모듈(100)은 사용자가 입력하는 영상 데이터(Real time 또는 기저장된 영상 파일)를 수신하여 음성 데이터를 추출하는 모듈이다.The voice extraction module 100 is a module that receives image data (Real time or pre-stored image file) input by the user and extracts the voice data.

텍스트 변환 모듈(101)은 추출된 상기 음성 데이터를 입력 텍스트로 변환하는 모듈이다. The text conversion module 101 is a module for converting the extracted voice data into input text.

구분 모듈(102)은 텍스트 변환 모듈(101)에서 입력 텍스트를 일반 문자열 형태(Normal text)로 입력받게 되고, 이렇게 입력받은 일반 문자열 형태를 NLP 모듈에 의해 개체(entity)와 의미구(intent)와 같은 어구로 구분하여 어구 정보를 생성하는 모듈이다. NLP 모듈은 구체적으로 형태소 분석, 어간 추출, 불용어 추출, TF, TFIDF 등의 기능을 포함할 수 있다. 이후, 벡터화 모듈(Sentence2vec나 Word2vec, SyntexNet)에 의해 구분된 개체와 의미구를 벡터값으로 처리하게 될 수 있다. 이러한 벡터값 처리에는 Word2vec가 이용될 수 있고, 구체적으로는 n-gram, 문맥으로부터 단어를 예측하는 CBOW 모델, 또는 단어로부터 문맥을 예측하는 Skip-gram 모델 등이 이용될 수 있다. 즉, 구분 모듈(102)은 개체와 의미구를 포함하는 어구 정보를 생성할 수 있고, 벡터화 모듈에 의해 어구 정보는 벡터값(어구 정보의 embedding vector)으로 표현될 수 있다. The division module 102 receives the input text in the normal text form from the text conversion module 101, and the entity and the intent and the intent are input by the NLP module. This module creates phrase information by dividing it into the same phrase. The NLP module may specifically include functions such as morphological analysis, stem extraction, stopword extraction, TF, and TFIDF. Subsequently, the objects and semantic phrases classified by the vectorization module (Sentence2vec, Word2vec, SyntexNet) may be processed as vector values. Word2vec may be used for such a vector value processing. Specifically, n-gram, a CBOW model that predicts a word from a context, or a Skip-gram model that predicts a context from a word may be used. That is, the classification module 102 may generate phrase information including an object and a semantic phrase, and the phrase information may be expressed as a vector value (embedding vector of phrase information) by the vectorization module.

주체 및 모션 선정 모듈(103)은 상기 구분 모듈(102)에서 구분된 어구 정보를 SyntaxNet으로 분석하여 대표 어구를 선정하는 모듈이다. 본 발명의 일실시예에 따르면 대표 어구는 주어나 목적어에 해당되는 명사구(Noun phase)와 같은 주체 대표 어구 및 동사구(Verb phase)나 수식어구와 같은 모션 대표 어구로 선정될 수 있다.The subject and motion selection module 103 is a module that selects a representative phrase by analyzing the phrase information classified in the classification module 102 with SyntaxNet. According to an embodiment of the present invention, the representative phrase may be selected as a subject representative phrase such as a noun phrase corresponding to a subject or an object and a motion representative phrase such as a verb phase or a modifier phrase.

매칭 모듈(11)은 상기 구분 모듈(102)에서 구분되어 벡터화된 어구 정보(어구 정보의 embedding vector) 중 선정된 대표 어구(대표 어구의 embedding vector)와 가장 가까운(가장 유사한) 효과 데이터베이스에 기저장된 효과 어구(효과 어구의 Embedding vector)를 매칭하는 모듈이다. 즉, 효과 데이터베이스에 저장된 상기 효과 어구 중 상기 주체 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 주체 효과 정보를 생성하고, 상기 모션 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 모션 효과 정보를 생성하는 모듈이다. 어구 정보와 효과 어구의 유사도를 계산하는 방법으로, 본 발명의 일실시예에 따른 코사인 유사도가 이용될 수 있다. 코사인 유사도는 두 벡터의 각도를 측정하는 것이다. 각도가 같은 경우, 즉 두 벡터가 이루는 각이 0도인 경우엔 유사도의 최대값인 1.0이 나오게 된다. 그리고 가장 유사도가 낮은 경우는 두 벡터의 각도가 90도가 되는 경우이다. 예를 들어 본 발명의 일실시예에 따른 코사인 유사도로‘스웨덴’과 ‘노르웨이’의 유사도를 구하면 0.760124 라는 높은 유사도가 나오게 된다. 이에 따라, 매칭 모듈(11)에서는 특정 주체 대표 어구 및 특정 모션 대표 어구 각각에 대해 효과 어구가 매칭되면서, 주체 대표 어구에 대응되는 주체 효과 정보 및 모션 대표 어구에 대응되는 모션 효과 정보가 생성된다. 특정 주체 대표 어구에 대응되는 주체 효과 정보는 특히 특정 주체 대표 어구에 대한 정적 그래픽 정보를 의미할 수 있고, 특정 모션 대표 어구에 대응되는 모션 효과 정보는 특히 특정 모션 대표 어구에 대한 동적 그래픽 정보를 의미할 수 있다.The matching module 11 is pre-stored in the effect database that is closest to (most similar to) the representative phrase (embedding vector of the representative phrase) selected from the phraseized information (embedding vector of the phrase information) vectorized and separated by the classification module 102. A module for matching effect phrases (embedding vectors of effect phrases). That is, the module generates subject effect information which is a graphic effect of the effect phrase matching the subject representative phrase among the effect phrases stored in the effect database, and generates motion effect information which is a graphic effect of the effect phrase matching the motion representative phrase. to be. As a method of calculating the similarity between the phrase information and the effect phrase, cosine similarity according to an embodiment of the present invention may be used. Cosine similarity measures the angle of two vectors. If the angles are the same, that is, the angle formed by the two vectors is 0 degrees, the maximum value of similarity is 1.0. The lowest degree of similarity is when the angles of the two vectors become 90 degrees. For example, when the similarity between “Sweden” and “Norway” is obtained as the cosine similarity according to an embodiment of the present invention, a high similarity of 0.760124 may be obtained. Accordingly, in the matching module 11, while the effect phrase is matched with each of the specific subject representative phrase and the specific motion representative phrase, the subject effect information corresponding to the subject representative phrase and the motion effect information corresponding to the motion representative phrase are generated. The subject effect information corresponding to a specific subject representative phrase may mean, in particular, static graphic information about a specific subject representative phrase, and the motion effect information corresponding to a specific motion representative phrase refers to dynamic graphic information about a specific motion representative phrase. can do.

효과 데이터베이스(12)는 효과 어구에 대응되는 그래픽 효과를 저장하는 구성이다. 상기 매칭 모듈(11) 및 효과 생성 모듈에서 특정 효과 어구에 대한 그래픽 효과를 요청(Call)하면, 특정 효과 어구에 대응되는 그래픽 효과를 출력하게 된다.The effect database 12 is a structure for storing graphic effects corresponding to effect phrases. When the matching module 11 and the effect generation module call for a graphic effect for a specific effect phrase, the graphic effect corresponding to the specific effect phrase is output.

효과 생성 모듈(13)은 상기 주체 효과 정보와 상기 모션 효과 정보를 기초로 결합 효과 정보를 생성하는 모듈이다. 주체 효과 정보와 모션 효과 정보와 같은 복수개의 그래픽 효과를 기초로 결합 생성하는 결합 효과의 생성에는 VAE(Variable Auto-Encoder)나 GAN(Generative Adversaral Network) 등의 Generation 계열이 이용될 수 있다. 본 발명의 일실시예에 따른 효과 생성 모듈(13)의 최적화를 위한 학습(Training)은 상기 결합 효과 정보의 정적 그래픽이 상기 주체 효과 정보에 가깝게 출력되도록 최적화되고, 상기 결합 효과 정보의 동적 그래픽이 상기 모션 효과 정보에 가깝게 출력되도록 최적화되도록 구성될 수 있다.The effect generation module 13 is a module that generates combined effect information based on the subject effect information and the motion effect information. Generation generation such as Variable Auto-Encoder (VAE) or Generic Adversaral Network (GAN) may be used to generate a combined effect based on a plurality of graphic effects such as subject effect information and motion effect information. Training for the optimization of the effect generation module 13 according to an embodiment of the present invention is optimized such that the static graphic of the combined effect information is output close to the subject effect information, and the dynamic graphic of the combined effect information is It may be configured to be optimized to be output close to the motion effect information.

본 발명의 일실시예에 따른 효과 생성 모듈(13)은 인코딩 모듈(130), 최적화 모듈(131), 제너레이션 모듈(132)를 포함할 수 있다. The effect generation module 13 according to an embodiment of the present invention may include an encoding module 130, an optimization module 131, and a generation module 132.

인코딩 모듈(130)은 주체 효과 정보, 모션 효과 정보 및 학습(Training) 과정에서 이전 Epoch의 결합 효과 정보의 특성 정보를 입력 데이터로 하는 Neural Network으로 구성될 수 있고, 특성 정보를 인코딩하여 latent vector matrix를 생성하는 모듈이다. 본 발명의 일실시예에 따른 인코딩 모듈(130)은 latent vector 매트릭스 생성을 위해 Convolution Neural Network이 이용될 수 있다. 각 효과 정보의 특성 정보에 관하여, 주체 효과 정보는 정적 그래픽 정보를 의미할 수 있으므로 엣지 정보, 구도 정보, 색상 정보 등을 포함할 수 있고, 모션 효과 정보는 동적 그래픽 정보를 의미할 수 있으므로 이동 정보, 모션 정보, 포스쳐(Posture) 정보, 표정 정보 등을 포함할 수 있다.The encoding module 130 may be configured as a neural network which uses the characteristic information of the combined effect information of the previous Epoch as input data in the subject effect information, the motion effect information, and the training process, and encodes the characteristic information to form a latent vector matrix. This module creates. In the encoding module 130 according to an embodiment of the present invention, a convolution neural network may be used to generate a latent vector matrix. Regarding the characteristic information of each effect information, since the subject effect information may mean static graphic information, it may include edge information, composition information, color information, and the like, and the motion effect information may mean dynamic graphic information, and thus the movement information. , Motion information, posture information, facial expression information, and the like.

최적화 모듈(131)은 정적 그래픽 최적화 모듈(1310)과 동적 그래픽 최적화 모듈(1311)을 포함할 수 있고, 인코딩 모듈(130)에서 생성된 주체 효과 정보, 모션 효과 정보 및 결합 효과 정보의 latent vector matrix 또는 feature map의 차이에 대한 손실함수를 계산하고, 손실함수 결과를 토대로 제너레이션 모듈(132)의 가중치를 최적화하는 모듈이다. 본 발명의 일실시예에 따른 최적화 모듈(131)의 손실함수(loss function or cost function)로는 Softmax, cross entropy 등이 이용될 수 있다. 본 발명의 일실시예의 최적화 모듈(131)에 따르면 주체 효과 정보의 그래픽 형태에 모션 효과 정보의 모션을 입혀서 결합 효과 정보를 생성하려는 것이 주목적이다.The optimization module 131 may include a static graphic optimization module 1310 and a dynamic graphic optimization module 1311, and a latent vector matrix of subject effect information, motion effect information, and combined effect information generated by the encoding module 130. Alternatively, the module calculates a loss function for the difference of the feature map and optimizes the weight of the generation module 132 based on the loss function result. Softmax, cross entropy, and the like may be used as a loss function or cost function of the optimization module 131 according to an embodiment of the present invention. According to the optimization module 131 of the embodiment of the present invention, the main purpose is to generate the combined effect information by applying the motion of the motion effect information to the graphic form of the subject effect information.

정적 그래픽 최적화 모듈(1310)은 정적 그래픽 Loss function을 포함하여 주체 효과 정보의 정적 그래픽 효과와 결합 효과 정보의 정적 그래픽 효과의 차이를 최소화 하는 모듈이다. 정적 그래픽 최적화 모듈(1310)은, 주체 효과 정보의 인코딩 정보 중 Layer m까지의 feature map들을 auto-correlation한 gram matrix와 결합 효과 정보의 인코딩 정보 중 Layer m까지의 feature map들을 auto-correlation한 gram matrix와의 차이를 최소화하는 방향으로 제너레이션 모듈(132)의 가중치가 수렴되도록 정적 그래픽 Loss function이 구성될 수 있다. The static graphic optimization module 1310 is a module that minimizes the difference between the static graphic effect of the subject effect information and the static graphic effect of the combined effect information, including the static graphic loss function. The static graphic optimization module 1310 may include a gram matrix that auto-correlates feature maps up to Layer m in encoding information of the subject effect information, and a gram matrix that auto-correlates feature maps up to Layer m in encoding information of combined effect information. The static graphic loss function may be configured such that the weights of the generation module 132 converge in a direction of minimizing the difference from.

동적 그래픽 최적화 모듈(1311)은 동적 그래픽 Loss function을 포함하여 모션 효과 정보의 동적 그래픽 효과와 결합 효과 정보의 동적 그래픽 효과의 차이를 최소화하는 모듈이다. 동적 그래픽 최적화 모듈(1311)은 모션 효과 정보의 인코딩 정보 중 Layer n에서의 feature map과 결합 효과 정보의 인코딩 정보 중 Layer n에서의 feature map과의 차이를 최소화하는 방향으로 제너레이션 모듈(132)의 가중치가 수렴되도록 동적 그래픽 Loss function이 구성될 수 있다. The dynamic graphic optimization module 1311 is a module including a dynamic graphic loss function to minimize the difference between the dynamic graphic effect of the motion effect information and the dynamic graphic effect of the combined effect information. The dynamic graphic optimization module 1311 weights the generation module 132 to minimize the difference between the feature map in Layer n among the encoding information of the motion effect information and the feature map in Layer n among the encoding information of the combined effect information. The dynamic graphic loss function can be configured such that is converged.

상기 정적 그래픽 최적화 모듈(1310)과 동적 그래픽 최적화 모듈(1311)은 주체 효과 정보, 모션 효과 정보 및 결합 효과 정보의 페이셜 구역(Facial Region)과 바디 구역(Body Region)을 디텍션(detection)하여 각 구역별로 최적화 하도록 구성될 수 있다. 본 발명의 일실시에에 따라 페이셜 구역과 바디 구역이 각각 최적화 되는 경우, 제너레이션 모듈에서 페이셜 구역의 모션과 바디 구역의 모션이 더 정교하게 제너레이션 될 수 있는 효과가 발생된다. The static graphic optimization module 1310 and the dynamic graphic optimization module 1311 detect a facial region and a body region of the subject effect information, motion effect information, and combined effect information to detect each region. Can be configured to optimize. According to an embodiment of the present invention, when the facial zone and the body zone are respectively optimized, the motion of the facial zone and the body zone may be more precisely generated in the generation module.

제너레이션 모듈(132)은 결합 효과 정보를 생성하는 디코더로 구성된 모듈이다. 제너레이션 모듈(132)은 최적화 모듈(131)의 손실함수를 최소화하는 방향으로 가중치가 학습되어 점차적으로 주체 효과 정보 및 모션 효과 정보의 특성 정보와 유사한 방향으로 결합 효과 정보를 생성할 수 있다.The generation module 132 is a module composed of a decoder for generating combined effect information. The generation module 132 may learn weights in a direction of minimizing a loss function of the optimization module 131, and may gradually generate combined effect information in a direction similar to that of the characteristic information of the subject effect information and the motion effect information.

영상 출력 모듈(14)은 결합 효과 정보를 디코딩하고 입력된 영상 데이터에 결합하여 출력 영상 정보를 생성하는 모듈이다. 영상 출력 모듈(14)에 의해 상기 결합 효과 정보가 상기 사용자의 영상 데이터에 오버랩되어 사용자 클라이언트에 출력되게 된다.The image output module 14 is a module that decodes the combined effect information and combines the input image data to generate output image information. The combined effect information is overlapped with the image data of the user by the image output module 14 to be output to the user client.

본 발명의 일실시예에 따른 영상 내 그래픽 생성 장치에 따르면, 제너레이션 모듈(132)의 확률 분포를 이용한 샘플링에 의해 생성되는 그래픽의 예측 불가성이 향상되어 Over fitting 될 확률이 낮아지는 동시에 주체 효과 정보의 최적화에 의해 모션 효과 정보와 전체적인 특성과 구조는 유사한 그래픽이 생성되는 효과가 발생된다. According to the apparatus for generating an image in an image according to an embodiment of the present invention, the unpredictability of a graphic generated by sampling using the probability distribution of the generation module 132 is improved, thereby reducing the possibility of over fitting and subject effect information. By the optimization of the motion effect information and the overall characteristics and structure, similar graphics are generated.

도 3은 본 발명의 일실시예에 따른 영상 내 그래픽 생성 방법을 도시한 흐름도이다. 도 3에 도시된 바와 같이, 본 발명의 일실시예에 따른 영상 내 그래픽 생성 방법은, 대표 어구 선정 단계(S10), 매칭 단계(S11), 결합 효과 생성 단계(S12), 영상 출력 단계(S13)를 포함할 수 있다. 3 is a flowchart illustrating a method for generating a graphic in an image according to an exemplary embodiment. As shown in FIG. 3, in the graphic generating method according to an embodiment of the present invention, the representative phrase selection step S10, a matching step S11, a combined effect generation step S12, and an image output step S13. ) May be included.

대표 어구 선정 단계(S10)는 대표 어구 선정 모듈(10)이 사용자의 영상 데이터에서 주체에 해당되는 주체 대표 어구 및 모션에 해당하는 모션 대표 어구를 선정하는 단계이다. The representative phrase selecting step (S10) is a step in which the representative phrase selecting module 10 selects a subject representative phrase corresponding to a subject and a motion representative phrase corresponding to a motion from the user's image data.

매칭 단계(S11)는 매칭 모듈(11)이 대표 어구와 유사도가 높은 효과 데이터베이스에 기저장된 그래픽 효과를 매칭하는 단계이다. 특히, 명사구 등과 같은 '주체'에 대응되는 그래픽 효과와 동사구나 부사구 등과 같은 '모션'에 해당하는 그래픽 효과를 각각 매칭하도록 구성될 수 있다. 즉, 효과 데이터베이스에 저장된 상기 효과 어구 중 상기 주체 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 주체 효과 정보를 생성하고, 상기 모션 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 모션 효과 정보를 생성할 수 있다.In the matching step S11, the matching module 11 matches the graphic effects previously stored in the effect database with high similarity to the representative phrase. In particular, it may be configured to match a graphic effect corresponding to a 'subject' such as a noun phrase and a graphic effect corresponding to a 'motion' such as a verb or an adverb phrase. That is, subject effect information, which is a graphic effect of an effect phrase that matches the subject representative phrase among the effect phrases stored in the effect database, may be generated, and motion effect information that is a graphic effect of an effect phrase that matches the motion representative phrase may be generated. have.

결합 효과 생성 단계(S12)는 상기 주체 효과 정보와 상기 모션 효과 정보를 기초로 결합 효과 정보를 생성하는 단계이다. The combined effect generating step (S12) is a step of generating combined effect information based on the subject effect information and the motion effect information.

영상 출력 단계(S13)는 결합 효과 정보를 토대로 출력 영상 정보를 생성하여 사용자의 영상 데이터에 결합 효과 정보를 오버랩한 뒤 사용자 클라이언트에 출력하는 단계이다.The image output step S13 is a step of generating output image information based on the combined effect information, overlapping the combined effect information on the user's image data, and outputting the combined image to the user client.

본 발명의 일실시예에 따른 영상 내 그래픽 생성 방법에 따르면, 사용자가 "사자가 나타났어요"라는 음성을 포함한 영상을 입력하는 경우, "사자"에 해당되는 그래픽 효과와 "나타났어요"에 해당되는 그래픽 효과가 결합되어 "사자"가 특정 모션으로 화면에 나타나는 출력 그래픽 효과가 사용자 클라이언트에 사용자의 영상에 오버랩되어 출력되게 된다.According to a method for generating a graphic in an image according to an embodiment of the present invention, when a user inputs an image including a voice of "the lion appeared," the graphic effect corresponding to "lion" and "appeared" The graphic effects are combined so that the "lion" appears on the screen in a specific motion so that the graphic output of the user overlaps the user's image.

구연동화 영상 생성 방법 및 장치 Method and device for generating motion picture image

도 4,5는 본 발명의 일실시예에 따른 구연동화 영상 생성 장치를 도시한 것이다. 도 4,5에 도시된 바와 같이, 본 발명의 일실시예에 따른 구연동화 영상 생성 장치(2)는, 디텍션 모듈(20), 이미지 분류 모듈(21), 대표 어구 선정 모듈(22), 매칭 모듈(23), 효과 데이터베이스(24), 효과 생성 모듈(25), 영상 출력 모듈(26)을 포함할 수 있다. 도 4에 도시된 바와 같이, 본 발명의 일실시예에 따른 구연동화 영상 생성 장치(2)는, 사용자 또는 특정 알고리즘에 의해 제공되는 동화 이미지 정보에서 특정 객체로 분류되는 바운딩 박스를 설정하고(디텍션 모듈,20), 해당 바운딩 박스를 분류하여 이미지 태그 정보를 생성하며(이미지 분류 모듈,21), 사용자 또는 특정 알고리즘에 의해 제공되는 영상 데이터 또는 음성 데이터 중 대표 어구를 선정하고(대표 어구 선정 모듈,22), 선정된 대표 어구에 가장 유사한 이미지 태그 정보를 매칭하며(매칭 모듈,23), 매칭된 이미지 태그 정보를 기초로 출력될 출력 구연동화 영상 데이터에 맞게 최적화하여 해당 이미지의 바운딩 박스에 특정 효과를 적용 및 생성하고(효과 생성 모듈,25), 출력 구연동화 영상 데이터를 사용자 클라이언트 또는 서버에 송신하게 된다(영상 출력 모듈,26).4 and 5 illustrate an apparatus for generating a storytelling image according to an embodiment of the present invention. As illustrated in FIGS. 4 and 5, the device for generating a motion picture image 2 according to an embodiment of the present invention may include a detection module 20, an image classification module 21, a representative phrase selection module 22, and matching. The module 23 may include an effect database 24, an effect generation module 25, and an image output module 26. As illustrated in FIG. 4, the image generating apparatus 2 according to an embodiment of the present invention sets a bounding box classified as a specific object in moving image information provided by a user or a specific algorithm (detection). Module 20, classify the bounding box to generate image tag information (image classification module 21), select a representative phrase from video data or audio data provided by a user or a specific algorithm (representative phrase selection module, 22) matching image tag information most similar to the selected representative phrase (matching module 23), and optimizing the output image based on the matched image tag information based on the matched image data specific effect on the bounding box of the image Apply and generate (effect generation module 25), and transmit the output cipher animation image data to the user client or the server. Phase output module, 26).

디텍션 모듈(20) 및 이미지 분류 모듈(21)은 동화 이미지 정보에서 특정 객체를 검출하여 특정 객체 이미지 정보를 생성하고, 상기 특정 객체 이미지 정보를 분류하여 이미지 분류 정보를 생성하는 모듈이다. 도 6는 R-CNN의 특정 객체 검출 및 분류 방법을 도시한 흐름도이다. 도 7은 YOLO의 네트워크 형태를 도시한 모식도이다. 도 6,7에 도시된 바와 같이, 본 발명의 일실시예에 따르면, 디텍션 모듈(20) 및 이미지 분류 모듈(21)에 R-CNN 계열(R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN 등)이 이용되거나, YOLO(You only look once) 알고리즘이 이용될 수 있다. The detection module 20 and the image classification module 21 are modules that detect specific objects from moving image information, generate specific object image information, and classify the specific object image information to generate image classification information. 6 is a flowchart illustrating a method of detecting and classifying a specific object of an R-CNN. 7 is a schematic diagram illustrating the network form of YOLO. 6 and 7, according to an embodiment of the present invention, the detection module 20 and the image classification module 21 in the R-CNN series (R-CNN, Fast R-CNN, Faster R-CNN) , Mask R-CNN, etc.) or YOLO (You only look once) algorithm may be used.

대표 어구 선정 모듈(22)은 사용자가 입력하거나 특정 알고리즘에 의해 입력되는 음성 데이터에서 문맥(Context)를 대표하는 대표 어구를 선정하는 구성이다. 본 발명의 일실시예에 따른 대표 어구 선정 모듈(22)은, 텍스트 변환 모듈(220), 구분 모듈(221), 주체 및 모션 선정 모듈(222)을 포함할 수 있다. The representative phrase selection module 22 is configured to select a representative phrase representing a context from voice data input by a user or input by a specific algorithm. The representative phrase selection module 22 according to an embodiment of the present invention may include a text conversion module 220, a division module 221, a subject, and a motion selection module 222.

텍스트 변환 모듈(220)은 상기 음성 데이터를 입력 텍스트로 변환하는 모듈이다. The text conversion module 220 is a module for converting the voice data into input text.

구분 모듈(221)은 텍스트 변환 모듈(220)에서 입력 텍스트를 일반 문자열 형태(Normal text)로 입력받게 되고, 이렇게 입력받은 일반 문자열 형태를 NLP 모듈에 의해 개체(entity)와 의미구(intent)와 같은 어구로 구분하여 어구 정보를 생성하는 모듈이다. NLP 모듈은 구체적으로 형태소 분석, 어간 추출, 불용어 추출, TF, TFIDF 등의 기능을 포함할 수 있다. 이후, 벡터화 모듈(Sentence2vec나 Word2vec, SyntexNet)에 의해 구분된 개체와 의미구를 벡터값으로 처리하게 될 수 있다. 이러한 벡터값 처리에는 Word2vec가 이용될 수 있고, 구체적으로는 n-gram, 문맥으로부터 단어를 예측하는 CBOW 모델, 또는 단어로부터 문맥을 예측하는 Skip-gram 모델 등이 이용될 수 있다. 즉, 구분 모듈(221)은 개체와 의미구를 포함하는 어구 정보를 생성할 수 있고, 벡터화 모듈에 의해 어구 정보는 벡터값(어구 정보의 embedding vector)으로 표현될 수 있다. The division module 221 receives the input text in the normal text form in the text conversion module 220, and the entity and the intent and the intent are input by the NLP module. This module creates phrase information by dividing it into the same phrase. The NLP module may specifically include functions such as morphological analysis, stem extraction, stopword extraction, TF, and TFIDF. Subsequently, the objects and semantic phrases classified by the vectorization module (Sentence2vec, Word2vec, SyntexNet) may be processed as vector values. Word2vec may be used for such a vector value processing. Specifically, n-gram, a CBOW model that predicts a word from a context, or a Skip-gram model that predicts a context from a word may be used. That is, the division module 221 may generate phrase information including an object and a semantic phrase, and the phrase information may be expressed as a vector value (embedding vector of phrase information) by the vectorization module.

주체 및 모션 선정 모듈(222)은 상기 구분 모듈(221)에서 구분된 어구 정보를 SyntaxNet으로 분석하여 대표 어구를 선정하는 모듈이다. 본 발명의 일실시예에 따르면 대표 어구는 주어나 목적어에 해당되는 명사구(Noun phase)와 같은 주체 대표 어구 및 동사구(Verb phase)나 부사구, 형용사구와 같은 모션 대표 어구로 선정될 수 있다. 특히, 주체 대표 어구의 선정은, 디텍션 모듈에 의해 검출되고 이미지 분류 모듈에 의해 분류된 이미지 분류 정보에 대응되는 어구 중에서 선정되도록 구성될 수 있다. 이처럼 이미지 분류 정보 중에서 주체 대표 어구가 선정되는 경우, 구연동화 영상 생성에 있어서 구연동화 이미지에서 주체가 되는 특정 객체의 이미지에만 효과를 부여할 수 있게 되는 효과가 발생된다. The subject and motion selection module 222 is a module that selects a representative phrase by analyzing the phrase information classified in the classification module 221 with SyntaxNet. According to an embodiment of the present invention, the representative phrase may be selected as a subject representative phrase such as a noun phrase corresponding to a subject or an object and a motion representative phrase such as a verb phase, an adverb phrase, an adjective phrase. In particular, the selection of the representative representative phrase may be configured to be selected from the phrases detected by the detection module and corresponding to the image classification information classified by the image classification module. As such, when the representative representative phrase is selected from the image classification information, an effect of applying an effect to only an image of a specific object that is a subject in the storytelling image in the storytelling image generation is generated.

매칭 모듈(23)은 상기 구분 모듈(221)에서 구분되어 벡터화된 어구 정보(어구 정보의 embedding vector) 중 선정된 대표 어구(대표 어구의 embedding vector)와 가장 가까운(가장 유사한) 이미지 분류 정보(대응되는 특정 객체 이미지 정보 포함)를 이미지 분류 모듈(21)에서 수신하고, 효과 데이터베이스(24)에 기저장된 효과 어구(효과 어구의 Embedding vector)를 매칭하는 모듈이다. 즉, 효과 데이터베이스에 저장된 상기 효과 어구 중 상기 주체 대표 어구와 매칭되는 이미지 분류 정보의 그래픽 효과(특정 객체 이미지 정보)인 주체 효과 정보를 생성하고, 상기 모션 대표 어구와 매칭되는 효과 어구의 그래픽 효과인 모션 효과 정보를 생성하는 모듈이다. 주체 대표 어구와 이미지 분류 정보(주로 정적 그래픽), 모션 대표 어구와 효과 데이터베이스의 효과 어구(주로 동적 그래픽)의 유사도를 계산하는 방법으로, 예를 들면, 본 발명의 일실시예에 따른 코사인 유사도가 이용될 수 있다. 코사인 유사도는 두 벡터의 각도를 측정하는 것이다. 각도가 같은 경우, 즉 두 벡터가 이루는 각이 0도인 경우엔 유사도의 최대값인 1.0이 나오게 된다. 그리고 가장 유사도가 낮은 경우는 두 벡터의 각도가 90도가 되는 경우이다. 이에 따라, 매칭 모듈(23)에서는 특정 주체 대표 어구에 대응되는 주체 효과 정보와 특정 모션 대표 어구에 대응되는 모션 효과 정보가 생성될 수 있다. 주체 효과 정보는 특정 주체 대표 어구와 가까운 이미지 분류 정보에 대응되는 특정 객체 이미지 정보(또는, 이미지 분류 정보와 가까운 효과 어구에 대응되는 그래픽 정보)의 정적 그래픽 정보를 의미할 수 있고, 모션 효과 정보는 특정 모션 대표 어구와 가까운 효과 어구에 대응되는 그래픽 정보인 모션 효과 정보에 대한 동적 그래픽 정보를 의미할 수 있다.Matching module 23 is the image classification information (corresponding to the closest (most similar) to the representative phrase (embedding vector of the representative phrase) selected among the phraseized information (embedding vector of the phrase information) which is divided and vectorized in the classification module 221. Receiving specific object image information) from the image classification module 21 and matching the effect phrase (embedding vector of the effect phrase) previously stored in the effect database 24. That is, subject effect information, which is a graphic effect (specific object image information) of the image classification information that matches the subject representative phrase among the effect phrases stored in the effect database, is generated, and is a graphic effect of the effect phrase that matches the motion representative phrase. This module creates motion effect information. As a method for calculating the similarity between the subject representative phrase and image classification information (mainly static graphic), the motion representative phrase and the effect phrase (mainly dynamic graphic) of the effect database, for example, the cosine similarity according to an embodiment of the present invention Can be used. Cosine similarity measures the angle of two vectors. If the angles are the same, that is, the angle formed by the two vectors is 0 degrees, the maximum value of similarity is 1.0. The lowest degree of similarity is when the angles of the two vectors become 90 degrees. Accordingly, the matching module 23 may generate subject effect information corresponding to the specific subject representative phrase and motion effect information corresponding to the specific motion representative phrase. The subject effect information may mean static graphic information of specific object image information (or graphic information corresponding to an effect phrase close to the image classification information) corresponding to the image classification information close to the specific subject representative phrase. It may mean dynamic graphic information about motion effect information, which is graphic information corresponding to an effect phrase close to a specific motion representative phrase.

효과 데이터베이스(24)는 효과 어구에 대응되는 동적 그래픽 효과를 저장하는 구성이다. 상기 매칭 모듈(23) 및 효과 생성 모듈(25)에서 특정 효과 어구에 대한 동적 그래픽 효과를 요청(Call)하면, 특정 효과 어구에 대응되는 동적 그래픽 효과를 출력하게 된다. The effect database 24 is a component for storing dynamic graphic effects corresponding to effect phrases. When the matching module 23 and the effect generation module 25 call for a dynamic graphic effect for a specific effect phrase, the dynamic graphic effect corresponding to the specific effect phrase is output.

효과 생성 모듈(25)은 상기 주체 효과 정보와 상기 모션 효과 정보를 기초로 결합 효과 정보를 생성하는 모듈이다. 주체 효과 정보와 모션 효과 정보와 같은 복수개의 그래픽 효과를 기초로 결합 생성하는 결합 효과의 생성에는 VAE(Variable Auto-Encoder)나 GAN(Generative Adversaral Network) 등의 Generation 계열이 이용될 수 있다. 본 발명의 일실시예에 따른 효과 생성 모듈(13)의 최적화를 위한 학습(Training)은 상기 결합 효과 정보의 정적 그래픽이 상기 주체 효과 정보에 가깝게 출력되도록 최적화되고, 상기 결합 효과 정보의 동적 그래픽이 상기 모션 효과 정보에 가깝게 출력되도록 최적화되도록 구성될 수 있다.The effect generation module 25 is a module that generates combined effect information based on the subject effect information and the motion effect information. Generation generation such as Variable Auto-Encoder (VAE) or Generic Adversaral Network (GAN) may be used to generate a combined effect based on a plurality of graphic effects such as subject effect information and motion effect information. Training for the optimization of the effect generation module 13 according to an embodiment of the present invention is optimized such that the static graphic of the combined effect information is output close to the subject effect information, and the dynamic graphic of the combined effect information is It may be configured to be optimized to be output close to the motion effect information.

본 발명의 일실시예에 따른 효과 생성 모듈(25)은 인코딩 모듈(250), 최적화 모듈(251), 제너레이션 모듈(252)를 포함할 수 있다. The effect generation module 25 according to an embodiment of the present invention may include an encoding module 250, an optimization module 251, and a generation module 252.

인코딩 모듈(250)은 주체 효과 정보, 모션 효과 정보 및 학습(Training) 과정에서 이전 Epoch의 결합 효과 정보의 특성 정보를 입력 데이터로 하는 Neural Network으로 구성될 수 있고, 특성 정보를 인코딩하여 latent vector matrix를 생성하는 모듈이다. 본 발명의 일실시예에 따른 인코딩 모듈(250)은 latent vector 매트릭스 생성을 위해 Convolution Neural Network이 이용될 수 있다. 각 효과 정보의 특성 정보에 관하여, 주체 효과 정보는 정적 그래픽 정보를 의미할 수 있으므로 엣지 정보, 구도 정보, 색상 정보 등을 포함할 수 있고, 모션 효과 정보는 동적 그래픽 정보를 의미할 수 있으므로 이동 정보, 모션 정보, 포스쳐(Posture) 정보, 표정 정보 등을 포함할 수 있다.The encoding module 250 may be configured as a neural network which uses the characteristic information of the combined effect information of the previous Epoch as input data in the subject effect information, the motion effect information, and the training process, and encodes the characteristic information to form a latent vector matrix. This module creates. In the encoding module 250 according to an embodiment of the present invention, a convolution neural network may be used to generate a latent vector matrix. Regarding the characteristic information of each effect information, since the subject effect information may mean static graphic information, it may include edge information, composition information, color information, and the like, and the motion effect information may mean dynamic graphic information, and thus the movement information. , Motion information, posture information, facial expression information, and the like.

최적화 모듈(251)은 정적 그래픽 최적화 모듈(2510)과 동적 그래픽 최적화 모듈(2511)을 포함할 수 있고, 인코딩 모듈(250)에서 생성된 주체 효과 정보, 모션 효과 정보 및 결합 효과 정보의 latent vector matrix 또는 feature map의 차이에 대한 손실함수를 계산하고, 손실함수 결과를 토대로 제너레이션 모듈(252)의 가중치를 최적화하는 모듈이다. 본 발명의 일실시예에 따른 최적화 모듈(251)의 손실함수(loss function or cost function)로는 Softmax, cross entropy 등이 이용될 수 있다. 본 발명의 일실시예의 최적화 모듈(251)에 따르면 주체 효과 정보의 그래픽 형태에 모션 효과 정보의 모션을 입혀서 결합 효과 정보를 생성하려는 것이 주목적이다.The optimization module 251 may include a static graphic optimization module 2510 and a dynamic graphic optimization module 2511, and a latent vector matrix of subject effect information, motion effect information, and combined effect information generated by the encoding module 250. Alternatively, the module calculates a loss function for the difference of the feature map and optimizes the weight of the generation module 252 based on the loss function result. Softmax, cross entropy, or the like may be used as a loss function or cost function of the optimization module 251 according to an embodiment of the present invention. According to the optimization module 251 of the embodiment of the present invention, the main purpose is to generate the combined effect information by applying the motion of the motion effect information to the graphic form of the subject effect information.

정적 그래픽 최적화 모듈(2510)은 정적 그래픽 Loss function을 포함하여 주체 효과 정보의 정적 그래픽 효과와 결합 효과 정보의 정적 그래픽 효과의 차이를 최적화하는 모듈이다. 정적 그래픽 최적화 모듈(2510)은, 주체 효과 정보의 인코딩 정보 중 Layer m까지의 feature map들을 auto-correlation한 gram matrix와 결합 효과 정보의 인코딩 정보 중 Layer m까지의 feature map들을 auto-correlation한 gram matrix와의 차이를 최소화하는 방향으로 제너레이션 모듈(252)의 가중치가 수렴되도록 정적 그래픽 Loss function이 구성될 수 있다. The static graphic optimization module 2510 is a module for optimizing the difference between the static graphic effect of the subject effect information and the static graphic effect of the combined effect information, including the static graphic loss function. The static graphic optimization module 2510 may include a gram matrix that auto-correlates feature maps up to Layer m in encoding information of the subject effect information, and a gram matrix that auto-correlates feature maps up to Layer m in encoding information of combined effect information. The static graphic loss function may be configured such that the weights of the generation module 252 converge in the direction of minimizing the difference from.

동적 그래픽 최적화 모듈(2511)은 동적 그래픽 Loss function을 포함하여 모션 효과 정보의 동적 그래픽 효과와 결합 효과 정보의 동적 그래픽 효과의 차이를 최적화 하는 모듈이다. 동적 그래픽 최적화 모듈(2511)은 모션 효과 정보의 인코딩 정보 중 Layer n에서의 feature map과 결합 효과 정보의 인코딩 정보 중 Layer n에서의 feature map과의 차이를 최소화하는 방향으로 제너레이션 모듈(252)의 가중치가 수렴되도록 동적 그래픽 Loss function이 구성될 수 있다. The dynamic graphic optimization module 2511 is a module that optimizes the difference between the dynamic graphic effect of the motion effect information and the dynamic graphic effect of the combined effect information, including the dynamic graphic loss function. The dynamic graphic optimization module 2511 weights the generation module 252 to minimize the difference between the feature map in Layer n among the encoding information of the motion effect information and the feature map in Layer n among the encoding information of the combined effect information. The dynamic graphic loss function can be configured such that is converged.

제너레이션 모듈(252)은 결합 효과 정보를 생성하는 디코더로 구성된 모듈이다. 제너레이션 모듈(252)은 최적화 모듈(251)의 손실함수를 최소화하는 방향으로 가중치가 학습되어 점차적으로 주체 효과 정보 및 모션 효과 정보의 특성 정보와 유사한 방향으로 결합 효과 정보를 생성할 수 있다.The generation module 252 is a module composed of a decoder for generating combined effect information. The generation module 252 may learn the weights in a direction of minimizing the loss function of the optimization module 251 to gradually generate the combined effect information in a direction similar to the characteristic information of the subject effect information and the motion effect information.

영상 출력 모듈(26)은 결합 효과 정보를 디코딩하여 구연동화 이미지 정보와 상기 결합 효과 정보를 결합한 출력 영상 정보를 생성하는 모듈이다. 상기 출력 영상 정보와 사용자 음성 정보가 상기 구연동화 이미지 정보에 오버랩되어 영상으로 인코딩 된 뒤, 사용자 클라이언트 또는 서버에 출력하게 된다. 본 발명의 일실시예에 따른 영상 출력 모듈(26)은, 상기 구연동화 이미지 정보에서 검출된 상기 특정 객체의 이미지 위에 상기 결합 효과 정보를 오버랩하여 상기 출력 영상 정보를 생성하도록 구성될 수 있다. 이처럼 기존 이미지 위에 결합 효과 정보를 오버랩하여 출력하는 경우, 기존의 구연동화 이미지 위에서 문맥상 주체가 되는 특정 객체만 동적 그래픽 효과를 가지게 되므로, 사용자의 주의 집중이 강화되는 효과가 발생될 수 있다. The image output module 26 is a module for decoding the combined effect information to generate output image information combining the combined motion image information and the combined effect information. The output image information and the user voice information are overlapped with the image of the storytelling image and encoded into an image, and then output to the user client or server. The image output module 26 according to an embodiment of the present invention may be configured to overlap the combined effect information on the image of the specific object detected in the image of the storytelling image to generate the output image information. As such, when the combined effect information is overlapped on the existing image and outputted, only a specific object that is the contextual subject on the existing image of the fairy tale image has a dynamic graphic effect, so that the user's attention can be enhanced.

본 발명의 일실시예에 따른 구연동화 영상 생성 장치에 따르면, 제너레이션 모듈(252)의 확률 분포를 이용한 샘플링에 의해 생성되는 그래픽의 예측 불가성이 향상되어 Over fitting 될 확률이 낮아지는 동시에 주체 효과 정보의 최적화에 의해 모션 효과 정보와 전체적인 특성과 구조는 유사한 그래픽이 생성되는 효과가 발생된다. According to the image generating apparatus according to the embodiment of the present invention, the unpredictability of the graphic generated by sampling using the probability distribution of the generation module 252 is improved, so that the probability of over fitting is lowered and subject effect information. By the optimization of the motion effect information and the overall characteristics and structure, similar graphics are generated.

도 8은 본 발명의 일실시예에 따른 구연동화 영상 생성 방법을 도시한 흐름도이다. 도 8에 도시된 바와 같이, 본 발명의 일실시예에 따른 구연동화 영상 생성 방법은, 대표 어구 선정 단계(S20), 객체 검출 및 분류 단계(S21), 매칭 단계(S22), 결합 효과 생성 단계(S23), 영상 출력 단계(S24)를 포함할 수 있다. 8 is a flowchart illustrating a method for generating a storytelling image according to an embodiment of the present invention. As shown in FIG. 8, in the method for generating a fairy tale image according to an embodiment of the present invention, the representative phrase selection step (S20), an object detection and classification step (S21), a matching step (S22), and a combined effect generation step In operation S23, an image output step S24 may be included.

대표 어구 선정 단계(S20)는 대표 어구 선정 모듈(20)이 음성 데이터에서 주체 및 모션에 해당하는 대표 어구를 선정하는 단계이다. The representative phrase selection step S20 is a step in which the representative phrase selection module 20 selects a representative phrase corresponding to a subject and a motion from voice data.

객체 검출 및 분류 단계(S21)는 동화 이미지 정보에서 특정 객체가 검출되고 인공신경망에 의해 분류되어 이미지 분류 정보를 생성하는 단계이다. In the object detecting and classifying step S21, a specific object is detected from the moving image information and classified by an artificial neural network to generate image classification information.

매칭 단계(S22)는 주체 대표 어구와 유사도가 높은 이미지 분류 정보 및 모션 대표 어구와 유사도가 높은 효과 데이터베이스에 기저장된 그래픽 효과를 매칭하여 주체 효과 정보 및 모션 효과 정보를 생성하는 단계이다. In the matching step S22, the subject effect information and the motion effect information are generated by matching the image classification information having high similarity with the subject representative phrase and the graphic effect previously stored in the effect database having the high similarity with the motion representative phrase.

결합 효과 생성 단계(S23)는 주체 효과 정보 및 모션 효과 정보를 기초로 결합 효과 정보를 생성하는 단계이다. The combined effect generating step S23 is a step of generating combined effect information based on the subject effect information and the motion effect information.

영상 출력 단계(S13)는 결합 효과 정보를 토대로 출력 영상 정보를 생성하여, 동화 이미지 정보에 결합 효과 정보를 오버랩한 뒤 영상으로 인코딩하여 사용자 클라이언트 또는 서버에 출력하는 단계이다.The image output step S13 is a step of generating output image information based on the combined effect information, overlapping the combined effect information on the moving image information, and encoding the image to output to the user client or server.

본 발명의 일실시예에 따른 구연동화 영상 생성 방법에 따르면, 사용자가 "사자가 나타났어요"라는 음성을 입력하고, '사자'가 포함된 동화 이미지 정보를 입력하는 경우, "사자"에 해당되는 동화 이미지가 검출되어 "나타났어요"에 해당되는 그래픽 효과와 결합되어 "사자"가 특정 그래픽 모션으로 화면에 나타나게 되고, 이러한 출력 그래픽 효과 및 음성 데이터를 포함한 동화 이미지 정보가 영상으로 인코딩되어 사용자 클라이언트 또는 서버에 제공되게 된다.According to the method for generating a storytelling image according to an embodiment of the present invention, when a user inputs a voice of "the lion appeared" and inputs moving image information including "lion", the "lion" corresponds to When a moving image is detected and combined with a graphic effect corresponding to "shown," the "lion" appears on the screen in a specific graphic motion, and the moving image information, including the output graphic effect and audio data, is encoded into a video to be used by the user's client or Will be provided to the server.

이상에서 설명한 바와 같이, 본 발명이 속하는 기술 분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 상술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함하는 것으로 해석되어야 한다.As described above, those skilled in the art will appreciate that the present invention can be implemented in other specific forms without changing the technical spirit or essential features. Therefore, the above-described embodiments are to be understood in all respects as illustrative and not restrictive. The scope of the present invention is shown by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts should be construed as being included in the scope of the present invention.

본 명세서 내에 기술된 특징들 및 장점들은 모두를 포함하지 않으며, 특히 많은 추가적인 특징들 및 장점들이 도면들, 명세서, 및 청구항들을 고려하여 당업자에게 명백해질 것이다. 더욱이, 본 명세서에 사용된 언어는 주로 읽기 쉽도록 그리고 교시의 목적으로 선택되었고, 본 발명의 주제를 묘사하거나 제한하기 위해 선택되지 않을 수도 있다는 것을 주의해야 한다.The features and advantages described herein do not include all, and in particular many additional features and advantages will become apparent to those skilled in the art in view of the drawings, specification, and claims. Moreover, it should be noted that the language used herein has been chosen primarily for ease of reading and for teaching purposes, and may not be selected to describe or limit the subject matter of the present invention.

본 발명의 실시예들의 상기한 설명은 예시의 목적으로 제시되었다. 이는 개시된 정확한 형태로 본 발명을 제한하거나, 빠뜨리는 것 없이 만들려고 의도한 것이 아니다. 당업자는 상기한 개시에 비추어 많은 수정 및 변형이 가능하다는 것을 이해할 수 있다.The foregoing description of the embodiments of the invention has been presented for purposes of illustration. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Those skilled in the art can appreciate that many modifications and variations are possible in light of the above disclosure.

그러므로 본 발명의 범위는 상세한 설명에 의해 한정되지 않고, 이를 기반으로 하는 출원의 임의의 청구항들에 의해 한정된다. 따라서, 본 발명의 실시예들의 개시는 예시적인 것이며, 이하의 청구항에 기재된 본 발명의 범위를 제한하는 것은 아니다.Therefore, the scope of the present invention is not limited by the detailed description, but is defined by any claims of the application on which it is based. Accordingly, the disclosure of the embodiments of the present invention is illustrative and does not limit the scope of the invention described in the claims below.

1: 영상 내 그래픽 생성 장치
2: 구연동화 영상 생성 장치
10: 대표 어구 선정 모듈
11: 매칭 모듈
12: 효과 데이터베이스
13: 효과 생성 모듈
14: 영상 출력 모듈
20: 디텍션 모듈
21: 이미지 분류 모듈
22: 대표 어구 선정 모듈
23: 매칭 모듈
24: 효과 데이터베이스
25: 효과 생성 모듈
26: 영상 출력 모듈
100: 음성 추출 모듈
101: 텍스트 변환 모듈
102: 구분 모듈
103: 주체 및 모션 선정 모듈
130: 인코딩 모듈
131: 최적화 모듈
132: 제너레이션 모듈
220: 텍스트 변환 모듈
221: 구분 모듈
222: 주체 및 모션 선정 모듈
250: 인코딩 모듈
251: 최적화 모듈
252: 제너레이션 모듈
1310: 정적 그래픽 최적화 모듈
1311: 동적 그래픽 최적화 모듈
2511: 정적 그래픽 최적화 모듈
2512: 동적 그래픽 최적화 모듈1: Graphic generator in the image
2: storytelling image generating device
10: representative phrase selection module
11: matching module
12: effects database
13: Effect Generation Module
14: video output module
20: detection module
21: image classification module
22: representative phrase selection module
23: matching module
24: effects database
25: Effect Generation Module
26: video output module
100: speech extraction module
101: text conversion module
102: separator module
103: subject and motion selection module
130: encoding module
131: optimization module
132: generation module
220: text conversion module
221: division module
222: subject and motion selection module
250: encoding module
251: optimization module
252: generation module
1310: Static Graphics Optimization Module
1311: Dynamic Graphics Optimization Module
2511: Static Graphics Optimization Module
2512: Dynamic Graphics Optimization Module

Claims

A subject representative phrase selection module for selecting a subject representative phrase representing a subject of a context from the input voice data;
A motion representative phrase selection module for selecting a motion representative phrase representing a motion of a context in the voice data;
An effect phrase which is a graphic effect of an effect phrase matching the subject representative phrase among the effect phrases stored in an effect database storing an effect phrase and a graphic effect corresponding to the effect phrase, and an effect matched with the motion representative phrase A matching module for generating motion effect information which is a graphic effect of a phrase;
An effect generation module for generating combined effect information based on the subject effect information and the motion effect information; And
An image output module configured to generate output image information by combining the combined effect information with image data input together with the audio data;
Including,
The effect generation module may be configured such that the difference between the static graphic effect of the subject effect information and the static graphic effect of the combined effect information is minimal, and the difference between the dynamic graphic effect of the motion effect information and the dynamic graphic effect of the combined effect information is minimal. Generating the combined effect information to be minimal;
Graphics generation device in the image.

A memory module storing effect phrases and graphic effects corresponding to the effect phrases, and storing program codes for generating graphics in an image; And
A processing module operatively coupled to the memory module and executing the program code;
Including,
The program code is,
A subject representative phrase selecting step of selecting a subject representative phrase representing a subject of a context from the input voice data;
A motion representative phrase selecting step of selecting a motion representative phrase representing a motion of a context from the voice data;
A matching step of generating subject effect information which is a graphic effect of an effect phrase matching the subject representative phrase among the effect phrases, and generating motion effect information which is a graphic effect of an effect phrase matching the motion representative phrase;
An effect generating step of generating combined effect information based on the subject effect information and the motion effect information; And
An image output step of generating output image information by combining the combined effect information with the image data input together with the audio data;
On your computer,
In the effect generating step, the difference between the static graphic effect of the subject effect information and the static graphic effect of the combined effect information is minimal, and the difference between the dynamic graphic effect of the motion effect information and the dynamic graphic effect of the combined effect information is minimal. Generating the combined effect information to be minimal;
Graphics generation device in the image.

A subject representative phrase selecting step of the subject representative phrase selecting module selecting a subject representative phrase representing a subject of a context from the input voice data;
A motion representative phrase selecting step of selecting, by the motion representative phrase selecting module, a motion representative phrase representing a motion of a context in the voice data;
The matching module generates subject effect information which is a graphic effect of the effect phrase matching the subject representative phrase among the effect phrases stored in the effect database storing the effect phrase and the graphic effect corresponding to the effect phrase, and the motion representative phrase A matching step of generating motion effect information which is a graphic effect of an effect phrase that matches with;
An effect generation step of generating, by the effect generation module, combined effect information based on the subject effect information and the motion effect information; And
An image output module, generating an output image information by combining the combined effect information with the image data input together with the audio data;
Including,
The effect generation module may be configured such that the difference between the static graphic effect of the subject effect information and the static graphic effect of the combined effect information is minimal, and the difference between the dynamic graphic effect of the motion effect information and the dynamic graphic effect of the combined effect information is minimal. Generating the combined effect information to be minimal;
How to create graphics in images.

A subject representative phrase selection module for selecting a subject representative phrase representing a subject of a context from the input voice data;
A motion representative phrase selection module for selecting a motion representative phrase representing a motion of a context in the voice data;
A detection module which detects a specific object from the input storytelling image information and generates specific object image information;
An image classification module configured to classify the specific object image information to generate image classification information;
Matching with the motion representative phrase among the effect phrases stored in an effect database for generating subject effect information, which is a graphic effect of the image classification information matched with the subject representative phrase, and storing an effect phrase and a graphic effect corresponding to the effect phrase. A matching module for generating motion effect information that is a graphic effect of the effect phrase;
An effect generation module for generating combined effect information based on the subject effect information and the motion effect information; And
An image output module configured to generate output image information by combining the combined effect information with the audio data and the picture content image;
Including,
The effect generation module may be configured such that the difference between the static graphic effect of the subject effect information and the static graphic effect of the combined effect information is minimal, and the difference between the dynamic graphic effect of the motion effect information and the dynamic graphic effect of the combined effect information is minimal. Generating the combined effect information to be minimal;
A storytelling image generating device.

A memory module for storing an effect phrase and graphic effects corresponding to the effect phrase, and storing program code for generating a storytelling image; And
A processing module operatively coupled to the memory module and executing the program code;
Including,
The program code is,
A subject representative phrase selecting step of selecting a subject representative phrase representing a subject of a context from the input voice data;
A motion representative phrase selecting step of selecting a motion representative phrase representing a motion of a context from the voice data;
A detection step of detecting a specific object from the input fairy tale image information and generating specific object image information;
An image classification step of classifying the specific object image information to generate image classification information;
Matching with the motion representative phrase among the effect phrases stored in an effect database for generating subject effect information, which is a graphic effect of the image classification information matched with the subject representative phrase, and storing an effect phrase and a graphic effect corresponding to the effect phrase. A matching step of generating motion effect information which is a graphic effect of the effect phrase;
An effect generating step of generating combined effect information based on the subject effect information and the motion effect information; And
An image output step of generating output image information by combining the combined effect information with the audio data and the picture content image;
On your computer,
In the effect generating step, the difference between the static graphic effect of the subject effect information and the static graphic effect of the combined effect information is minimal, and the difference between the dynamic graphic effect of the motion effect information and the dynamic graphic effect of the combined effect information is minimal. Generating the combined effect information to be minimal;
A storytelling image generating device.

A subject representative phrase selecting step of the subject representative phrase selecting module selecting a subject representative phrase representing a subject of a context from the input voice data;
A motion representative phrase selecting step of selecting, by the motion representative phrase selecting module, a motion representative phrase representing a motion of a context in the voice data;
A detection step of detecting, by the detection module, a specific object from the inputted storytelling image information to generate specific object image information;
An image classification step of classifying, by the image classification module, the specific object image information to generate image classification information;
The matching module may generate subject effect information, which is a graphic effect of the image classification information matched with the subject representative phrase, and store the effect phrase and the graphic phrase corresponding to the effect phrase in the motion phrase stored in the effect database. A matching step of generating motion effect information which is a graphic effect of an effect phrase that matches the representative phrase;
An effect generation step of generating, by the effect generation module, combined effect information based on the subject effect information and the motion effect information; And
An image output step of generating, by an image output module, output image information by combining the combined effect information with the audio data and the image of the storytelling image;
Including,
The effect generation module may be configured such that the difference between the static graphic effect of the subject effect information and the static graphic effect of the combined effect information is minimal, and the difference between the dynamic graphic effect of the motion effect information and the dynamic graphic effect of the combined effect information is minimal. Generating the combined effect information to be minimal;
How to create a storytelling image.

delete