KR20110080712A

KR20110080712A - Method and system for searching moving picture by voice recognition of mobile communication terminal and apparatus for converting text of voice in moving picture

Info

Publication number: KR20110080712A
Application number: KR1020100001063A
Authority: KR
Inventors: 문영진
Original assignee: 주식회사 엘지유플러스
Priority date: 2010-01-07
Filing date: 2010-01-07
Publication date: 2011-07-13

Abstract

PURPOSE: A video search method through the voice recognition of a mobile communication terminal, a system thereof, and a video audio-text converting apparatus are provided to search a text included in each text file, find out a video having a text corresponding to a search word, and reproduce the video. CONSTITUTION: The voice separation extraction unit of a voice recognition engine extracts only a voice data part from video data inputted through a video input unit(S12). A text creation unit changes the voice of an extracted syllable into a text according to a voice-text converting algorithm(S13). The text creation unit records a timestamp in a video section in which the text converted voice data is located(S14). The text creation unit stores the text-converted syllable of a text file form in a voice text memory(S15).

Description

Method for retrieving video through speech recognition of mobile communication terminal and its system and text converting device for video speech {Method and System for Searching Moving Picture by Voice Recognition of Mobile Communication Terminal and Apparatus for Converting Text of Voice in Moving Picture}

본 발명은 이동통신 단말기의 내장 카메라를 통해 촬영된 동영상의 음성 내용을 검색어의 입력을 통해 검색하여 원하는 동영상을 찾아서 재생할 수 있도록 하는 이동통신 단말기의 음성 인식을 통한 동영상 검색 방법 및 그 시스템과 동영상 음성의 텍스트 변환 장치에 관한 것이다.The present invention provides a video retrieval method and system and a video voice through voice recognition of a mobile communication terminal to search for and play a desired video by inputting a search word for the voice content of a video recorded through the built-in camera of the mobile communication terminal. Relates to a text conversion device.

일반적으로, 이동통신 단말기의 경우에는 음성 통화/화상 통화 또는 문자 데이터 통신 등과 같은 기본적인 기능 이외에도 여러가지 다양한 부가 기능들이 제공되고 있는 바, 대표적으로 멀티미디어 기기로서의 사용 능력을 감안하여 통상의 디지털 카메라의 성능과 동등한 성능을 갖는 카메라 모듈을 내장하여 고화소의 이미지 촬영은 물론 고화질의 동영상 촬영까지도 가능하도록 하고 있다. In general, in the case of a mobile communication terminal, various additional functions are provided in addition to basic functions such as voice call / video call or text data communication. Equipped with an equivalent camera module, high-resolution images and high-quality video can be captured.

또한, 이동통신 단말기에서는 무선 인터넷 등과 같은 멀티미디어 데이터 통신망을 통하여 이미지 데이터, 동영상 데이터 등의 멀티미디어 정보를 외부로부터 수신받아 저장하여 사용하는 것도 가능하도록 되어 있다. In addition, the mobile communication terminal can receive and store multimedia information such as image data and video data from the outside through a multimedia data communication network such as wireless Internet.

이러한 고용량의 멀티미디어 데이터를 저장하기 위해, 이동통신 단말기에서는 대용량의 메모리가 내장되어 있거나, 기가 바이트 단위의 대용량을 갖는 메모리 카드를 탑재하여 사용하고 있는 상태이다. In order to store such high-capacity multimedia data, mobile communication terminals have a built-in large-capacity memory or a memory card having a large-capacity unit of gigabyte.

한편, 이러한 이동통신 단말기에서는 단말기 자체의 카메라를 통해 촬영하거나, 외부로부터 수신받아 대용량의 메모리에 각각 저장되어 있는 동영상 파일을 다시 재생하고자 하는 경우에, 동영상의 파일 목록 리스트를 불러와서 사용자 자신이 원하는 파일을 일일이 검색하거나, 동영상의 촬영시나 수신 저장시 임의로 명명한 파일 이름을 검색어 또는 키워드 입력을 통해 검색하여 찾아내도록 되어 있다.On the other hand, in such a mobile communication terminal, if the user wants to play back video files recorded by the camera of the terminal itself or received from the outside and stored in a large memory, respectively, the user can call up the file list list of the video. When searching a file one by one, or when recording a video or receiving and storing a randomly named file name by searching a keyword or a keyword input to find.

그러나, 이러한 종래 이동통신 단말기에서의 동영상 검색 기능에서는 대용량의 메모리에 다량으로 저장된 동영상 파일을 사용자가 일일이 확인하여 검색하는 경우에 장시간이 소요될 뿐만 아니라, 동영상 파일의 파일명이나 키워드 입력을 통해 검색하는 기능의 경우에는 사용자가 원하는 동영상의 파일명 또는 키워드를 일일이 기억하였다가 검색하는데 한계가 있으므로 동영상 파일의 검색이 실패하는 횟수가 많을 수 밖에 없다는 문제점이 있다. However, such a video search function in the conventional mobile communication terminal not only takes a long time when a user checks and searches a video file stored in a large amount of memory one by one, but also searches through a file name or keyword input of a video file. In this case, since there is a limit in retrieving and retrieving a file name or keyword of a video desired by a user, there is a problem in that the retrieval of a video file is often failed.

게다가, 동영상 파일 목록 리스트의 검색이나 동영상 파일명 또는 키워드 입력을 통해 원하는 동영상을 찾았다고 하더라도, 해당 동영상 파일을 처음 구간부터 재생해야 하므로, 동영상의 내용 중에서 원하는 촬영 부분만을 신속하게 찾아서 재생하기가 어렵도록 되어 있다는 문제점이 있다. In addition, even if the desired video is found by searching the video file list or entering the video file name or keyword, the video file must be played from the beginning, so it is difficult to quickly find and play only the desired part of the video content. There is a problem.

따라서, 본 발명은 상기한 종래의 문제점을 해결하기 위해 이루어진 것으로서, 그 목적은 동영상의 음성 내용을 텍스트로 변환하여 파일링(Filing)하고, 음성이 변환된 텍스트를 검색어로 사용하여 원하는 동영상을 찾아낼 수 있도록 하는 이동통신 단말기의 음성 인식을 통한 동영상 검색 방법 및 그 시스템과 동영상 음성의 텍스트 변환 장치를 제공하는 것이다. Accordingly, the present invention has been made to solve the above-mentioned conventional problems, the object of which is to convert the audio content of the video to the text (Filing), and to find the desired video using the text converted to speech The present invention provides a video retrieval method through voice recognition of a mobile communication terminal, and a system and a text conversion device for video voice.

본 발명의 다른 목적은 텍스트를 검색어로 하여 검색한 동영상을 해당 텍스트가 위치하는 동영상 구간으로부터 재생할 수 있도록 하는 이동통신 단말기의 음성 인식을 통한 동영상 검색 방법 및 그 시스템과 동영상 음성의 텍스트 변환 장치를 제공하는 것이다. Another object of the present invention is to provide a video retrieval method through a voice recognition of a mobile communication terminal and a system and a text conversion apparatus for video voice, which enables a video searched using text as a search word to be played from a video section in which the text is located. It is.

상기한 목적을 달성하기 위해 본 발명의 방법에 따르면, 이동통신 단말기의 음성 인식 엔진이 상기 이동통신 단말기에 저장되는 적어도 하나의 동영상 데이터에서 각각 음성 데이터만을 분리하여 추출하고, 상기 음성 데이터의 음성 내용을 텍스트로 변환하여 각각의 동영상 데이터에 대응시켜서 저장하는 단계와, 원하는 동영상 데이터의 검색을 위해 검색어가 입력되면, 상기 이동통신 단말기의 마이크로 프로세서에서 상기 검색어에 대응하는 텍스트의 음성 내용을 갖는 동영상 데이터를 추출하여 재생되도록 하는 단계를 포함하여 이루어진 것을 특징으로 하는 이동통신 단말기의 음성 인식을 통한 동영상 검색 방법을 제공한다. According to the method of the present invention, the voice recognition engine of the mobile communication terminal separates and extracts only the voice data from the at least one video data stored in the mobile communication terminal, the voice content of the voice data Converting the text into text and storing the video data corresponding to the video data, and when a search word is input to search for the desired video data, the microprocessor of the mobile communication terminal has the audio content of the text corresponding to the search word. It provides a video search method through the voice recognition of the mobile communication terminal, characterized in that it comprises the step of extracting and playing.

상기한 목적을 달성하기 위해 본 발명의 시스템에 따르면, 동영상 촬영이 가능한 카메라부를 갖춘 이동통신 단말기에 있어서, 상기 카메라부에서 촬영된 동영상 데이터 또는 네트워크를 통해 수신받은 동영상 데이터에 포함되어 있는 음성이 텍스트로 변환될 수 있도록 제어하고, 관련 동영상을 찾기 위한 검색어가 입력되면 상기 변환된 텍스트 내용을 근거로 검색어에 대응하는 텍스트를 검색하여 해당 검색어에 따른 텍스트의 음성이 포함된 동영상이 재생되도록 제어하는 마이크로 프로세서와, 상기 카메라부에서 촬영된 동영상 데이터 또는 네트워크를 통해 수신받은 동영상 데이터에서 각각 음성 데이터만을 분리하여 추출하고, 상기 음성 데이터의 음성 내용을 텍스트로 변환하여 각각의 동영상 데이터에 대응시켜서 저장되도록 하는 음성 인식 엔진, 상기 마이크로 프로세서에 의해 상기 동영상 데이터와, 음성의 텍스트 변환 데이터가 저장되는 메모리 및, 상기 마이크로 프로세서의 제어에 따라, 상기 입력 검색어에 대응하는 텍스트가 포함된 동영상의 재생 상태를 화면 표시하는 표시부를 포함하여 구성된 것을 특징으로 하는 이동통신 단말기의 음성 인식을 통한 동영상 검색 시스템을 제공한다.In order to achieve the above object, according to the system of the present invention, in a mobile communication terminal having a camera unit capable of shooting a video, the voice contained in the video data captured by the camera unit or the video data received through the network text And a microcomputer that controls the video to be played with the voice of the text according to the search word by searching for the text corresponding to the search word based on the converted text content when a search word for searching related video is inputted. And extracts and extracts only voice data from the video data captured by the camera unit or the video data received through the network, and converts the voice content of the voice data into text to be stored in correspondence with the respective video data. Speech recognition A display unit for displaying a playback state of a video including an engine, a memory in which the video data, text conversion data of voice is stored by the microprocessor, and text corresponding to the input search word under the control of the microprocessor; It provides a video search system through the voice recognition of the mobile communication terminal, characterized in that configured to include.

상기한 목적을 달성하기 위해 본 발명의 동영상 음성의 텍스트 변환 장치에 따르면, 동영상 촬영이 가능한 카메라부를 갖추고, 카메라부에서 촬영된 동영상 데이터 또는 네트워크를 통해 수신받은 동영상 데이터를 재생할 수 있는 이동통신 단말기에 있어서, 상기 동영상 데이터로부터 음성 데이터만을 분리하여 추출하는 음성 분리 추출부와, 상기 추출된 음성 데이터의 음파를 분석하여 사람의 음성 대역에 해당하는 음파의 음성 데이터만을 분리하는 음파 분석부, 상기 사람 음성에 해당하는 음성 데이터 중에서 일정한 음절을 갖는 음성 데이터만을 추출하는 음절 추출부 및, 상기 추출된 음절을 갖는 음성 데이터를 텍스트 문자 형태로 생성한 다음 텍스트로 변환된 각 음절들을 해당 동영상 파일 별로 텍스트 파일의 형태로 생성하고, 상기 텍스트 음성과 동기되는 동영상 구간에 타임 스탬프(TS)를 기록하며, 각 타임 스탬프의 기록 정보가 상기 텍스트 파일과 함께 메모리에 저장되도록 하는 텍스트 생성부를 포함하여 구성된 것을 특징으로 하는 동영상 음성의 텍스트 변환 장치를 제공한다.In order to achieve the above object, according to the apparatus for text-to-speech of video and voice of the present invention, a mobile communication terminal having a camera unit capable of capturing video and reproducing video data captured by the camera unit or video data received through a network can be provided. A voice separation extractor for separating and extracting only voice data from the video data, a sound wave analyzer for analyzing only sound data of the extracted voice data, and separating only voice data of sound waves corresponding to a human voice band, the human voice A syllable extracting unit which extracts only the speech data having a certain syllable among the speech data corresponding to, and generates the speech data having the extracted syllables in the form of text characters, and then converts each syllable converted into text into a text file for each video file. Generated in the form, the text And a text generating unit for recording a time stamp (TS) in a video section synchronized with the sex, wherein the recording information of each time stamp is stored in a memory together with the text file. do.

이상과 같이 본 발명에 따르면, 이동통신 단말기에 내장된 카메라를 통해 동영상을 촬영하거나, 네트워크를 통해 외부로부터의 동영상을 수신받는 경우에, 해당 동영상의 음성 데이터를 분석하여 사람의 음성으로서 음절을 갖는 효용성 있는 음성을 텍스트로 변환하여 텍스트 파일 형태로 저장하고, 원하는 동영상을 찾기 위해 특정 검색어를 사용자가 선택적으로 입력하면, 각 텍스트 파일에 포함된 텍스트 내용을 검색하여 검색어에 대응하는 텍스트를 갖는 동영상을 찾아내어 재생할 수 있도록 함에 따라, 동영상 검색의 정확도가 향상될 수 있게 되면서 사용자가 원하는 동영상을 더욱 정확하고 간편하게 검색하는 것이 가능하고, 검색한 동영상의 내용 중에서도 사용자가 원하는 구간의 동영상을 정확히 찾아서 재생할 수 있다는 각별한 효과를 갖는다.According to the present invention as described above, when shooting a video through a camera built in a mobile communication terminal, or receives a video from the outside through a network, by analyzing the voice data of the video having a syllable as a human voice After converting the effective voice to text and saving it in the form of a text file, and a user selectively enters a specific search word to find a desired video, the user can search for the text content included in each text file to create a video having the text corresponding to the search word. By finding and playing the video, the accuracy of the video search can be improved, so that the user can search the video more accurately and easily, and the user can accurately find and play the video in the section of the searched video. Has a special effect The.

도 1은 본 발명에 따른 음성 인식을 통한 동영상 검색 시스템이 적용된 이동통신 단말기의 구성을 나타낸 도면,
도 2는 도 1에 도시된 음성 인식 엔진의 구성을 상세히 나타낸 도면,
도 3은 본 발명에 따른 이동통신 단말기의 음성 인식을 통한 동영상 검색 방법에서 동영상의 음성 부분을 텍스트로 변환하는 과정을 설명하기 위한 플로우차트,
도 4는 본 발명에 따른 이동통신 단말기의 음성 인식을 통한 동영상 검색 방법에서 검색어 입력을 통하여 동영상 음성 변환 텍스트에서 원하는 동영상을 선택하는 과정을 설명하는 플로우차트,
도 5a 내지 도 5c는 본 발명의 바람직한 실시예에 따라 검색어 입력을 통해 원하는 내용의 동영상을 검색하는 상태가 화면 표시되는 일예를 각각 나타낸 도면이다. 1 is a diagram illustrating a configuration of a mobile communication terminal to which a video retrieval system using voice recognition according to the present invention is applied;
2 is a view showing in detail the configuration of the speech recognition engine shown in FIG.
3 is a flowchart illustrating a process of converting a voice portion of a video into text in a video search method through voice recognition of a mobile communication terminal according to the present invention;
4 is a flowchart illustrating a process of selecting a desired video from video-to-speech text by inputting a search word in a video search method through voice recognition of a mobile communication terminal according to the present invention;
5A to 5C are diagrams illustrating examples in which a state of searching for a video of a desired content through screen input is displayed according to a preferred embodiment of the present invention.

이하, 상기한 바와 같이 구성된 본 발명에 대해 첨부도면을 참조하여 상세히 설명한다. Hereinafter, the present invention configured as described above will be described in detail with reference to the accompanying drawings.

즉, 도 1은 본 발명에 따른 음성 인식을 통한 동영상 검색 시스템이 적용된 이동통신 단말기의 구성을 나타낸 도면이다. That is, FIG. 1 is a diagram illustrating a configuration of a mobile communication terminal to which a video retrieval system using voice recognition according to the present invention is applied.

도 1에 도시된 바와 같이, 본 발명에 따른 음성 인식을 통한 동영상 검색 시스템은, 키입력부(10)와, 카메라부(12), 프로그램 메모리(14), 데이터 메모리(16), 음성 텍스트 메모리(18), 무선신호 송수신부(20), 무선신호 처리부(22), 디지털신호 처리부(24), 표시부(26), 마이크로폰(28), 스피커(30), 마이크로 프로세서(32), 음서인식 엔진(34)로 구성된다. As shown in FIG. 1, the video search system using voice recognition according to the present invention includes a key input unit 10, a camera unit 12, a program memory 14, a data memory 16, and a voice text memory ( 18, wireless signal transceiver 20, wireless signal processor 22, digital signal processor 24, display unit 26, microphone 28, speaker 30, microprocessor 32, sound recognition engine ( 34).

상기 키입력부(10)는 상기 카메라부(12)를 통한 동영상 촬영의 개시/종료 키입력과, 네트워크로부터의 동영상 수신을 위한 키입력, 동영상 파일의 파일명 입력, 텍스트 검색어의 키입력을 위한 다수의 숫자키/문자키 및 메뉴키를 포함하고 있다. The key input unit 10 is a start / end key input of video recording through the camera unit 12, a key input for receiving a video from the network, a file name input of the video file, a plurality of key input for the text search word Numeric / character keys and menu keys are included.

상기 카메라부(12)는 해당 이동통신 단말기의 특정 부위에 촬영 렌즈가 노출되도록 내장되어, 동영상의 촬영 명령에 따라 주변 경관을 동영상 파일로 생성 가능하도록 연속적으로 촬영한다. The camera unit 12 is embedded so that the photographing lens is exposed to a specific portion of the corresponding mobile communication terminal, and continuously photographs the surrounding scenery as a video file according to a photographing command of the video.

상기 프로그램 메모리(14)는 해당 이동통신 단말기의 전체 시스템 기능을 운영하는 시스템 운영 프로그램과, 동영상의 음성 부분 텍스트 변환과, 텍스트 검색어 입력을 통한 동영상 검색 기능을 제어하는 제어 프로그램이 저장되어 있다. The program memory 14 stores a system operating program for operating the entire system functions of the mobile communication terminal, a control program for controlling the voice partial text conversion of the video and the video search function through the text search word input.

상기 데이터 메모리(16)는 상기 카메라부(12)로부터 촬영되거나, 네트워크로부터 수신받은 동영상 파일을 저장하기 위한 것으로서, 해당 메모리는 이동통신 단말기 내에 내장된 대용량 메모리 칩과, 대용량의 메모리 카드를 모두 적용할 수 있다. The data memory 16 is for storing a video file photographed from the camera unit 12 or received from a network, and the memory applies both a large memory chip and a large memory card built in the mobile communication terminal. can do.

상기 음성 텍스트 메모리(18)는 상기 동영상의 음성으로부터 추출된 음성에 대응하는 텍스트 데이터를 각 동영상 파일에 대응하여 텍스트 파일 형태로 저장하고, 해당 텍스트 파일에 포함된 텍스트의 음절(즉, 단어, 숙어)과 동기하여 동영상에 기록되는 타임 스탬프(Time Stamp; TS)의 기록 정보가 저장되어 있다. The voice text memory 18 stores text data corresponding to the voice extracted from the voice of the video in the form of a text file corresponding to each video file, and includes syllables (ie words, idioms) of the text included in the text file. Recording information of a time stamp (TS) recorded in a moving picture in synchronization with the "

상기 무선신호 송수신부(20)는 안테나를 통해서 이동 통신망의 기지국과 무선 통신을 수행하여 RF 신호를 송수신하기 위한 것이고, 상기 무선신호 처리부(22)는 상기 무선신호 송수신부(20)를 통한 RF 신호를 중간주파수 신호로 변환하여 디지털 정보 신호로 변환하고, 상기 제어부(32) 및 디지털신호 처리부(24)로부터의 디지털 정보 신호를 중간주파수 신호로 변환하고, 그 중간주파수 신호를 RF 신호로 변환하여 상기 무선신호 송수신부(20)를 통해 전송되도록 한다. The radio signal transceiver 20 performs radio communication with a base station of a mobile communication network through an antenna and transmits and receives an RF signal, and the radio signal processor 22 transmits an RF signal through the radio signal transceiver 20. To convert an intermediate frequency signal into a digital information signal, convert the digital information signal from the control unit 32 and the digital signal processing unit 24 into an intermediate frequency signal, and convert the intermediate frequency signal into an RF signal. It is to be transmitted through the wireless signal transceiver 20.

상기 디지털신호 처리부(24)는 상기 무선신호 처리부(22)로부터의 디지털 정보 신호를 디지털 신호 처리하여 상기 표시부(26)에 화면 표시되도록 함과 더불어, 상기 스피커(30)를 통해 음성 출력되도록 하고, 상기 카메라부(12)를 통한 동영상 촬영시 상기 마이크로폰(28)을 통해 입력되는 주변 음성을 입력받아 디지털신호 처리하여 제공한다. The digital signal processing unit 24 digitally processes the digital information signal from the wireless signal processing unit 22 to display the screen on the display unit 26, and outputs audio through the speaker 30. When taking a video through the camera unit 12 receives the surrounding voice input through the microphone 28 to provide a digital signal processing.

상기 마이크로 프로세서(32)는 상기 키입력부(10)의 키입력에 따라 상기 카메라부(12)에서 동영상을 촬영하도록 함과 더불어, 상기 디지털신호 처리부(24)로부터 마이크로폰(28)에 의해 입력되는 음성신호가 촬영 동영상의 음성으로서 결합되도록 하고, 무선 네트워크를 통해 원격지로부터 사용자가 원하는 동영상 파일을 수신하여 저장하기 위한 제어를 수행한다. The microprocessor 32 captures a video from the camera unit 12 according to the key input of the key input unit 10, and voice input by the microphone 28 from the digital signal processing unit 24. The signal is combined as a voice of the captured video, and control is performed to receive and store a desired video file from a remote place through a wireless network.

또한, 마이크로 프로세서(32)는 상기 카메라부(12)를 통해 촬영된 동영상 데이터 또는 네트워크로부터 수신되는 동영상 파일이 상기 음성 인식 엔진(34)을 통해 음성 부분을 텍스트로 변환하는 처리가 이루어지도록 하고, 변환된 텍스트 파일이 동영상 구간에 기록된 타임 스탬프의 기록 정보와 함께 상기 음성 텍스트 메모리(18)에 저장되도록 한다. In addition, the microprocessor 32 is a video data received through the camera unit 12 or a video file received from the network to be processed through the voice recognition engine 34 to convert the voice portion into text, The converted text file is stored in the audio text memory 18 together with the recording information of the time stamp recorded in the moving picture section.

상기 마이크로 프로세서(32)는 상기 키입력부(10)의 동영상 검색을 위한 검색어 입력에 따라, 상기 음성 텍스트 메모리(18)에 저장된 텍스트 파일별 텍스트 내용을 검색하여 입력된 검색어에 대응하는 텍스트를 갖는 텍스트 파일의 동영상 파일을 불러와서 검색 결과로서 화면 표시되도록 하고, 사용자에 의해 선택된 동영상이 재생되도록 한다. The microprocessor 32 searches for text content for each text file stored in the voice text memory 18 according to a search word input for searching a video of the key input unit 10 and has text corresponding to the input search word. The video file of the file is loaded and displayed as a search result, and the video selected by the user is played back.

여기서, 상기 마이크로 프로세서(32)에서는 텍스트의 검색어 입력에 의해 찾은 동영상의 재생시에, 타임 스탬프의 기록 정보를 참조하여 해당 검색어 텍스트의 음성이 위치하는 동영상 구간부터 재생이 이루어지도록 한다. In this case, the microprocessor 32 performs the playback from the video section where the voice of the corresponding search word text is located by referring to the recording information of the time stamp when the moving picture found by the search word input of the text is played.

반면, 상기 마이크로 프로세서(32)에서는 해당 검색어 텍스트의 음성이 위치하는 동영상 구간을 찾았더라도, 사용자의 임의 설정에 따라 해당 동영상을 처음 구간부터 재생할 수 있도록 하는 것도 가능하다. On the other hand, even if the microprocessor 32 finds a video section in which the voice of the corresponding search text is located, the microprocessor 32 may play the video from the first section according to a user's arbitrary setting.

상기 음성 인식 엔진(34)은 상기 카메라부(12)를 통해 촬영되는 동영상의 음성 부분에서 텍스트 검색으로서 효용이 있는 사람의 음성을 해당 내용 그대로 텍스트로 변환하고, 각각 변환된 텍스트 데이터를 해당 동영상 파일에 대응하는 텍스트 파일 형태로 생성한다. The voice recognition engine 34 converts the voice of a person who is useful as a text search from the voice portion of the video photographed by the camera unit 12 into text, and converts the converted text data into the corresponding video file. Create in the form of a text file corresponding to.

여기서, 상기 음성 인식 엔진(34)은 하드웨어 칩의 형태로 해당 이동통신 단말기에 내장하는 것도 가능하고, 각각의 기능을 프로그래밍한 소프트웨어 프로그램의 형태로 제공하는 것도 얼마든지 가능하다. Here, the voice recognition engine 34 may be embedded in the mobile communication terminal in the form of a hardware chip, or may be provided in the form of a software program programmed with each function.

한편, 상기 음성 인식 엔진(34)은 도 2에 도시된 바와 같이, 동영상 입력부(40)와, 음성분리 추출부(42), 음파 분석부(44), 음절 추출부(46), 음절 검색부(48), 텍스트 생성부(50), 버퍼(52)를 포함하여 구성된다.Meanwhile, as illustrated in FIG. 2, the speech recognition engine 34 includes a video input unit 40, a speech separation extractor 42, a sound wave analyzer 44, a syllable extractor 46, and a syllable search unit. 48, a text generation unit 50, and a buffer 52 is configured.

상기 동영상 입력부(40)는 상기 동영상 데이터를 입력받는 포트의 역할을 수행하고, 상기 음성 분리 추출부(42)는 상기 동영상 데이터로부터 음성 데이터 부분만을 추출하게 된다. The video input unit 40 serves as a port for receiving the video data, and the voice separation extracting unit 42 extracts only the voice data portion from the video data.

여기서, 상기 음성 분리 추출부(42)는 상기 마이크로 프로세서(32)에 의해 상기 디지털신호 처리부(24)로부터의 음성 신호를 별도로 추출하여 제공하는 경우에는, 동영상으로부터 음성을 또 다시 추출할 필요가 없으므로 삭제하여도 무방하다. In this case, when the voice separation extractor 42 separately extracts and provides a voice signal from the digital signal processor 24 by the microprocessor 32, the voice separation extractor 42 does not need to extract voice from the video again. You can delete it.

상기 음파 분석부(44)는 상기 동영상으로부터 분리되어 추출된 음성 데이터의 음파를 분석하여 사람의 음성 대역에 포함되는 음파를 갖는 음성 데이터만을 분리한다. The sound wave analyzer 44 separates only voice data having sound waves included in a human voice band by analyzing sound waves of voice data extracted and separated from the video.

상기 음절 추출부(46)는 상기 음파 분석부(44)의 음파 분석에 의해 추출된 사람의 음성에 해당하는 음성 데이터 중에서 일정한 음절 즉, 의미가 있는 단어 또는 숙어를 갖는 음성 데이터만을 추출하게 되고, 상기 음절 검색부(48)는 상기 음성 데이터에 일정의 음절이 있는지를 지속적으로 검색하는 역할을 수행한다. The syllable extractor 46 extracts only speech data having a certain syllable, that is, a word or idiom, from the speech data corresponding to the human voice extracted by sound wave analysis of the sound wave analyzer 44, The syllable searching unit 48 continuously searches for whether a certain syllable exists in the voice data.

여기서, 상기 음절 검색부(48)는 의미를 갖는 단어 또는 숙어에 따른 음성 데이터 부분을 검색하기 위해, 통상적인 음절 분석 알고리즘을 이용하거나 음절 분석용 데이터베이스를 적용할 수 있다. Here, the syllable search unit 48 may use a conventional syllable analysis algorithm or apply a syllable analysis database to search for a voice data portion according to a word or idiom having a meaning.

의미가 있는 단어 또는 숙어의 음절을 찾는 이유는, 동영상 검색을 위한 텍스트 검색어 입력시 텍스트의 비교 검색에 대한 신속성 및 편리성을 제고하기 위한 것이며, 텍스트 검색의 실패 확률을 감소시키기 위한 것으로서, 예컨대 외마디 비명 소리나, 환호 소리, 일반적으로 사용되지 않는 비속어 등을 제외시키기 위한 것이다. The reason for finding the syllables of meaningful words or idioms is to improve the speed and convenience of comparative search of text when inputting a text search term for video search, and to reduce the probability of the text search failure, for example, It is intended to exclude screams, cheers, and slang that are not commonly used.

상기 텍스트 생성부(50)는 상기 음절 추출부(46)를 통해 추출된 음절을 갖는 음성 데이터를 텍스트 문자 형태로 생성하기 위한 것으로서, 특정의 음성-텍스트 변환 알고리즘을 사용하여 텍스트의 변환을 진행한다. The text generation unit 50 is for generating speech data having syllables extracted through the syllable extracting unit 46 in the form of text characters, and performs text conversion using a specific speech-to-text conversion algorithm. .

한편, 상기 텍스트 생성부(50)는 각 음절의 음성 데이터를 텍스트로 변환할때, 상기 버퍼(52)를 통과하는 동영상 데이터 중에서 해당 음성과 동기되는 동영상 구간에 타임 스탬프(TS)를 기록하도록 한다. On the other hand, when the text generation unit 50 converts the voice data of each syllable into text, the text generator 50 records a time stamp TS in a video section synchronized with the corresponding voice among the video data passing through the buffer 52. .

상기 버퍼(52)는 상기 동영상 데이터를 음성 데이터의 음파 분석, 음절 추출, 텍스트 생성 기능과 동기되도록 버퍼링하여 출력하는 것으로서, 상기 텍스트 생성부(50)로부터 각 텍스트에 해당하는 음성이 수록된 동영상 구간에 타임 스탬프가 기록되어 출력된다. The buffer 52 buffers and outputs the video data in synchronization with sound wave analysis, syllable extraction, and text generation functions of the voice data, and outputs the video corresponding to the text from the text generation unit 50 in the video section. The time stamp is recorded and output.

상기 텍스트 생성부(50)를 통해 텍스트로 변환된 각 음절들은 해당 동영상 파일 별로 텍스트 파일(*.txt)의 형태로 생성되어 상기 버퍼(52) 내에서 동영상 구간에 기록되는 타임 스탬프의 기록 정보와 함께 동기화 처리되어 상기 음성 텍스트 메모리(18)에 저장된다. Each syllable converted into text through the text generation unit 50 is generated in the form of a text file (* .txt) for each video file and recorded with the time stamp recorded in the video section in the buffer 52. They are synchronized together and stored in the voice text memory 18.

이어, 상기한 바와 같이 이루어진 본 발명에서 동영상의 음성 부분을 텍스트로 변환하는 과정에 대해 도 3의 플로우차트를 참조하여 상세히 설명한다. Next, the process of converting the voice portion of the video to text in the present invention made as described above will be described in detail with reference to the flowchart of FIG.

먼저, 키입력부(10)의 사용자 키입력에 의해 동영상의 촬영 명령이 입력되면, 마이크로 프로세서(32)는 카메라부(12)에서 주변 경관을 동영상으로 촬영하도록 하고, 마이크로폰(28)을 통해 입력된 주변 음성이 디지털신호 처리부(24)에 의해 디지털신호 처리되어 입력된 다음에, 상기 촬영 동영상 데이터와 결합되도록 한다(단계 S10).First, when a video recording command is input by a user key input of the key input unit 10, the microprocessor 32 causes the camera unit 12 to capture the surrounding landscape as a video, and is input through the microphone 28. The ambient voice is digitally processed by the digital signal processing unit 24, inputted, and then combined with the captured video data (step S10).

그 상태에서, 상기 마이크로 프로세서(32)는 상기 키입력부(10)가 동영상 촬영을 완료하기 위한 키입력을 수행하는 지를 판단하게 되는 바(단계 S11), 동영상의 촬영을 완료시키기 위한 키입력이 이루어졌다고 판단되면 상기 카메라부(12)로부터 촬영된 동영상 데이터가 음성 인식 엔진(34)에 의해 음성이 텍스트로 변환되는 처리가 이루어지도록 하게 된다. In this state, the microprocessor 32 determines whether the key input unit 10 performs a key input for completing the video recording (step S11), and a key input for completing the video recording is performed. If it is determined that the video data captured by the camera unit 12 is processed by the voice recognition engine 34, the voice is converted into text.

즉, 상기 음성 인식 엔진(34)의 음성 분리 추출부(42)는 동영상 입력부(40)를 통해 입력되는 동영상 데이터에서 음성 데이터 부분만을 추출한다(단계 S12).That is, the voice separation extractor 42 of the voice recognition engine 34 extracts only the voice data portion from the video data input through the video input unit 40 (step S12).

그 다음에, 음파 분석부(44)는 추출된 음성 데이터에서 사람의 음파 대역에 따른 음성 데이터만을 추출해 내도록 하며, 음절 검색부(48) 및 음절 추출부(46)를 통해서 일정 의미의 음절을 갖는 음성 데이터 부분을 추출하게 되고, 텍스트 생성부(50)는 상기 음절 추출부(46)에 의해 추출된 음절의 음성을 음성-텍스트 변환 알고리즘에 따라 텍스트 문자로 변환한다(단계 S13). Next, the sound wave analyzer 44 extracts only the voice data according to the sound wave band of the person from the extracted voice data, and has a syllable having a certain syllable through the syllable searcher 48 and the syllable extractor 46. The speech data portion is extracted, and the text generation unit 50 converts the speech of the syllables extracted by the syllable extraction unit 46 into text characters according to a speech-to-text conversion algorithm (step S13).

이 때, 상기 텍스트 생성부(50)는 버퍼(52)를 통해 버퍼링되어 출력되는 동영상 데이터 중에서, 상기 텍스트로 변환된 음성 데이터가 위치하는 동영상 구간에 타임 스탬프가 기록되도록 하고(단계 S14), 상기 타임 스탬프에 대한 기록 정보와 동기화되어 텍스트 파일 형태로 음성 텍스트 메모리(18)에 저장되도록 한다(단계 S15). At this time, the text generation unit 50 allows a time stamp to be recorded in a video section in which the audio data converted into text is located among the video data buffered and output through the buffer 52 (step S14). Synchronized with the recording information on the time stamp, it is stored in the voice text memory 18 in the form of a text file (step S15).

그 다음에, 도 4의 플로우차트와 도 5a 내지 도 5c의 도면을 참조하여 검색어 입력을 통하여 동영상 음성 변환 텍스트에서 원하는 동영상을 선택하는 과정에 대해 상세히 설명한다. Next, a process of selecting a desired video from video-to-speech text by inputting a search word will be described in detail with reference to the flowchart of FIG. 4 and the drawings of FIGS. 5A to 5C.

먼저, 상기 키입력부(10)의 사용자 키입력에 의해 표시부(26)의 화면 상에 도 5a에 도시된 바와 같이, 데이터 메모리(16)에 저장되어 있는 동영상의 리스트와, 해당 동영상을 처리할 메뉴(예컨대, 1. 동영상 검색, 2. 동영상 이동, 3. 동영상 삭제 등)가 화면 표시된다(단계 S20). First, as shown in FIG. 5A on the screen of the display unit 26 by a user key input of the key input unit 10, a list of moving images stored in the data memory 16 and a menu for processing the moving images. (E.g., 1. video search, 2. video movement, 3. video deletion, etc.) is displayed on the screen (step S20).

그 상태에서, 사용자에 의해 동영상 검색 메뉴가 선택되어 도 5b에 도시된 바와 같이, 상기 표시부(26) 화면 상에 검색어를 입력할 수 있는 메뉴 화면이 표시되면, 상기 키입력부(10)의 사용자 키입력에 의해 찾고자 하는 원하는 동영상에 관련된 텍스트를 검색어로 입력하게 된다(단계 S21).In this state, when a video search menu is selected by the user and a menu screen for inputting a search word is displayed on the display unit 26 screen as shown in FIG. 5B, the user key of the key input unit 10 is displayed. By inputting text related to the desired video to be searched as a search word (step S21).

이에, 상기 마이크로 프로세서(32)는 음성 텍스트 메모리(18)에 저장되어 있는 각 동영상 파일들에 대응하는 복수의 텍스트 파일을 각각 불러내어 텍스트 파일에 포함되어 있는 각 텍스트의 음절을 검색어와 비교하여 대응하는 단어 또는 숙어를 검색하게 된다(단계 S22).Accordingly, the microprocessor 32 retrieves a plurality of text files corresponding to each video file stored in the voice text memory 18 and compares the syllables of each text included in the text file with a search word. The word or phrase is searched (step S22).

상기 마이크로 프로세서(32)에서는 각 텍스트 파일 내에서 사용자가 입력한 검색어에 대응하는 단어 또는 숙어가 존재하는 지를 판단하게 되는데(단계 S23), 존재하지 않은 것으로 판단되면 상기 표시부(26) 상에 검색 실패 상태가 표시되도록 한다(단계 S24).The microprocessor 32 determines whether a word or idiom corresponding to the search word input by the user exists in each text file (step S23). If it is determined that the word does not exist, the search fails on the display unit 26. The status is displayed (step S24).

하지만, 상기 마이크로 프로세서(32)는 텍스트 파일 내에서 사용자가 입력한 검색어에 대응하는 단어 또는 숙어가 존재하고 있다고 판단되면, 해당 단어 또는 숙어가 포함되어 있는 텍스트 파일의 파일명과 동일한 동영상 파일을 상기 데이터 메모리(16)로부터 로딩하여 불러오게 되고(단계 S25), 상기 음성 텍스트 메모리(18)에 각 텍스트 파일과 함께 저장된 타임 스탬프의 기록 정보를 참조하여, 해당 검색어 내용의 음성이 위치한 동영상 구간의 이미지를 썸네일(Thumb Nail) 이미지 형태로 변환함에 따라, 도 5c에 도시된 바와 같이 상기 표시부(26) 화면 상에 썸네일 이미지와 해당 동영상 파일의 파일명이 함께 리스트화되어 화면 표시되도록 한다(단계 S26).However, when the microprocessor 32 determines that a word or idiom corresponding to the search word input by the user exists in the text file, the microprocessor 32 may output a video file having the same name as the file name of the text file including the word or idiom. It loads from the memory 16 and loads it (step S25), referring to the recording information of the time stamp stored with each text file in the voice text memory 18, and displays the image of the video section in which the voice of the corresponding search word content is located. As shown in FIG. 5C, as shown in FIG. 5C, the thumbnail image and the file name of the corresponding video file are listed together and displayed on the screen as illustrated in FIG. 5C (step S26).

그 상태에서, 상기 마이크로 프로세서(32)는 상기 키입력부(10)의 사용자 키입력에 의해 특정 동영상의 실행 명령이 입력되는 지의 여부를 판단하는데(단계 S27), 특정 동영상의 실행을 위한 키입력이 이루어지고 있다고 판단되면 해당 동영상을 상기 검색어에 따른 텍스트의 음성 데이터가 위치하는 동영상 구간부터 재생하게 된다(단계 S28). In this state, the microprocessor 32 determines whether an execution command of a specific video is input by a user key input of the key input unit 10 (step S27), and a key input for executing a specific video is performed. If it is determined that the video is made, the video is played from the video section in which the audio data of the text according to the search word is located (step S28).

상기에서 본 발명의 특정한 실시예가 설명 및 도시되었지만, 본 발명이 당업자에 의해 다양하게 변형되어 실시될 가능성이 있는 것은 자명한 일이다. 이와 같은 변형된 실시예들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안되며, 본 발명에 첨부된 청구범위 안에 속한다고 해야 할 것이다.While specific embodiments of the present invention have been described and illustrated above, it will be apparent that the present invention may be embodied in various modifications by those skilled in the art. Such modified embodiments should not be understood individually from the technical spirit or the prospect of the present invention, but should fall within the claims appended to the present invention.

10:키입력부, 12:카메라부,
14:프로그램 메모리, 16:데이터 메모리,
18:음성 텍스트 메모리, 20:무선신호 송수신부,
22:무선신호 처리부, 24:디지털신호 처리부,
26:표시부, 28:마이크로폰,
30:스피커, 32:마이크로 프로세서,
34:음성인식 엔진, 42:음성분리 추출부,
44:음파 분석부, 46:음절 추출부,
48:음절 검색부, 50:텍스트 생성부.10: key input unit, 12: camera unit,
14: program memory, 16: data memory,
18: voice text memory, 20: wireless signal transceiver,
22: wireless signal processing unit, 24: digital signal processing unit,
26: display unit, 28: microphone,
30: speaker, 32: microprocessor,
34: speech recognition engine, 42: speech separation extraction unit,
44: sound wave analysis section, 46: syllable extraction section,
48: syllable searching unit, 50: text generating unit.

Claims

The voice recognition engine of the mobile communication terminal separates and extracts only the voice data from each of the at least one video data stored in the mobile communication terminal, converts the voice content of the voice data into text and stores the corresponding video data in correspondence. A first step;
And a second step of extracting and reproducing video data having audio content of text corresponding to the search word by a microprocessor of the mobile communication terminal when a search word is input for searching for desired video data. Video search method through speech recognition of mobile communication terminal.

The method of claim 1,
The first step may include extracting and extracting only voice data from the video data by a voice separation extractor of the voice recognition engine;
Analyzing sound waves of the extracted voice data by the sound wave analyzer of the speech recognition engine to separate only voice data of sound waves corresponding to a human voice band;
Extracting only speech data having a certain syllable from speech data corresponding to the human voice by a syllable extracting unit of the speech recognition engine;
Generating speech data having the extracted syllables in the form of text characters in the text generating unit of the speech recognition engine, and generating and storing each syllable converted to the text in the form of a text file for each video file; Video search method through the speech recognition of the mobile communication terminal, characterized in that made.

The method of claim 2,
The first step may further include recording a time stamp (TS) in a video section synchronized with the text voice by the text generation unit, and storing recording information of each time stamp together with the text file. A video search method through voice recognition of a mobile communication terminal.

The method of claim 2,
In the second step, if a search word is input to search for desired video data, the microprocessor searches for a text file corresponding to the search word in a text file of each video to find a corresponding text file;
Importing a video file corresponding to the text file including the search word and displaying the video file in a list;
If a specific video is selected by the user, the method for retrieving the video through the voice recognition of the mobile communication terminal comprising the step of playing the selected video.

The method of claim 3, wherein
When displaying a video file corresponding to the text file including the search word in a list, displaying the video section in which the search word content is located as a thumbnail image with reference to the recording information of each time stamp;
When the user selects a specific video, the method further comprises the step of playing from the video section where the content of the search word is located.

In the mobile communication terminal having a camera unit capable of shooting video,
Controls the voice included in the video data captured by the camera unit or the video data received through the network to be converted into text, and when a search word for searching for a related video is inputted, the search word is searched based on the converted text content. A microprocessor configured to search for the corresponding text and to play a video including the text corresponding to the corresponding search word;
A voice recognition engine that separates and extracts only voice data from video data captured by the camera or video data received through a network, converts the voice content of the voice data into text, and stores them in correspondence with the video data. ;
A memory for storing the moving picture data and the text conversion data of the voice by the microprocessor; And
And a display unit for displaying a playback state of a video including text corresponding to the input search word under the control of the microprocessor.

The method according to claim 6,
The speech recognition engine may include: a speech separation extractor configured to separate and extract only speech data from the video data;
A sound wave analyzer for analyzing only sound data of the extracted voice data and separating only voice data of sound waves corresponding to a voice band of a person;
Syllable extraction unit for extracting only the speech data having a certain syllable from the speech data corresponding to the human voice,
After generating the speech data having the extracted syllables in the form of text characters, each syllable converted to text is generated in the form of a text file for each video file, and a time stamp TS is applied to the video section synchronized with the text voice. And a text generating unit for recording the recording information of each time stamp together with the text file in the memory.

The method of claim 7, wherein
The microprocessor displays a video file corresponding to a text file including a search word on the display unit, and displays the video section in which the search word content is located as a thumbnail image with reference to the recording information of each time stamp. When the specific video is selected by the user, the video search method through the voice recognition of the mobile communication terminal, characterized in that the control to play from the video section where the content of the search word is located.

In the mobile communication terminal equipped with a camera unit capable of shooting video, and capable of playing back video data captured by the camera unit or video data received through a network,
A voice separation extractor for separating and extracting only voice data from the video data;
A sound wave analyzer for analyzing only sound waves of the extracted voice data to separate only voice data of sound waves corresponding to a human voice band;
A syllable extraction unit configured to extract only voice data having a predetermined syllable from the voice data corresponding to the human voice; And
After generating the speech data having the extracted syllables in the form of text characters, each syllable converted to text is generated in the form of a text file for each video file, and a time stamp TS is applied to the video section synchronized with the text voice. And a text generating unit for recording and storing the recording information of each time stamp in the memory together with the text file.