WO2008111760A1

WO2008111760A1 - Method and apparatus for providing video synthesizing call service using voice recognition

Info

Publication number: WO2008111760A1
Application number: PCT/KR2008/001268
Authority: WO
Inventors: Gil-Soo Lee; Bong-Kyu Heo
Original assignee: Ti Square Technology Ltd.
Priority date: 2007-03-12
Filing date: 2008-03-06
Publication date: 2008-09-18

Abstract

The present invention relates to a method and apparatus for providing a video synthesis call service using voice recognition. The present invention provides a method comprising the steps of extracting a voice signal from a video call signal transmitted between the calling video terminal and the called video terminal; extracting one or more words contained in the extracted voice signal from the voice signal; searching for images or video corresponding to the extracted words; synthesizing found images or video with the video call signal transmitted between the calling video terminal and the called video terminal; and transmitting the synthesized video call signal to either or both of the calling video terminal and the called video terminal and the apparatus using the method.

Description

METHOD AND APPARATUS FOR PROVIDING VIDEO SYNTHESIZING CALL SERVICE USING VOICE RECOGNITION

Technical Field

[1] The present invention relates, in general, to a method and apparatus for providing a video synthesis(video overlay) call service using voice recognition, and, more particularly, to a method and apparatus, which can analyze the content of a voice call of a video terminal user during a video call conversation, synthesize(overlay) images or video corresponding to words spoken by the user with a video call signal in real time, and provide a synthesized video call signal to a video terminal. Background Art

[2] Recently, with the rapid development of mobile communication technology, a user can make a call while personally viewing the image of the other party, rather than merely making a voice call, in a Wideband Code Division Multiple Access (WCDMA) environment or the like, and various types of supplementary service using such a video call have been gradually developed and have been provided to users. However, a conventional video call service is disadvantageous in that, since video is generally transmitted around the face of a user, the content of a video call is simple and the call may be more unnatural than a voice call. In consideration of this fact, a video synthesis method of synthesizing background images has been proposed so as to provide a decorative effect to a screen. However, such a method is problematic in that, since still images are generally used, the limitation of the simplicity of a screen cannot be overcome, and a user must perform setting of a decorative effect every time in order to change such a simple screen. For example, there is inconvenience in that a user must store images or video for background images in his or her terminal, and determine whether to use the images or video during a video call conversation and which background image will be used. Disclosure of Invention

Technical Problem

[3] Accordingly, the present invention has been made keeping in mind the above problems occurring in the prior art, and an object of the present invention is to provide a method and apparatus for providing a video synthesis call service, which analyze the content of voices input by a user during a video call conversation in real time, synthesize images or video corresponding to words contained in a voice signal, with a video call signal, transmitted between a calling video terminal and a called video terminal, in real time, and provide a synthesized signal to the terminal of the other party, thus increasing video call users' interest, and transmitting various types of video screens in real time to overcome the simplicity of a typical video call.

[4] Another object of the present invention is to provide a method and apparatus for providing a video synthesis call service, which automatically provide screens, synthesized with various types of stereographical video or images, to the video terminal of the other party during a call, without requiring the user of a video terminal to perform specific manipulation in each video call.

[5] A further object of the present invention is to provide a method and apparatus for providing a video synthesis call service, which download video or images corresponding to words, contained in voice content, from an external video or image provision device during a video call conversation, store the downloaded video or images, synthesize the video or images with a video call signal with respect to the same words when subsequently spoken, and transmit a synthesized video call signal, thus enabling various types of video or images to be synthesized. Technical Solution

[6] In accordance with an aspect of the present invention to accomplish the above objects, there is provided a method of providing a video synthesis call service using voice recognition in an apparatus for providing a video synthesis call service using voice recognition, the apparatus being connected to a calling video terminal and a called video terminal through a video call network, comprising the steps of extracting a voice signal from a video call signal transmitted between the calling video terminal and the called video terminal; extracting one or more words from the extracted voice signal; searching for images or video corresponding to the extracted words; synthesizing the found images or video with the video call signal transmitted between the calling video terminal and the called video terminal; and transmitting the synthesized video call signal to at least one of the calling video terminal and the called video terminal.

[7] Preferably, the step of extracting the words from the extracted voice signal may be performed to convert the extracted voice signal into a sentence and to extract one or more words from the sentence by separating the sentence into one or more words.

[8] Further, at the step of searching for images or video corresponding to the extracted words, the images or video may be stored in advance to correspond to respective words.

[9] Further, the step of extracting the voice signal from the video call signal transmitted between the calling video terminal and the called video terminal may be performed at regular periods.

[10] Further, the step of extracting the voice signal from the video call signal transmitted between the calling video terminal and the called video terminal may be performed only when an amplitude of the voice signal is greater than a predetermined level.

[11] In accordance with another aspect of the present invention to accomplish the above objects, there is provided an apparatus for providing a video synthesis call service, the apparatus being connected to a calling video terminal and a called video terminal through a video call network and configured to provide the video synthesis call service, comprising a video call network cooperation unit for receiving a video call signal, transmitted between the calling video terminal and the called video terminal, in cooperation with the video call network; a video call voice extraction unit for extracting a voice signal from the video call signal received from the video call network cooperation unit; a voice recognition processing unit for converting the voice signal extracted by the video call voice extraction unit into a sentence containing one or more words; a sentence word processing unit for extracting respective words from the sentence, converted by the voice recognition processing unit, by separating the sentence into words; an image/video search unit for comparing the words extracted by the sentence word processing unit with images or video stored in advance and selecting images or video corresponding to the images or video, stored in advance to correspond to the extracted words; and an image/video synthesis unit for synthesizing the images or video selected by the image/video selection unit with the video call signal transmitted between the calling video terminal and the called video terminal, and transmitting the video call signal synthesized with the images or video to the video call network cooperation unit, wherein the video call network cooperation unit transmits the video call signal synthesized with the images or video to at least one of the calling video terminal and the called video terminal.

[12] Preferably, the apparatus may further comprise an image/video database for storing images or video corresponding to one or more words.

[13] In accordance with a further aspect of the present invention to accomplish the above objects, there is provided a video terminal for providing a video synthesis call service through a video call network, comprising a video call signal reception unit for receiving a video call signal; a video call voice extraction unit for extracting a voice signal from the video call signal received from the video call signal reception unit; a voice recognition processing unit for converting the voice signal extracted by the video call voice extraction unit into a sentence containing one or more words; a sentence word processing unit for extracting respective words from the sentence output from the voice recognition processing unit by separating the sentence into words; an image/ video search unit for comparing the words extracted by the sentence word processing unit with images or video stored in advance and selecting images or video corresponding to the images or video, stored in advance to correspond to the extracted words; and an image/video synthesis unit for synthesizing the images or video selected by the image/video search unit with the video call signal, and transmitting the video call signal synthesized with the images or video to a video/voice communication unit, wherein the video/voice communication unit receives the video call signal synthesized with the images or video from the image/video synthesis unit, and transmits the synthesized video call signal to a video terminal of another party.

[14] Preferably, the video terminal may further comprise an image/video database for storing images or video corresponding to one or more words.

[15] Further, the video terminal may further comprise a download processing unit for downloading images or video from an external image/video provision device, and transmitting the images or video to the image/video database.

Advantageous Effects

[16] Accordingly, the present invention can provide a method and apparatus for providing a video synthesis call service, which analyze the content of voices input by a user during a video call conversation in real time, synthesize images or video corresponding to words contained in a voice signal, with a video call signal, transmitted between a calling video terminal and a called video terminal, in real time, and provide a synthesized signal to the terminal of the other party, thus increasing video call users' interest, and transmitting various types of video screens in real time to overcome the simplicity of a typical video call.

[17] Further, the present invention can provide a method and apparatus for providing a video synthesis call service, which automatically provide screens, synthesized with various types of stereographical video or images, to the video terminal of the other party during a call, without requiring the user of a video terminal to perform specific manipulation in each video call.

[18] Furthermore, the present invention can provide a method and apparatus for providing a video synthesis call service, which download video or images corresponding to words, contained in voice content, from an external video or image provision device during a video call conversation, store the downloaded video or images, synthesize the video or images with a video call signal with respect to the same words when subsequently spoken, and transmit a synthesized video call signal, thus enabling various types of video or images to be synthesized. Brief Description of the Drawings

[19] FIG. 1 is a diagram of the entire construction showing the connection of an apparatus 30 for providing a video synthesis call service using voice recognition, a calling video terminal 10 and a called video terminal 20 according to the present invention; [20] FIG. 2 is a block diagram showing the detailed construction of the video synthesis call service provision apparatus 30 of FIG. 1 ;

[21] FIG. 3 is a flowchart showing the procedure of the video synthesis call service of

FIGS. 1 and 2;

[22] FIG. 4 is a diagram showing a method of extracting words contained in a voice signal in FIG. 3;

[23] FIG. 5 is a diagram showing a method of searching for images or video corresponding to the extracted words in FIG. 3; and

[24] FIG. 6 is a diagram showing the detailed construction of a video terminal 60 capable of providing a video synthesis call service using voice recognition according to another embodiment of the present invention. Best Mode for Carrying Out the Invention

[25] Hereinafter, the construction of the present invention will be described in detail with reference to the attached drawings.

[26] FIG. 1 is a diagram of the entire construction showing the connection of an apparatus for providing a video synthesis(video overlay) call service using voice recognition, a calling video terminal and a called video terminal according to the present invention.

[27] Referring to FIG. 1, an apparatus 30 for providing a video synthesis call service according to the present invention is connected to a calling video terminal 10 and a called video terminal 20 through a video call network. The calling video terminal 10 and the called video terminal 20 perform a video call over the video call network, and are connected to the video synthesis call service provision apparatus 30 through the video call network, thus being provided with a video synthesis call service. The video synthesis call service provision apparatus 30 extracts and recognizes a voice signal from the video call signal of the user of the calling video terminal 10 or the called video terminal 20 in real time, detects words contained in the voice signal, searches for images or video corresponding to the detected words, synthesizes found images or video with the video call signal, and transmits the synthesized video call signal to either or both of the calling video terminal 10 and the called video terminal 20.

[28] FIG. 2 is a block diagram showing the detailed construction of the video synthesis call service provision apparatus 30 of FIG. 1.

[29] Referring to FIG. 2, the video synthesis call service provision apparatus 30 includes a video call network cooperation unit 31, a video call voice extraction unit 32, a voice recognition processing unit 33, a sentence word processing unit 34, an image/video search unit 35, an image/video synthesis unit 36 and an image/video database (DB) 37.

[30] The video call network cooperation unit 31 functions to receive a video call signal transmitted between the calling video terminal 10 and the called video terminal 20 in cooperation with the video call network and transmits the received video call signal to the video call voice extraction unit 32, and also functions to transmit a video call signal synthesized by the image/video synthesis unit 36 to the calling or called video terminal over the video call network.

[31] The video call voice extraction unit 32 functions to extract a voice signal from the video call signal received from the video call network cooperation unit 31 by separating the video call signal into a video signal and the voice signal, and to transmit the extracted voice signal to the voice recognition processing unit 33.

[32] The voice recognition processing unit 33 functions to convert the voice signal, received from the video call voice extraction unit 32, into a sentence and to transmit the sentence to the sentence word processing unit 34.

[33] The sentence word processing unit 34 functions to extract one or more words from the sentence, received from the voice recognition processing unit 33, by separating the sentence into one or more words, and to transmit the extracted words to the image/ video search unit 35.

[34] The image/video search unit 35 functions to search the image/video DB 37 for images or video corresponding to the words received from the sentence word processing unit 34, and transmit found images or video corresponding to the extracted words to the image/video synthesis unit 36.

[35] The image/video synthesis unit 36 functions to synthesize the images or video, received from the image/video search unit 35, with the video call signal transmitted between the users of the calling video terminal and the called video terminal, and to transmit a video call signal synthesized with the images or video to the video call network cooperation unit 31. That is, the video call signal transmitted between the users of the calling video terminal 10 and the called video terminal 20 is synthesized with the found images or video, and thus the synthesized video call signal is transmitted to either or both of the calling video terminal 10 and the called video terminal 20.

[36] The image/video DB 37 stores in advance words frequently used by typical users

(for example, "love", "drink", "meal", "home", etc.) and images or video corresponding to the words, thus enabling the image/video search unit 35 to search for the images or video corresponding to the words received from the sentence word processing unit 34.

[37] FIG. 3 is a flowchart showing the procedure of the video synthesis call service described with reference to FIGS. 1 and 2.

[38] Referring to FIG. 3, when the calling video terminal 10 make a video call to the called video terminal 20 over the video call network, the video call voice extraction unit 32 of the video synthesis call service provision apparatus 30 extracts only a voice signal from a video call signal at step S301. When the extracted voice signal passes through the voice recognition processing unit 33 and the sentence word processing unit 34, words contained in the voice signal are extracted from the voice signal at step S303. The image/video search unit 35 searches the image/video DB 37 for images or video corresponding to the extracted words at step S305, and transmits the images or video, corresponding to the extracted words and found in the search, to the image/ video synthesis unit 36. The image/video synthesis unit 36 synthesizes the received images or video with the video call signal transmitted between the calling video terminal 10 and the called video terminal 20 at step S307, and transmits the video call signal synthesized with the images or video to the video call network cooperation unit 31. The video call network cooperation unit 31 transmits the video call signal, synthesized with the images or video, to either or both of the calling video terminal 10 and the called video terminal 20 at step S309.

[39] FIG. 4 is a diagram showing a method of extracting words contained in the voice signal in FIG. 3.

[40] FIG. 4 illustrates the method (step S303) of extracting the words contained in the voice signal in FIG. 3. When the voice signal, extracted by the video call voice extraction unit 32, is transmitted to the voice recognition processing unit 33, the voice recognition processing unit 33 converts the extracted voice signal into a sentence at step S401, and transmits the sentence to the sentence word processing unit 34. Here, the term 'sentence' means text containing one or more words. Meanwhile, voice recognition technology, used in the procedure for converting a voice signal into a sentence in the present invention, may be implemented using conventional well-known technology. The present invention is not intended to propose voice recognition technology itself, and thus a detailed description thereof is omitted. The sentence word processing unit 34 extracts one or more words from the sentence received from the voice recognition processing unit 33 by separating the sentence into one or more words at step S403, and transmits the extracted words to the image/video search unit 35. In this case, the steps after the above-described step S305 are performed.

[41] Meanwhile, an example used to describe steps S401 and S403 is shown in the lower portion of FIG. 4. That is, when the voice signal is extracted and passes through the voice recognition processing unit 33, the voice signal is converted into the sentence "Let's drink beer at home today." In this case, the voice recognition processing unit 33 converts voices into the sentence using voice recognition technology. The sentence word processing unit 34 receives the sentence and separates the sentence into one or more words, thus extracting the words from the sentence. That is, in the above example, the sentence "Let's drink beer at home today" is separated into words, such as "today", "home", and "beer", and thus the words are extracted. [42] FIG. 5 is a diagram showing a method of searching for images or video corresponding to the extracted words in FIG. 3.

[43] FIG. 5 illustrates a method (step S305) of searching for images or video corresponding to the extracted words in FIG. 3. When the words contained in the voice signal are extracted from the voice signal at step S303, the image/video search unit 35 searches the image/video DB 37 for the images or video corresponding to the extracted words at step S501. The images or video, corresponding to the extracted words and found in the search, are transmitted to the image/video synthesis unit 36 at step S503.

[44] Meanwhile, an example used to describe steps S501 and S503 is shown in the lower portion of FIG. 5. That is, when the extracted words, such as "today", "home", and "beer", are transmitted to the image/video search unit 35, the image/video search unit 35 searches the image/video DB 37 for images or video corresponding to respective words, such as "today", "home", and "beer." It is determined that images or video corresponding to the words "home" and "beer" exist in the image/video DB 37 shown in the lower portion of FIG. 5, but no image or video corresponding to the word "today" exists in the DB. Therefore, the image/video search unit 35 transmits images or video corresponding to the words "home" and "beer" to the image/video synthesis unit 36.

[45] Meanwhile, the image/video synthesis unit 36 synthesizes the received images or video with the video call signal transmitted between the calling video terminal 10 and the called video terminal 20, and transmits the synthesized video call signal to either or both of the calling video terminal 10 and the called video terminal 20.

[46] FIG. 6 is a diagram showing the detailed construction of a video terminal capable of providing a video synthesis call service using voice recognition according to another embodiment of the present invention. Unlike the above embodiment, described with reference to FIGS. 1 to 3, FIG. 6 illustrates a block diagram showing the construction of a video terminal capable of performing a video synthesis call service without requiring a separate video synthesis call service provision apparatus.

[47] Referring to FIG. 6, a video terminal 60 for performing a video synthesis call service using voice recognition includes a video call reception unit 61, a video call voice extraction unit 62, a voice recognition processing unit 63, a sentence word processing unit 64, an image/video search unit 65, an image/video synthesis unit 66, a video/voice communication unit 67, an image/video database (DB) 69, and a download processing unit 68.

[48] The video call reception unit 61 functions to receive a video call signal from the user of the video terminal and transmit the received video call signal to the video call voice extraction unit 62. The video call voice extraction unit 62 functions to extract only a voice signal from the video call signal, received from the video call reception unit 61, by separating the video call signal into a video signal and the voice signal, and to transmit the extracted voice signal to the voice recognition processing unit 63. The voice recognition processing unit 63 functions to convert the voice signal, received from the video call voice extraction unit 62, into a sentence, and to transmit the sentence to the sentence word processing unit 64. The sentence word processing unit 64 functions to extract one or more words from the sentence, received from the voice recognition processing unit 63, by separating the sentence into one or more words, and to transmit the extracted words to the image/video search unit 65. The image/video search unit 65 functions to search the image/video DB 69 for images or video corresponding to the words received from the sentence word processing unit 64, and to transmit found images or video corresponding to the extracted words to the image/ video synthesis unit 66. The image/video synthesis unit 66 functions to synthesize the images or video, received from the image/video search unit 65, with the video call signal transmitted between the users of the calling video terminal and the called video terminal, and to transmit the video call signal synthesized with the images or video to the video/voice communication unit 67. The image/video DB 69 functions to store in advance words frequently used by typical users, and images or video corresponding to the words, thus enabling the image/video search unit 65 to search for images or video corresponding to the words received from the sentence word processing unit 64. The download processing unit 68 functions to download images or video from an external image/video provision device, and to transmit the images or video to the image/video DB 69. Here, the external image/video provision device may be a specific server, a Personal Computer (PC), etc. [49] Meanwhile, when all call content is analyzed in real time from video call signals transmitted between the calling video terminal 10 and the called video terminal 20 and thus the search and comparison are performed on the call content with reference to the image/video DB 37 in the above embodiments, there may be problems in that the communication quality of a video call may be deteriorated or the results of video synthesis may be transmitted late due to the delay of processing time. In order to solve these problems, a method of extracting voice signals from video call signals at regular periods, rather than extracting voice signals from complete video call signals, may be used. For example, when the period is set to 5 or 10 seconds, the method can be designated to perform a series of procedures, such as procedures for extracting a voice signal, converting the voice signal into a sentence, and extracting words from the sentence at each period during the entire video call conversation. Further, a method of extracting only voice signals, the amplitudes of which are greater than a predetermined level, among voice signals contained in video call signals transmitted between the calling video terminal 10 and the called video terminal 20, may be used. This is used to prevent the above problems by synthesizing images or video corresponding to relevant words with video call signals only when the amplitudes of voice signals are greater than a predetermined level, because a talking person generally has a tendency to pronounce important words in a louder voice than typical words so as to emphasize the important words in content desired to be transmitted. [50] Although the construction of the present invention has been disclosed with reference to preferred embodiments of the present invention, those skilled in the art will appreciate that the present invention is not limited to the embodiments, and that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims and the attached drawings.

Claims

[1] A method of providing a video synthesis (video overlay) call service using voice recognition in an apparatus for providing a video synthesis call service using voice recognition, the apparatus being connected to a calling video terminal and a called video terminal through a video call network, comprising the steps of: extracting a voice signal from a video call signal transmitted between the calling video terminal and the called video terminal; extracting one or more words from the extracted voice signal; searching for images or video corresponding to the extracted words; synthesizing(overlaying) the found images or video with the video call signal transmitted between the calling video terminal and the called video terminal; and transmitting the synthesized(overlayed) video call signal to at least one of the calling video terminal and the called video terminal.

[2] The method according to claim 1, wherein the step of extracting the words from the extracted voice signal is performed to convert the extracted voice signal into a sentence and to extract one or more words from the sentence by separating the sentence into one or more words.

[3] The method according to claim 1, wherein, at the step of searching for images or video corresponding to the extracted words, the images or video are stored in advance to correspond to respective words.

[4] The method according to claim 1, wherein the step of extracting the voice signal from the video call signal transmitted between the calling video terminal and the called video terminal is performed at regular periods.

[5] The method according to claim 1, wherein the step of extracting the voice signal from the video call signal transmitted between the calling video terminal and the called video terminal is performed only when an amplitude of the voice signal is greater than a predetermined level.

[6] An apparatus for providing a video synthesis call service, the apparatus being connected to a calling video terminal and a called video terminal through a video call network and configured to provide the video synthesis call service, comprising: a video call network cooperation unit for receiving a video call signal, transmitted between the calling video terminal and the called video terminal, in cooperation with the video call network; a video call voice extraction unit for extracting a voice signal from the video call signal received from the video call network cooperation unit; a voice recognition processing unit for converting the voice signal extracted by the video call voice extraction unit into a sentence containing one or more words; a sentence word processing unit for extracting respective words from the sentence, converted by the voice recognition processing unit, by separating the sentence into words; an image/video search unit for comparing the words extracted by the sentence word processing unit with images or video stored in advance and selecting images or video corresponding to the extracted words; and an image/video synthesis unit for synthesizing the images or video selected by the image/video selection unit with the video call signal transmitted between the calling video terminal and the called video terminal, and transmitting the video call signal synthesized with the images or video to the video call network cooperation unit, wherein the video call network cooperation unit transmits the video call signal synthesized with the images or video to at least one of the calling video terminal and the called video terminal.

[7] The apparatus according to claim 6, further comprising an image/video database for storing images or video corresponding to one or more words.

[8] A video terminal for providing a video synthesis call service through a video call network, comprising: a video call signal reception unit for receiving a video call signal; a video call voice extraction unit for extracting a voice signal from the video call signal received from the video call signal reception unit; a voice recognition processing unit for converting the voice signal extracted by the video call voice extraction unit into a sentence containing one or more words; a sentence word processing unit for extracting respective words from the sentence output from the voice recognition processing unit by separating the sentence into words; an image/video search unit for comparing the words extracted by the sentence word processing unit with images or video stored in advance and selecting images or video corresponding to the the extracted words; and an image/video synthesis unit for synthesizing the images or video selected by the image/video search unit with the video call signal, and transmitting the video call signal synthesized with the images or video to a video/voice communication unit, wherein the video/voice communication unit receives the video call signal synthesized with the images or video from the image/video synthesis unit, and transmits the synthesized video call signal to a video terminal of another party.

[9] The video terminal according to claim 8, further comprising an image/video database for storing images or video corresponding to one or more words. [10] The video terminal according to claim 9, further comprising a download processing unit for downloading images or video from an external image/video provision device, and transmitting the images or video to the image/video database.